Crypto Infrastructure and DevOps Best Practices

DevOps Practices for Crypto Infrastructure, Part II: Authentication, Authorization, Networking, Monitoring, and Logging

Picking Up Where We Left Off

In part one of this two-part series, I discussed core DevOps principles that helped guide our crypto infrastructure here at PureStake, and discussed some unique considerations around version control, full stack automation, and secrets management. If you didn’t have a chance to read the first post yet, you can find it here.

In this second post, I will continue calling out principles and examining different areas that are important to consider when setting up and running secure and reliable crypto infrastructure.

Authentication and Authorization

One of the most important aspects of ensuring the security of your infrastructure is having the right authentication systems in place.

For logging into infrastructure and servers, I favor centralized/federated authentication directories over local ones. It is very important that DevOps staff have unique user accounts for logging into infrastructure, rather than using shared accounts.  Unique accounts provide a record of who logged into what, which is essential to understanding what is happening in your environment. Shared accounts, including direct use of the Administrator or root accounts on servers, become very challenging when you have turnover in your staff or, in the worst case, if there has been an incident.  It’s much cleaner to revoke access, assign rights, review past history, and understand what is happening with a centralized directory.

For the scope of authentication, I recommend a full separation of the corporate IT environment and the production infrastructure environments.  Fully separate directories are the best approach, even if your directory supports different groups and roles. This greatly reduces human error, resulting in too much — or incorrect — access.

However, that doesn’t mean that you shouldn’t use groups. Grouping users who need access to the infrastructure and assigning the appropriate roles to them is critical to being able to manage access in a reasonable way in crypto and other environments.  It is too complicated and too easy to make a mistake when assigning rights to individual users. Even for a small team, having at least a few roles will be appropriate, such as a role with full access for select senior DevOps staff, a role with limited access for junior DevOps staff, and perhaps a monitoring only role for managers and other technical staff.  The principle to keep in mind is that of least privilege, which states that users and groups should have as few rights as possible to do their job with a mechanism/process to escalate that can be logged/monitored. This also supports a closely-related concept of blast radius minimization. Having users with roles that employ the concept of least privilege will minimize the blast radius associated with an incident where user credentials or accounts have been compromised.

Using traditional passwords as a way to log into crypto infrastructure is not a good security practice.  Where passwords must be used, I recommend the use of a password manager such as Dashlane, which can be set up in a corporate configuration with shared groups, role-based access, and where a unique strong password can be used for each system.  Crypto environments require more security than this. At a minimum, all accounts must require two-factor authentication, where the first factor can be a traditional strong password, and an authenticator app is the second factor. A better setup replaces the authenticator app with a physical hardware device.

For identity management in Windows environments, Active Directory is the logical choice.  For Linux environments, OpenLDAP and Kerberos serve a similar function. Each cloud vendor has their own identity management scheme including AWS IAM, Azure AD, and Google Cloud IAM, each with their own nuances.  Google authenticator works very well as a second factor in 2FA setups. For a physical device second factor, YubiKey is an inexpensive option that plugs into a USB port on your computer. Requiring the YubiKey as one of the authentication factors means that the device must physically be in the possession of the user at the time of login.


A well-run infrastructure has good mechanisms in place to manage server, application, and as-a-service logs.  Logs are not only useful for troubleshooting infrastructure issues, but also provide the basis for audit control and intrusion detection.  You need reliable logs to understand what has happened in your environment.

The most important practice is to ship logs off the servers, containers, and other infrastructure elements to isolated, tamper-proof locations.  Authorization roles should be employed to isolate these log collection points to make them as tamper-proof as possible. Then the logs can be loaded into query optimized data stores to facilitate visibility, troubleshooting, and monitoring scenarios.

In particular, when running crypto nodes, sometimes logs are the only way to understand what is happening on the nodes. Critical error messages and log entries related to the crypto network protocol can be the only way you can understand that a node is running well or poorly.

In a Windows environment, events can be forwarded to an event collector.  In a Linux environment, rsyslog works well for forwarding syslog to regional and ultimately centralized data stores.  For log-based searching, troubleshooting, and time series analysis, Splunk is a Cadillac solution: tons of functionality, but at a very high price.  An alternative to Splunk is the open source ELK stack (Elasticsearch, Logstash, Kibana) which has gotten a lot better over time and offers a much less expensive way to search and troubleshoot infrastructure based on log data.

Monitoring and Alerting

If you want to run reliable crypto infrastructure, you have to know when services are not running well.  The principle here is that everything fails — but early detection allows infrastructure element failures to be remedied quickly.

With good redundancy in design, individual element failures ideally have little-to-no end user impact. For cases where there are failures that lead to end user service impact, strong automation will minimize the time to restore services.

Focusing on early detection, the best way to accomplish that is through the extensive use of monitoring at the different layers of the stack and from different locations. If you are in a colo environment and managing hardware, the monitoring of that hardware will likely require vendor specific tools and possibly the collection of SNMP traps.  For cloud environments, the providers offer native monitoring that is integrated with their service offerings. As an example, AWS offers CloudWatch for monitoring AWS based services.

There are a lot of elements in a crypto infrastructure that need to be monitored.  It’s important to choose a platform which will serve as the place where monitoring data is sent, where alerting thresholds are set and where alerts are managed.  As different monitoring checks are added over time, they can feed into that system. It is extremely difficult to manage alerts, maintenance downtime, and inventory completeness if you have multiple places to go to manage these items.

At the lower end of the stack, you will want to put basic checks in place for OS-level resources like CPU, memory, and disk.  Basic network checks would include ping/ICMP, TCP port exhaustion, and TCP service checks. Security events such as those that come off IDS and IPS systems could be fed in here as well.  Application-level checks can include HTTPS checks that hit a URL and look for status or error codes and messages.

For crypto-specific infrastructure checks, consider that the base crypto infrastructure consists of nodes.  Crypto nodes often expose status and query interfaces via a REST API, so querying that API on a regular basis to look for status and error codes is a good start, but you should be careful that you are not exposing that API to the wider internet.  Other checks specific to crypto nodes include looking at block height on nodes, and making sure that it has an expected value. Nodes should be producing blocks on a regular cadence and, depending on the node, role may be helping to support the consensus mechanism of the network.  Using monitoring to look for deviations from normal block production or consensus participation behavior is a good early warning indicator of trouble.

Once you have a view from inside of your environment, it is equally important to get a point of view from outside of your environment.  This means taking the perspective of your customers and seeing if your services are performing well from their viewpoint. I’ve experienced situations where all services are green from the internal point of view, but a WAN or internet issue means that certain customers are not able to use the service. A common cause is a physical line cut that creates bad network paths to your service until traffic is rerouted. Using a cloud provider with multiple external points of presence can help provide this outside in view of your services.

From an open source monitoring tools perspective, the old workhorse is Nagios and Nagios variants such as Checkmk, which I have used for years to monitor production environments.  These tools are starting to show their age, but they are battle-tested and reliable. A newer option getting good traction is Prometheus with its more modern-looking Grafana-based visualizations.  For a greenfield environment, Prometheus is a good choice.

Nagios/Prometheus work in a poll model, where servers provide data on a port and a centralized service routinely collects the data and makes it available.  DataDog is an example of an alternate model where the data is streamed from the server itself with an agent to a centralized location. For alerting operational staff when there are critical alarms, I have always found PagerDuty to be a good choice, but OpsGenie or VictorOps will provide similar functionality.  For external cloud based availability monitoring, ThousandEyes is a good choice, and something like Pingdom will get you basic external coverage for a low-cost entry point.

Concluding Thoughts

The two posts in this series have only scratched the surface of crypto DevOps practices.  Other areas that may be the subject of future posts include networking and vpcs, blue / green deployments, docker vs vms, load balancing and failover strategies, ids / ips, storage management for blockchain nodes, crypto key management strategies, personal security best practices for DevOps staff, and other topics.  Employing good practices across all of these areas are an important part of what it takes to provide secure and reliable crypto infrastructure.

Looking for further information about infrastructure for crypto based applications? Contact us today.

Crypto Infrastructure and DevOps Best Practices

DevOps Practices for Crypto Infrastructure, Part I: Version Control, Full Stack Automation, and Secrets Management

When standing up services that will have cryptographic interactions with a blockchain, the DevOps infrastructure and practices you employ will dictate a lot about the security and reliability of those services. In this two-part series of posts, I will introduce core DevOps principles that will help guide crypto infrastructure creation. I’ll also share different DevOps infrastructure aspects that I have to worked well for me, and could be helpful to other teams looking to stand up crypto as-a-service offerings.

Cloud vs Roll-Your-Own

For many infrastructure elements, you must choose whether to go with a cloud provider such as AWS, Azure, or Google, or to roll-your-own in a colocated data center with self-managed software. In crypto and blockchain, there are some specific requirements, particularly relating to key security, which may factor into requirements around hardware security modules (HSMs), physical servers, and tiers of colocated data centers (more on this later).

But in general, all other things being equal, if there is an option from one of the three major cloud providers for an as-a-service offering vs rolling your own with purchased hardware and self managed software, there are a lot of reasons to go for the cloud option.

In my experience, it is very easy to underestimate the investment and labor required to self-manage infrastructure elements in a high-quality way over time. Especially when the software is open source, the temptation is always to just pull down the software and start running it. Focus tends to be on the cost of the hardware instead of the cost of the cloud service. The DevOps staff that is required to manage, upgrade, performance tune, patch and evolve this infrastructure over time is almost always underestimated by startup teams and becomes baggage as the team and the company grows. For any piece of infrastructure, you really have to ask yourself if this is the best use of your team’s time.

In most cases, you will want to focus your energy on things that you only you can do, and purchase services where possible from a reputable cloud provider. The three major cloud providers (AWS, Azure, Google) all have large and highly specialized teams surrounding each of their as-a-service offerings. For smaller companies, there is no way you are going to do a better job with management and security than these cloud provider teams for base/commodity offerings.

My take: go with a cloud provider or (better yet) more than one cloud provider, so you can and focus on building and running things that you can’t purchase as a service and that are unique to your offering.


Version Control

In recent years, the idea of infrastructure-as-code has become a leading principle in DevOps. This is part of a larger evolution of DevOps that continues to shift the discipline towards looking more and more like a software development practice. A core part of any software development practice is storing all your software artifacts in a version control repository. Artifacts can include source code, configuration files, data files, and in general any of the inputs needed to build your software and your infrastructure environment. It seems like a given, but I have seen operational environments where not all of the artifacts necessary to build the environments were stored in source control.

The benefits of storing everything under version control is that you have a unique version for a given state of the artifacts used to build your environments. This allows for the repeatable build of environments, the implementation of processes around change to these artifacts, and the ability to roll back to any previous known good state in case there are issues. High-quality and cost-effective cloud-based services such as GitHub make this an easy choice to serve as a foundation for DevOps activity.

Full Stack Automation

One of the best things about using the cloud for your infrastructure is the programmability and APIs that the cloud vendors provide. These APIs can be used to automate the entire application stack from base layer network, DNS, storage, compute, up to operating systems and serverless functions, and all the way through to the custom code in your application. Taking an infrastructure-as-code approach means having software artifacts in your source code repository and a build process that can create an entire application environment in a fully automated way. This automation can be used to drive the initial build and incremental change to development, test, and production environments.

There are good tooling options these days to achieve this kind of infrastructure automation. At the base infrastructure level, there are solutions native to cloud provider environments such as AWS CloudFormation or Google Cloud Deployment Manager. We are fans of Terraform as it allows for the management of infrastructure in AWS, Azure, and Google from the same codebase with provider-specific modules and extensions. Once the base level infrastructure has been provisioned, packer images combined with configuration management tools like Ansible, Chef, or Puppet can be used configure host-based services.

There are a lot of benefits to be had from automating the full application stack. Automation eliminates the chance of manual errors and allows for a repeatable process. It also can drive the same stack into dev, test, and prod, thus minimizing the chances of environmental differences leading to surprises. Automation can also be used to support blue/green production deploys in which an entire new environment is built with updated code and then traffic is cut over from the existing to the new environment in a controlled fashion. In addition, it is easy to roll back in this model if there is a problem with the new environment.

Full stack automation also lends itself to the switch from thinking about servers as unique elements with individual character to managing servers as interchangeable elements. It becomes a straightforward proposition to rip and replace troublesome infrastructure and to use tightly-focused servers rather than sprawling snowflakes that acquire dozens of responsibilities and take on a life of their own.

Secrets Management

When you have an automated environment it is very important that the secrets that are part of your application are managed carefully. Secrets could include service passwords, API tokens, database passwords, and cryptographic keys. The management of crypto keys is particularly critical for crypto infrastructure where private keys are present, such as exchange infrastructure and validators on proof of stake networks. Read my recent blog to learn more about crypto key management using multisig accounts and offline keys.

However, a lot of the same principles apply to infrastructure, application, and crypto secrets. You want to make sure that these secrets are not in your source code repo, but rather that they are obtained at build or, better yet, at runtime in the different environments in which your application is running.

Software and platform native tools that help protect secrets in production environments include AWS KMS/CloudHSM, Azure Key Vault and Hashicorp Vault if you are looking for something cross platform. Some very sensitive secrets such as crypto private keys can benefit from hardware key management systems such as YubiHSM2 and Azure Dedicated HSM based on Safenet Luna hardware. The downside is that hardware solutions are generally less cloud-friendly than software ones and, while they may improve key security, some aspects of security are worsened by taking a hardware approach over a more automatable cloud-native software approach. The infrastructure costs and surface area that needs to be managed can also be far higher when taking a hardware-centric approach.

Intel SGX is a promising hardware technology that allows processes to run in secure enclaves.  A process running in a secure enclave is totally isolated from the host operating system. What this means is that, if you have access to the guest operating system, you cannot read the memory of the process running in the SGX enclave even if you have root privileges.  I am excited by the use of SGX enclaves combined with e.g. Hashicorp Vault to improve the security of software and cloud native secrets management. SGX is available today via Azure Trusted Compute but has the downside of requiring coding to the SGX APIs. We eagerly await further developments of the AWS Nitro architecture which we believe will greatly improve the security of software and cloud native secrets management. Nitro is the AWS version of providing hardware support for isolation of customer workloads on shared infrastructure.

Topics to Cover in Part II

There are many aspects to consider when thinking about secure and reliable infrastructure for crypto based applications.  We’ve only touched on a handful of areas in this article. Here are some additional areas I cover in part II:

  • Authentication
  • Authorization and Roles
  • Networking
  • Monitoring
  • Logging

Looking for further information about infrastructure for crypto-based applications? Contact us today

Participation Keys in Algorand Blog Banner Image

Participation Keys in Algorand

What Are Algorand Participation Keys?

In Algorand, there are 2 types of nodes: relay nodes and participation nodes. Relay nodes serve as network hubs in Algorand, relaying protocol messages very quickly and efficiently between participation nodes. Participation nodes support the consensus mechanism in Algorand by proposing and validating new blocks. Participation keys live on participation nodes and are used to sign consensus protocol messages.

A participation key in Algorand is distinct and totally separate from a spending key. When you have an account in Algorand there is an associated spending key (or multiple keys in the case of a multi-sig account). The spending key is needed to spend funds in the account. A participation key, on the other hand, is associated with an account and is used to bring stake online on the network. Importantly, participation keys cannot be used to spend funds in the associated account, they can only be used for helping to support the consensus protocol.

Participation Keys Are Good

Having distinct keys for spending the Algo in an account, and staking the Algo in an account, results in several key security improvements.

In any crypto network, protecting the spending keys is of the utmost importance. Situations that require having spending keys on an internet connected computer are inherently dangerous and always contain the risk of loss of funds.

In Algorand, the spending key never has to be online. The spending key can be kept on an airgapped computer or other offline setup and only used for signing transactions offline. The participation key, in contrast, lives on the participation node and signs protocol messages, but the participation key cannot spend any funds in the account.

This separation of duties in 2 different keys improves the security of Algorand infrastructure substantially. Spending keys can always be kept totally offline and an attacker, if they are able to compromise an internet connected participation node, cannot spend or steal any of the funds in the associated account.

Of course, this doesn’t mean that participation keys shouldn’t be highly protected and secured. If an attacker does compromise a participation key, they can stand up a second participation node with the same participation key. This will result in protocol messages being double-signed, which the network will see as malicious behavior and will treat the node / associated stake as offline.

There is no bonding or slashing in Algorand, and staking rewards are still coming in the future, but regardless: being forced offline due to double signing is undesirable and means that the stake in question will no longer be supporting the consensus mechanism.

Participation Key Mechanics

My examples assume Algorand Node v1 software is installed and running in a participation node configuration on the Algorand MainNet. The software is installed using the Debian package on Ubuntu 18.04, with a standard non-multi-sig Algorand account with some Algo in it, and a separate offline computer with the spending key for the account.

To create a participation key you will need to use the “goal addpartkey” command and specify the account that you want to create the part key for and a validity range:

goal account addpartkey -a WHNXGKYOVIQADYS4VTYBG6SGWFIG6235C5LMXM76J3LHE475QJLIHUC5KY --roundFirstValid 789014 --roundLastValid 4283414

A few things to note. The account specified in the -a flag in the command above (WHNXGKYOVIQADYS4VTYBG6SGWFIG6235C5LMXM76J3LHE475QJLIHUC5KY) is made up and you would need to replace it with your account. Do not use this account as it, and the associated spending key, are not real. Any funds sent to this address will be permanently lost.

The validity range is specified in rounds. Rounds are equivalent to blocks in Algorand. So if you, for example, want to have a key that is valid from now until a point in the future, you need to find the current block height for the roundFirstValid and a future block height for the roundLastValid flag corresponding to the validity range you want.

To find the current block height you can use the “goal node status” command:

derek@algo-node:~$ goal node status Last committed block: 789014 Time since last block: 2.4s Sync Time: 0.0s Last consensus protocol: Next consensus protocol: Round for next consensus protocol: 789015 Next consensus protocol supported: true Genesis ID: mainnet-v1.0 Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

The last committed block, which is the same as the current block height, is reported as 789014, so we use that for our roundFirstValid. Figuring out the right value for the roundLastValid is a little more involved.

First, you have to determine what time range you want. It is a good practice to rotate participation keys and not to create a key with a really long validity range. In our example, we will use a time range of 6 months. What round corresponds to 6 months from now?

To figure that out, we have to do a little math. 6 months is approximately 182 days. So 182 days x 24 hours / day x 60 min / day x 60 sec / min = 15724800 seconds. At the time of writing, each round in Algorand takes about 4.5 sec. So 15724800 seconds / 4.5 seconds per block = 3494400 blocks. Now we need to add 3494400 to the current block height to get the height 6 months from now. E.g. 3494400 + 789014 = 4283414. This is where the 4283414 in the command above comes from for the roundLastValid.
As the network grows, the 4.5 second block time may not be a safe assumption. This may make the validity range slightly different than 6 months. You need to monitor for key validity and make sure to put a new key in place before the old one expires.

Once the addpartkey command has executed, you can find the participation key at:


It’s beyond the scope of this article, but this file is actually a sqlite database with N number of keys in it which will be internally rotated through automatically during the validity window. This is an additional security measure that is part of Algorand, where the keys used to sign protocol messages are rotated as rounds progress.

With the participation key created, the next step is to bring the account online. An account being online in Algorand means that the Algo in the account is supporting the consensus mechanism. We bring an account online by using the “goal account changeonlinestatus” command. Note that this action requires that you have a small amount of Algo in the account to pay for the transaction. If you have the spending key for the account directly on the participation node you can simply run this command

goal account changeonlinestatus -a WHNXGKYOVIQADYS4VTYBG6SGWFIG6235C5LMXM76J3LHE475QJLA -o=1

However, having the spending key on the participation node is not recommended and kind of defeats the whole purpose of having participation keys in the first place. It is much better to have an airgapped and totally offline computer that has the spending key on it. The process is a little more involved with this setup, but it is much more secure. With this setup you would issue the following command instead:

goal account changeonlinestatus -a WHNXGKYOVIQADYS4VTYBG6SGWFIG6235C5LMXM76J3LHE475QJLA -o=1 -t online.tx

This will produce a transaction file called online.tx in the current directory which has an unsigned transaction to bring the account online. This transaction file then needs to be securely moved to the airgapped computer with the spending key on it. Once on the airgapped computer you can use the algokey utility to sign the transaction file. The command would be:

algokey sign -k spendingkeyfile -t online.tx -o online.tx.signed

Note that algokey is standalone and does not need a running Algorand node. Also, the spendingkeyfile is the file that has the spending key for the account. This file can be created by algokey when you first set up your account.

There is also an option to specify the spending key mnemonic instead of a file, but I find this option worse as it leaves the mnemonic in the shell history, etc. The result of this command is that online.tx.signed will be created in the current directory. This file contains the signed online transaction and it needs to be securely moved back to the running participation node.

Once you have online.tx.signed back on the participation node you can send it to the network with the following command:

goal clerk rawsend -f online.tx.signed

Wait a little bit for the transaction to be processed, and your account should now be online. The creation of a transaction file, movement to the airgapped machine to sign the transaction, movement of the signed transaction back to the online node, and then sending the signed transaction to the network is a general pattern for sending transactions in Algorand without ever putting your spending key online.

Final Thoughts on Participation Keys in Algorand

The design of Algorand using separate keys for spending funds and for participating in network consensus improves the security of nodes running on the Algorand network substantially by protecting spending keys and removing the need for them to ever be online. I think this was a good design choice and wouldn’t be surprised if other protocols adopt this approach.


Why We Started PureStake Blog Banner

Why We Started PureStake

Many of us at PureStake were just starting our careers in the mid-to-late 90s, during the first internet wave. Since then, we have spent the last 20-plus years building infrastructure, software, and cloud companies based on the possibilities opened up by the internet. I recall the atmosphere and feeling of those early internet days and, in the intervening years, I hadn’t experienced that feeling since until I started getting involved with crypto.

The crypto genie is out of the bottle, and it has unleashed forces which cannot be stopped or contained. We believe that using blockchains to move value in an open, low friction, low-cost way will have as large an impact on all of us as the internet has had in moving information in an open, low friction, low-cost way. We are only at the beginning of a historical shift where crypto networks and applications will disintermediate many existing companies, structures, and practices, replacing them with code.

While the strategic direction of this shift is clear, the particulars of how this shift will play out are harder to call. That said, we have several beliefs that we stand behind:

  1. The future will be a multi-chain future vs one-chain-to-rule-them-all. In this future, bitcoin will continue to have a foundational place in the ecosystem, but there will also be many other blockchains, each of them good at different things.
  2. Public and permissionless blockchains will lead the way in terms of innovation and interesting applications vs private and permissioned ones.
  3. Proof of Stake consensus protocols are a more scalable, more efficient, and ultimately more secure consensus mechanism versus more traditional Proof of Work consensus protocols. As decentralized currencies, networks, and applications continue to mature and get traction, we believe there is a large opportunity to provide infrastructure as a service to support participation in and development on these decentralized networks.

We are taking all of our experience building and running cloud services and applying it to crypto infrastructure. Given that this infrastructure will be directly handling value, the security and reliability of our services must come first (and features will sometimes have to come second).

We use a software-first approach to solving problems. Treating our infrastructure as code and using software engineering best practices to deliver change to our infrastructure is one example of this. We aim to hide infrastructural complexity from our users and customers. We want to provide them with services that are simple to consume, freeing them to focus on the reasons they want to interact with the blockchain vs the details and mechanics of how to interact with the blockchain.

We will engage closely with a select number of networks that we believe in. We want to focus our energy on fewer vs more networks to be able to go deep on them to understand how they work, their nuances, their APIs, and their infrastructure needs. As we build expertise on specific networks we will be giving back to those networks in the form of services, tools, and information that help the community. Our goal is to provide secure and reliable blockchain infrastructure that participants can depend on and that developers can build upon.

The first network we are focused on is Algorand. Algorand is currently in TestNet and will be launching their MainNet soon.

Why Algorand? We personally know many of the people on the Algorand team. They have an extremely talented engineering, research, and business operational team. We believe in Silvio Micali, Steve Kokinos, and the team they have assembled. We think they can execute on a complicated and difficult roadmap in a way that other projects have historically been challenged with.

Our experience with the Algorand software and network has been similarly very positive. The quality of the code, the security, and design innovations, and the the rich set of financial primitives have all made a big impression. The performance of the network we have seen on the TestNet without significant sacrifices to security or decentralization we believe will move the needle among public blockchains and blockchain design in general.

We are excited to be one of the companies helping to support the upcoming Algorand MainNet network launch and look forward to engaging with participants and developers in the Algorand community.

Stay tuned for updates on our journey by signing up for our newsletter, or feel free to contact us if you are developing an Algorand application or need help with blockchain infrastructure.