Posts

Crypto Infrastructure and DevOps Best Practices

DevOps Practices for Crypto Infrastructure, Part II: Authentication, Authorization, Networking, Monitoring, and Logging

Picking Up Where We Left Off

In part one of this two-part series, I discussed core DevOps principles that helped guide our crypto infrastructure here at PureStake, and discussed some unique considerations around version control, full stack automation, and secrets management. If you didn’t have a chance to read the first post yet, you can find it here.

In this second post, I will continue calling out principles and examining different areas that are important to consider when setting up and running secure and reliable crypto infrastructure.

Authentication and Authorization

One of the most important aspects of ensuring the security of your infrastructure is having the right authentication systems in place.

For logging into infrastructure and servers, I favor centralized/federated authentication directories over local ones. It is very important that DevOps staff have unique user accounts for logging into infrastructure, rather than using shared accounts.  Unique accounts provide a record of who logged into what, which is essential to understanding what is happening in your environment. Shared accounts, including direct use of the Administrator or root accounts on servers, become very challenging when you have turnover in your staff or, in the worst case, if there has been an incident.  It’s much cleaner to revoke access, assign rights, review past history, and understand what is happening with a centralized directory.

For the scope of authentication, I recommend a full separation of the corporate IT environment and the production infrastructure environments.  Fully separate directories are the best approach, even if your directory supports different groups and roles. This greatly reduces human error, resulting in too much — or incorrect — access.

However, that doesn’t mean that you shouldn’t use groups. Grouping users who need access to the infrastructure and assigning the appropriate roles to them is critical to being able to manage access in a reasonable way in crypto and other environments.  It is too complicated and too easy to make a mistake when assigning rights to individual users. Even for a small team, having at least a few roles will be appropriate, such as a role with full access for select senior DevOps staff, a role with limited access for junior DevOps staff, and perhaps a monitoring only role for managers and other technical staff.  The principle to keep in mind is that of least privilege, which states that users and groups should have as few rights as possible to do their job with a mechanism/process to escalate that can be logged/monitored. This also supports a closely-related concept of blast radius minimization. Having users with roles that employ the concept of least privilege will minimize the blast radius associated with an incident where user credentials or accounts have been compromised.

Using traditional passwords as a way to log into crypto infrastructure is not a good security practice.  Where passwords must be used, I recommend the use of a password manager such as Dashlane, which can be set up in a corporate configuration with shared groups, role-based access, and where a unique strong password can be used for each system.  Crypto environments require more security than this. At a minimum, all accounts must require two-factor authentication, where the first factor can be a traditional strong password, and an authenticator app is the second factor. A better setup replaces the authenticator app with a physical hardware device.

For identity management in Windows environments, Active Directory is the logical choice.  For Linux environments, OpenLDAP and Kerberos serve a similar function. Each cloud vendor has their own identity management scheme including AWS IAM, Azure AD, and Google Cloud IAM, each with their own nuances.  Google authenticator works very well as a second factor in 2FA setups. For a physical device second factor, YubiKey is an inexpensive option that plugs into a USB port on your computer. Requiring the YubiKey as one of the authentication factors means that the device must physically be in the possession of the user at the time of login.

Logging

A well-run infrastructure has good mechanisms in place to manage server, application, and as-a-service logs.  Logs are not only useful for troubleshooting infrastructure issues, but also provide the basis for audit control and intrusion detection.  You need reliable logs to understand what has happened in your environment.

The most important practice is to ship logs off the servers, containers, and other infrastructure elements to isolated, tamper-proof locations.  Authorization roles should be employed to isolate these log collection points to make them as tamper-proof as possible. Then the logs can be loaded into query optimized data stores to facilitate visibility, troubleshooting, and monitoring scenarios.

In particular, when running crypto nodes, sometimes logs are the only way to understand what is happening on the nodes. Critical error messages and log entries related to the crypto network protocol can be the only way you can understand that a node is running well or poorly.

In a Windows environment, events can be forwarded to an event collector.  In a Linux environment, rsyslog works well for forwarding syslog to regional and ultimately centralized data stores.  For log-based searching, troubleshooting, and time series analysis, Splunk is a Cadillac solution: tons of functionality, but at a very high price.  An alternative to Splunk is the open source ELK stack (Elasticsearch, Logstash, Kibana) which has gotten a lot better over time and offers a much less expensive way to search and troubleshoot infrastructure based on log data.

Monitoring and Alerting

If you want to run reliable crypto infrastructure, you have to know when services are not running well.  The principle here is that everything fails — but early detection allows infrastructure element failures to be remedied quickly.

With good redundancy in design, individual element failures ideally have little-to-no end user impact. For cases where there are failures that lead to end user service impact, strong automation will minimize the time to restore services.

Focusing on early detection, the best way to accomplish that is through the extensive use of monitoring at the different layers of the stack and from different locations. If you are in a colo environment and managing hardware, the monitoring of that hardware will likely require vendor specific tools and possibly the collection of SNMP traps.  For cloud environments, the providers offer native monitoring that is integrated with their service offerings. As an example, AWS offers CloudWatch for monitoring AWS based services.

There are a lot of elements in a crypto infrastructure that need to be monitored.  It’s important to choose a platform which will serve as the place where monitoring data is sent, where alerting thresholds are set and where alerts are managed.  As different monitoring checks are added over time, they can feed into that system. It is extremely difficult to manage alerts, maintenance downtime, and inventory completeness if you have multiple places to go to manage these items.

At the lower end of the stack, you will want to put basic checks in place for OS-level resources like CPU, memory, and disk.  Basic network checks would include ping/ICMP, TCP port exhaustion, and TCP service checks. Security events such as those that come off IDS and IPS systems could be fed in here as well.  Application-level checks can include HTTPS checks that hit a URL and look for status or error codes and messages.

For crypto-specific infrastructure checks, consider that the base crypto infrastructure consists of nodes.  Crypto nodes often expose status and query interfaces via a REST API, so querying that API on a regular basis to look for status and error codes is a good start, but you should be careful that you are not exposing that API to the wider internet.  Other checks specific to crypto nodes include looking at block height on nodes, and making sure that it has an expected value. Nodes should be producing blocks on a regular cadence and, depending on the node, role may be helping to support the consensus mechanism of the network.  Using monitoring to look for deviations from normal block production or consensus participation behavior is a good early warning indicator of trouble.

Once you have a view from inside of your environment, it is equally important to get a point of view from outside of your environment.  This means taking the perspective of your customers and seeing if your services are performing well from their viewpoint. I’ve experienced situations where all services are green from the internal point of view, but a WAN or internet issue means that certain customers are not able to use the service. A common cause is a physical line cut that creates bad network paths to your service until traffic is rerouted. Using a cloud provider with multiple external points of presence can help provide this outside in view of your services.

From an open source monitoring tools perspective, the old workhorse is Nagios and Nagios variants such as Checkmk, which I have used for years to monitor production environments.  These tools are starting to show their age, but they are battle-tested and reliable. A newer option getting good traction is Prometheus with its more modern-looking Grafana-based visualizations.  For a greenfield environment, Prometheus is a good choice.

Nagios/Prometheus work in a poll model, where servers provide data on a port and a centralized service routinely collects the data and makes it available.  DataDog is an example of an alternate model where the data is streamed from the server itself with an agent to a centralized location. For alerting operational staff when there are critical alarms, I have always found PagerDuty to be a good choice, but OpsGenie or VictorOps will provide similar functionality.  For external cloud based availability monitoring, ThousandEyes is a good choice, and something like Pingdom will get you basic external coverage for a low-cost entry point.

Concluding Thoughts

The two posts in this series have only scratched the surface of crypto DevOps practices.  Other areas that may be the subject of future posts include networking and vpcs, blue / green deployments, docker vs vms, load balancing and failover strategies, ids / ips, storage management for blockchain nodes, crypto key management strategies, personal security best practices for DevOps staff, and other topics.  Employing good practices across all of these areas are an important part of what it takes to provide secure and reliable crypto infrastructure.

Looking for further information about infrastructure for crypto based applications? Contact us today.

Crypto Infrastructure and DevOps Best Practices

DevOps Practices for Crypto Infrastructure, Part I: Version Control, Full Stack Automation, and Secrets Management

When standing up services that will have cryptographic interactions with a blockchain, the DevOps infrastructure and practices you employ will dictate a lot about the security and reliability of those services. In this two-part series of posts, I will introduce core DevOps principles that will help guide crypto infrastructure creation. I’ll also share different DevOps infrastructure aspects that I have to worked well for me, and could be helpful to other teams looking to stand up crypto as-a-service offerings.

Cloud vs Roll-Your-Own

For many infrastructure elements, you must choose whether to go with a cloud provider such as AWS, Azure, or Google, or to roll-your-own in a colocated data center with self-managed software. In crypto and blockchain, there are some specific requirements, particularly relating to key security, which may factor into requirements around hardware security modules (HSMs), physical servers, and tiers of colocated data centers (more on this later).

But in general, all other things being equal, if there is an option from one of the three major cloud providers for an as-a-service offering vs rolling your own with purchased hardware and self managed software, there are a lot of reasons to go for the cloud option.

In my experience, it is very easy to underestimate the investment and labor required to self-manage infrastructure elements in a high-quality way over time. Especially when the software is open source, the temptation is always to just pull down the software and start running it. Focus tends to be on the cost of the hardware instead of the cost of the cloud service. The DevOps staff that is required to manage, upgrade, performance tune, patch and evolve this infrastructure over time is almost always underestimated by startup teams and becomes baggage as the team and the company grows. For any piece of infrastructure, you really have to ask yourself if this is the best use of your team’s time.

In most cases, you will want to focus your energy on things that you only you can do, and purchase services where possible from a reputable cloud provider. The three major cloud providers (AWS, Azure, Google) all have large and highly specialized teams surrounding each of their as-a-service offerings. For smaller companies, there is no way you are going to do a better job with management and security than these cloud provider teams for base/commodity offerings.

My take: go with a cloud provider or (better yet) more than one cloud provider, so you can and focus on building and running things that you can’t purchase as a service and that are unique to your offering.

SECURE, RELIABLE CRYPTO INFRASTRUCTURE FROM PURESTAKE

Version Control

In recent years, the idea of infrastructure-as-code has become a leading principle in DevOps. This is part of a larger evolution of DevOps that continues to shift the discipline towards looking more and more like a software development practice. A core part of any software development practice is storing all your software artifacts in a version control repository. Artifacts can include source code, configuration files, data files, and in general any of the inputs needed to build your software and your infrastructure environment. It seems like a given, but I have seen operational environments where not all of the artifacts necessary to build the environments were stored in source control.

The benefits of storing everything under version control is that you have a unique version for a given state of the artifacts used to build your environments. This allows for the repeatable build of environments, the implementation of processes around change to these artifacts, and the ability to roll back to any previous known good state in case there are issues. High-quality and cost-effective cloud-based services such as GitHub make this an easy choice to serve as a foundation for DevOps activity.

Full Stack Automation

One of the best things about using the cloud for your infrastructure is the programmability and APIs that the cloud vendors provide. These APIs can be used to automate the entire application stack from base layer network, DNS, storage, compute, up to operating systems and serverless functions, and all the way through to the custom code in your application. Taking an infrastructure-as-code approach means having software artifacts in your source code repository and a build process that can create an entire application environment in a fully automated way. This automation can be used to drive the initial build and incremental change to development, test, and production environments.

There are good tooling options these days to achieve this kind of infrastructure automation. At the base infrastructure level, there are solutions native to cloud provider environments such as AWS CloudFormation or Google Cloud Deployment Manager. We are fans of Terraform as it allows for the management of infrastructure in AWS, Azure, and Google from the same codebase with provider-specific modules and extensions. Once the base level infrastructure has been provisioned, packer images combined with configuration management tools like Ansible, Chef, or Puppet can be used configure host-based services.

There are a lot of benefits to be had from automating the full application stack. Automation eliminates the chance of manual errors and allows for a repeatable process. It also can drive the same stack into dev, test, and prod, thus minimizing the chances of environmental differences leading to surprises. Automation can also be used to support blue/green production deploys in which an entire new environment is built with updated code and then traffic is cut over from the existing to the new environment in a controlled fashion. In addition, it is easy to roll back in this model if there is a problem with the new environment.

Full stack automation also lends itself to the switch from thinking about servers as unique elements with individual character to managing servers as interchangeable elements. It becomes a straightforward proposition to rip and replace troublesome infrastructure and to use tightly-focused servers rather than sprawling snowflakes that acquire dozens of responsibilities and take on a life of their own.

Secrets Management

When you have an automated environment it is very important that the secrets that are part of your application are managed carefully. Secrets could include service passwords, API tokens, database passwords, and cryptographic keys. The management of crypto keys is particularly critical for crypto infrastructure where private keys are present, such as exchange infrastructure and validators on proof of stake networks. Read my recent blog to learn more about crypto key management using multisig accounts and offline keys.

However, a lot of the same principles apply to infrastructure, application, and crypto secrets. You want to make sure that these secrets are not in your source code repo, but rather that they are obtained at build or, better yet, at runtime in the different environments in which your application is running.

Software and platform native tools that help protect secrets in production environments include AWS KMS/CloudHSM, Azure Key Vault and Hashicorp Vault if you are looking for something cross platform. Some very sensitive secrets such as crypto private keys can benefit from hardware key management systems such as YubiHSM2 and Azure Dedicated HSM based on Safenet Luna hardware. The downside is that hardware solutions are generally less cloud-friendly than software ones and, while they may improve key security, some aspects of security are worsened by taking a hardware approach over a more automatable cloud-native software approach. The infrastructure costs and surface area that needs to be managed can also be far higher when taking a hardware-centric approach.

Intel SGX is a promising hardware technology that allows processes to run in secure enclaves.  A process running in a secure enclave is totally isolated from the host operating system. What this means is that, if you have access to the guest operating system, you cannot read the memory of the process running in the SGX enclave even if you have root privileges.  I am excited by the use of SGX enclaves combined with e.g. Hashicorp Vault to improve the security of software and cloud native secrets management. SGX is available today via Azure Trusted Compute but has the downside of requiring coding to the SGX APIs. We eagerly await further developments of the AWS Nitro architecture which we believe will greatly improve the security of software and cloud native secrets management. Nitro is the AWS version of providing hardware support for isolation of customer workloads on shared infrastructure.

Topics to Cover in Part II

There are many aspects to consider when thinking about secure and reliable infrastructure for crypto based applications.  We’ve only touched on a handful of areas in this article. Here are some additional areas I cover in part II:

  • Authentication
  • Authorization and Roles
  • Networking
  • Monitoring
  • Logging

Looking for further information about infrastructure for crypto-based applications? Contact us today