Devops Infrastructure Principles
When standing up services that will have cryptographic interactions with a blockchain, the devops infrastructure and practices employed in creating and managing that infrastructure will dictate a lot about the security and reliability of those services. In this series of posts, we will cover devops principles that help guide our thinking here at PureStake and then start looking at different devops infrastructure areas to share things which we have found to work well and that could be helpful to other teams looking to stand up crypto as-a-service offerings.
We have as a core company value, that the security and reliability of our services come first, before any features. That priority has heavily influenced some of the choices we have made and what we think is important.
Cloud vs Roll Your Own
For many infrastructure elements, there is a choice to go with a cloud provider such as AWS, Azure, or Google, or to roll your own in a colocated data center with self managed software. In crypto there are some specific requirements, particularly relating to key security, which may lead to requirements around hardware security modules (HSMs), physical servers, and tiers of colocated data centers. More on this later.
But in general, all other things being equal, if there is an option from one of the three major cloud providers for an as-a-service offering vs rolling your own with purchased hardware and self managed software, there are a lot of reasons to go for the cloud option.
In my experience it is very easy to underestimate the investment and labor required to self manage infrastructure elements in a high quality way over time. Especially when the software is open source, the temptation is always to just pull down the software and start running it. Focus tends to be on the cost of the hardware vs the cost of the cloud service. The devops staff required to manage, upgrade, performance tune, patch and evolve this infrastructure over time is almost always underestimated by startup teams and becomes baggage as the team and the company grows. You really have to ask yourself for any piece of infrastructure if this is the best use of your team’s time. In most cases you want to focus your energy on things which you only you can do, and purchase services where possible from a reputable cloud provider. The three major cloud providers (AWS, Azure, Google) all have large and highly specialized teams surrounding each of their as-a-service offerings. For smaller companies, there is no way you are going to do a better job with management and security than these cloud provider teams for base / commodity offerings.
Our take is to go with a cloud provider or better yet, more than one cloud provider where you can and focus on building and running things that you can’t purchase as a service and that are unique to your offering.
In recent years the idea of infrastructure-as-code has been a leading principle in devops. This is part of a larger evolution of devops that continues the shift of the discipline towards looking more and more like a software development practice. A core part of any software development practice is storing all your software artifacts in a version control repository. Artifacts include source code, configuration files, data files, and in general any of the inputs needed to build your software and your infrastructure environment. It seems like a given, but I have seen operational environments where not all of the artifacts necessary to build the environments were stored in source control.
The benefits of storing everything under version control is that you have a unique version for a given state of the artifacts used to build your environments. This allows for the repeatable build of environments, the implementation of processes around change to these artifacts, and the ability to roll back to any previous known good state in case there are issues. High quality and cost effective cloud based services such as Github make this an easy choice to serve as a foundation for devops activity.
Full Stack Automation
One of the best things about using the cloud for your infrastructure is the programmability and APIs that the cloud vendors provide. These APIs can be used to automate the entire application stack from base layer network, DNS, storage, compute, up to operating systems and serverless functions and all the way through to the custom code in your application. Taking an infrastructure-as-code approach means having software artifacts in your source code repository and a build process that can create an entire application environment in a fully automated way. This automation can be used to drive the initial build and incremental change to development, test, and production environments.
There are good tooling options these days to achieve this kind of infrastructure automation. At the base infrastructure level there are solutions native to cloud provider environments such as AWS CloudFormation or Google Cloud Deployment Manager. We are fans of Terraform as it allows for the management of infrastructure in AWS, Azure, and Google from the same codebase with provider specific modules and extensions. Once the base level infrastructure has been provisioned, packer images combined with configuration management tools like Ansible, Chef, or Puppet can be used configure host based services.
There are a lot of benefits to be had from automating the full application stack. Automation eliminates the chance of manual errors and allows for a repeatable process. It also can drive the same stack into dev, test, and prod, thus minimizing the chances of environmental differences leading to surprises. Automation can also be used to support blue / green production deploys where an entire new environment is built with updated code and then traffic is cut over from the existing to the new environment in a controlled fashion. In addition, it is easy to rollback in this model if there is a problem with the new environment.
Full stack automation also lends itself to the switch from thinking about servers as unique elements with individual character to managing servers as interchangeable elements. It becomes a straightforward proposition to rip and replace troublesome infrastructure and to use tightly focused servers rather than sprawling snowflakes that acquire dozens of responsibilities and take on a life of their own.
When you have an automated environment it is very important that the secrets that are part of your application are managed carefully. Secrets could include service passwords, api tokens, database passwords, and cryptographic keys. The management of crypto keys is particularly critical for crypto infrastructure where private keys are present, such as exchange infrastructure and validators on proof of stake networks. A detailed discussion of crypto private key management will be the subject of a future blog. However a lot of the same principles apply to infrastructure, application, and crypto secrets. You want to make sure that these secrets are not in your source code repo, but rather that they are obtained at build, or better yet at runtime in the different environments that your application is running in.
Software and platform native tools that help protect secrets in production environments include AWS KMS/CloudHSM, Azure Key Vault and Hashicorp Vault if you are looking for something cross platform. Some very sensitive secrets such as crypto private keys can benefit from hardware key management systems such as YubiHSM2 and Azure Dedicated HSM based on Safenet Luna hardware. The downside is that hardware solutions are generally less cloud friendly than software ones, and while they may improve key security, some aspects of security are worsened by taking a hardware approach over a more automatable cloud native software approach. The infrastructure costs and surface area that needs to be managed can also be far higher when taking a hardware centric approach.
Intel SGX is a promising hardware technology that allows processes to run in secure enclaves. A process running in a secure enclave is totally isolated from the host operating system. What this means is that if you have access to the guest operating system, you cannot e.g. read the memory of the process running in the SGX enclave even if you have root privileges. We are excited by the use of SGX enclaves combined with e.g. Hashicorp Vault to improve the security of software and cloud native secrets management. SGX is available today via Azure Trusted Compute but has the downside of requiring coding to the SGX APIs. We eagerly await further developments of the AWS Nitro architecture which we believe will greatly improve the security of software and cloud native secrets management. Nitro is the AWS version of providing hardware support for isolation of customer workloads on shared infrastructure.
Topics to Cover in Part II
There is a large number of areas to consider when thinking about secure and reliable infrastructure for crypto based applications. We’ve only touched on a handful of areas in this article. Here are some additional areas I’ll cover in part II:
- Authorization and Roles
Looking for further information about infrastructure for crypto based applications? Contact us today