The security of a system is only as strong as the metaphorical “weakest link” in that system. In the case of our product, the weakest link tends to be the deployment of our infrastructure. Although the engineering skills at our company are spectacular, our combined experience in DevOps is less substantial. That lack of experience, plus the startup mantra guiding us to “move fast and break things” has resulted in an infrastructure deployment strategy that is a manual process twisted up in a mishmash of infrastructure deployment technologies. Sometimes the delicate ecosystem of deployment technologies miraculously manages to deploy a running system into the production environment, but, more often than not, mistakes are made during the deployment process that cause the maintenance of the running system to be a major headache.

In an attempt to make the deployment of the infrastructure more reliable and secure, I have targeted two rules that must be followed to improve the infrastructure deployment strategy: automate everything and use tools only as they are intended.

Automate everything

To automate everything is one of the core requirements of effective DevOps. Fully automating a process is the only way to scale that process faster than linearly with respect to active human operators. In addition to the DevOps advantage of making possible the unprecedented scaling of system, automating everything also results in several benefits directly related to security. In this section, I describe some of these benefits.

Reviewable infrastructure

At the company, we have a very strict and well-defined process of merging code into the main application. One of the steps in that process includes a mandatory code review from certain owners of the specific component that is being modified. This code review not only ensures quality of the submitted code changes (and, therefore, of the entire codebase itself), it also ensures security of critical components of the application (such as the authentication mechanism) by requiring the security team to review the code before it is submitted.

Similarly, the automated parts of our infrastructure require code reviews from owners of the specific sections of infrastructure that are being modified. However, since part of our infrastructure deployment is manual, those changes require a ticket to be submitted through a change management system which eventually end up in the inbox of a member of the operations team. That member of the operations team, who likely has complete administrator access to the system, then manually implements the change. Although we can log the changes that are manually made to the infrastructure using an auditing service like AWS CloudTrail or AWS Config, discrepancies or mistakes in the implementation can only be noticed after they have already occurred (if they are even noticed at all). Fully automating all of the deployed infrastructure allows us to apply the same rigor of review, for both quality and security, of deployed infrastructure changes as we do for code changes of the main application before the changes are even applied to the system.

Auditable infrastructure

Where reviewability of infrastructure changes is useful before the change occurs, auditability of the infrastructure is useful after the changes have already been applied to the system. Fully automating the infrastructure means that all of the infrastructure is recorded in updated documentation in the form of the automated code. In the case where you, or an external auditor, needs to review how the infrastructure is designed, you can simply refer to the latest version of your automation code in your version control system as the most up to date documentation of the infrastructure.

If any of the infrastructure is deployed manually, then the documentation of that part of the system must be manually updated. The tedious, and typically less prioritized, process of manually keeping documentation in sync with an ever changing project inevitably results in an outdated, and incorrect, description of the infrastructure. However, if the infrastructure is constructed based on the documentation itself, as in a DevOps fully automated system, then the documentation will always inherently be in sync with the real system.

Disaster recovery

In addition to being able to audit, review, and version control a fully automated infrastructure, it is also significantly easier to relaunch a fully automated infrastructure in its entirety. Even though an entire system may not typically need to be completely redeployed under normal conditions, a disaster may require the entire system to be completely redeployed into a different region. The recovery time objective (RTO) of a critical system is usually very short, which requires the mean time to recovery (MTTR) to be as quick as possible. With a fully automated infrastructure, the MTTR can be reduced to the time it takes to press one, or maybe several, buttons (or maybe less if disaster failover is also automated!). Not only is the MTTR of a fully automated infrastructure quicker than a partially manually deployed infrastructure, it is also more reliable and significantly less stressful to deploy it.

Automatic rollback

One advantage of using a version control system is that each iteration of the system is recorded. Not only can you review previous versions of the system, you can also deploy previous versions of the system. This feature is especially useful if a mistake is made in the infrastructure that needs to be immediately rolled back to a previous state. In a manually deployed infrastructure, it can be difficult to even remember what changes were made, and even more difficult to figure out how to reverse them.

No snowflake systems

Another security challenge associated with infrastructure deployment is updating security patches and configuration settings on each of the resources. For example, if a piece of software running throughout your environment requires a security patch, then that update only needs to occur once in the automation code. Similarly, if the configuration of all load balancers needs to be updated to use a more stronger security policy, then that update only needs to occur once in the automation code. If these changes were made manually to each system, then, depending on the complexity of the change, the human operator is likely to unintentionally configure each one slightly differently. Slight differences of configuration settings of the systems can lead to security vulnerabilities that may go unnoticed for a very long time.

Use tools as they are intended

In addition to automating the entire infrastructure, choosing the right tool for the job is very important from a security point of view. Using a tool that is designed for a specific task helps to ensure readability and reliability of the deployed infrastructure defined by that tool.

More specifically, when defining an infrastructure, use an infrastructure definition tool (CloudFormation or Terraform). When configuring servers, use a server configuration tool (Chef, Puppet, or Ansible). When defining a Docker container, use a Dockerfile. When packaging and distributing a piece of software, use a package management system (yum, apt, etc.). Although it seems obvious to use the right tool for the job, each tool requires time and effort of the human operator to learn to use effectively. Couple the extra effort required to learn to use the tool with the fact that many of the tools also offer features to half-heartedly accomplish other tasks, many human operators are tempted to use a single tool outside of its intended domain. Although learning and using a single tool while ignoring other potentially more logical options may seem like a time-saving temptation, the added complexity of using a tool outside of its intended domain results in layers of technical debt that inevitably take more time to resolve in the future.

One example that I have seen of using a tool outside of its intended domain is using Red Hat’s Ansible, a server configuration tool, to define an infrastructure. The main difference between the configuration of an infrastructure and the configuration of a specific server is the number of configuration points in each of these types of systems. An infrastructure has a relatively limited number of configuration points (ELB options, route tables, etc.), whereas a server has an intractably large number of configuration points (installed software, configuration of that software, environment variables, etc.). Because of this difference, infrastructure tool templates are easier to read and understand than server configuration tool templates since infrastructure tool templates are able to explicitly define all configuration points of the system. On the reverse side, server configuration tools are only able to explicitly define the desired configuration points (make sure package A is installed with a specific configuration) while ignoring any part of the system that has not been mentioned (do not uninstall package B if it was not mentioned in the server configuration template). The additional complexity requiring the understanding both the server configuration and the initial state that it applies to is necessary for server configuration, but that additional complexity is unnecessary for infrastructure definition. Therefore, the additional unnecessary complexity of using a server configuration tool to define an infrastructure introduces unnecessary risk into the deployment of the infrastructure.

Another example that I have seen of using a tool outside of its intended domain is using Ansible (again) combined with preprocessing shell scripts to define Docker containers. In this instance, several bash scripts would generate a Dockerfile by replacing variables in a Dockerfile.tpl file (using a combination of environment variables and variables defined in the bash scripts themselves), build the container by running the newly created Dockerfile artifact that would run an Ansible playbook on itself, and then upload the resulting container to a remote container repository. Later, several shell scripts from another repository would pull and run that container with variables defined from the new environment. Needless to say, following the variables through this process or recreating a simple local environment of this tightly coupled system to test these containers proved exceedingly difficult. Understanding that most of this process could have been defined in a single Dockerfile (without Ansible or the complicated preprocessing scripts), accepting this high-level of unnecessary complexity equates to accepting a high-level of unnecessary risk in deploying this system. (In fairness to the writer of this process, that system was initially created to deploy a system directly onto a VM. The containerization of the system was later added as a constraint, and insufficient resources were granted to properly rewrite the process to address the new constraint.)

Solution

Although automating all of the infrastructure and choosing the right tools for the job is a difficult and time-consuming task, it is a necessary task to create a resilient and secure infrastructure. In this section, I describe a workable solution using several specific tools. These tools may not work for your specific system, but they may provide a good place to start.

Infrastructure definition - Use CloudFormation to define resources in the cloud such as VPCs, route tables, SQS queues, and EC2 instances. Include the installation of a pull-based server configuration agent on each EC2 instance defined so that it will be able to configure itself when it boots.
Server configuration - Use a pull-based server configuration tool, such as Chef, that can define the configuration of each server in the infrastructure based on the “role” of that server (secure transparent proxies have configuration X, bastion hosts have configuration Y, etc.). When the machines boot up from the infrastructure definition tool, they automatically pull their own configuration from the server configuration tool.
Container building tool - Use a Dockerfile to define how a container should be built. The additional complexity of requiring preprocessing with bash scripts or self-configuration with Ansible is likely to be a warning sign that the system is not designed properly. Reassess the design and try to follow Docker’s best practices.
FaaS deployment tools - I am a fan of running small services as FaaS since most of the infrastructure responsibilities are delegated to the cloud service provider. Launch these services with FaaS deployment tool such as Serverless.

Although developing a resilient and secure infrastructure is a difficult and complicated task, following these two rules will immediately take you a long way. Also, as an added benefit, your security team and auditor will thank you.

Craig Code

Thursday, December 22, 2016