Thursday, December 22, 2016

Automate Everything with the Right Tools for Security and Profit

The security of a system is only as strong as the metaphorical “weakest link” in that system. In the case of our product, the weakest link tends to be the deployment of our infrastructure. Although the engineering skills at our company are spectacular, our combined experience in DevOps is less substantial. That lack of experience, plus the startup mantra guiding us to “move fast and break things” has resulted in an infrastructure deployment strategy that is a manual process twisted up in a mishmash of infrastructure deployment technologies. Sometimes the delicate ecosystem of deployment technologies miraculously manages to deploy a running system into the production environment, but, more often than not, mistakes are made during the deployment process that cause the maintenance of the running system to be a major headache.

In an attempt to make the deployment of the infrastructure more reliable and secure, I have targeted two rules that must be followed to improve the infrastructure deployment strategy: automate everything and use tools only as they are intended.

Automate everything

To automate everything is one of the core requirements of effective DevOps. Fully automating a process is the only way to scale that process faster than linearly with respect to active human operators. In addition to the DevOps advantage of making possible the unprecedented scaling of system, automating everything also results in several benefits directly related to security. In this section, I describe some of these benefits.

Reviewable infrastructure

At the company, we have a very strict and well-defined process of merging code into the main application. One of the steps in that process includes a mandatory code review from certain owners of the specific component that is being modified. This code review not only ensures quality of the submitted code changes (and, therefore, of the entire codebase itself), it also ensures security of critical components of the application (such as the authentication mechanism) by requiring the security team to review the code before it is submitted.

Similarly, the automated parts of our infrastructure require code reviews from owners of the specific sections of infrastructure that are being modified. However, since part of our infrastructure deployment is manual, those changes require a ticket to be submitted through a change management system which eventually end up in the inbox of a member of the operations team. That member of the operations team, who likely has complete administrator access to the system, then manually implements the change. Although we can log the changes that are manually made to the infrastructure using an auditing service like AWS CloudTrail or AWS Config, discrepancies or mistakes in the implementation can only be noticed after they have already occurred (if they are even noticed at all). Fully automating all of the deployed infrastructure allows us to apply the same rigor of review, for both quality and security, of deployed infrastructure changes as we do for code changes of the main application before the changes are even applied to the system.

Auditable infrastructure

Where reviewability of infrastructure changes is useful before the change occurs, auditability of the infrastructure is useful after the changes have already been applied to the system. Fully automating the infrastructure means that all of the infrastructure is recorded in updated documentation in the form of the automated code. In the case where you, or an external auditor, needs to review how the infrastructure is designed, you can simply refer to the latest version of your automation code in your version control system as the most up to date documentation of the infrastructure.

If any of the infrastructure is deployed manually, then the documentation of that part of the system must be manually updated. The tedious, and typically less prioritized, process of manually keeping documentation in sync with an ever changing project inevitably results in an outdated, and incorrect, description of the infrastructure. However, if the infrastructure is constructed based on the documentation itself, as in a DevOps fully automated system, then the documentation will always inherently be in sync with the real system.

Disaster recovery

In addition to being able to audit, review, and version control a fully automated infrastructure, it is also significantly easier to relaunch a fully automated infrastructure in its entirety. Even though an entire system may not typically need to be completely redeployed under normal conditions, a disaster may require the entire system to be completely redeployed into a different region. The recovery time objective (RTO) of a critical system is usually very short, which requires the mean time to recovery (MTTR) to be as quick as possible. With a fully automated infrastructure, the MTTR can be reduced to the time it takes to press one, or maybe several, buttons (or maybe less if disaster failover is also automated!). Not only is the MTTR of a fully automated infrastructure quicker than a partially manually deployed infrastructure, it is also more reliable and significantly less stressful to deploy it.

Automatic rollback

One advantage of using a version control system is that each iteration of the system is recorded. Not only can you review previous versions of the system, you can also deploy previous versions of the system. This feature is especially useful if a mistake is made in the infrastructure that needs to be immediately rolled back to a previous state. In a manually deployed infrastructure, it can be difficult to even remember what changes were made, and even more difficult to figure out how to reverse them.

No snowflake systems

Another security challenge associated with infrastructure deployment is updating security patches and configuration settings on each of the resources. For example, if a piece of software running throughout your environment requires a security patch, then that update only needs to occur once in the automation code. Similarly, if the configuration of all load balancers needs to be updated to use a more stronger security policy, then that update only needs to occur once in the automation code. If these changes were made manually to each system, then, depending on the complexity of the change, the human operator is likely to unintentionally configure each one slightly differently. Slight differences of configuration settings of the systems can lead to security vulnerabilities that may go unnoticed for a very long time.

Use tools as they are intended

In addition to automating the entire infrastructure, choosing the right tool for the job is very important from a security point of view. Using a tool that is designed for a specific task helps to ensure readability and reliability of the deployed infrastructure defined by that tool.

More specifically, when defining an infrastructure, use an infrastructure definition tool (CloudFormation or Terraform). When configuring servers, use a server configuration tool (Chef, Puppet, or Ansible). When defining a Docker container, use a Dockerfile. When packaging and distributing a piece of software, use a package management system (yum, apt, etc.). Although it seems obvious to use the right tool for the job, each tool requires time and effort of the human operator to learn to use effectively. Couple the extra effort required to learn to use the tool with the fact that many of the tools also offer features to half-heartedly accomplish other tasks, many human operators are tempted to use a single tool outside of its intended domain. Although learning and using a single tool while ignoring other potentially more logical options may seem like a time-saving temptation, the added complexity of using a tool outside of its intended domain results in layers of technical debt that inevitably take more time to resolve in the future.

One example that I have seen of using a tool outside of its intended domain is using Red Hat’s Ansible, a server configuration tool, to define an infrastructure. The main difference between the configuration of an infrastructure and the configuration of a specific server is the number of configuration points in each of these types of systems. An infrastructure has a relatively limited number of configuration points (ELB options, route tables, etc.), whereas a server has an intractably large number of configuration points (installed software, configuration of that software, environment variables, etc.). Because of this difference, infrastructure tool templates are easier to read and understand than server configuration tool templates since infrastructure tool templates are able to explicitly define all configuration points of the system. On the reverse side, server configuration tools are only able to explicitly define the desired configuration points (make sure package A is installed with a specific configuration) while ignoring any part of the system that has not been mentioned (do not uninstall package B if it was not mentioned in the server configuration template). The additional complexity requiring the understanding both the server configuration and the initial state that it applies to is necessary for server configuration, but that additional complexity is unnecessary for infrastructure definition. Therefore, the additional unnecessary complexity of using a server configuration tool to define an infrastructure introduces unnecessary risk into the deployment of the infrastructure.

Another example that I have seen of using a tool outside of its intended domain is using Ansible (again) combined with preprocessing shell scripts to define Docker containers. In this instance, several bash scripts would generate a Dockerfile by replacing variables in a Dockerfile.tpl file (using a combination of environment variables and variables defined in the bash scripts themselves), build the container by running the newly created Dockerfile artifact that would run an Ansible playbook on itself, and then upload the resulting container to a remote container repository. Later, several shell scripts from another repository would pull and run that container with variables defined from the new environment. Needless to say, following the variables through this process or recreating a simple local environment of this tightly coupled system to test these containers proved exceedingly difficult. Understanding that most of this process could have been defined in a single Dockerfile (without Ansible or the complicated preprocessing scripts), accepting this high-level of unnecessary complexity equates to accepting a high-level of unnecessary risk in deploying this system. (In fairness to the writer of this process, that system was initially created to deploy a system directly onto a VM. The containerization of the system was later added as a constraint, and insufficient resources were granted to properly rewrite the process to address the new constraint.)

Solution

Although automating all of the infrastructure and choosing the right tools for the job is a difficult and time-consuming task, it is a necessary task to create a resilient and secure infrastructure. In this section, I describe a workable solution using several specific tools. These tools may not work for your specific system, but they may provide a good place to start.

Infrastructure definition - Use CloudFormation to define resources in the cloud such as VPCs, route tables, SQS queues, and EC2 instances. Include the installation of a pull-based server configuration agent on each EC2 instance defined so that it will be able to configure itself when it boots.
Server configuration - Use a pull-based server configuration tool, such as Chef, that can define the configuration of each server in the infrastructure based on the “role” of that server (secure transparent proxies have configuration X, bastion hosts have configuration Y, etc.). When the machines boot up from the infrastructure definition tool, they automatically pull their own configuration from the server configuration tool.
Container building tool - Use a Dockerfile to define how a container should be built. The additional complexity of requiring preprocessing with bash scripts or self-configuration with Ansible is likely to be a warning sign that the system is not designed properly. Reassess the design and try to follow Docker’s best practices.
FaaS deployment tools - I am a fan of running small services as FaaS since most of the infrastructure responsibilities are delegated to the cloud service provider. Launch these services with FaaS deployment tool such as Serverless.

Although developing a resilient and secure infrastructure is a difficult and complicated task, following these two rules will immediately take you a long way. Also, as an added benefit, your security team and auditor will thank you.

Sunday, December 18, 2016

Offloading Service Deployment Responsibilities to a Cloud Service Provider

Working as an engineer at a progressive tech startup, I quickly realized that when you are tasked to build a production-level system, you are implicitly also responsible for maintaining that system. That maintenance may seem like a simple task at first (“The code runs on my machine, just run the same command on some server in the cloud. Easy”), but when it comes to technology, the seemingly simple details are usually what makes the job incredibly difficult. The task to “just run the same command on a server in the cloud” quickly unravels into a number of issues:

Which server do I run the command on?
How to I provision that server?
How do I get my code to that server?
How do I get my secret values to the service that will run on that server (API keys, certificates, etc.)
How do I keep the server patched with the latest security updates, both on the OS level and the background daemons?
Whenever I make changes in the code in my version control system, how do I deploy those new changes to server (CI/CD)?

And these few issues are just the tip of the iceberg. Each issue is likely to have another “just do something” solution that generates another list of issues to be resolved.

Looking at all of the issues that need to be resolved just to deploy some code can be daunting for a new engineer who is used to building software in a single comfortable language (pure Python or Java or maybe even C) and simply running it locally on his or her own system. And even while many of the issues of deploying software have been ubiquitous enough to evolve from checklists of best practices into an entire career field dubbed “devops”, sometimes it is easier to just avoid these issues entirely by offloading them to a cloud service provider.

The tradeoff of offloading responsibilities to a cloud service provider comes at the cost of losing some detail of control over the deployed environment. For certain services, it may be important to retain a high level of control over the deployed environment at the cost of retaining the responsibility of maintaining that environment. In this post, I will cover the 3 most common deployment strategies that tradeoff levels of detailed control of the environment for ease of deployment and maintenance of a service in the cloud. These 3 strategies are the server level strategy, the container level strategy, and the function level strategy.

Server level

The service can be installed directly on a server (or more precisely, on a virtual machine). This is the most traditional way to deploy a service and also gives the operations team the most control over the deployed environment.

Pros

Access to low-level resources - Access to hardware-accelerated resources including network hardware.

Cons

Patch management - All patches on the server must be managed. Especially security updates.
Development environment != production environment - Typically results in the “well it works on MY machine” argument between developers and the operations team if the development environment and production environment are not kept in sync.
Scaling management - Scaling of the servers in the cluster must be managed.

Use-cases

Network service - A packet router that forwards most packets using hardware-accelerated network interfaces, but routes certain packets to a user-level process for inspection or logging.

Container level

Giving up some control over the deployment and operation of the service (and also the associated responsibility of maintaining that control), services can be deployed at the container level. A container is an abstraction of the higher-level functionality of the operating system from the lower-level kernel. For example, a virtual machine can be running a certain version of a Linux kernel, but then have several higher-level operating systems running on top of the kernel inside containers, such as Ubuntu 14.04, Ubuntu 16.04, CentOS 6, and another Ubuntu 14.04. Since the separation of the higher-level operating system from the kernel makes the deployment of each operating system much more lightweight, this gives the operations team the option to run many more independently running operating systems than traditionally possible.

Pros

Development environment = Production environment - Since the environment is explicitly defined by the developer, the development environment and the production environment will always be the same. This advantage drastically reduces the “well it works on MY machine” arguments between developers and the operations team since if it works on the developer’s machine, then it is very likely to still work on the production machine.

Cons

Patch management - Patches must be managed both at the server level and at the container level. However, this is not as challenging as in a server-level deployment strategy since the virtual machine kernel will require significantly less maintenance than higher-level operating system services, and the higher-level operating system services will be explicitly defined and managed by the developer.
Scaling management - In addition to managing the scaling of the servers in the cluster, the containers within that cluster must also be managed.

Use-cases

Service that requires custom environment
Large monolithic services

Function level

Giving up even more control of the deployment and operation of the service, a service can be deployed as pure code to a function as a service (FaaS) platform. Popular examples of FaaS are AWS Lambda (https://aws.amazon.com/lambda/), Google Functions (https://cloud.google.com/functions/docs/), and Azure Functions (https://docs.microsoft.com/en-us/azure/azure-functions/functions-overview). Each of these platforms allows a user to simply upload code that can be triggered on any number of events, such as scheduled time, an HTTP request (useful for webhooks), a certain log being generated, or a number of other platform specific events. Although FaaS may seem like magic at first since logically you are running pure code in the cloud without any servers, FaaS is simply an abstraction on top of containers that moves the responsibility of container management to the cloud service provider. With the least amount of control (and associated responsibility) over the deployment, FaaS is the easiest deployment method to maintain.

Pros

No patch management - All patches/updates are managed by the FaaS provider.

Automatically scales - All scaling is managed by FaaS provider.
Price - The typical pay-per-execution model of FaaS means the service is only charged while the code is executing. Therefore, if a service is triggered every 5 minutes and only runs for 200ms, then it will be charged only 4% of the cost of the same service that requires the server to be running at all times. However, if the service is running 100% of the time, FaaS pricing is usually more expensive than maintaining the system yourself (see pricing in cons section).

Cons

Price - FaaS is more expensive than maintaining an identical system yourself. However, the time that is spent building and maintaining that identical system is time that the engineer could be spending on creating new features for your service. Therefore, FaaS is likely to be less expensive in the long run.
Constrained to a limited list of environments - FaaS provides a limited list of runtime environments, typically one per each supported programming language. The environment usually contains all the standard libraries and tools required for most applications, but if specific customizations to the environment are required to run the service, FaaS may not be an option.
Downloads entire service on each run - FaaS works by starting a new container that downloads the entire service code (usually compressed) and then executing that code. If the codebase is large, then the download will take a long time causing a longer delay between the time the function was triggered and the time it is executed. However, if code if properly minified (a standard workflow with nodejs projects), then this download time is relatively negligible.
Development environment ~= production environment - The developer must be responsible for ensuring that the development environment matches the production environment so that behavior remains the same in both environments. Although this is easier to accomplish in FaaS than with traditional server-level deployments since the FaaS environments are usually well-defined, it is more difficult than container-level deployments where the environment is explicitly defined by the developer.

Use-cases

Simple services that can be run in a provided environment - One of the main principles of microservices is that a microservice should have one responsibility. Adhering to this principle fits FaaS like a glove since the services are small and the responsibilities are well-defined. If you are looking to deploy a large monolith service, however, FaaS might not be the best option.

Although these lists and descriptions of each deployment strategy are not comprehensive, they are usually good enough to make a decision on which deployment strategy is best for a given system or service.