Zero Downtime Application Updates with Ansible

OSCON 2013 Speaker Series

Automating the configuration management of your operating systems and the rollout of your applications is one of the most important things an administrator or developer can do to avoid surprises when updating services, scaling up, or recovering from failures. However, it’s often not enough. Some of the most common operations that happen in your datacenter (or cloud environment) involve large numbers of machines working together and humans to mediate those processes. While we have been able to remove a lot of human effort from configuration, there has been a lack of software able to handle these higher-level operations.

I used to work for a hosted web application company where the IT process for executing an application update involved locking six people in a room for sometimes 3-4 hours, each person pressing the right buttons at the right time. This process almost always had a glitch somewhere where someone forgot to run the right command or something wasn’t well tested beforehand. While some technical solutions were applied to handle configuration automation, nothing that could perform configuration could really accomplish that high level choreography on top as well. This is why I wrote Ansible.

Ansible is a configuration management, application deployment, and IT orchestration system. One of Ansible’s strong points is having a very simple, human readable language – it allows users very fine, precise control over what happens on what machines at what times.

Getting started

To get started, create an inventory file, for instance, ~/ansible_hosts that defines what machines you are managing, and which machines are frequently organized into groups. Ansible can also pull inventory from multiple cloud sources, but an inventory file is a quick way to get started:

Now that you have defined what machines you are managing, you have to define what you are going to do on the remote machines.

Ansible calls this description of processes a “playbook,” and you don’t have to have just one, you could have different playbooks for different kinds of tasks.

Let’s look at an example for describing a rolling update process. This example is somewhat involved because it’s using haproxy, but haproxy is freely available. Ansible also includes modules for dealing with Netscalers and F5 load balancers, so this is just an example — ordinarily you would start more simply and work up to an example like this:

Now, the interesting stuff is really where Ansible “roles,” come into play. This is where you would put the configuration of your application. Let’s take a look at one of the files that defines the webserver role:

Ansible playbooks are intended to be pretty straightforward descriptions of what needs to happen on your infrastructure. Ansible modules are ‘idempotent’ and declare the desired state of a system, as opposed to actions that must occur. However, Ansible also makes it easy to describe very explicit steps that must be taken, and things run in the order written, and tasks can easily depend on the results of previous tasks. This makes it a good solution for application deployment as well as base OS configuration. In the age of modern Linux cloud deployments and “DevOps” collaboration, the distinction between managing the OS and the applications is all but gone.

(If you would like to see more about this playbook setup, see https://github.com/ansible/ansible-examples/tree/master/lamp_haproxy)

The update process would then be triggered like so:

ansible-playbook application.yml -i ansible_hosts

So, what happens now? Ansible will carve up your webservers into however many groups it needs to. For instance if you had 500 webservers and wanted to update 50 at a time, you could. In the previous playbook, we have selected a “serial” value of “5”, which means to only update 5 at a time.

Ansible will take 5 hosts at a time, signal a monitoring outage window and take them out of the load-balanced pool. Updates will happen on those 5 servers, and if successful, the outage window on those servers will end, and they will be put back into the load-balanced pool. Should the group of hosts fail, the rolling update will stop, preserving your infrastructure, it will stop updating the remaining hosts, and not put the failed hosts back into rotation. All of this is done in the core application, without users having to write any automation glue on top.

A unique quality of Ansible is that it can manage all of the remote machines exclusively over SSH. This means that you do not need to bootstrap any software on your remote nodes, and you have a very small attack surface with no additional root level agents. It’s easy to tie into Kerberos and there’s no need for root logins.

This is just a introduction, but I hope this shows how you can go from a standard configuration management setup to a fully operational rolling update system. If you wish, you can take the next step to wire in the ansible-playbook command to something like Jenkins, and you can continuously deploy your infrastructure as fast as you would like every time there are code changes!

If you would like to learn more about Ansible see http://ansibleworks.com and http://ansibleworks.com/docs/

NOTE: If you are interested in attending OSCON to check out Michael’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

Related

Sign up for the O'Reilly Programming Newsletter to get weekly insight from industry insiders.
topic: Programming