Auto-scaling on Amazon EC2 with Opscode Chef

There are lots of ways for setting up auto-scaling for EC2 nowadays, there's Amazon's own products like the recently announced AWS OpsWorks and CloudFormation. The benefit of using these tools is integration with other AWS services. But, there's also downsides, as OpsWorks cannot integrate with ELB currently, and using CloudFormation will probably involve you writing funky JSON templates.

There's also third-party solutions, like open-source Asgard from Netflix and rightscale, an enterprise cloud management service.

These services can also be used for some basic configuration management, though I feel that is not their primary purpose. We chose to go with a separate solutions for that - Opscode Chef.

There are lots of guides on how to set up EC2 auto-scaling, as well as guides on integrating Chef with CloudFormation, like Amazon's own docs, however there isn't much information on how to do this without CloudFormation. Specifically, if you just want auto-scaling without the extra complexity of CloudFormation and still want to use Chef for configuration management, here's what you need to do.

Before you continue you'll need to install and configure the auto scaling tools, which you can get from your favorite package manager or directly from Amazon.

Adding nodes to chef

First you need to create a launch config for AWS to use when launching new instances. The difference here compared to something like "knife ec2 server create" is that you need to manually bootstrap Chef. Replace parameters with ones for your application as needed.

as-create-launch-config EXAMPLE --image-id ami-b8d147d1 --instance-type m1.large \
    --group EC2_SECURITY_GROUP --monitoring-disabled --user-data-file chef-user-data.sh

The most important part here is the "--user-data-file" parameter, as we're able to provide a script that will bootstrap Chef for us. Here's a version that works with Ubuntu 12.10 and Chef 11.4

#!/bin/bash -v
# install pre-requisites
apt-get update
apt-get upgrade
apt-get install -y ruby1.9.1-dev ruby1.9.1 rubygems s3cmd
gem install ohai chef --no-rdoc --no-ri --verbose
mkdir -p /etc/chef
# write first-boot.json
(
cat << 'EOP'
{"run_list": ["role[YOUR_SERVER_ROLE]"]}
EOP
) > /etc/chef/first-boot.json
# write .s3cfg
(
cat << 'EOP'
[default]
access_key = ***
secret_key = ******
use_https = True
EOP
) > /home/ubuntu/.s3cfg
# get chef validation key from S3
s3cmd -c /home/ubuntu/.s3cfg get s3://YOUR_BUCKET/validation.pem
/etc/chef/validation.pem
# write client.rb
(
cat << 'EOP'
log_level :info
log_location STDOUT
chef_server_url 'YOUR_CHEF_URL'
validation_client_name 'YOUR_PROJECT-validator'
EOP
) > /etc/chef/client.rb
# Bootstrap chef
chef-client -j /etc/chef/first-boot.json

This script will get your validation key for chef from S3 and register the instance with the chef server. Next, we create the auto-scaling group (once again, change parameters as needed).

as-create-auto-scaling-group EXAMPLE --availability-zones us-east-1a,us-east-1b \
    --launch-configuration EXAMPLE --desired-capacity 2 --min-size 2 --max-size 10 \
    --load-balancers YOUR_ELB_NAME --health-check-type ELB --grace-period 300

You can then setup triggers for scaling the group up and down using "as-create-or-update-trigger" script or create policies with "as-put-scaling-policy" script and add alarms in the CloudWatch web interface.

You should see servers booting up as soon as you create the auto-scaling group.

Deleting nodes from chef

Now you should have your servers provisioned and correctly registered with chef. The only missing part would be to remove them from chef when they're shut down. The simplest way to do this would be to place a script in /etc/rc0.d/ that will do it for you. You can do this with the userdata script, or with a Chef recipe:

/usr/local/bin/knife node delete -y -c /root/.chef/knife.rb <%=node['fqdn'] %>
/usr/local/bin/knife client delete -y -c /root/.chef/knife.rb <%=node['fqdn'] %>

It will also require you to write a knife.rb config on the server (into /root/.chef/knife.rb in this case).

This solution is not ideal, as it won't delete your servers from chef in case the server goes completely unresponsive (which does happen with EC2) but is good enough. You could also setup something more advanced that will send a SQS notification when scaling down, and have the listener remove nodes for you.

Conclusion and Notes

There you have it, we now have a fully automated auto-scale solution that you can still manage with chef.

User data that you provide to launch config will be included with each instance that is launched and can be seen in plain text, so it's probably not a good idea to include your global AWS credentials in there.

Unfortunately, if something goes wrong, the only way to debug is checking the system log of the instance. But once chef-client starts running, you're pretty much done assuming your recipes don't produce any errors and your application boots up and is able to respond to the ELB healthcheck. You should probably test this beforehand with a simple "knife ec2 server create" and making sure that you can just add the instance to ELB afterwards.

I also recommend setting a high grace period for the auto-scale group, as chef and all of its dependencies can take a while to install (> 2 minutes).

Posted Thu 28 February 2013 by Ivan Dyedov in AWS (Amazon Web Services, Ubuntu, Opscode Chef, autoscale)