Auto Scaling on AWS is a great way to reliably host your applications with the added benefit of scaling the stateless bits.  However, Auto Scaling deploys present a unique set of challenges that you need to be aware of, especially if you choose to use AWS CodeDeploy.

What’s so hard about deploying to Auto Scaling groups?

At any moment, your Auto Scaling group (ASG) could be spinning up a new instance or terminating an instance.  If you fire off a deploy, you need to ensure that your ASG doesn’t do these things which could lead your deployment to a weird state.  If you start a deployment and your ASG terminates an instance as part of that deployment, your deploy might fail since it didn’t successfully deploy to all the instances it intended to.  If you start a deployment and your ASG launches a new instance, it might grab the now “old” version of your application since your application’s “new” version hasn’t finished deploying.  If you start a deployment and it fails on some step on an instance, what should the ASG do?  Should it terminate the instance?  Should your deployment tool redeploy the old version to get back to good known state?

Enter AWS CodeDeploy

I’ll be the first to admit, CodeDeploy is not a perfect tool.  Like many of AWS’s non-infrastructure services (I call these sugary services), there’s a lot left to be desired.  CodeDeploy does check a few boxes for me, however.  I try very hard to fully bake AMIs using Ansible and I don’t want to run a Chef server, Puppet server, a server with Ansible on cron…  none of that.  When I saw that CodeDeploy could work as an artifact repository of sorts, holding the latest known good version, and gave me easy insertion points with deploy cycle hooks I was pretty excited.  But once you start taking a look at how it works, you realize its integration with Auto Scaling is pretty lame and limited.  When you create a CodeDeploy deployment group with an ASG as a target, it adds an ASG lifecycle hook so that when an instance boots CodeDeploy knows to watch for that instance ID checking in.  It essentially creates a deployment just for that instance using the latest good version of the application at the time the instance boots.  Even with this lifecycle hook, you’re still left with most of the issues I raised above.

Sample Scripts

The first clue that you’ll need to take some manual steps to get this working as magically as you want is the sample scripts repo AWS provides.  If you look at load-balancing/elb (not elb-v2 because they haven’t made that ASG aware for some reason…?), you’ll see that they give you a pretty good starting point.  This is a collection of scripts that you bundle with your application and add to your appspec.yml file to be ran during various deploy cycle steps.  If you read through the README.md in that directory, you’ll see they point out some of the same issues I raised above under “Important notice about handling AutoScaling processes”.  I highly recommend you read through their points and the subsequent paragraphs.

The way they handle Auto Scaling processes doesn’t do enough to address even the issues they highlight, which is why we need to add more magic.  I’m going to assume from this point on that you’re using these scripts with your application and have them added to various hooks in your appspec.yml.  You’ll notice that these scripts have a flag called HANDLE_PROCS which is meant for ASGs.  This solves a couple of the issues raised in their readme, but doesn’t stop an instance from launching during a deploy and even worse leads to a very dangerous situation if a deploy fails.  Let’s say you have two instances in an ASG just for redundancy’s sake and have your min, max, and desired all set to two.  Ideally for a zero downtime deployment, you’d want to deploy to one instance, wait for that to succeed, and then move on to the second.  Let me take you through the list of things that happens if you use the AWS provided sample scripts with your application in a failure scenario:

  1. You start a deploy with CodeDeploy which finds an instance in the ASG and informs the CodeDeploy agent running on the instance to start the deployment.
  2. The sample scripts suspend the ASG processes (specifically AZRebalance, AlarmNotification, ScheduledActions, and ReplaceUnhealthy).
  3. The script checks to see if it can move the instance to the standby state (which would take it out of the ELB if the ASG is tied to one), but it cannot because it would no longer satisfy the min condition of two healthy instances.
  4. The script will then decrease the ASG’s min to 1, and set the instance to standby which the ASG will happily do.
  5. If your install fails or another script fails at this point, your ASG is left at a min of one, desired one, and max two with suspended processes.

I don’t think it’s good enough to say that you just need to catch all failures and resume processes on fail.  That doesn’t reset my ASG’s settings and you can’t do anything if one of the deploy cycles that you can’t hook into fails (install for example).

Most of the problems can be addressed by some manipulation of Auto Scaling processes, but not by the instances themselves as you can’t guarantee that your failure point will be during a deploy hook (or that someone else won’t try to add a script without catching failures).  We really need an out-of-band process to handle situations like the one I outlined above, and thankfully AWS has made this easy-ish to do with CloudWatch Events.

Add magic to the sugar

We can set up a deployment group to send notifications to an SNS topic when a deploy starts and stops, which we can have fire Lambda functions… and ta-da, you have your out-of-band process!  You’ll need to set up permission for the role your  CodeDeploy deployment group is using to publish to SNS, create an SNS topic, and then create a trigger on your deployment group.

Deployment group triggers

For events, I chose Deployment Status (all), but do whatever makes sense for you.  Then select your new SNS topic and click “Create Trigger”.  Now your deployment group should be emitting events to your SNS topic in the form of a JSON object (some samples found here).  The next step is to create a Lambda function that your SNS topic invokes.

Obviously you’re free to do whatever you want, but here’s what I made.  Note that this assumes your ASG name is the same as your CodeDeploy application name.

This Lambda will look at the JSON it receives from SNS (once we’ve set up the subscription) and on deployment start will pause processes on the ASG and save the ASG sizing to a DynamoDB table (configurable via dynamodbTableName at the top).  On deployment stop, it will reset the ASG sizing and resume processes for us, as well as delete the record from the DynamoDB table.  If there was a failure, the script will check to see if any instances are in the standby state and terminate them, and then proceed to reset the sizing.  This will leave your ASG in a consistent state with exactly the number of instances you want.  You’ll need to ensure you create the Lambda function with a role that has access to Auto Scaling, CodeDeploy, DynamoDB, and CloudWatch logs.

Now we need to create the SNS topic subscription which invokes the Lambda function.  Go to your SNS topic and click “Create subscription”, choose “AWS Lambda”, select the Lambda function you just created, and then click “Create subscription”.

Now that our Lambda function is handling suspending and resuming Auto Scaling processes for us, we can change HANDLE_PROCS in common_functions.sh to false.  When a deploy starts, there is a chance that the instance starts running through the deploy steps before our Lambda has fired which could cause problems for us.

Since having the instance entering standby is one of the first things you want to do, we can add a check to see if the Lambda has run by looking at the DynamoDB table with our deployment ID.  I edited the autoscaling_enter_standby function to call a new wait_for_dynamodb_record function (also make sure you set TABLE_NAME at the top of common.sh so it knows where to look).

Your instance will need to have permissions to read from that DynamoDB table if it doesn’t have it already.  I use an instance profile with a role for this, but use whatever works for you.

Finishing Up

There’s quite a few moving parts here now, so let’s run through a successful deployment.

  1. You create a deployment with CodeDeploy which sends an event to your SNS topic and starts a deploy on an instance in the ASG.
  2. The instance tries to enter standby, but waits until it sees a record in the DynamoDB table which would indicate the Lambda function has run.
  3. The Lambda function runs, adding a record to the DynamoDB table that has the ASG’s min, max, and desired counts at the time of deploy start. It then suspends Auto Scaling processes.
  4. The instance (now seeing the record in the DynamoDB table) continues on with the deploy, completing successfully and tells the ASG to put itself back in service.
  5. CodeDeploy sends a complete event to the SNS topic which fires our Lambda.
  6. The Lambda function sees it was a successful deployment, deletes the record from DynamoDB, and resumes Auto Scaling processes.

But what about a failure?

  1. You create a deployment with CodeDeploy which sends an event to your SNS topic and starts a deploy on an instance in the ASG.
  2. The instance tries to enter standby, but waits until it sees a record in the DynamoDB table which would indicate the Lambda function has run.
  3. The Lambda function runs, adding a record to the DynamoDB table that has the ASG’s min, max, and desired counts at the time of deploy start. It then suspends Auto Scaling processes.
  4. The instance (now seeing the record in the DynamoDB table) continues on with the deploy, fails at some step which isn’t caught or handled in any way.
  5. CodeDeploy sends a failure event to the SNS topic which fires our Lambda.
  6. The Lambda function sees it was a failed deployment, deletes the record from DynamoDB, terminates any instances left in standby, and resumes Auto Scaling processes.
  7. The ASG spins up a new instance to account for the now terminated standby instance.

At last you have the magical sugary goodness it seemed AWS was offering you when it introduced a deploy service back in 2014.


Photo Credit

Categories: AWS