optimizing spot instance allocation

There is surprisingly little information on how to optimize costs using the AWS spot instance market. There are several services that provide management on top of the spot market if you have an architecture that supports an interruptible workload but very little in the way of how to go about doing it yourself other than surface level advice on setting up autoscaling groups. To remedy the situation here’s an outline of how I’ve solved part of the problem for CI (continuous integration) type of workload

1. Set up autoscaling groups with Terraform in various regions for several instance types
2. Generate mixed integer program and solve it with GLPK
3. Update the autoscaling group limits using results from above solution

The above is currently in production and works well enough but it is missing what I consider a key component of taking the solution from GLPK and using it in an optimal way. The linear program generated and solved in step 2 does not account for the hourly block allocation and the continuity aspect of evolving the existing set of autoscaling limits to the optimal one. The planning component for evolving the existing limits to the solved limits can potentially be expressed as another mixed integer program but using a planner is probably a better approach because it requires interacting with a dynamic and evolving environment and the mixed integer program makes a lot of simplifying assumptions about the state of the world. The other issue is the “discontinuity” of the spot market. I’ve seen cases were the price is $0.2 for some instance type and when I update the autoscaling group limits the price jumps to $2.0 which happens to be the maximum price I have set for the bids. This happens quickly enough were the optimal solution basically becomes useless within 10 minutes of being generated. This is why I think a planner is a key component for a true optimizing spot allocation strategy. Without probing the market and updating actions accordingly it is possible to be caught off guard by discontinuous price hikes in the market.

(I’ve solved these problems since the last time I wrote the above)

But with that proviso out of the way the current set up is consistently much cheaper than on-demand allocation and even with the discontinuous price hikes the savings are still quite significant. Onto the description of the pieces involved.

Set up autoscaling groups with Terraform

Here is the outline of the template. You will need to fill in the blanks. I’m skipping the pieces that would generate the VPC, security groups, IAM instance profiles, and a whole bunch of other things that would be necessary for standing up an autoscaling group. It is a tedious and boring process and there is very little insight to be gained from spelling out those details. If you copy/paste any number of Terraform templates available out there for standing up a VPC then you will be fine. I’m using ERB for generating the Terraform template because the rest of the pipeline is written in Ruby and the declarative pieces live in Ruby instead of Terraform because it is easier for me to re-use that information from Ruby than from parsing the Terraform state files

Generate mixed integer program and solve it with GLPK

So now that we have the autoscaling groups it is time to tie them into an optimization program. Here again I use an ERB template to generate a MathProg file and then feed it into glpsol to get the solution. Most of the pieces are exactly what you’d expect and explanations follow after the template. If you’re interested in what linear and mixed integer programs are then this wikipedia article does a pretty good job

variables

The variables are what you’d expect them to be. It is a hash map that takes a name and maps it to a set of attributes. Since we are dealing with autoscaling groups, availability zones, and instance types I mangle all those pieces and the variables end up being the concatenation of all those pieces. For example, if you had an autoscaling group in us-west-2a availability zone and the instance type was m4.4xlarge then the corresponding variable would be us_west_2a_m4_4xlarge.

Each variable is mapped to a set of attributes corresponding to it like maximum price we are willing to pay for each instance, how many instances at minimum and maximum should be included in the final solution, and other potentially useful bits of information.

objective

The objective function is again what you’d expect it to be. It is the total hourly cost given the current state of the spot market and ends up being the dot product of all the variables with their associated costs. As a concrete example, 0.5924 * us_west_2a_i2_4xlarge + 0.3173 * us_west_2a_r3_8xlarge + ....

constraints

The constraints are where you have some freedom. Other than expressing the CPU and RAM requirements you can attach other kinds of constraints, e.g. spot prices should not exceed hourly thresholds, instances of the same type in various availability zones should be within a factor of 2 of each other, the bid price should not exceed a certain threshold, etc. Some constraint examples from my implementation

vcpu: 16 * us_west_2a_i2_4xlarge + 32 * us_west_2a_r3_8xlarge + ... >= 1400;

delta_us_west_2a_i2_4xlarge: (3.751 - 0.5924) * us_west_2a_i2_4xlarge, >= 0;

max_bid_us_west_2a_i2_4xlarge: (0.5 - 0.5924) * us_west_2a_i2_4xlarge, >= 0;

Those constraints are basically the common sense requirements of allocating instances. We never want to pay more than the maximum hourly cost so we have positivity constraints for the price delta. Similarly we have a maximum bid price and in this case those constraints subsume the price delta so the positivity constraint on the hourly vs spot price delta is kinda redundant.

Update the autoscaling group limits using results from above solution

Feeding the above generated program to glpsol gives us the solution in terms of the mangled variables. The rest of the pipeline takes the mangled variables, unmangles them to map back to the corresponding autoscaling groups, and finally updates the autoscaling group limits with the help of the AWS Ruby SDK. This is the part that is currently not that great in my approach. The solutions end up optimizing the cost but don’t account for the existing allocation and the hourly block allocation on the AWS side. It is possible that 30 minutes ago we allocated 20 m4.4xlarge instances in us-west-2 but now the prices have changed and the optimal solution forces 0 m4.4xlarge instances. Blindly taking this solution and applying it means you lose 20 instances even though you have already paid for another 30 minutes for them. I’m currently looking for solutions to this problem by way of non-linear planning agents that account for the price history and current state of allocations.

One thought on “optimizing spot instance allocation

  1. Pingback: optimizing spot instance allocation (part deux) | learning with scripting

Comments are closed.