See part 1 for context. This post explains the control/planner strategy I settled on to minimize the cost and make things as stable as possible to mitigate disruptions caused by VMs being re-allocated.
The missing component from my previous approach was the idea of “kill window” for a pool of instances. Spot allocation happens in hourly blocks assuming there are no issues with being pre-empted with higher bids from others. By grabbing the uptime information for each auto-scaling group and grouping the instances by uptime we can incorporate the number of instances we are willing to kill into the mixed integer program. For example, if one auto-scaling group has 20 instances and 5 of those instances are within 40-55 minute window past the hour then we consider those instances killable and add a constraint to the mixed integer program that forces instances in that group to be at least 15 (20 – 5). We do this for every region and auto-scaling group so that we don’t just suddenly go from 20 instances in one group to 20 instances in another group without accounting for the hourly block based allocation.
The “kill window” constraint creates a nice kind of continuity and allows gradual evolution. The solver runs every 5 minutes which is long enough for small perturbations to propagate throughout AWS in terms of VM startup and spot prices. I also place upper and lower bounds on how many instances can be re-allocated. Just because there are 5 instances within a kill window and those 5 instances get re-allocated doesn’t mean all the instances will be re-allocated all at once. These continuity constraints are handled at a higher level because I couldn’t figure out a nice way of incorporating them into the optimization program. Given those 5 instances we first re-allocate 2 and then wait 5 minutes while the instances and prices are updated. If the re-allocation was a good choice and the prices haven’t deviated too much we continue with what is left over in small increments. A really nice side-effect of this process is that we get to see how the price fluctuates based on our decisions. I have already seen cases where re-allocating 10 instances would have caused a price spike and by re-allocating 2 at a time we were able to avoid over-paying and at the end of the 5 minute period the other 8 instances were allocated to another region which was much cheaper.
The other component is looking at the current workload and adjusting the VCPU and RAM constraints accordingly. The process here is similar to the “kill window” and gradual evolution approach. Every 5 minutes we check to see what the workload is and increase or decrease the requirements. We don’t commit the changes unless 15 minutes have passed and we are still seeing backlog. The increase is faster than the decrease so we ramp up faster than we ramp down. Like scaling things up, the scale-down process also has floors. If the solver says reduce by 10 we only reduce by 2 and then wait to see what happens. It could be in those 5 minutes our requirements go up again and so the slower approach ends up being better and optimizes for stability in exchange for slightly higher costs.
There are similar constraints based on pricing. If the solver re-allocated a bunch of instances and the price delta is less than 20% of what we are currently paying then the disruption is not worth it so the re-allocation doesn’t happen. This is again a stability vs cost trade-off. We could save more but it comes at the cost of new VMs starting up and kinda ends up being a wash. The 20% mark is just a heuristic and I don’t have any basis for why I chose it. If there are indeed large price fluctuations then anecdotal evidence seems to suggest the 20% cut-off does a good enough job.
There are a few more parameters and control points but the above are the major ones that have allowed me to run this control/planner process in a loop and not worry about things breaking. I think the underlying theory is called “approximate dynamic programming” and there is probably a more principled approach to the entire problem instead of using heuristics for the control parameters. Nonetheless it’s good enough for the time being and is savings significant amounts of money daily.