Modelling Rolling Deployments in Alloy


This post has a few goals. One is to outline how to think about a simple imperative process in a declarative language/system like Alloy. Another is to make a case for why this kind of formalization is a worthwhile exercise for software engineers.

Disclaimer: To get the most out of this post you need some familiarity with Alloy syntax otherwise most of it will probably not make much sense. I’ll try to explain things as I go along but promise no mind blowing revelations. As some old person used to say, “There are no royal roads”.

Rolling Deployments

Folks that are familiar with rolling deployments and have dealt with them in a practical setting can safely skip this section. Everyone else should read and make sure it makes some sense to them because this is the foundation for the Alloy specification/model presented in the next section.

There are a few ways to distribute software/code updates for a typical web application sitting behind a load balancer. One way is to provision a new/disjoint set of servers and reconfigure the load balancer to gracefully (or abruptly) transfer connections from the old servers to the new ones. This is called blue/green deployments. Another way is to not worry about any new servers and roll out the new code in stages/batches to the existing servers by taking them out of the load balancer and recommissioning them. This is called rolling deployments.

Depending on various requirements and restrictions one or the other strategy might be the only viable option but at large enough scale rolling deployments are the only reasonable option because the cost and time involved in spinning up new servers might be prohibitive or downright impossible (think about bare metal servers and account based limitations in cloud environments). The main issue is that for blue/green deployments we need 1x slack in the system which means half of our capacity must be set aside to be either green or blue and not doing any useful work behind the load balancer. For rolling deployments we only need just enough slack to update one server at a time and don’t need blue/green servers and idle capacity. There is a way to play with the blue/green ratios to reduce the required slack but then it ends up being a hybrid model of blue/green and rolling. I’m going to leave exploration of that extension as an exercise for the reader.

Most of this also holds in any environment where compute capacity can be doled out in chunks. The load balancer is just one way of distributing resources so you can also imagine rolling deployments happening in other kinds of environments that are not necessarily web application related and instead of a load balancer there is some other general purpose scheduler. The same ideas apply in the more generic context as long as there is some notion of being able to mark resources as out of commission.

With that out of the way we can start thinking about how to model what seems like an imperative process in a declarative context. Hopefully as the model is elaborated the connections between the imperative and declarative components will become clear.

Modelling Rolling Deployments in Alloy

There is a bit of an art to specification and modelling when it comes to bridging the gap between reality and formality. We want our models to tell us useful things but we also can’t model every detail so we must choose which aspects we want to focus on and decide for ourselves whether the resulting model is a good approximation of what we’re trying to understand. Fundamentally there is no substitute for common sense as a meta-theoretic guide for using a tool properly. Best advice I can give is to practice and run into enough edge and corner cases to develop a sense for what is a “good”/”bad” when abstracting details for modelling purposes.

In the previous section I mentioned load balancers and schedulers but the specific thing to be concerned about is the state of each server. We really only care about what step of the recommissioning process a server is currently at. This is a deliberate modelling choice and someone else could have made another decision and gone down another path. Although the specification below is a good model it is not the only way to specify rolling deployments.

So with that in mind I’m going to model the state of the servers as old, new, and undefined. The last state (undefined) defines the whole rigamarole of doing the necessary violence to the server to update it before putting it back into production as a new server

open util/ordering[State]

some sig Machine { }

// State is defined by the aggregate state of the machines
sig State {
  // Working with old software
  old: set Machine,
  // Working with new software
  new: set Machine,
  // Being updated and not currently doing work
  undefined: set Machine
} {
  // Machines can't randomly disappear
  Machine = old + new + undefined
  // There must always be at least 1 machine in the old or new states
  // doing some useful work
  some (old + new)
  // The machines can't be in multiple states at once
  disj[old, new, undefined]

Notice how we have several restrictions of what comprises the aggregate state of a set of servers/machines

  // Machines can't randomly disappear
  Machine = old + new + undefined
  // There must always be at least 1 machine in the old or new states
  // doing some useful work
  some (old + new)
  // The machines can't be in multiple states at once
  disj[old, new, undefined]

I initially did not have these restrictions and discovered them interactively in Alloy’s graphical analyzer. It is hard to think of all the implicit assumptions we have in our mental models and Alloy is an excellent tool for uncovering those hidden assumptions.

Next thing we need to do is specify the dynamics of rolling deployments. The way this is accomplished in most declarative languages is by having some kind of ordering between states and then specifying what is a valid transition from one state to the next by relating the ordered states in a way that respects the ordering. For our purposes we’ll only need to worry about the current and the next state

pred transition(s, s': State) {
  // New machine stay where they are.
  // This essentially pins the state of new machines. in s'.new
  // Old machines can't magically become new machines.
  no (s.old & s'.new)
  // Now we specify the mapping for what happens to machines
  // that are being updated.
  some s.undefined => {
    // If there are machines in an undefined state then they
    // must move to the new state but they don't all have to
    // move together
    some (s.undefined & s'.new)
    // We don't specify what is happening to the old machines.
    // Think about why this is ok (or not).
  } else {
    // There are no machines in an undefined state so pick
    // some from old ones and move them to an undefined state.
    some (s.old & s'.undefined)

Just like with the specification for the aggregate state I did not come up with this on the first try. Initially there were many low level details and the model was more restrictive than necessary. I had implicitly and inadvertently excluded concurrent operations. The important part of the specification is how set intersections are used to specify the compatibility between different states, e.g. some (s.undefined & s'.new). This specific condition says that on each transition there must be at least one server that moves from an undefined state to being a new/updated server. The set based codification is what allows for concurrent operations because as long as the intersection is non-empty it can have more than one element which indicates more than one server moved from one set to the next.

There is nothing left to do now other than run the specification to see what Alloy comes up with

run {
  first.old = Machine = Machine
  all s: State, s': | transition[s, s']
} for 4 State, exactly 3 Machine

I highly recommend trying this at home and seeing what transitions Alloy comes up with. It really feels kinda magical when declarative descriptions become “animated” in Alloy’s graphical analyzer. Here’s a formatted version of one of the examples Alloy came up with

State0 = {
  old: {M0, M1, M2}
State1 = {
  old: {M1}
  undefined: {M0, M2}
State2 = {
  new: {M2}
  undefined: {M0, M1}
State3 = {
  new: {M0, M1, M2}

Notice how it takes two state transitions for M0 to move from undefined to new. This tells me our model has some legs because it can model delays in the deployment process. The model does not say anything about everything happening in a single step for each machine and machines can idle in a given state for several time steps.

Final Words

Hopefully that shows some of the thinking process involved in specifying models in a context that most software engineers should be familiar with. Formal specification doesn’t have to be an academic exercise and can be fruitfully used for modeling more pedestrian processes in day to day software engineering.