Software lifecycle management as operations on attributed graphs

A confluence of events at work recently reminded me of the schema drift problem. The concrete instance or how it comes about doesn’t really matter because the end result is always the same: there is some code on a server somewhere that is running with a version of a schema that is no longer valid. It will continue to work as long as nothing is restarted because the ORM only validates assumptions during startup but as soon as the server is rebooted everything is broken.

The fix is very simple: whenever database schemas change you do a rolling restart with updated code across your entire fleet. This is the only way to guarantee some accidental restart in the middle of the night doesn’t lead to cascading failures across your entire fleet.

Even though the fix is simple it’s not clear how to enforce this requirement and a million other such requirements when managing the software lifecycle in a cloud environment. The problem is exasperated by the fact that most of the time many of these assumptions are implicit. It could be that people consistently deploy to the entire fleet whenever there is any code change. This means you will never run into schema drift until an optimization forces you to deploy to a subset of the fleet. At that point your optimized deployment practices are inconsistent with a set of implicit assumptions that was masking the schema drift problem.

I think I have a solution that is generic enough to fix the actual root cause of an entire class of problems like schema drift and it involves attributed graphs. I recently learned about attributed graphs so take whatever I say with some grain of salt. I think the seed of the idea is correct but I haven’t implemented anything yet so there might be some dragons I haven’t seen yet that need slaying.

There are a few ways to model schema drift as an attributed graph but the invariant we want to preserve is that each presentation will make the problem obvious. The simplest way to model the problem is to have a node for each entity in our infrastructure. In this case we have code living on some servers so we are going to have a bunch of code nodes but for concreteness sake I’m going to assume we just have two: C1, C2. These code nodes will have requirements (edges) connected to other entities and in this case they will be connected to a schema S. The schema node will have a version attribute, e.g. S: {version: 1}. So far our attributed graph looks as follows C1: {} -> S: {version: 1}, C2: {} -> S: {version: 1}. This is the steady state representation of the current state of the world. Now let’s break the steady state.

Breaking the steady state means incrementing the version of the schema so that it becomes S: {version: 2}. This change will need to be propagated throughout our existing graph. In this case it means the code edges are now invalid C1: {} -!-> S: {version: 2}, C2: {} -!-> S: {version: 2}. We also track the reason for the edge invalidation which is the set of attributes that changed at S to invalidate the edges. The reconciliation process to restore the steady state is simple: we’ll need to restart the nodes. Actually the restart/deployment process can also be expressed as another set of operations on an attributed graph but for now I’m just going to assume there is a process somewhere that will do the rolling restart and report back to the graph that the steady state requirements have been fulfilled which will get us back to the original graph structure.

Now imagine your entire software system represented as an attributed graph. The management of the system becomes all about adding, removing, and reconciling broken edges. Whenever new implicit assumptions are found they are added to the graph as another set of attributed nodes and edges along with the invariants and reconciliation processes for restoring those nodes and edges to a steady state.

If you squint just enough you’ll see the beginnings of a data flow problems expressed as an attributed graph and combined with some kind of graph traversal or even object system for passing messages back and forth between edges this all kinda looks like a smalltalk environment with a graph database behind the scenes.