Of Semantic Code Vectors, Migrations, and Continuity

I’m knee deep in some legacy build scripts and realized a few things while I was trying to make sense of the code. This is an outline of some of the those realizations.

When we’re writing code we are performing two distinct types of activities. One is about making assumptions and encoding the assumptions into the structure of the code. The other is about building something on top of the assumed structure. Most bugs are a result of interleaving and confusing the two activities. If we make assumptions that are not reflected in the structure and build on those assumptions then we are writing buggy code. If we build on top of assumptions that are confusing and overlap in different ways then the code by necessity will be confusing.

One way to write better code then is to make assumptions that are as orthogonal as possible and make their structure as obvious in the code as possible. This way when someone else comes along they can see the semantic vectors and structure their code around those vectors. If their changes are incompatible with the existing vectors then making that as obvious as possible will only help them.

So one way to make code better is to “orthogonalize” the semantic vectors by finding the principal vectors/ideas/assumptions. There is no automatic way to do this so you have to use your judgement and figure out what migrations are necessary to make that happen. When I refactor code with the intent of orthogonalizing the semantic vectors I usually end up with more clearly structured code so anecdotal evidence suggests this is a good heuristic.

This analogy with linear algebra though only goes so far and sometimes we also need to bring in a bit of topology into the mix and try to understand when we’re trying to make “discontinuous” changes. Discontinuous changes change the space and fundamental assumptions we were making to structure the code. I’m not aware of any good heuristics for this. Usually people do big bang rewrites but that’s probably because we don’t have a good theory of discontinous changes for code. We don’t have a good topological theory of code so it’s hard to know when changes are continuous vs when they’re not. The closest analogy I know is about database migrations. Migrations can be considered discontinuous changes and they’re about transforming the surrounding code to fit the assumptions of the new migrated schema.

It is unfortunate that most of our tools do not expose the semantic vectors and let us operate with them as first class citizens. As long as it is easier to build with faulty assumptions that’s what programmers will continue to do and we will continue to have legacy code.