infrastructure boilerplate

1 shell script

There is no automation other than automatically running this shell script whenever a commit is pushed to whatever branches are active. You don’t need to think very hard to see how this will break down at a certain point.

Time passes and as you predicted that 1 shell script is starting to show some cracks. Although it is not visible to everyone people are starting to notice things are taking longer and longer. Commits in general are not going into the mainline branch quickly enough. The lag creates enough accumulated changes in various branches and when it is time to consolidate the changes everything is conflicting with everything else. So first order of business is to make people aware of the conflicts ahead of time instead of when it is time to consolidate into the mainline branch.

2 shell scripts

One of those scripts is what was running the tests and the other is some sanity checks to make sure mainline has not accumulate conflicting changes because otherwise you’d be wasting resources and then wasting those resources again when the conflicts are resolved. The 2nd shell script lets you detect such waste and abort things early. This is fine but now you have introduced a visibility problem. People previously would commit changes, wait an hour or however long they were used to waiting, and then check to see what was going on. Now with this early cancelation system those people are basically idling. We need to notify these people about failures but how?

Most of the people working on the code are unaware of much of the surrounding infrastructure which means the notifications will have to be on their own terms. Context switching is expensive and you will need to ping them on the same channels where the work happens. Usually those channels are github and slack. How are you going to do this though? You will need some kind of workflow management system to track commits, figure out if tests have passed or failed, notify relevant stakeholders, and somehow make the whole thing a coherent experience. We’ll wave a magic wand and assume you figured out this part and it’s deployed and running as some kind of service.

2 shell scripts, 1 service

So far this is all to make the testing pipeline somewhat more efficient in terms of reporting and resource usage. We still haven’t tackled the actual issues inherent in those shell scripts. Currently dependencies are installed every time across however many hosts and agents the testing pipeline is running on. This was the expedient and prudent thing to do when there were few dependencies but as the code has churned so has the set of dependencies. It has ballooned to a pretty significant portion of the setup time so it needs to be fixed somehow. Obvious thing to do is figure out how to re-use the assets from one invocation of the pipeline to the next.

3 shell scripts, 1 service

The setup code before running tests is starting to get pretty convoluted. There are hashes and directories for stashing things all over the place. On the bright side you have now saved several minutes and when running the same code across 40-100 hosts the time saved starts to add up pretty quickly. This is working out pretty well but it is still limited to a single host at a time. All these caches are nice but populating the caches on a cold start is still taking significant time so sharing the caches will save even more time when things are starting cold. There is nothing on the market to do this though. You just need something like bittorrent but not bittorrent exactly. Something that will take a file and then allow clients to connect to the original source to get the file. We again wave a magic wand and assume this service and associated clients are up and running.

3 shell scripts, 2 services, and associated client libraries

I’d like to add the services are pretty non-trivial. They have added significant time savings but now the entire pipeline is complicated enough that 1 or 2 people are just working on it full-time. Our story is not finished yet and just for running a CI pipeline at scale we have already added so much boilerplate that it requires full-time employees to manage it. Not only that but it has become highly specialized and has diverged significantly from what someone would need to do locally to test a piece of code. So we have made things more efficient but at the same time we have added cognitive overhead and potentially disempowered the people that should be making changes to the CI pipeline.

N shell scripts, M services

This process continues until an entire department is dedicated to managing the services that support the software development pipeline. There is nothing agile about it. It is not elegant. No one remembers how things worked with the first iteration with just 1 shell script and the current status quo is accepted as the truth and how things should be.