Lack of rigor

TL;DR The dream of DevOps has yet to be realized and we continue to lack the tools to tame and manage software complexity at every level of the stack.

I spend most of my days making software systems in the cloud more efficient and manageable. The tools I use day to day are woefully inadequate for getting the job done because they’re mostly YAML DSLs. It is unfortunate that as an industry we have settled on YAML as the DevOps standard for tools and utilities. Other than YAML there are some other custom DSLs, e.g. Puppet, HCL, etc. but they’re no better than their YAML counterparts. Fundamentally what I need is something like Prolog + SQL + MiniZinc + Alloy + Isablle/HOL in some coherent package but what I have is a bunch of YAML files and custom DSLs glued together with bash scripts, make files, and for some odd reason Go binaries. Layer on top of this the container and serverless craze and you have a pretty nice recipe for burnout. I’m going to first talk about the nonsense in the container craze and then circle back to the YAML/custom DSL problem.

As an industry we keep jumping from one fad to the next while re-creating the abstraction facilities at the lower (kernel) layers in the higher (userspace) layers. Docker and Kubernetes are a perfect example of this phenomenon. Docker abstracts dependencies by packaging the entire file system but at the same time it doesn’t quite know what to do with the network so the abstraction leaks and heroic workarounds are necessary to bridge the gap. The answer given by the container gurus is to make all containerized applications “stateless” because stateless applications don’t need to worry about the network and can technically run anywhere. But even if your application is stateless it still needs to talk to other “stateless” applications and since we have forgotten all about IP addresses and networks we now need something else to fill in that gap and that’s where all the “service mesh” and “serverless” vendors make an appearance.

Before we had IP addresses and ports and now we have containers and “meshes”. This doesn’t feel like we have made the right trade-offs. In order to fix the leaky network abstraction because we abstracted the file system we re-invented a whole new class of networking abstractions. We took 2 steps back and then took another step back and have yet to take 3 steps forward to end up where we had started with virtual machines. This is all at the file system and network layer and we still haven’t even gotten to the problem of managing and controlling these containers which for some reason now require a distributed control plane. The operating system used to manage processes just fine but since we have shifted all those responsibilities to containers and userspace processes we also now need to address that problem at the userspace level and this is where Docker Swarm and Kubernetes make an appearance.

Swarm and Kubernetes are glorified process managers the same way upstart and systemd are process managers. We have again shifted kernel responsibilities into userspace for basically negative gains. No one quite knows what Kubernetes and Swarm actually do. It is impossible to debug them when they do fail and the only solution is to “turn it off and on again”. I guess it’s nice to know that these systems can still be reset to a clean state but I could do that just as easily with AWS APIs for VMs so again I have to ask what exactly have we gained? Why couldn’t we have built a nicer interface to the existing cloud APIs for controlling VMs and networks? What exactly is the killer feature of containers other than abstracting the file system?

I don’t have any answers to these questions and I’ve looked far and wide. Anyone that deploys Kubernetes talks about what a nightmare it is to maintain and everyone else seems to just offload the maintenance of these clusters to the cloud vendors. Containers and all the extra layers associated with them don’t seem to provide any killer advantages because if they did any company that wasn’t using them would be wiped out and yet plenty of companies still manage to survive and be profitable without them. I think the only people that care about these things are the ones selling the equivalent of shovels for containers because I don’t think anyone else quite gets what value there is in a distributed process manager and service mesh.

So the container ecosystem is mostly a mess and I don’t think it is going to get any better because everything is 10x more complicated than it needs to be and built for imaginary use cases that only Google, Amazon, and Facebook seem to care about. For chumps like me that still manage less than 100 or so VMs we are left with the likes of ansible, salt, chef, puppet, terraform, and a few other things I’m probably forgetting. These things are broken for another reason: they’re all pseudo programming languages. At the end of the day all system management requires doing some things imperatively but somehow all the DevOps thought leaders are convinced that “declarative” and “immutable” is better and YAML DSLs are how you accomplish that.

Usually if you give up some capabilities then you might as well get something in exchange but I’ve written plenty of puppet and ansible and I don’t really see what I’ve gained. I guess there is an “inventory” file now so that’s nice but these ansible playbooks are just as much of a mess as the old ruby and bash scripts I used to write. My CI/CD is still using bash and some other custom YAML configuration format so not like the ansible playbooks are more portable. To configure the local vagrant VM I still need to write a vagrant file and modify so many variables that the production playbooks are actually useless and there are basically two different sets of files that need to be kept in sync manually. There is no simple way to test these things even though in theory they’re not Turing complete so there should be some kind of abstract interpreter to tell me when things are broken but all I get is “–dry-run” which doesn’t work in 90% of the cases I need it to work. So all I’ve done in the end is replace a bunch of bash and ruby with a bunch of YAML that compiles to bash or is interpreted by some python or ruby process and makes a bunch of fork/exec calls.

I’m pretty sure there is a better way to do all this and I’m pretty sure nothing will change as long as we continue to worship celebrities instead of learning the fundamentals.