I think I’ve now been in enough places to be able to discern some common patterns across software engineering organizations. Unfortunately I have nothing good to report. Most of the patterns I’ve seen lead to misery for the humans and bit rot for the software.
Zack Tellman has a great blog post about how senior engineers in a software organization reduce risk. Similarly Dan Luu has an equally great blog post about why big organizations have specialists in domains that you would not imagine. You should definitely read both because each one distills a great deal of knowledge into a nice essay. The common theme in both is about the value of specialists and the benefits that specialists enable at the organizational level. The unfortunate truth though is that most software business have no idea what they’re doing, what risks they’re trying to mitigate, and what pipelines they are trying to optimize to reduce costs. The survivors accidentally stumble on the building blocks that make large software organizations barely viable and the newcomers are doomed to repeat the same mistakes and rediscover things all over again.
The main problem is what works for small scale software organizations does not work for large scale ones. You have to be cognizant of all processes, tools, and incentives and be able to evolve them in a coordinated way across the entire organization. Bottom up process and tool changes that are not aligned with existing incentives will not be enough. Similarly top down incentive changes misaligned with existing tools and processes will also not work because whatever people are doing on the ground floor is usually the most optimal way to get things done given the existing incentives and tools and you can’t just change the incentives and hope for the processes and tools to smoothly catch up. So at the bottom you can change processes and at the top you can change incentives but to make effective changes you need to combine both in a coherent way. I have yet to be at any organization, large or small, that manages to pull off this trick. The changes are always a shock to the system that lead to a lot of human misery and very little positive progress.
My most recent experiences of incoherent bottom up and top down changes are from Zenefits. Even though I’m proud of the work I did there and the people I worked with, in hindsight, if I’m being honest, it was mostly just technical masturbation. We were able to build a pretty efficient CI and deployment pipeline running on top of AWS, github, and buildkite but ultimately that was never the bottleneck that made Zenefits engineering horrible. We optimized the hell out of that pipeline and in the end it was way too magical for anyone to make sense of, even the people that had built it.
One reason we failed was because the developers were not incentivized to understand the underlying infrastructure and systems that supported their work. Without that back pressure instead of being the people that built tools for the engineering organization we became the black magicians that created a bunch of black boxes. Given the incentives in place this was mostly inevitable. Developers were tasked with delivering features and we were tasked with getting those features into the master branch as quickly as possible even though there were a few million lines of python and 18k+ “unit” tests that each feature had to get through.
Our failures could have been mitigated somewhat if we were working in an engineering organization that wasn’t so focused on shipping features at the expense of engineering proper systems. I was on a team that worked on engineering proper systems but it was a quixotic task without the rest of the organization seeing and understanding the value of properly engineered systems. Some of the bits and pieces of the infrastructure managed to be maintainable and understandable but the rest of it was so specialized that only one or two people ever managed to make sense of it.
It might also be the case that an engineering organization that balances shipping features and building proper supporting systems is economically and socially non-viable. It might be too expensive to hire engineers that can imbue the organization with those qualities. My opinion is slightly biased but the infrastructure team came closest to satisfying most of those qualities. The problem there was we had a lot of strong personalities. It seems like smart and curious engineers tend to be a little strong-willed as well. So the tools we built were embodiments of our own personalities. Things mostly worked but because we didn’t have strong leadership in place there was nothing to tie it all together. Signals did not properly travel upstream and then downstream. All the tool and process changes in the world are meaningless if the rest of the engineering organization doesn’t get the memo and if those tools are not driving towards some singular vision.
Fundamentally Zenefits never had technical problems. Whereas we (infrastructure team) were trying to address everything from a technical perspective by driving process and tool changes the rest of the engineering organization was hampered by lack of strong leadership. Code quality was atrocious. There was never time set aside for cleaning up dead code and refactoring. Code was only added and never removed. There was much magical thinking around the next silver bullet without figuring out what made the previous silver bullet a failure. Even when we had accumulated enough political capital to push back and get the rest of the engineering organization to get their act together the leadership balked and let everyone continue as they were. Instead of imbuing the organization with the proper qualities in the end we just ended up appeasing all of its worst qualities with a more efficient and optimized software delivery pipeline.