I’m currently reading “An Astronaut’s Guide to Life on Earth: What Going to Space Taught Me About Ingenuity, Determination, and Being Prepared for Anything” and almost everything Chris Hadfield says is not just applicable to life in general but it is applicable to software engineering in particular.
In chapter 3 he drives the point home about training and preparation. To become and continue to be an astronaut one must be constantly learning and practicing. The point is that the vacuum of space is hostile to life and making the wrong decision can have all sorts of cascading effects that lead to disaster. So instead of just sitting around and moping astronauts tackle the problem head on. They constantly run simulations about all sorts of disaster scenarios. They practice and drill so much that cool headed thinking even in the face of certain doom becomes almost instinctive. Instead of freezing up they “work the problem”. At one point he mentions that they even have simulations for what happens when somebody aboard the ISS (international space station) dies. It’s not just the people in ISS that run this simulation. The people on earth also go through what they would have to do if they found out a close friend or a loved one had died in space. Suffice it to say that astronauts don’t mess around when it comes to being prepared for almost anything.
How does all this apply to software engineering? If I asked you what happens when your master database dies would you have an answer? If you don’t then you need to start running a simulation about that exact scenario. You need to start “working the problem”. How about if one of your application servers gets hacked? Are you running the application inside a chroot jail? Is it isolated enough from the rest of the system that a compromised application server would not have long lasting consequences? How easy is it to deploy previous versions of your code? Is your infrastructure “generative” in the sense that anyone on the team can re-create a working environment from scratch in a reasonable amount of time? What happens when one of your team members gets hit by a bus?
You need to be asking and answering such questions every week. You need to start practicing your responses to such disaster scenarios to the point that they almost become instinctive. For example, what do you do if the master database goes down? First thing you need to do is figure out if it went down because of high load. If it did then you need to figure out what is causing the high load and mitigate it because even if you fail over to a standby box that one is going to fail as well and you’ll be back where you started. Once the source of the high load has been pinpointed and cordoned off then you can decided whether you should fail over and gradually increase the load or in case of a DDOS attack just ride it out. If your server failed because of a hardware issue then you need to call the hardware support folks and get a replacement ASAP while you distribute the load to the standby master. If you don’t have a standby master then you need to promote one of the database slaves to a master while you create another master from a backup or through some other means. This is where having “generative” infrastructure helps because anyone on your team, not just the DBA, should be able to do this.
That’s just for the database. You should be running such simulations for every component of your stack. Chances are you’ll stumble upon a few knowledge and process gaps along the way which you’ll then hopefully rectify through better processes and tooling.