22/02/2023

Emergency Drills: Why Failing is Part of Success

Why Failing is Part of Success Security products

Do you know the name Richard Koos? If not, you're in good company. During the Mercury, Gemini and Apollo space missions, Koos held a position that was little known outside NASA and not very glamorous compared to the "rock stars" of space travel, the astronauts. His position was "Simulation Supervisor", or SimSup for short, and without him the moon landing would very likely never have happened.

To be sure, a whole lot of people are responsible for the success of a project as big as the lunar landing - from the engineers and workers who assemble the rockets to the programmers who broke truly new ground by developing the hardware and software for the lunar module to the crews at ground stations around the world. Each one of these areas holds enough compelling stories for several full-length films. There's no shortage of dazzling names - whether it's a Wernher von Braun, a Margaret Hamilton, Katherine Johnson or household characters like Armstrong, Collins and Aldrin.

The story that is the subject of this article, and whose implications still play a role today, takes place eleven days before the actual launch of the Apollo 11 mission. The group around SimSup Koos had a single purpose: to put the lunar module and mission control crews through the wringer mercilessly in realistic simulations in the three months before the launch. Workdays of 16 hours were not uncommon and left those involved "in a state beyond mere exhaustion," as Gene Kranz (among other things, Flight Director during the lunar landing and also on the later Apollo 13 mission) wrote in his memoirs entitled "Failure is not an Option."

Cut To Size

As exhausting and nerve-wracking as the simulations were that Koos and his colleagues developed - they were also sporting competition. Having successfully mastered a scenario was cause for celebration. But if you got cocky, you could be sure that you would be rigorously cut back down to size in a future simulation. At times, Koos pulled out all the stops, simultaneously presenting the astronauts and controllers with various problems with the electronics and the radio link to the simulated lander. At times, Kranz recalls, the crews had real trouble not getting behind in the power curve. On one particularly stressful day, SimSup had won the game, hands down. The first landing simulation was a success. But two subsequent simulations resulted in a crash. This significantly dampened the exuberance that Koos had already noticed cropping up in Mission Control several times

Case 26

It was supposed to be the last simulation run before the launch of the lunar mission. But even though the vast majority of outstanding problems had been solved and this was the final rehearsal, that was no reason for Koos to show leniency. Before beginning a landing approach, he instructed his staff to "Load case number 26." He had a plan. "Let's see what these guys know about program alarms." During an approach, the Guidance and Navigation Officer received an alert message with the code "1201." No one had seen this message before. It had never appeared outside of software testing on the ground. A query from the Guidance and Navigation Officer to the associated development team yielded an unhelpful answer at the time: "The computer is busier than hell for some reason. It doesn't have time to complete all the tasks".  The realization quickly came: There was no plan, no procedure for this case.

Since the landing approach is an extremely critical phase, there wasn't much time to mull over a decision. So the confused controller reported to his flight director, who made the final and unchallenged decisions: "Flight, something's wrong with the computer - I've got a bunch of program alarms here. I think we should abort. Abort!" Startled by the dreaded word "abort," the staff member responsible for communicating with the astronauts (Called "CapCom") spoke up, "Flight, do we call an abort?" Kranz quickly decided, "Affirmative, CapCom. Abort!"

Disappointment

Some of those in attendance were angry with Koos for denying them a landing and thus a successful dress rehearsal. In the debriefing, the shock followed.

Koos, who as Kranz writes, "had a mind so razor sharp that you didn't realize you' were bleeding until long after the thrust," surprised and stunned everyone by announcing, "That was not an abort scenario." Turning to Kranz, he said "You violated one of the most fundamental rules of mission control, which says you need two key criteria to call an abort. And you only had one." That one hit home, hard. With a more than bad feeling in the pit of their stomachs, they conceded defeat. SimSup had won this last game. But the navigation computer software specialists immediately got to work. Quickly, a list of alarm codes was compiled that represented an abort criterion. Despite the brutality of this last minute wake-up call, they were eventually grateful to Koos.

When Exercise Pays Off

"If the dress rehearsal goes wrong, the premiere will be all the better," is an old superstition from the stage acting world. And this was to prove true here as well. So, on approach to the lunar surface, the computer spat out several error messages: 1202 and 1201. Neither was on the list of hard abort criteria, and the landing went ahead. The rest is history. You might be expecting this article to conclude with some variation of "and that's why emergency drills are so important - why don't you do some on the corporate IT network?" But that would be too simplistic. What this episode from history still teaches us today is that very few things are as simple as they seem.

Inseparable Units

Technology is an important component - without it, many things are simply impossible. But what was missing not only during the botched dress rehearsal, but also repeatedly plays a role as a factor in red-hot security incidents, are the corresponding processes. The two form an inseparable unit. If there is no suitable process behind the best technology, then the technology is of little use. Likewise, a process can never work without the appropriate means for technical implementation. The controller noticed in 1969 that he did not have a process for this situation. He only got the information that the computer had a problem and could not fulfill all the tasks it was given. The expert on the ground as well as the astronauts was practically flying blind into unknown territory. And since the computer was one of the central components of the moon landing, a computer problem quickly can become a global problem for the entire mission. Everyone else present knew this, too - and so an incomplete situation picture and a missing process became a command that, in an emergency, would have caused the entire mission to fail - possibly even with a fatal outcome, because an abort during the landing approach was one of the nightmare scenarios. Every part of the abort sequence had to be one hundred percent correct, with zero margin of error if you didn't want to risk a crash on the lunar surface or a collision with the spacecraft in lunar orbit.

Exactly the same mechanisms come into play in any major IT security incident. When the emergency occurs, defined processes and clearly regulated competencies are the most important things. And in this failed exercise, there was only one of these.

Testing Technology AND Processes

Exposing points of weakness is the only goal of exercises. That was true back then, and it's still true today. Always rehearsing things that have always worked well doesn't necessarily get you anywhere, even if it may feel good to succeed. But leaving your comfort zone is the real goal. Of course, it doesn't feel good to have your weaknesses pointed out in front of everyone. But especially in the context of a company, the best time for this is an exercise. In the case of a real attack on the corporate network, it is too late for that. The same applies to learning from exercises. This must be done as quickly as possible. If this learning effect fails to materialize, disaster is inevitable. Ruthless honesty - also towards oneself - is the only way out. Throughout, all participants reported that the real moon flight almost felt like an exercise - only the press representatives who were present and the TV program reminded them that this was the "real deal".

No Excuses

If something goes wrong, everyone will be aware of it and it can't be explained away. Another quote from the flight controller's autobiography fits here, and I'd like to conclude with it: "There are no excuses. When you have failed, there are only two possible answers.
Either "I was wrong" or "I don't know, but I will find out".

 

Image credits:
Fig. 1: Steven Michael / Flickr
Fig. 2: Brett Sayles / Pexels