Mars Climate Orbiter, Python, and Type Systems…

Faithful readers know that I’m learning Scala, somewhat reluctantly. A few weeks ago, I was reading New Scientist magazine and saw this writeup of Peter Norvig, Director of Research at Google, which mentioned that he was on the review board that studied the crash of the Mars Climate Orbiter. That sparked a question in me, because I’d recently watched Norvig’s talk “What To Demand From A Scientific Computing Language” which says “don’t spend too much time fretting about language choice,” but further goes on to argue that Python is a heckuva good scientific computing language. So I asked him:

The Mars Climate Orbiter disaster would seem to be a case study for a type system with more compile-time strictness. Obviously, an SI unit system is not part of Scala’s standard library, but I find it interesting that you have not cast all aside and picked up the banner of Haskell or what-have-you.

And he kindly responded with a typically pithy insight:

I don’t think the MCO incident has anything to say about language choice.  The problem involved reading data from a file and a miscommunication about what the numbers in the file were.  I don’t know of any language, no matter how type-strict, that forces you to tag the string “123.45″ in a file with the units of force (newtons vs foot-pounds), nor do I know of any language, no matter how type-loose, in which you could not impose such a convention if you wanted to.

This, for me, is the last word in the argument for and against manifest static typing: you can always get around it on the one hand, and you can always enforce preconditions on the other.

Although I have a marginal preference for explicit/manifest typing in function signatures because I think that it helps readability, it’s horses for courses — if you say you think explict typing hinders readability I can’t say you’re wrong (much less get worked up about it).

Regretfully, this is not to say that I don’t get worked up in type-system discussions: I do. But I get worked up by the over-simplifications that say ‘your way is fundamentally wrong.’ We’re so beyond the days when you could make that argument: there are large systems written in dynamic languages and type-inferred languages that minimize finger-typing in the mainstream (I’m looking at you, C#. No one’s calling you a Java clone now, are they?).

So, if the type system wasn’t at fault, what were the causes? Norvig explains:

Beyond the initial error, the reasons why the error proved fatal were more around organizational structure than around language choice:
(1) An anomaly was detected early on, but was not entered into an official issue-tracking database. Better practices would force all such things to be tracked.
(2) The team was separated between JPL in California and Lockheed-Martin in Colorado, so there were no lunvh-time discussions about “hey, did you get that anomaly straighten out?  No?  Well, let’s look into it more carefully…”
(3) The faulty code was not carefully code-reviewed, because of improper code re-use.  On the previous mission, this file was just a log file, not used during flight operations, and so was not subject to careful scrutiny.  In MCO, the file and surrounding code was re-used, but then at some point they promoted it to be a part of actual navigation, and unfortunately nobody went back and subjected the relevant code to careful review.
(4) Bad onboarding process of new engineers: The faulty code was written by a new engineer — first week (or maybe first month or so — on the job. This was deemed ok because originally it was “just a log file”, not mission-critical.

There you go: It’s not type-systems. It’s not where you put the brackets (Allman-style, all hail!). Defect management, team dynamics, code review, and onboarding. Now those things are worth getting worked up about.