Archive for July 2011

Project Daytona: Iterative MapReduce on Windows Azure – Microsoft Research

Mars Climate Orbiter, Python, and Type Systems…

Faithful readers know that I’m learning Scala, somewhat reluctantly. A few weeks ago, I was reading New Scientist magazine and saw this writeup of Peter Norvig, Director of Research at Google, which mentioned that he was on the review board that studied the crash of the Mars Climate Orbiter. That sparked a question in me, because I’d recently watched Norvig’s talk “What To Demand From A Scientific Computing Language” which says “don’t spend too much time fretting about language choice,” but further goes on to argue that Python is a heckuva good scientific computing language. So I asked him:

The Mars Climate Orbiter disaster would seem to be a case study for a type system with more compile-time strictness. Obviously, an SI unit system is not part of Scala’s standard library, but I find it interesting that you have not cast all aside and picked up the banner of Haskell or what-have-you.

And he kindly responded with a typically pithy insight:

I don’t think the MCO incident has anything to say about language choice.  The problem involved reading data from a file and a miscommunication about what the numbers in the file were.  I don’t know of any language, no matter how type-strict, that forces you to tag the string “123.45″ in a file with the units of force (newtons vs foot-pounds), nor do I know of any language, no matter how type-loose, in which you could not impose such a convention if you wanted to.

This, for me, is the last word in the argument for and against manifest static typing: you can always get around it on the one hand, and you can always enforce preconditions on the other.

Although I have a marginal preference for explicit/manifest typing in function signatures because I think that it helps readability, it’s horses for courses — if you say you think explict typing hinders readability I can’t say you’re wrong (much less get worked up about it).

Regretfully, this is not to say that I don’t get worked up in type-system discussions: I do. But I get worked up by the over-simplifications that say ‘your way is fundamentally wrong.’ We’re so beyond the days when you could make that argument: there are large systems written in dynamic languages and type-inferred languages that minimize finger-typing in the mainstream (I’m looking at you, C#. No one’s calling you a Java clone now, are they?).

So, if the type system wasn’t at fault, what were the causes? Norvig explains:

Beyond the initial error, the reasons why the error proved fatal were more around organizational structure than around language choice:
(1) An anomaly was detected early on, but was not entered into an official issue-tracking database. Better practices would force all such things to be tracked.
(2) The team was separated between JPL in California and Lockheed-Martin in Colorado, so there were no lunvh-time discussions about “hey, did you get that anomaly straighten out?  No?  Well, let’s look into it more carefully…”
(3) The faulty code was not carefully code-reviewed, because of improper code re-use.  On the previous mission, this file was just a log file, not used during flight operations, and so was not subject to careful scrutiny.  In MCO, the file and surrounding code was re-used, but then at some point they promoted it to be a part of actual navigation, and unfortunately nobody went back and subjected the relevant code to careful review.
(4) Bad onboarding process of new engineers: The faulty code was written by a new engineer — first week (or maybe first month or so — on the job. This was deemed ok because originally it was “just a log file”, not mission-critical.

There you go: It’s not type-systems. It’s not where you put the brackets (Allman-style, all hail!). Defect management, team dynamics, code review, and onboarding. Now those things are worth getting worked up about.

Getting To Know Scala: Project Euler Primes

Prime numbers are not my thing, but generating them is a common task in the early Project Euler problems. The one algorithm I know for generating primes is the Sieve of Eratosthenes, which I defined in Scala as:

def successor(n : Double) : Stream[Double] = Stream.cons(n, successor(n + 1))

def sieve(nums : Stream[Double]) : Stream[Double] = Stream.cons(nums.head, sieve ((nums tail) filter (x => x % nums.head != 0)) )

val prime_stream = sieve(successor(2))

The first function is the only function that I’ve ever written that I’m sure is properly “functional.” It’s stuck in my head from circa 1982 LISP. It uses Scala’s Stream class, which is like a List but is “lazily evaluated,” in other words, it only calculates the next value in the List when it’s needed (the Stream pattern is to create a List whose head is the next value and whose tail is a recursive call that, when executed will produce the next value).

The 2nd function sieve is my take on the Sieve of Eratosthenes. It too returns a Stream of primes. (By the way, the reason I use Double rather than an Int or Long is that one of the early Project Euler problems involves a prime larger than LONG_MAX.)

In case you’re not familiar with the algorithm, the Sieve is conceptually simple. Begin with a list containing all positive integers starting at 2 (the first prime) [2, 3, 4, ...] . Remove from the list every multiple of your current prime. The first number remaining is the next prime. For instance, after removing [2, 4, 6, ... ], the first number remaining is 3. Prime! So remove [3, 6 (already removed), 9, ... ]. Since 4 was removed as a multiple of 2, the next available is 5. Prime! Remove [5, 10 (already removed), 15 (already removed), ...] …

The 7th Project Euler problem is “What is the 10001st prime number?” Unfortunately,

scala> prime_stream take 10001 print
2.0, 3.0, 5.0, 7.0, ...SNIP ... 29059.0, 29063.0, java.lang.OutOfMemoryError: Java heap space
 at scala.Stream$cons$.apply(Stream.scala:62)
 at scala.Stream.filter(Stream.scala:381)
 at scala.Stream$$anonfun$filter$1.apply(Stream.scala:381)
 at scala.Stream$$anonfun$filter$1.apply(Stream.scala:381)
 at scala.Stream$cons$$anon$2.tail(Stream.scala:69)
 at scala.Stream$$anonfun$filter$1.apply(Stream.scala:381)
 at scala.Stream$$anonfun$...

That will never do. Obviously, I could run Scala with more heap space, but that would only be a bandage. Since a quick Google search shows that the 1000th prime number is 104,729 and I’m running out of heap space near 30K, it seems that “messing around with primes near the 10Kth mark” requires some memory optimization.

Converting the Sieve

If I really wanted to work with very large sequences of primes, I should certainly move away from the Sieve of Eratosthenes. But I’m not really interested in prime number algorithms, I’m interested in the characteristics of the Scala programming language, so I’m going to intentionally ignore better algorithms.

My first thought was “OK, I’ll allocate a chunk of memory and every time I find a prime, I’ll set every justFoundPrime-th bit to 1.” But that would depend upon my allocated memory being sufficient to hold the nth prime. With my Google-powered knowledge that the 10001st prime is only 100K or so, that would be easy enough, but (a) it seemed like cheating and (b) it would require a magic number in my code.

My next thought was “OK, when I run out of space, I’ll dynamically allocate double the space– no, wait, I only need justFoundPrime-(2 * justFoundPrime) space, since I’ve already checked the numbers up to justFoundPrime.”
My next thought was “And really I only need half that space, since I know 2 is a prime and I can just count by 2…And, y’know, I know 3 is prime too, so I can check–” At which point, I engaged in a mental battle over what was appropriate algorithmic behavior.

On the one hand, I didn’t want to change algorithms: if I moved from the Sieve to a slightly better algorithm, then wasn’t it Shameful not to move to at least a Quite Good algorithm? On the other hand, the instant I opened the door to allocating new memory, I committed to keeping around the list of already-discovered primes, since I would have to apply that list to my newly-allocated memory. But if you have a list of numbers, checking if your candidate number is a multiple of any of them can be done without consuming any additional memory. But is it the same algorithm? Isn’t the Sieve fundamentally about marking spots in a big array?

Finally, I decided that checking a candidate number against the list of already-discovered primes was the Sieve algorithm, just with a smallest possible amount of memory allocation — one number. (By the way, did you read the article in which scientists say that rational thought is just a tool for winning arguments to which you’re already emotionally committed?)

Here then, is what I wrote:

def multiple_of = (base : Long, target : Long) => target % base == 0;

val twoFilter = multiple_of(2, _ : Long)
val threeFilter = multiple_of(3, _ : Long)

println(twoFilter(4))
println(twoFilter(5))
println(twoFilter(6))
println(threeFilter(4))
println(threeFilter(5))
println(threeFilter(6))

The first function multiple_of is a function literal (?) that returns true if the target is a multiple of the base. The next two lines, where I define twoFilter and threeFilter are  an example of the functional idiom of partial function application (I think — “currying” is the use of partial function application to accomplish a goal, right?).

This is an undeniably cool feature of functional languages. Without any fuss, these lines create new functions that require one less argument to have their needed context. Once you have a twoFilter, you don’t need to keep the value “2″ around to pass in. Which might not seem like a big win, since a function named twoFilter or threeFilter is no more compact than calling multiple_of(2, x) or multiple_of(3,x). But…

def filter = (x : Long) => multiple_of(x, _ : Long);

val fs = List(filter(2), filter(3))
for(f <- fs){
   println("Eight is a multiple of this filter: " + f(8))
}

OK, now that’s nice and compact! Now we have a new tk function literal? tk called filter and rather than have a bunch of variables called twoFilter and threeFilter and fiveFilter, we just have a List of filters. With such a list in hand, it’s easy to figure out which numbers in a list are relatively prime:

def relatively_prime(fs : List[(Long)=>Boolean], target : Long) : Boolean = {
 for(f <- fs){
    if(f(target)){
      return false;
    }
 }
 return true;
}

println("4 is prime? " + relatively_prime(fs, 4))
println("5 is prime? " + relatively_prime(fs, 5))

val list = List[Long](2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
println(list.map(relatively_prime(fs, _)))

Which leads to a simple recursive function to find the next prime:

def next_prime(fs : List[(Long)=>Boolean], x : Long) : Long = {
 if (relatively_prime(fs, x)) {
    return x
 }
 return next_prime(fs, x + 1)
}

println(next_prime(fs, 4))
println(next_prime(fs, 8))

Which leads to our solution:

def primes(fs : List[(Long)=>Boolean], ps: List[Long], x : Long, nth : Long) : List[Long] = {
   if(ps.size == nth){
      return ps;
   }

   val np = next_prime(fs, x)
   val sieve = fs ::: List(filter(np));
   primes(sieve, ps ::: List(np), np + 1, nth);
}

println("Missing 3 because its in fs" + primes(fs, List[Long](2L), 2, 8))
println((primes(List(filter(2)), List(2L), 2, 8) reverse) head)

def nth_prime(nth : Long) : Long = {
   (
      primes(
         List(filter(2)),
         List(2L),
         2,
         nth
     ) reverse
   ) head
}

println(nth_prime(8))
println("The 10001st prime is " + nth_prime(10001))

BodyMedia Fit vs. Heart Rate Monitor

According to my stationary bike, I just burned 497 calories.

According to my heart rate monitor, I just burned 412 calories.

According to my BodyMedia Fit, I just burned 199 calories.

Hmm…

 

div>