Writing A Technical Article, Pt. 3: Research Begins…

I’m not one of those people for whom the task of composition is a somewhat mechanical extension of increasingly detailed outlines. Nonetheless, research consumes between 80-90% of the time I spend on a technical article (note I said time and not effort — subjectively, writing text is two or three times harder than writing C++. You can satisfy a compiler.)

In this case, research will consist of:

  1. Integrating CodeAnalyst and Eclipse
  2. Profiling a simple application
  3. Implementing an image-processing application
  4. Profiling it to discover a performance problem
  5. Investigating that problem with CodeAnalyst
  6. Ameliorating it
  7. Evaluating the performance “win”
  8. If not “big enough” go to 3 

In this case, my target sample application has two algorithms that I think will be difficult:

  1. Seam-discovery (CPU intensive)
  2. Seam-removal (Memory-manipulation intensive)

With seam-discovery, I know that even if I implement a known-good algorithm (such as graph-cuts), I am guaranteed to peg the CPUs. My big gamble professionally is that I’m a good enough programmer to find a Step 6 “win” that I can pull off in a reasonable amount of time. A safer bet would be to choose an algorithm that would have performance problems if intentionally implemented in a naive manner (for instance, a filter that just iterated across rows and columns and, at each pixel, retrieved the neighboring pixels and averaged them. Pretty easy to tune that up!)

Now, a confession. By the time you read this, I’ve been somewhere not-Hawai’i for a week or so. I’ll carry print outs of this paper on content-aware image resizing and this paper on graph-cuts and study them on the plane tomorrow. I’ll have my laptop with me and may noodle around with source code, but it’s far more likely that I’ll be working with pen and paper to see if I can “get” the algorithm intuitively. Other than that, I’m going to do my darnedest not to spend too much time thinking about computers.

Coming September 19th or thereabouts: Part 4, Research Gets Underway…

Writing A Technical Article, Part 2: Gathering Tools

Probably the biggest difference between academic and commercial writing is this: academic writing is almost always concerned with algorithms and process, commercial writing is almost always driven by a specific set of technologies. In this case, I’m being paid by AMD to explain integration of CodeAnalyst and Eclipse CDT. So that dictates that I’ll be using C++, which leads to a couple questions:

  • Linux or Windows?
  • What C++?

The Linux or Windows question is fairly easy for me to answer: I boot into Windows and run Linux in Virtual Machines. For an article on a profiler, the VM is a complication — CodeAnalyst works within a VM, of course, but since I expect this article to get down to the hardware, I don’t want any confusion about how toasty-warm my code is making my Opterons. So I’ll be experiencing the code in Windows.

But when writing about Eclipse, surely I should imagine that my target audience is not strictly Windows-based. So it seems logical to set up a compiler-chain based on GNU C++: I know that will be portable. (Note: I’ll probably cross-compile with at least Microsoft C++ and compare the experience. That might give me enough information for, if not an article, at least a column.)

Since the article is supposed to emphasize multicore/multichip development and in C++, it’s a no-brainer that I’ll be using OpenMP.

My next thought is unit-testing. For one thing, I just don’t develop anything without unit-testing anymore. For another, repeatability is critical to the article. For a third, I expect to hit some hard problems along the way and unit-testing is the best way to maintain momentum in the face of brain-twisters. I have not compared C++ unit-testing frameworks, but Boost has a test component, and while there may well be better, I know that Boost won’t be poor.

Version control is not the same issue in an article that it is in professional development, but you can never over-estimate the benefit of a backup: I’ll be using Subversion against a local server.

Since I know that I’ll be doing image-processing, I’d just as well use a library or two to handle file-loading and basic pixel manipulation. If I were building an application, I’d definitely tend towards GTK+. However, I want to keep things as simple as possible around the algorithm(s) that I’ll be developing, so I’m going to see how far I can get with this project, CImg, which seems nice and minimal.

So my initial tool list is:

  • CodeAnalyst : Profiler
  • Eclipse CDT : IDE
  • GCC : Compiler-chain
  • OpenMP : Shared-memory parallel programming
  • Boost.Test : Unit-testing
  • Subversion : VCS
  • CImg : Image-processing library

If I need or change libraries, I’ll make mention of it in forthcoming posts.

Next: Part 3, Research Begins…

C++0x to Incorporate Standard Threading Model

The working groups of the C++0x committee are working hard to complete a major new standard for C++ (there’s a big meeting here in Kona in October). If you’re not intimate with C++, you may be surprised that such an important language has not had a standard threading model and that such a model is a major part of the C++0x version. This is actually part-and-parcel of the design philosophy that made C and C++ so important: the number of libraries dictated by the standard for C and C++ is much smaller than the BCL or Java’s SE libraries. This allows standard C and C++ to be available for hundreds of processors.

I recently read the public C++0x papers on threading (links below). The proposed threading model is non-radical and is based on Boost.Thread. The reasonable perspective is that this is a conservative decision thoroughly in keeping with C/C++’s long tradition of minimal hardware/OS assumptions.

The emotional perspective is that they’ve let slip by a golden opportunity to incorporate the best thinking about memory models. “Multi-threading and locking” is, I would argue, demonstrably broken for business programming. It just doesn’t work in a world of systems built from a combination of libraries and user-code; while you can create large applications based on this model, large multithreaded applications based on locking require not just care, but sophistication, at every level of coding. By standardizing an established systems-level model, C++0x foregoes an opportunity for leadership, albeit radical.

One of the real thought leaders when it comes to more sophisticated concurrency semantics is Herb Sutter. His Concur model (here’s a talk on Concur from PDC ’05) is, I think, a substantial step forward and I’ve really hoped to see it influence language design. Is Sutter, though, just an academic with flighty thoughts and little understanding of the difficulties of implementation? It seems unlikely, given that he’s the Chair of the ISO C++ standards committee. So you can see why there might have been an opportunity.

Multithreading proposals for C++0x:

Dynamic Language Runtime / IronRuby Inst-analysis

That Microsoft was going to increase support for dynamic languages is no surprise: they’ve been talking about that since (at least) PDC ’03 and various hires and initiatives have clearly been in the works. I haven’t seen the DLR yet, but my big question is: what version / runtime / patch level of the CLR  and libraries becomes the lowest-common denominator for Silverlight (i.e., cross-platform, in the browser)? Because for better or worse, that becomes the platform for dynamic languages in the .NET world.

I am surprised by the IronRuby announcement (and officially bestow the He-Man Programming Award to John Lam). I really thought we were going to see some form of Ruby#:Ruby::C#:Java. Although I’m happy (Ruby is now my #1 administrative programming language), I was actually hoping to see a new language. Ruby’s a fine language, but it doesn’t have a good story for concurrence, it has a boring model of XML (unlike VB), it has some unattractive Perl-isms. Most importantly, I think MS does a good job when they have the flexibility to evolve the language and, simultaneously, can devote the resources to developing the compilers, libraries, and toolsets.

Exceptions in the Manycore Era

Here’s some interesting reading on the challenges of and possible strategies for dealing with exceptions in concurrent versions of C++. The try…catch…finally model of exception handling introduces its own control flow. How will that interact with concurrent models in which you’re passing around a “future” (essentially, an IOU that can be cashed in for the results of a calculation)? Even more practically, as the number of cores increase, the possibility of simultaneous exceptions rises (probably dramatically, since the worked-out assumptions of normal control-flow no longer hold). Among other things, this paper proposes “reduction functions” that create “compound exceptions.” Interesting stuff.

When Things Go Awry Writing

I’ve been writing a series of articles for DevX on concurrent programming. The final installment was supposed to be “Multicore for multimedia.” Plan A was to speed up the MAD (MPEG Audio Decoder) processing library using OpenMP. That went well enough except for the fact that the code was so clean that I couldn’t get any very impressive wins without committing myself to really changing the design in some fairly substantial ways (like concentrating the processing phases of a single audio channel to a single core). So Plan B was to write a video filter. Problem with that is that Adobe Premiere Pro already uses multiple threads to perform the callbacks to the video filter, distributing one frame to one core, the next frame to the next. That’s pretty much optimal. So when I added threads, my performance actually decreased.

So now I’ve got to figure out a Plan C. Oh, and I recommend MAD and Adobe Premiere Pro: they’re well engineered pieces of code.

Article on Using OpenMP with C++/CLI

My latest article on DevX shows how easy it is (in the best case) to use OpenMP with C++/CLI. OpenMP is a low-level library to help create concurrent operations. One of things I talk about in the article is that it is at the finest-grain (loops) and the coarsest-grain (service-orientation) that concurrency is easiest. It’s when you get into concurrency while manipulating data structures (including objects) that disaster looms.

Update: If you like the article and would like to see more like it, consider digging it.

Example of Surprising Closure Behavior

What do you expect to be outputted from this program (note that line 19 captures the outer variable “i”)?


    1 using System;

    2 using System.Collections.Generic;

    3 using System.Text;

    4 using System.Threading;


    6 delegate void VoidDelegate();


    8 class Program

    9 {

   10     public static void Main(string[] args)

   11     {

   12         List<VoidDelegate> closures = new List<VoidDelegate>();

   13         //Create a bunch of closures

   14         for (int i = 0; i < 10; i++)

   15         {

   16             VoidDelegate myClosure = delegate

   17             {

   18                 //Capture outer variable

   19                 Console.WriteLine(i);

   20             };

   21             closures.Add(myClosure);

   22         }


   24         foreach (VoidDelegate closure in closures)

   25         {

   26             closure();

   27         }

   28         Console.ReadKey();

   29     }

   30 }


Contrast with the output of this Ruby program:


closures = Array.new()

#Create a bunch of closures
10.times { | i |
  myClosure = lambda {

    #Capture outer variable

closures.each { | myClosure |

ParallelApply: Distribute Calculations Over Multicore / Processors

This code applies a BackgroundFunction to elements of an IEnumerable using the ThreadPool. If you don’t know what that means, it’s probably not of interest to you:

   10     delegate void BackgroundFunction(object memberOfEnumerable);


   12     class Program

   13     {

   14         static void Main(string[] args)

   15         {

   16             int[] nums = new int[100];

   17             for (int i = 0; i < 100; i++)

   18             {

   19                 nums[i] = i;

   20             }


   22             Random r = new Random();

   23             BackgroundFunction func = delegate(object memberOfEnumerable)

   24             {

   25                 int i = (int) memberOfEnumerable;

   26                 Thread.Sleep(r.Next(1000));

   27                 Console.WriteLine(i);

   28             };


   30             ParallelApply(nums, func);

   31             Console.ReadKey();

   32         }


   34         static void ParallelApply(IEnumerable enumerable, BackgroundFunction function)

   35         {

   36             ManualResetEvent done = new ManualResetEvent(false);

   37             int doneRefCount = 1;


   39             BackgroundFunction wrappedBlock = delegate(object state)

   40             {

   41                 function.DynamicInvoke(state);

   42                 int isDone = Interlocked.Decrement(ref doneRefCount);

   43                 if (isDone == 0)

   44                 {

   45                     done.Set();

   46                 }

   47             };


   49             WaitCallback callback = new WaitCallback(wrappedBlock);

   50             foreach (object o in enumerable)

   51             {

   52                 Interlocked.Increment(ref doneRefCount);

   53                 ThreadPool.QueueUserWorkItem(callback, o);

   54             }

   55             int isDoneLate = Interlocked.Decrement(ref doneRefCount);

   56             if (isDoneLate == 0)

   57             {

   58                 done.Set();

   59             }

   60             done.WaitOne();

   61         }

   62     }