Literate Programming in HTML and C#


Literate C# Coding in HTML

This page includes the complete definition of a C# program called mtangle, that can be used to extract code from HTML pages.
In other words, this page is an example of literate programming.

The mtangle is based on the C program defined at http://axiom-developer.org/axiom-website/litprog.html. Both this program and that one do not parse the input HTML file properly; they both use hard-coded string patterns and regular expressions. This is probably a weakness and could be
trivially replaced by XML parsing. (TODO: It would be interesting to replace the regex-based code chunks with XML-based code chunks and derive either program from the same Web page.)

There’s a bootstrapping issue with mtangle, so if you want to use it, you should probably clone the Github repository here. If you do that, rather than a single HTML or Markdown page, you’ll get a Xamarin Studio solution that includes C# files and test files, etc. Perhaps eventually I’ll change that.

Once you have compiled Program.cs into an executable called mtangle you can check the behavior by running it over a locally-saved copy of this HTML page: `mono mtangle.exe LiterateCSharp.html Program > Program.cs and it should generate a new copy of itself.

The form of mtangle

At the most basic level, you just need one thing: the ability to extract a <pre> element based on an id attribute:

public static string GetChunk(string html, string chunkName)
{
    <getchunk id="GetChunkImplementation"/>
}
	

Of course, the <getchunk> line is not valid C# code. What you really for the body of the GetChunk function is something that searches the input html for a tag of the form <pre id="chunkName">:

string chunkStart = "<pre id=\"{0}\">";
//Would be better if I could rely on XML parsing, but I'm just going to hard-code in strict text
var chunkTag = String.Format (chunkStart, chunkName);
var chunkLocation = html.IndexOf(chunkTag);
if(chunkLocation >= 0)
{
    <getchunk id="FoundChunk"/>
}
else
{
    //No chunk. Return empty (Or should it throw?)
    return "";
}

(As you can see, I’m not sure what to do when a chunk is sought that doesn’t exist. Right now, I return an empty string, but maybe I should throw an exception. What do you think?)

Now, how should the FoundChunk be implemented? Easy, just extract the text from the original html up to its end:

string chunkEnd = "</pre>";
//Found it
var postChunk = html.Substring(chunkLocation + chunkTag.Length);
var chunk = postChunk.Substring(0, postChunk.IndexOf(chunkEnd));
return chunk;
	

So that would work, but only if the chunk contained the entire program to be extracted. We don’t want just that; we want the ability to use a <getchunk> pseudo-tag to indicate a chunk someplace else on the page. So what we really need is this:

//Found it
var postChunk = html.Substring(chunkLocation + chunkTag.Length);
var chunkWithPossibleGetChunks = postChunk.Substring(0, postChunk.IndexOf(chunkEnd));
var fixedChunk = FixHTMLCode(chunkWithPossibleGetChunks);
var chunk = ResolveGetChunks(html, fixedChunk);
return chunk;

And, of course, we have to define string ResolveGetChunks(string html, string chunk). Again, that’s easy:

public static string chunkGetForm = "";
public static string ResolveGetChunks(string html, string chunk)
{
   var matches = Regex.Matches(chunk, chunkGetForm);
   if(matches.Count > 0)
   {
      var replaced = chunk;
      foreach(Match match in matches)
      {
         var innerChunkName = match.Groups[1].Value;
         var innerChunk = GetChunk(html, innerChunkName);
         replaced = replaced.Replace(match.Groups[0].Value, innerChunk);
      }
      return replaced;
   }
   else
   {
      return chunk;
   }
}

(OK, the use of regular expressions to extract these things is really getting to me as I write this page. And, yeah, if you create a cycle in your chunk references you’ll blow the stack. I wonder if that’s a benefit of literate programming is that you really, really confront your compromises?)

Now comes something even uglier than hard-coded text.

<getchunk> is not a real tag. If you put it in as a literal element, the browser won’t render it. Instead, you need to add it using escape codes: &lt;getchunk id="ChunkName"&gt;. And, unfortunately, that goes for code that includes <s and >s. So we need to transform text containing escaped <s and >s in the source text into a tag-based form that matches chunkGetForm:

public static string FixHTMLCode(string html)
{
   var sansLT = html.Replace("&lt;", "< ");
   var sansGT = sansLT.Replace("&gt;", ">");
   return sansGT;
}

Let’s gather the hard-coded string literals together:

static readonly string chunkStart = "<pre id=\"{0}\">";
static readonly string chunkEnd = "</pre>";           
static readonly string chunkGetForm = "<" + "getchunk id=\"?(.*?)\"/>";

And a trivial Main function:

public static void Main (string[] args)
{
   if(args.Length < 2 || args.Length > 2)
   {
      throw new ArgumentException("Usage: tangle filename chunkname");
   }

   StreamReader streamReader = new StreamReader(args[0]);
   string html = streamReader.ReadToEnd();
   streamReader.Close();
   string code = GetChunk(html, args[1]);
   Console.WriteLine (code);
}

And we’re there. Of course, we have to define our overall file structure:

using System;
using System.IO;
using System.Text.RegularExpressions;

namespace mtangle
{
   /*
   Port of http://axiom-developer.org/axiom-website/litprog.html to C# / mono
   Of course this would be better if it were XML parsed, but this is a straight port of a text strategy
   */
   public class MTangle
   {
      <getchunk id="Constants"/>
      <getchunk id="Main"/>
      <getchunk id="FixHTMLCode"/>
      <getchunk id="GetChunk"/>
      <getchunk id="ResolveGetChunks"/>
   }
}

Now, assuming that we’ve added the proper id attribute to our <pre> elements, including id’ing that last one MTangle, if we run mono mtangle.exe LiterateCSharp.html > foo.cs we end up with a poorly-formatted version of the program!

Leave a Reply

Your email address will not be published. Required fields are marked *