Friday 9 March 2012

A Literate Programming Tool

This is the last blog post I'll be hand-writing html for. I'd like a tool that can take a markdown document (as they're easy to write) and either "weave" it into a legible html document or "tangle" it into source code. There are loads of tools that do this already - but I want one that requires a minimum of markdown syntax, can understand (read: syntax highlight) many different language, and most importantly produce more than one source file from a single markdown document (for polyglot programming, or code & config, or antlr grammars and other DSLs). I don't want to re-invent any more of the wheel than I have to, so I'll use redcarpet to do the markdown parsing, trollop for a command-line interface and coderay to generate syntax highlighted html.

I'll start with a basic overview of the functionality:


opts = Trollop::options do
  opt :weave, "Produce documentation", :short => 'w'
  opt :tangle, "Produce code", :short => 't'
  opt :outputdir, "Directory to write files to", :default => Dir.pwd, :short => 'o'
  opt :lang, "Default language of code", :default => "ruby", :short => 'l'
  opt :files, "Files to process", :type => :string, :short => 'f', :required => true
end

It's pretty straightforward; I want to tangle and weave in a single command as well as one or the other, and want to be able to override the output directory to aid integration into build chains. Coderay can't guess what (programming) language a piece of text is in, so I'll have to either specify it in the markdown (as part of a fenced code block or use some default). I'll add some other markdown extensions that seem useful:


html_opts = {
  :fenced_code_blocks => true, 
  :superscript => true, 
  :tables => true, 
  :lax_html_blocks => true, 
  :strikethrough => true
}

The markdown must specify how to glue the various bits of code together. I like to think of it in terms of "anchors", or as a tree of bits of code. Each section of code could have one or more "anchors", or named markers, and should say which marker it should be attached to. Each output source file has an artificial root anchor (called *). I'll define anchors using ampersands (as that seems to be a common feature of literate tags in other implementations), so something like @My Anchor Name. I'll also indicate that a section of code should be added to an anchor using anchor-name-plus-equals, e.g. @My Anchor Name@ += .... A markdown code block can have zero or more of these plus-equals, and any code before the first plus-equals should be added to the root (i.e. *) anchor of a file with the same name as the markdown document, but with a file extension corresponding to the default language (let's call this the "default source file"). You can add a block to the root of a specific source file by prepending the source file path (including folders) to the *, so something like @my_folder/my_output.c:*@ += .... We'll find these tags in the code using regexes:


$allowed = '(\w| )*|((.*:)?\*)'

And we'll have a poor-but-simple way of guessing the output extension for the default source file: a hardcoded hash keyed by the default language (config files, command line overrides or generally anything not hard-coded would be useful here - I haven't found a library that does this):


$ext_for_lang = {
  :ruby => 'rb',
  :c => 'c',
}

And as we should create any directory structure needed to write the source files to the locations given in their root tags, I'll add a helper (this should really be an optional argument to File.open - maybe I should monkey-patch it?):


def write_to path, data
 FileUtils.mkdir_p(File.dirname(path))
 File.open(path, 'w') {|f| f.write(data)}
end

I'll do "weaving" (conerting to HTML) first. I'll use redcarpet's XHTML renderer for the most part - we just have to add some highlighting to the code blocks. I'll also use Coderay's :line_number_start feature to print the line numbers in the markdown document. There's not any obvious helper method that redcarpet can give me (it's written in C, and only exposes some lightweight ruby object wrappers, so there's not much in the way of utility functions - or even any way to interact with the parsing process that I can see), so I'll just count the new lines between the start of the markdown document and this fragment of code (and yes, this is very inefficient and potentially incorrect if the code block is copy+pasted - maybe I should add a new-line counter to every callback). There's one really big problem: "@" symbols and unicode. Any "@" passed to Coderay get highlighted (i.e. surrounded by HTML tags), which will interfere with the anchor display. I want to display the tags as left- and right-angle brackets (a bit like the latex output of Knuth's web), so I'll have to substitute dummy strings in place of the "@"s before passing the code to Coderay for formatting and then replace the dummy strings with the HTML entity codes for the symbols I want. Ideally I'd just pass the unicode equivalent of the left- and right-angle brackets, but that gets tagged by Coderay too.


class Weave < Redcarpet::Render::XHTML
  attr_accessor :default_lang, :original_text
  def block_code(code, lang)
   l_ang, r_ang, equiv = "__lang__", "__rang__", "__equiv__"
   line_num = @original_text[0,@original_text.index(code)].count("\n")+1
    code = code.
      gsub(/@(#{$allowed})@\s*\+=/,"#{l_ang}\\1#{r_ang}+#{equiv}").
      gsub(/@(#{$allowed})@/,"#{l_ang}\\1#{r_ang}")
    code = CodeRay.
      scan(code, lang.nil? ? @default_lang : lang.to_sym).
      html(
        :wrap => :div, 
        :css => :style, 
        :line_numbers => :table,
        :line_number_start => line_num)
    code.
      gsub(/#{l_ang}(#{$allowed})#{r_ang}(\s*)\+#{equiv}/,'⟨\\1⟩+≡').
      gsub(/#{l_ang}(#{$allowed})#{r_ang}/,'⟨\\1⟩')
  end  
  def codespan(code); block_code(code,nil) end
end

The code that calls this is pretty standard boilerplate: create a parser, iterate over the input markdown files, render them to HTML & write the output to an HTML file with the same name as the markdown file (with an HTML file extension tacked on):


if opts.weave
  r = Redcarpet::Markdown.new(Weave, html_opts)
  r.renderer.default_lang = opts.lang.to_sym
  opts.files.split(',').each{|file|
    code = File.open(file, 'r'){|f|f.read}
    r.renderer.original_text = code
 html = r.render(code)
 write_to("#{opts.outputdir}/#{file}.html", html)
  }
end

The tangle code's a bit more interesting. We have to build up a hash of what strings of code are attached to each anchor, and then run a pass after processing every file to construct the output source files (fully & recursively resolving all the anchors in the code blocks). The normalise method converts the default root node * into a root node with the default source file path - so it needs to know what the markdown file name is (@file_no_ext) and what language the code is in (if it's not given) (@default_lang).

The chunk processing code splits the code block into sections to be added to different anchors (including a synthentic anchor-add for everything before the first anchor). It picks up the start index and the length of the anchor tokens so we can remove them from the code blocks - they won't be valid c/ruby/&c. code. There's also a synthetic anchor-add at the end to make the iteration easier.


class Tangle < Redcarpet::Render::Base
  attr_accessor :default_lang, :links, :file_no_ext
  def block_code(code, lang)
    chunks = 
      [{:start => 0, :anchor => '*', :anchor_len => 0}] + 
      code.scan(/(@(#{$allowed})@\s*\+=)/).map{|x|
        {:start => code.index(x[0]), :anchor => x[1], :anchor_len => x[0].length}
      } +
      [{:start => code.length}]
    (1..chunks.length-1).each{|index|
      last, this = chunks[index-1], chunks[index]
      @links[normalise(last[:anchor],lang)] << code[last[:start] + last[:anchor_len], this[:start]]
    }
    nil
  end
  def normalise(link_name, lang)
    if link_name == '*'
      "#{@file_no_ext}.#{$ext_for_lang[lang.nil? ? @default_lang : lang.to_sym]}:*"
    else
      link_name
    end
  end
  def codespan(code); block_code(code,nil) end
end

The last bit of code does the tangling. As in the weaving code it uses redcarpet to process each file, but this time we don't care about the output. Instead we're left with the links Hash which has all the sections of code to be added to each anchor point. The (normalised) root anchor(s) will also be keys to this hash. There's one final step - we'd like to just extract the roots and write their arrays of code snippets to a file. As these snippets may still have anchors in we should remove the anchor tokens and replace them with any snippets that should be attached to those anchors. We should do this recursively until no anchors are left. My implementation isn't very safe - you could easily make it recurse infinitely or forget to attach snippets to an anchor due to a typo. It also doesn't warn you if anchors don't have any code attached to them - another potential sign of misspelt anchor names.


if opts.tangle
  links = Hash.new{|h,k|h[k]=[]}
  r = Redcarpet::Markdown.new(Tangle, html_opts)
  r.renderer.default_lang, r.renderer.links = opts.lang.to_sym, links
  opts.files.split(',').each{|file|
    r.renderer.file_no_ext = file[0,file.rindex('.')]
 r.render(File.open(file, 'r'){|f|f.read})
  }
  resolve = lambda{|parts| parts.join('').gsub(/@(#{$allowed})@/) {|match|resolve.call(links[match[1..-2]])} }
  links.keys.reject{|k|!k.end_with? ':*'}.each{|root|
 write_to(
   "#{opts.outputdir}/#{root[0..-3]}", 
   resolve.call(links[root]))
  }
end

So that's it! An example: the following markdown...


### my file ###
add some *text*:

~~~~~
@*@ +=
int main() { 
  @Do Stuff@
  @Do More Stuff@
  return 0; 
}
~~~~~

and some more text. we should probably say what dostuff is; it's:

~~~~~
@Do Stuff@ +=
int* me = &1;
*me++;
~~~~~

was that clearer? More stuff is the same as the first:

~~~~
@Do More Stuff@ += @Do Stuff@
~~~~

...can be woven into the following html using literate_md --lang=c --weave -f example.md into...

my file

add some text:

5
6
7
8
9
10
⟨*⟩+≡
int main() { 
  ⟨Do Stuff⟩
  ⟨Do More Stuff⟩
  return 0; 
}

and some more text. we should probably say what dostuff is; it's:

16
17
18
⟨Do Stuff⟩+≡
int* me = &1;
*me++;

was that clearer? More stuff is the same as the first:

24
⟨Do More Stuff⟩+≡ ⟨Do Stuff⟩

...or tangled using into...


int main() { 
  
int* me = &1;
*me++;

   
int* me = &1;
*me++;


  return 0; 
}

I've put this script as a gem here - let me know if you have any improvements!

1 comment:

  1. Would be easier to learn how to use Emacs and org-mode :p http://orgmode.org/org.html#Working-With-Source-Code

    ReplyDelete