Parslet on Indentation Syntax for Extending Old Language

Do you know what a PEG is?

PEG What?

I like compiler technology so naturally I had to find out. It stands for Parser Expression Grammar and is different from the standard CFG represented in parser generator tools like Lex and Yacc primarily by encoding precedence as simply the first rule that matches. Sounds interesting. I had a new parsing project to try it on so I found a library in the language I was using: Parslet.

Parsley has an interesting property too: it is a Ruby DSL rather than a separate definition file that gets compiled to Ruby. So there’s no extra step but what makes it better is that you can use Ruby’s inheritance and metaprogramming to build your language’s parser. That makes it similar to OMeta which is all about parsers extending other parsers to build a language up from pieces in an object-oriented fashion.

Cure for MUMPS

There’s an old programming language that my employer uses it. Not many people have heard of it outside of the few industries that still use it. It is called MUMPS or M (the ANSI standard calls it M). It is fascinating with its unique heritage that differs from C-like syntax and semantics of so many of our modern programming languages. Built into the language is a concept of a “global” which is basically a key-value datastore. That makes it a NoSQL database solution and the runtime we use is modern and optimized. But the language itself has relics from a bygone era that make it a little annoying to use at times.

Old language, competitive NoSQL database, well this sounds like a job for a compiler. I can compile a newer syntax down to ANSI M and even give it some features like objects, syntactic sugar for various types of loops, maybe even an expressive macro system (like Lisp and Scheme). At the very least I can get automate away a few annoyances.

One of the interesting parts of M’s syntax is the use of “.” to group blocks of code rather than curly braces. For example:

if condition=1 do
. write “True”
. if condition2=2 do
. . write “Also true”

That makes a text-based indentation guide! We don’t need that today since most good text editors that a programmer would use can show lines for indentations and/or have code folding, block highlighting, etc. to help work with blocks of code. But it still seems pretty cool to me. So when designing my updated super-M language I thought that an indentation-based syntax would be a more natural spiritual successor than a curly-braces syntax.

Indentation with PEG

So how does one make an indentation syntax parser using a PEG? Most examples I saw in Google and Stack Overflow searches involve matching the whitespace at the beginning of the line. Ideally I would have INDENT and DEDENT tokens and just work them into the grammar the same way I would for an open and a close brace.

For my first attempt, I created a parser IndentParser that handles just the indent/dedent rules. Then my language parser inherits from that. Short lexemes for keywords, identifiers, etc. all consume whitespace immediately following them. Statements consume the newline. The newline rule is special because IndentParser uses it to check for changes in indentation:

rule(:nl)         { anynl >> dynamic {|source, ctx| checkIndentation(source,ctx) }}
rule(:unixnl)     { str("\n") }
rule(:macnl)      { str("\r") }
rule(:winnl)      { str("\r\n") }
rule(:anynl)      { unixnl | macnl | winnl }

The “dynamic” rule allows calling a Ruby block to return a match. I return AlwaysMatch.new which is a class I made that extends Parslet::Atoms::Base using their way of calling it a successful match.

class AlwaysMatch < Parslet::Atoms::Base
  def try(source, context, consume_all)
    succ("")
  end
end

Before returning AlwaysMatch, checkIndentation scans the whitespace at the start of the next line, skipping over lines that are only whitespace. It keeps a stack in the class of all the matched indentation levels so we can compare and see if the new indentation is more or less. If more then we add another INDENT to the stack and to a list of intent tokens to later return. If it is less, then match it up with the indentation stack to see how many DEDENT tokens to add to our list. The indent and dedent rules will match only when there is an INDENT or DEDENT in the internal list, consuming it. Neither consumes characters from the input string.

But now things start to fall apart. The dedent and indent rules are only checked in the block rule at the beginning and end, but we need to make sure that any statement we match in the block does not have a DEDENT token before it. I tried adding a SAMEDENT rule like this:

rule(:block) { 
  colon >> nl >> indent >> 
  (samedent >> stmt).repeat(1).as(:stmts) >> 
  dedent
}

That rule will only match if there are no INDENT or DEDENT tokens pending in the IndentParser’s internal list. But it wasn’t enough. I could quite iron out all the bugs with how this interacts with PEG parsing, which remembers succeed/fail per rule per character position and has backtracking. Perhaps it is possible to use my SAMEDENT idea to solve this, but what I really wanted was for the IDNENT/DEDENT tokens to get injected into the input stream like we’d get with a separate lexer.

Indentation via Bridge

So onto my second attempt. Since each rule in Parslet is just a Ruby function I decided to make a hand-written lexer and then bridge this over to Parslet with metaprogramming to generate functions for each token. This would include the special INDENT/DEDENT tokens of course.

The bridged parser is pretty short so I’ll show it all here:

class PEGBridgedParser < Parslet::Parser

  ##
  # Need to override this so it takes a lexer instead of a str and makes my Source subclass.
  ##
  alias :super_parse :parse
  def parse(lexer)
    if lexer.is_a?(MppFront::Lexer)
      super_parse(LexerSource.new(lexer))
    else
      super_parse(lexer) #an IO or Parslet::Source object
    end
  end

  ##
  # Define a lexer token rule.
  # 
  # @param name : sym - The lexer token's "token" value (maybe) and the rule name.
  # @param char : char - If given, the lexer token's "token" value is actually this char.
  ##
  def self.token(name, char = nil)
    # See Parslet::ClassMethods. Does memoization of the rules.
    define_method(name) do
      @rules ||= {}     # <name, rule> memoization
      return @rules[name] if @rules.has_key?(name)

      char = name if char.nil?
      @rules[name] = TokenEntity.new(name, char)
    end
  end

  def self.keyword(name)
    self.token("kw_" + name.to_s, name)
  end

  def self.tokens(*names)
    names.each {|name| self.token(name)}
  end

  def self.keywords(*names)
    names.each {|name| self.keyword(name)}
  end

end

A Parslet parser takes a string and makes a Source object from it. I change the parse entry point to take an instance of my lexer and make my own LexerSource subclass from that. We still have to keep track of matches at each position. The LexerSource does this in a way that allows Parslet to backtrack to its heart’s content, consuming tokens from the lexer only when Parslet asks to go beyond the current list of tokens. The LexerSource also overrides consume and match since character-by-character matching from child class grammars in Parslet won’t be supported. It could be, if I add a feature to my lexer to ask for a character at a time instead of a token, but for my purposes I don’t need this. Parslet will only be allowed to match rules representing full tokens or rules based on those.

More behind-the-scenes plumbing includes implementing chars_left in LexerSource to return the number of, well, presumably chars in the input. But I’m using it to be the number of tokens instead and without running the lexer to the end I don’t know how many tokens. So I return 0 when there are no more tokens in the backtrack list and the lexer has hit EOF, otherwise I return 10. I originally returned 1, to mean there is always at least one token left, but Parslet sometimes thought that meant there were not enough chars left to match some rules so I return a larger number. I also had to make a couple subclasses of Parslet objects to represent my tokens: TokenSlice extends Parslet::Slice and implements line_and_column to track position, + to concatenate matches pieces, and == for comparison; TokenEntity extends Parslet::Atoms::Entity and is returned as the match from my token rules. TokenEntity overrides the try method to actually talk to the lexer and see if a desired token is the current token.

The metaprogramming magic is implemented by PEGBridgedParser::token (see above). It generates a method with the token’s name which handles memoization and returns a TokenEntity to do the matching. To help, keyword will match a token but prepends “kw_” to the token name and corresponding rule method while tokens and keywords allows specifying several at a time as long as the token/keyword is the same as the value that the lexer will look for. (E.g. “if” whereas a counter example is “plus” which matches “+”.)

My final parser starts by defining the tokens and keyword rules I want.

class MppGrammar < MppFront::PEGBridgedParser

  # Tokens
  tokens(:nl, :indent, :dedent)
  keywords(:break, :continue, :else, :for, :foreach, :func, :if, :loop, :return, :sub, :var)
  keywords(:as, :class, :interface, :final, :self)
  keywords(:pass)
  tokens(:rparen, :lparen, :rbracket, :lbracket, :comma, :colon, :eql, :neql, :land, :lor)

  # Operators +!#&|*/_'<>.@-  These are returned from lexer as single-char tokens.
  token(:plus, '+')
  token(:minus, '-')
  token(:mult, '*')
  token(:div, '/')
  token(:mod, '#')
  token(:tk_or, '!')
  token(:tk_and, '&')
  token(:tk_not, "'")
  token(:pipe, '|')
  token(:under, '_')
  token(:lt, '<')
  token(:gt, '>')
  token(:dot, '.')
  token(:at, '@')

  token(:tk_identifier, :identifier)
  token(:string)
  token(:number)

  #lexer error comes in as an error token
  token(:error)

The rest is elementary, my friend, like any other Parslet example.