• ¶

    Peasy

    Peasy means parsing is easy
    an easy but powerful parser
  • ¶

    To use Peasy, just copy this file to your project, read it, modify it, write the grammar rules,, and remove any unnecessary stuffs in Peasy, and parse with the grammar.
    See here for a sample grammar in Peasy.

  • ¶

    With Peasy, you write the parser by hand, just like to write any other kind of program.
    You need not play many balls like that any more:

    ballacrobatics.jpg
    You just play one ball like so:

    dolphinball.jpg
    ,

  • ¶

    You can embeded other features in the grammar rules seamless, such as lexer, rewriter, semantic action, error process( error reporting, error recovering) or any other tasks; you can dynamicly modify the grammar rules, even the parsed object, if you wish.

  • ¶

    Instead of a tool or a library, Peasy is rather a new method to write parser. As a demonstration, Peasy presents itself as a module of single file, which includes some global variables used while parsing, some functions to process left recursives symbols, some matchers which parse the text, and some cominators of matchers.

  • ¶

    With the method provided by Peasy, you can parse any object which is not only text, but also array, embedded list, binary stream, or other data structures, like tree, graph, and so on.

  • ¶
    Nothing but the method in Peasy is indispensable:
  • ¶
    • If there is no left recursive symbol, you can remove the code about that(mainly three functions and 80 lines).
    • If you don't need memorize the parsed result, you can remove the stuffs about memorization(mainly refer to the function memorize).
    • The matchers(e.g. spaces, literal, identifier, etc.) and combinators(e.g. andp, orp, etc) in this module exists just for demostrating the method initiated by Peasy. You will see they are all very simple, after seeing them, I bet that you would rather to remove them and write them by hand yourself when you write the grammar rules.

  • ¶

    Peasy provided two method to tell which symbols are left recursive.
    This module presents the semiautomatic method:
    At first you write grammar rules in this manner: replace one and only one of grmmar.symbol in the grammar rule with memo(symbol) for every left recursive circles. Before parsing, you first call intialize(), and tell which symbol are left call or left recursive by calling addParentChildrens(grammar, parentChildren) and/or addRecursiveCircles(grammar, recursiveCircles...), and then call computeLeftRecursives(grammar), now everything about left recursive symbols is computed and you can call parse(data, grammar, root) to do the work.

  • ¶

    The notation @name in the comment below means a global variable, and $name means a parameter.

  • ¶

    global variables

    It is for the performace reason that I use global variables, and at some time I will provide another version which have the class Parser for people who prefer modularization to speed. Of course you can modify the code to have the class Parser yourself if you like. It's just a snap to do that.

  • ¶

    some global variable used while parsing

  • ¶

    the text which is being parsed
    Don't be confused by the variable name, it could be not only text strings but also array, sequence, tree, graph,etc.

    text = ''
  • ¶

    the length of @text

    textLength = 0
  • ¶

    the current position of parsing, use text[cursor] to get current character in parsed stream

    cursor = 0
  • ¶

    the grammar object which contains all of rules defined for the symbol in grammar.

    grammar = undefined
  • ¶

    saved the original rules of the grammar.

    originalRules = {}
  • ¶

    store the wrapped function to rule of the left recursive symbol with recursive, and before entering them, store it in grammar too. when entering them, grammar[symbol] is unwrapped to originalRules[symbol] or memorizeRecursives[symbol]

    recursiveRules = {}
    symbolDescedentsMap = {}
  • ¶

    if you would like to use just symbol itself as the hash head, remove these four lines below and other correlative stuffs.
    {symbol: tag}, from rule symbol map to a shorter memo tag, for memory efficeny

    symbolToTagMap = {}
  • ¶

    {tag: true}, record tags that has been used to avoid conflict

    tags = {}
    
    parseCache = {} # {tag+start: [result, cursor]}, memorized parser result
    functionCache = {} # memorized normal function result
    
    hasOwnProperty = Object.hasOwnProperty
  • ¶

    helper functions

    some functions is helpful to make the parser.

  • ¶

    intialize: clear global varialbes. you should call intialize() at first.

    exports.initialize = () ->
      parseCache = {}
      functionCache = {}
      originalRules = {}
      recursiveRules = {}
      symbolDescedentsMap = {}
      symbolToTagMap = {}
      tags = {}
      parseCache = {}
  • ¶

    _parse: parse $data from $root with $aGrammar.
    before parsing, you should tell the informations about left recursion in grammar.

    exports.parse = _parse = (data, aGrammar, root) ->
      text = data
      textLength = text.length
      cursor = 0
      root = root or aGrammar.rootSymbol
      grammar = aGrammar
      grammar[root](0)
  • ¶

    parse: this is a sample parse function to demonstrate on how to write your own grammar rules yourself.
    notice that there exists indirect left recursion in the grammar.

    parse = (text) ->
  • ¶

    generate the matchers by the combinators in advance for better performance.
    if you don't mind performance, you can write them in the rule directly.

      a = char('a'); b = char('b'); x = char('x')
      memoA = memo('A')
  • ¶

    the grammar rules object.

      rules =
        A: (start) ->
  • ¶

    add or memoResult to prevent executing nonrecursive part more than once.

          (memoResult = m = rules.B(start)) and x(p.cur()) and m+'x' or memoResult\
          or a(start)
        B: (start) -> rules.C(start)
        C: (start) -> memoA(start) or b(start)
        rootSymbol: 'A'
      initialize()
      addRecursiveCircles(rules, ['A', 'B', 'C'])
      computeLeftRecursives(rules)
      _parse(text, rules)
  • ¶
    stuffs for left recursives

    use addParentChildren(grammar, parentChildrens) or addRecursiveCircles(grammar, recursiveCircles...) to tell the left calling relations between symbols in the grammar

  • ¶

    addParentChildrens: add direct left call parent->children relation for $parentChildrens to @symbolToParentsMap
    e.g. addRecursiveCircles(grammar, {A:['B'], B:['A', 'B']})

    exports.addParentChildrens = (grammar, parentChildren) ->
      map = grammar.parentToChildren ?= {}
      for parent, children of parentChildren
        list = map[parent] ?= []
        for name in children
          if name not in list then list.push name
      null
  • ¶

    addRecursiveCircles: add left recursive parent->children relation to grammar.parentToChildren for symbols in @recursiveCircles
    e.g. addRecursiveCircles(grammar, ['A', 'B'], ['B']) tell the same left recursive relations as above sample.

    exports.addRecursiveCircles = (grammar, recursiveCircles...) ->
      map = grammar.parentToChildren ?= {}
      for circle in recursiveCircles
        i = 0
        length = circle.length
        while i<length
          if i==length-1 then j = 0 else j = i+1
          name = circle[i]
          parent = circle[j]
          list = map[parent] ?= []
          if name not in list then list.push name
          i++
      null
  • ¶

    computeLeftRecursives: after telling left recursive relations, compute the left recsives group and wrap symbol in them with recursive function

    exports.computeLeftRecursives = (grammar) ->
      parentToChildren = grammar.parentToChildren
      addDescendents = (symbol, meetTable, descedents) ->
        children =  parentToChildren[symbol]
        for child in children
          if child not in descedents then descedents.push child
          if not meetTable[child] then addDescendents(child, meetTable, descedents)
      symbolDescedentsMap = {}
      for symbol of parentToChildren
        meetTable = {}; meetTable[symbol] = true
        descendents = symbolDescedentsMap[symbol] = []
        addDescendents(symbol, meetTable, descendents)
        if symbol in descendents
          originalRules[symbol] = grammar[symbol]
          grammar[symbol] = recursive(symbol)
      symbolDescedentsMap
  • ¶

    recursive: this is the key function to left recursive.
    Make $symbol a left recursive symbol, which means to wrap originalRules[symbol] with recursive. when recursiv(symbol)(start) is executed, first let rule = grammar[symbol], grammar[symbol] = originalRules[child] for all symbols in left recursive circles and loop computing rule(start) until no changes happened, and restore all symbols in left recursive cirle to recursiveRules[symbol] at last.,

    exports.recursive = recursive = (symbol) ->
      rule = originalRules[symbol]
      (start) ->
        for child in symbolDescedentsMap[symbol]
          grammar[child] = originalRules[child]
        hash = symbol+start
        m = parseCache[hash] ?= [undefined, -1]
        if m[1]>=0 then cursor = m[1]; return m[0]
        while 1
          result = rule(start)
          if m[1]<0
            m[0] = result
            if result then  m[1] = cursor
            else m[1] = start
          else
            if m[1]==cursor then m[0] = result; return result
            else if cursor<m[1] then m[0] = result; cursor = m[1]; return result
            else m[0] = result; m[1] = cursor
        for child in symbolDescedentsMap[symbol]
          grammar[child] = recursiveRules[child]
        result
  • ¶
    memorization
  • ¶

    memo: lookup the memorized result and reached cursor for $symbol at the position of $start
    It is set up automaticly by computeLeftRecursives(grammar) for the symbols which is left recursive.
    For other symbol, you should be able to call this in rule mannually.

    exports.memo = memo = (symbol) ->
      (start) ->
        hash = symbol+start
        m = parseCache[hash]
        if m then m[0]
  • ¶

    setMemorizeRules: set the symbols in grammar which memorize their rule's result.
    this function should be used only for the symbols which is not left recursives.
    you can do this after initialize() and before parse(...).

    setMemorizeRules = (grammar, symbols) ->
      for symbol in symbols
        originalRules[symbol] = grammar[symbol]
        grammar[symbol] = memorize(symbol)
  • ¶

    memorize: memorize result and cursor for $symbol which is not left recursive.
    The symbols which is left recursive should be wrapped by recursive(symbol), not memorize(symbol)!!!

    memorize = (symbol) ->
      tag = symbolToTagMap[symbol]
      rule = originalRules[symbol]
      (start) ->
        hash = tag+start
        m = parseCache[hash]
        if m then cursor = m[1]; m[0]
        else
          result = rule(start)
          parseCache[hash] = [result, cursor]
          result
  • ¶

    setMemoTag: find a shorter part of symbol as the head of hash tag to index @parseCache
    It exists just for performance reason. If you don't like this idea, you can remove the stuffs about memo tag yourself and just use symbol itself as the head of hash tag.

    setMemoTag = (symbol) ->
      i = 1
      while 1
        if hasOwnProperty.call(tags, symbol.slice(0, i)) then i++
        else break
      tag = symbol.slice(0, i)
      symbolToTagMap[symbol] = tag
      tags[tag] = true
  • ¶

    matchers and matcher combinators

  • ¶

    A matcher is a function which read the text being parsed and move cursor directly.
    All matcher should return truth value on succeed, and return falsic value on fail.
    A combinator is a function which receive zero or more matchers as parameter(maybe plus some other parameters which are not matchers), and generate a new matchers.
    there are other matcher generator besides the standard combinators described as above, like recursive, memo, memorize, which we have met above.

    isMatcher = (item) ->  typeof(item)=="function"
  • ¶

    combinator andp
    execute item(cursor) in sequence, return the result of the last one.
    when andp is used to combined of the matchers, the effect is the same as by using the Short-circuit evaluation, like below:
    item1(start) and item2(cursor] ... and itemn(cursor).
    In fact, you maybe would rather like to use item1(start) and item2(cursor) .. when you write the grammar rule in the manner of Peasy. That would be simpler, faster and more elegant.
    And in that manner, you would have more control on your grammar rule, say like below:
    if (x=item1(start) and (x>100) and item2(cursor) and not item3(cursor) and (y = item(cursor)) then doSomething()

    exports.andp = (items) ->
      items = for item in items
        if not isMatcher(item) then literal(item) else item
      (start) ->
        cursor = start
        for item in items
          if not(result = item(cursor)) then return
        return result
  • ¶

    combinator orp
    execute item(start) one by one, until the first item which is evaluated to truth value and return the value.
    when orp is used to combined of the matchers, the effect is the same as by using the Short-circuit evaluation, like below:
    item1(start) or item2(cursor] ... or itemn(cursor)
    In fact, you maybe would rather like to use item1(start) or item2(cursor) .. when you write the grammar rule in the manner of Peasy. That would be simpler, faster and more elegant.
    And in that manner, you would have more control on your grammar rule, say like below:
    if ((x=item1(start) and (x>100)) or (item2(cursor) and not item3(cursor)) or (y = item(cursor)) then doSomething()

    exports.orp = (items...) ->
      items = for item in items
        if not isMatcher(item) then literal(item) else item
      (start) ->
        for item in items
          if result = item(start) then return result
        result
  • ¶

    combinator notp
    notp is not useful except being used in other combinators, just like this: andp(item1, notp(item2)).
    It's unnessary, low effecient and ugly to write notp(item)(start), just write not item(start).

    exports.notp = (item) ->
      if not isMatcher(item) then item = literal(item)
      (start) -> not item(start)
  • ¶

    combinator any: zero or more times of item(cursor)

    exports.any = (item) ->
      if not isMatcher(item) then item = literal(item)
      (start) ->
        result = []; cursor = start
        while ( x = item(cursor)) then result.push(x)
        result
  • ¶

    combinator some: one or more times of item(cursor)

    exports.some = (item) ->
      if not isMatcher(item) then item = literal(item)
      (start) ->
        result = []; cursor = start
        if not (x = item(cursor)) then return x
        while 1
          result.push(x)
          x = item(cursor)
          if not x then break
        result
  • ¶

    combinator may: a.k.a optional
    try to match item(cursor), wether item(cursor) succeed or not, maybe(item)(start) succeed.

    exports.may = exports.optional = (item) ->
      if not isMatcher(item) then item = literal(item)
      (start) ->
        cursor = start
        if x = item(cursor) then x
        else cursor = start; true
  • ¶

    combinator follow
    try to match item(start), if succeed, reset cursor and return the value of item(start)
    whether succeed or not, cursor is reset to start

    exports.follow = (item) ->
      if not isMatcher(item) then item = literal(item)
      (start) ->
        cursor = start
        if x = item(cursor) then cursor = start; x
  • ¶

    combinator times: match $n times item(cursor), n>=1

    exports.times = (item, n) ->
      if not isMatcher(item) then item = literal(item)
      (start) ->
        cursor = start; i = 0
        while i++<n
          if x = item(cursor) then result.push(x)
          else return
        result
  • ¶

    combinator seperatedList: some times item(cursor), separated by @separator

    exports.seperatedList = (item, separator=spaces) ->
      if not isMatcher(item) then item = literal(item)
      if not isMatcher(separator) then separator = literal(separator)
      (start) ->
        cursor = start
        result = []
        x = item(cursor)
        if not x then return
        while 1
          result.push(x)
          if not(x = item(cursor)) then break
        result
  • ¶

    combinator timesSeperatedList: given @n times $item separated by @separator, n>=1

    exports.timesSeperatedList = (item, n, separator=spaces) ->
      if not isMatcher(item) then item = literal(item)
      if not isMatcher(separator) then separator = literal(separator)
      (start) ->
        cursor = start
        result = []
        x = item(cursor)
        if not x then return
        i = 1
        while i++<n
          result.push(x)
          if not(x = item(cursor)) then break
        result
  • ¶

    As the same as andp, orp, in the manner of Peasy, you would rather to write youself a loop to do the things, instead of useing the combinators like any, some, times, seperatedList,etc. and that would be simpler, faster and more elegant.
    And in that manner, you would have more control on your grammar rule so that we can do other things include lexer, error report, errer recovery, semantic operatons, and any other things you would like to do.

  • ¶

    It is for three motives to put all of the stuffs abve here: 1. to demonstrate how to write matcher and grammar rules in the method brought by Peasy 2. to demonstrate that Peasy can implement any component that other combinational parser libaries like pyparsing, parsec too, but in a simpler, faster manner. * 3. to show that it is how easy to write the matchers in the manner of Peasy.

  • ¶

    As you have seen above, all of these utilities is so simple that you can write them at home by hand easily. In fact, you can write yourself all of the grammar rules in the same manner as above.

  • ¶

    If you like, you can add a faster version for every matcher, which do not pass $start as parameter around.
    Some of the matchers below have two version, to demonstrate how to do that. Don't use the faster version in orp, any, some, times, separatedList, timesSeparatedList.

  • ¶

    some other matchers, combinators and predicate

    A predicate is a function which return true or false, usually according to its parameter, but not look at parsed object. below is some little utilities may be useful. Three version of some of them is provided.
    just remove them if you don't need them, except literal, which is depended by the matchers above.

  • ¶

    matcher literal, normal version
    match a text string.
    `notice: some combinators like andp, orp, notp, any, some, etc. use literal to wrap a object which is not a matcher.

    exports.literal = literal = (string) -> (start) ->
      len = string.length
      if text.slice(start, stop = start+len)==string then cursor = stop; true
  • ¶

    matcher literal_, faster version
    match a text string.

    exports.literal_ = literal_ = (string) -> (start) ->
      len = string.length
      if text.slice(cursor,  stop = cursor+len)==string then cursor = stop; true
  • ¶

    matcher char, normal version
    match one character

    exports.char = (c) -> (start) ->
      if text[start]==c then cursor = start+1; return c
  • ¶

    matcher char_, normal version
    match one character

    exports.char_ = (c) -> () ->
      if text[cursor]==c then cursor++; return c
  • ¶

    In spaces, spaces, spaces1, spaces1, a tat('\t') is seen as tabWidth spaces,
    which is used in indent style language, such as coffeescript, python, haskell, etc.
    If you don't need this feature, you can easily rewrite these utilities to remove the code about tab width yourself.

  • ¶

    matcher spaces, normal version
    zero or more whitespaces, ie. space or tab.

    exports.spaces = (start) ->
      len = 0
      cursor = start
      text = text
      while 1
        switch text[cursor++]
          when ' ' then len++
          when '\t' then len += tabWidth
          else break
      return len
  • ¶

    matcher spaces_, faster version
    zero or more whitespaces, ie. space or tab.

    exports.spaces_ = () ->
      len = 0
      text = text
      while 1
        switch text[cursor++]
          when ' ' then len++
          when '\t' then len += tabWidth
          else break
      len
  • ¶

    matcher spaces1, normal version
    one or more whitespaces, ie. space or tab.

    exports.spaces1 = (start) ->
      len = 0
      cursor = start
      text = text
      while 1
        switch text[cursor++]
          when ' ' then len++
          when '\t' then len += tabWidth
          else break
      if len then return cursor = cursor; len
  • ¶

    matcher spaces1_, faster version
    one or more whitespaces, ie. space or tab.

    exports.spaces1_ = () ->
      len = 0
      cursor = start
      while 1
        switch text[cursor++]
          when ' ' then len++
          when '\t' then len += tabWidth
          else break
      if len then return cursor = cursor; len
  • ¶

    matcher wrap, normal version
    match left, then match item, match right at last

    exports.wrap = (item, left=spaces, right=spaces) ->
      if not isMatcher(item) then item = literal(item)
      (start) ->
         if left(start) and result = item(cursor) and right(cursor) then result
  • ¶

    matcher identifierLetter: normal version
    is a letter than can be used in identifer?
    javascript style, '$' is a identifierLetter_

    identifierLetter = (start) ->
      start = cursor
      c = text[cursor]
      if c is '$' or c is '_' or 'a'<=c<'z' or 'A'<=c<='Z' or '0'<=c<='9'
        cursor++; true
  • ¶

    matcher identifierLetter_, version
    is a letter that can be used in identifer?
    javascript style, '$' is a identifierLetter_

    identifierLetter_ = () ->
      c = text[cursor]
      if c is '$' or c is '_' or 'a'<=c<'z' or 'A'<=c<='Z' or '0'<=c<='9'
        cursor++; true
  • ¶

    matcher followIdentifierLetter_, faster version
    lookahead whether the following character is a letter used in identifer. don't change cursor.
    javascript style, '$' is a identifierLetter_

    followIdentifierLetter_ = () ->
      c = text[cursor]
      c is '$' or c is '_' or 'a'<=c<'z' or 'A'<=c<='Z' or '0'<=c<='9'
    
    isIdentifierLetter = (c) -> 'a'<=c<='z' or 'A'<=c<='Z' or '0'<=c<='9' or 'c'=='$' or 'c'=='_'
  • ¶

    predicate isdigit

    exports.isdigit = (c) -> '0'<=c<='9'
  • ¶

    matcher digit, normal version

    exports.digit = (start) ->
      c = text[start];  if '0'<=c<='9' then cursor = start+1
  • ¶

    matcher digit_, faster version

    exports.digit_ = () ->
      c = text[cursor];  if '0'<=c<='9' then cursor++
  • ¶

    predicate isletter

    exports.isletter = exports.isalpha = (c) -> 'a'<=c<='z' or 'A'<=c<='Z'
  • ¶

    matcher letter, normal version

    exports.letter = exports.alpha = (start) ->
      c = text[start]; if 'a'<=c<='z' or 'A'<=c<='Z' then cursor = start+1
  • ¶

    matcher letter, faster version

    exports.letter_ = exports.alpha_ = () ->
      c = text[cursor]; if 'a'<=c<='z' or 'A'<=c<='Z' then cursor++
  • ¶

    predicate: islower

    exports.islower = (c) -> 'a'<=c<='z'
  • ¶

    matcher lower, normal version

    exports.lower = (start) ->
      c = text[start]; if 'a'<=c<='z' then cursor = start+1
  • ¶

    matcher lower_, faster version

    exports.lower_ = () ->
      c = text[cursor]; if 'a'<=c<='z' then cursor++
    
    exports.isupper = (c) ->'A'<=c<='Z'
  • ¶

    matcher upper, normal version

    exports.upper = (start) ->  c = text[start]; if 'A'<=c<='Z' then cursor = start+1
  • ¶

    matcher upper_, faster version

    exports.upper_ = (start) -> c = text[cursor]; if 'A'<=c<='Z' then cursor++
  • ¶

    matcher identifier, normal version

    exports.identifier = (start) ->
      cursor = start
      c = text[cursor]
      if 'a'<=c<='z' or 'A'<=c<='Z' or 'c'=='$' or 'c'=='_' then cursor++
      else return
      while 1
        c = text[cursor]
        if 'a'<=c<='z' or 'A'<=c<='Z' or '0'<=c<='9' or 'c'=='$' or 'c'=='_' then cursor++
        else break
      true
  • ¶

    matcher identifier_, faster version

    exports.identifier_ = (start) ->
      c = text[cursor]
      if 'a'<=c<='z' or 'A'<=c<='Z' or 'c'=='$' or 'c'=='_' then cursor++
      else return
      while 1
        c = text[cursor]
        if 'a'<=c<='z' or 'A'<=c<='Z' or '0'<=c<='9' or 'c'=='$' or 'c'=='_' then cursor++
        else break
      true
  • ¶

    The utilites above is just for providing some examples on how to write matchers for Peasy.
    In fact, It's realy easy peasy to write the matchers for your grammar rule yourself.
    see easy peasy

  • ¶

    These utilities below exists for people who want to use these file a independent module, and put the grammar rule in another separated file.

  • ¶

    gettext: get the being pased text
    If you use this file to contain the grammar rules, just directly use text
    and use text[cursor] to get current character, text.slice(cursor, cursor+n) to get substring of the text, if text is a string object.

    exports.gettext = () -> text
  • ¶

    If you use this file to contain the grammar rules, just directly use cursor or cursor = n or cursor++ or cursor--

    exports.getcursor = exports.cur = () -> cursor
    exports.setcursor = exports.setcur = (pos) -> cursor = pos