On the Design of a Parser

The concrete application we’ll be walking through is a naive parser for bbcode.

In a manner consistent with designing software as compositions of simple morphisms we start by determining the type of our solution (as illustrated by the following mockup):

-- | Our target structure -- a rose tree with an explicit terminal constructor
data DomTree = Element Text (Map Text Text) [DomTree]
             | Content Text
             deriving (Show, Eq)

bbcode :: Text -> Maybe DomTree
-- ^ Parse BBCode

Writing a parser capable of dealing with Text directly from scratch would be unnecessarily abstruse, we’ll be using the attoparsec family of parser combinators instead.

We reproduce an incomplete version of the lexer below (it’s missing tag attributes and self-closing tags).

We introduce escapedText, a helper function for extracting text until we reach one of a set of delimiting characters (exclusive). While doing this we also parse any delimiting character iff it’s prefixed with an escape character (we use \) – the escape character itself needs only be escaped if encountered directly before one of the delimiting characters.

data Token = BBOpen Text -- ^ "[open]"
           | BBClose Text -- ^ "[/close]"
           | BBStr Text -- ^ "text"

token :: Parser [Token]
token = BBClose <$ "[/" <*> escapedText' [']'] <* "]"
        <|> BBOpen <$ "[" <*> escapedText' [']'] <* "]"
        <|> BBStr <$> escapedText ['[']
		
escapedText' :: [Char] -> Parser Text
escapedText' = option "" . escapedText

escapedText :: [Char] -> Parser Text
escapedText [] = takeText -- No delimiting characters -- parse all remaining input
escapedText cs = recurse $ choice [ takeWhile1 (not . special) -- a series of characters we don't treat as special
                                  , escapeSeq -- an escaped delimiter
                                  , escapeChar' -- the escape character
                                  ]
  where
    escapeChar = '\\'
    special = inClass $ escapeChar : cs
    escapeChar' = string $ T.singleton escapeChar
    escapeSeq = escapeChar' *> (T.singleton <$> satisfy special) -- escape character followed by a special character (which encludes the escape character)
    recurse p = mappend <$> p <*> escapedText' cs -- parse a prefix and optionally append another chunk of escapedText

runTokenizer :: Text -> Maybe [Token]
runTokenizer = either (const Nothing) Just . parseOnly (many token <* endOfInput)

We have now reduced the Problem to [Token] -> DomTree. We quickly see that the structure of the problem is that of a fold.

Having realised this we require a function of type Token -> DomTree -> DomTree to recursively build up our target structure.

In general we’ll want to not only keep track of the DomTree during recursion but also maintain a reference to the position at which we’ll be inserting new tokens. This kind of problem is well understood and solved idiomatically by using a zipper (a cursory introduction).

Writing zippers tends to be tedious. We’ll therefore introduce an additional intermediate structure for which an implementation is available readily. The morphism from this new structure (Forest BBLabel) to our DomTree will be almost trivial.

import Data.Tree.Zipper (TreePos, Empty, Full)
import qualified Data.Tree.Zipper as Z

data BBLabel = BBTag Text
             | BBPlain Text

rose :: [BBToken] -> Maybe (Forest BBLabel)
rose = Z.toForest <$> foldM (flip rose') (Z.fromForest [])

rose' :: BBToken -> TreePos Empty BBLabel -> Maybe (TreePos Empty BBLabel)
rose' (BBStr t) = return . Z.nextSpace . Z.insert (Node (BBPlain t) []) -- insert a node with no children and move one step to the right in the forest we’re currently viewing
rose' (BBOpen t) = return . Z.children . Z.insert (Node (BBTag t) []) -- insert the node and move into position to insert it's first child
rose' (BBClose t) = close t -- haskell complains if multiple equations for the same function have a differing number of arguments, therefore: 'close'
  where
    close :: Text -> TreePos Empty BBLabel -> Maybe (TreePos Empty BBLabel)
    close tag pos = do
      pos' <- Z.parent pos -- fail if we're trying to close a tag that does not have a parent (this indicates imbalanced tags)
      let
        pTag = (\(BBTag t) -> t) $ Z.label pos' -- yes, this will fail unceremoniously if the parent is not a tag, this poses no problem since we're constructing the structure ourselves. The proof that this failure mode does not occur is left as an exercise for the reader.
      guard (pTag == tag) -- The structure shows that this mode of failure (opening tags content does not match the closing tags) is not logically required -- it only serves as a *notification* to the user
      return $ Z.nextSpace pos' -- move one level up and to point at the next sibling of the parent

dirty-haskell.org: On the Design of a Parser

It's not crazy — it's having fun with types.