Markdown Parsing in Dart

Markdown content has become pretty ubiquitous and along with that comes the requirement to parse it into its constituent parts. Sometimes it’s useful to know what links it contains, or to count the headings, etc. Writing parsers is super fun in comp sci class, and not too bad with something like Antlr, but when someone else has already done the work, so much the better. Dart has a very capable Markdown parser that is only missing a little bit of documentation to make it fit this more general requirement. This post fills in that bit of documentation.

The Basics

Markdown is a Dart package that you can find here along with good instructions on how to install it and reference it. Follow those instructions to add the package to your project. Reading the documentation makes it look like all the parser does is convert markdown to HTML for display. It does a good job of that and for many use cases that’s all you’ll need. This post covers the case where an application needs access to the content and structure for something other than display.

Note: If you want to display Markdown in Flutter there’s a nice Flutter package with a Widget that does all the work for you, that package is here. This package is not required for the parsing use case outlined below.

Parsing

To use the Markdown package for parsing instead of conversion or display requires just a little understanding of the internals of the package. In the source code examples and description that follows I use the md prefix for all internals from the Markdown package. The key classes are:

  • md.Document – Maintains the context needed to parse a Markdown document.
  • md.NodeVisitor – Defines an interface that will be invoked as the elements of the abstract syntax tree (AST) are recognized during parsing. The AST describes the components and structure of the Markdown document. For example, Text is an element in the AST of a Markdown document, when Text is encountered while parsing the document the visitText method is invoked. This interface follows the Visitor Pattern.
  • md.Node – Base class for AST nodes, will be either md.Element or md.Text.
  • md.Element – A named tag that can contain another node.
  • md.Text – A plain text element.

This structure and the method of handling document parse events is pretty common, although it may seem backwards if you haven’t used it before. The process is to implement the methods in md.NodeVisitor, then create an md.Document and then feed the Markdown content to that md.Document to parse. As it parses the Markdown the md.Document will invoke the members of the md.NodeVisitor interface with what it finds.

Source Example

This example is a complete parser that will print out every element and section of text as it is encountered while parsing the document.

import 'package:markdown/markdown.dart' as md;

class MarkdownParser implement md.NodeVisitor {

  /// parse all lines as Markdown
  void parse( String markdownContent ) {
    List<String> lines = markdownContent.split('\n');
    md.Document document = md.Document(encodeHtml: false);
    for (md.Node node in document.parseLines(lines)) {
      node.accept(this);
    }
  }

  // NodeVisitor implementation
  @override
  void visitElementAfter(md.Element element) {
    print('vea: ${element.tag}');
  }

  @override
  bool visitElementBefore(md.Element element) {
    print('veb: ${element.tag}');
    return true;
  }

  @override
  void visitText(md.Text text) {
    print('vet: ${text.textContent}');
  }
}

Most of the magic above happens in the parse method.

  • The first line splits the content into lines, note that it is pretty naive as it wouldn’t handle all variations of cr/lf so it should be made more robust depending on the use case.
  • Then an md.Document object is created. There are number of other parameters to that method, in particular extension handling. See the package documentation for that.
  • The for loop on the next line is where all the parsing work happens. The lines are parsed into a collection of nodes by the md.Document.parseLines method.
  • On the next line the visitor methods are invoked by calling the md.Node.accept method. That causes various of the md.NodeVisitor methods to be invoked in the right sequence based on the structure of the AST.

Making use of this to do something useful in the context of a particular application comes down to recording the important information within the various visit methods of the md.NodeVisitor implementation. What is important will vary depending on the goal. For example, if an application needs a list of links referenced in a Markdown document then the content of the anchor element is important.

The md.NodeVisitor methods are visited in this order:

  • visitElementBefore – When an md.Element has been reached, before its children have been visited. Return False to skip its children.
  • visitElementAfter – When an md.Element has been reached, after its children have been visited. Not called if visitElementBefore returns False.
  • visitText – When a Text node has been reached.

Markdown Tags

When an md.Element is visited it contains information about the tag that triggered it. That tag is defined as a string in the md.Element.tag property. So far I haven’t found a definitive list of the tags that could appear there, the best potential source is probably this. From there here’s a possible list of the block tags that matches what I’ve seen so far. I haven’t yet found a similar list of inline tags.

const _blockTags = [
  'blockquote',
  'h1',
  'h2',
  'h3',
  'h4',
  'h5',
  'h6',
  'hr',
  'li',
  'ol',
  'p',
  'pre',
  'ul',
];

Hope this helps, happy parsing!

Leave a Reply

Your email address will not be published. Required fields are marked *