Markdown content has become pretty ubiquitous and along with that comes the requirement to parse it into its constituent parts. Sometimes it’s useful to know what links it contains, or to count the headings, etc. Writing parsers is super fun in comp sci class, and not too bad with something like Antlr, but when someone else has already done the work, so much the better. Dart has a very capable Markdown parser that is only missing a little bit of documentation to make it fit this more general requirement. This post fills in that bit of documentation.
The Basics
Markdown is a Dart package that you can find here along with good instructions on how to install it and reference it. Follow those instructions to add the package to your project. Reading the documentation makes it look like all the parser does is convert markdown to HTML for display. It does a good job of that and for many use cases that’s all you’ll need. This post covers the case where an application needs access to the content and structure for something other than display.
Note: If you want to display Markdown in Flutter there’s a nice Flutter package with a Widget that does all the work for you, that package is here. This package is not required for the parsing use case outlined below.
Parsing
To use the Markdown package for parsing instead of conversion or display requires just a little understanding of the internals of the package. In the source code examples and description that follows I use the md
prefix for all internals from the Markdown package. The key classes are:
md.Document
– Maintains the context needed to parse a Markdown document.md.NodeVisitor
– Defines an interface that will be invoked as the elements of the abstract syntax tree (AST) are recognized during parsing. The AST describes the components and structure of the Markdown document. For example, Text is an element in the AST of a Markdown document, when Text is encountered while parsing the document thevisitText
method is invoked. This interface follows the Visitor Pattern.md.Node
– Base class for AST nodes, will be eithermd.Element
ormd.Text
.md.Element
– A named tag that can contain another node.md.Text
– A plain text element.
This structure and the method of handling document parse events is pretty common, although it may seem backwards if you haven’t used it before. The process is to implement the methods in md.NodeVisitor
, then create an md.Document
and then feed the Markdown content to that md.Document
to parse. As it parses the Markdown the md.Document
will invoke the members of the md.NodeVisitor
interface with what it finds.
Source Example
This example is a complete parser that will print out every element and section of text as it is encountered while parsing the document.
import 'package:markdown/markdown.dart' as md;
class MarkdownParser implement md.NodeVisitor {
/// parse all lines as Markdown
void parse( String markdownContent ) {
List<String> lines = markdownContent.split('\n');
md.Document document = md.Document(encodeHtml: false);
for (md.Node node in document.parseLines(lines)) {
node.accept(this);
}
}
// NodeVisitor implementation
@override
void visitElementAfter(md.Element element) {
print('vea: ${element.tag}');
}
@override
bool visitElementBefore(md.Element element) {
print('veb: ${element.tag}');
return true;
}
@override
void visitText(md.Text text) {
print('vet: ${text.textContent}');
}
}
Most of the magic above happens in the parse
method.
- The first line splits the content into lines, note that it is pretty naive as it wouldn’t handle all variations of cr/lf so it should be made more robust depending on the use case.
- Then an
md.Document
object is created. There are number of other parameters to that method, in particular extension handling. See the package documentation for that. - The for loop on the next line is where all the parsing work happens. The lines are parsed into a collection of nodes by the
md.Document.parseLines
method. - On the next line the visitor methods are invoked by calling the
md.Node.accept
method. That causes various of themd.NodeVisitor
methods to be invoked in the right sequence based on the structure of the AST.
Making use of this to do something useful in the context of a particular application comes down to recording the important information within the various visit
methods of the md.NodeVisitor
implementation. What is important will vary depending on the goal. For example, if an application needs a list of links referenced in a Markdown document then the content of the anchor
element is important.
The md.NodeVisitor
methods are visited in this order:
visitElementBefore
– When anmd.Element
has been reached, before its children have been visited. ReturnFalse
to skip its children.visitElementAfter
– When anmd.Element
has been reached, after its children have been visited. Not called ifvisitElementBefore
returnsFalse
.visitText
– When aText
node has been reached.
Markdown Tags
When an md.Element
is visited it contains information about the tag that triggered it. That tag is defined as a string in the md.Element.tag
property. So far I haven’t found a definitive list of the tags that could appear there, the best potential source is probably this. From there here’s a possible list of the block tags that matches what I’ve seen so far. I haven’t yet found a similar list of inline tags.
const _blockTags = [
'blockquote',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'hr',
'li',
'ol',
'p',
'pre',
'ul',
];
Hope this helps, happy parsing!