There are several issues that have to be dealt with in treating the mathematics. On the one hand, the TeX markup gives a pretty good indication of what the author wants the math to look like, and so we would seem to have a good handle on the conversion to presentation forms. On the other hand, content formats are desirable as well; there are a few, but too few, clues about what the intent of the mathematics is. And in fact, the generation of even Presentation MathML of high quality requires recognizing the mathematical structure, if not the actual semantics. The mathematics processing must therefore preserve the presentational information provided by the author, while inferring, likely with some help, the mathematical content.
From a parsing point of view, the TeX-like processing serves as the lexer,
tokenizing the input which LaTeXML will then parse
[perhaps eventually a type-analysis phase will be added].
Of course, there are a few twists.
For one, the tokens, represented by XMTok, can carry extra attributes
such as font and style, but also the name, meaning and grammatical role,
with defaults that can be overridden by the author — more on those, in a moment.
Another twist is that, although LaTeX’s math markup is not nearly
as semantic as we might like, there is considerable semantics and structure in the
markup that we can exploit. For example, given a
\frac, we’ve already
established the numerator and denominator which can be parsed individually,
but the fraction as a whole can be directly represented as an application,
using XMApp, of a fraction operator; the resulting structure can be treated
as atomic within its containing expression.This structure preserving character
greatly simplifies the parsing task and helps reduce misinterpretation.
The parser, invoked by the postprocessor, works only with the top-level lists of lexical tokens, or with those sublists contained in an XMArg. The grammar works primarily through the name and grammatical role. The name is given by an attribute, or the content if it is the same. The role (things like ID, FUNCTION, OPERATOR, OPEN, …) is also given by an attribute, or, if not present, the name is looked up in a document-specific dictionary (jobname.dict), or in a default dictionary.
Additional exceptions that need fuller explanation are:
Spacing and similar markup generates XMHint elements, which are currently ignored during parsing, but probably shouldn’t.