3 Architecture

§ 3.1 latexml architecture

Like TeX, latexml is data-driven: the text and executable control sequences (ie. macros and primitives) in the source file (and any packages loaded) direct the processing. For LaTeXML, the user exerts control over the conversion, and customizes it, by providing alternative bindings of the control sequences and packages, by declaring properties of the desired document structure, and by defining rewrite rules to be applied to the constructed document tree.

The top-level class, LaTeXML, manages the processing, providing several methods for converting a TeX document or string into an XML document, with varying degrees of postprocessing and writing the document to file. It binds a (LaTeXML::Core::)State object (to $STATE)to maintain the current state of bindings for control sequence definitions and emulates TeX’s scoping rules. The processing is broken into the following stages

Digestion

the TeX-like digestion phase which converts the input into boxes.

Construction

converts the resulting boxes into an XML DOM.

Rewriting

applies rewrite rules to modify the DOM.

Math Parsing

parses the tokenized mathematics.

Serialization

converts the XML DOM to a string, or writes to file.

§ 3.1.1 Digestion

Digestion is carried out primarily in a pull mode: The (LaTeXML::Core::)Stomach pulls expanded (LaTeXML::Core::)Tokens from the (LaTeXML::Core::)Gullet, which itself pulls Tokens from the (LaTeXML::Core::)Mouth. The Mouth converts characters from the plain text input into Tokens according to the current catcodes (category codes) assigned to them (as bound in the State). The Gullet is responsible for expanding Macros, that is, control sequences currently bound to (LaTeXML::Core::Definition::)Expandables and for parsing sequences of tokens into common core datatypes ((LaTeXML::Common::)Number, (LaTeXML::Common::)Dimension, etc.). See 4.1.1 for how to define macros and affect expansion.

The Stomach then digests these tokens by executing (LaTeXML::Core::Definition::)Primitive control sequences, usually for side effect, but often for converting material into (LaTeXML::Core::)Lists of (LaTeXML::Core::)Boxes and (LaTeXML::Core::)Whatsits (A Macro should never digest). Normally, textual tokens are converted to Boxes in the current font. The main (intentional) deviation of LaTeXML’s digestion from that of TeX is the introduction of a new type of definition, a (LaTeXML::Core::Definition::)Constructor, responsible for constructing XML fragments. A control sequence bound to Constructor is digested by reading and processing its arguments and wrapping these up in a Whatsit. Before- and after-daemons, essentially anonymous primitives, associated with the Constructor are executed before and after digesting the Constructor arguments’ markup, which can affect the context of that digestion, as well as augmenting the Whatsit with additional properties. See 4.1.2 for how to define primitives and affect digestion.

§ 3.1.2 Construction

Given the List of Boxes and Whatsits, we proceed to constructing an XML document. This consists of creating an (LaTeXML::Core::)Document object, containing a libxml2 document, , and having it absorb the digested material. Absorbing a Box converts it to text content, with provision made to track and set the current font. A Whatsit is absorbed by invoking the associated Constructor to insert an appropriate XML fragment, including elements and attributes, and recursively processing their arguments as necessary See 4.1.3 for how to define constructors.

A (LaTeXML::Common::)Model is maintained througout the digestion phase which accumulates any document model declarations, in particular the document type (RelaxNG is preferred, but DTD is also supported). As LaTeX markup is more like HTML than XML, additional declarations may be used (see Tag in (LaTeXML::)Package) to indicate which elements may be automatically opened or closed when needed to build a document tree that matches the document type. As an example, a <subsection> will automaticall be closed when a <section> is begun. Additionally, extra bits of code can be executed whenever particularly elements are openned or closed (also specified by Tag). See 4.1.4 for how to affect the schema.

§ 3.1.3 Rewriting

Once the basic document is constructed, (LaTeXML::Core::)Rewrite rules are applied which can perform various functions. Ligatures and combining mathematics digits and letters (in certain fonts) into composite math tokens are handled this way. Additionally, declarations of the type or grammatical role of math tokens can be applied here See 4.1.5 for how to define rewrite rules.

§ 3.1.4 MathParsing

After rewriting, a grammar based parser is applied to the mathematical nodes in order to infer, at least, the structure of the expressions, if not the meaning. Mathematics parsing, and how to control it, is covered in detail in Chapter 5.

§ 3.1.5 Serialization

Here, we simple convert the DOM into string form, and output it.