4.1 LaTeXML Customization

§ 4.1.3 Construction & Constructors

Constructors are where things get interesting, but also complex; they are responsible for defining how the XML is built. There are basic constructors corresponding to normal control sequences, as well as environments. Mathematics generally comes down to constructors, as well, but is covered in Chapter 5.

Here are a couple of trivial examples of constructors:

   "<ltx:emph>#1</ltx:emph>", mode=>’text’);
    beforeDigest=>sub{ Let(’\\\\’,’\@block@cr’);});
    properties=> sub {
      ($_[1] ? (refnum=>$_[1]) : RefStepCounter(’footnote’)) });


The $replacement for a constructor describes the XML to be generated during the construction phase. It can either be a string representing the XML pattern (described below), or a subroutine CODE($document,$arg1,...props) receiving the arguments and properties from the Whatsit; it would invoke the methods of Document to construct the desired XML.

At its simplest, the XML pattern is a just serialization of the desired XML. For more expressivity, XML trees, text content, attributes and attribute values can be effectively ‘interpolated’ into the XML being constructed by use of the following expressions:

  • #1,#2,…#%name% returns the construction of the numbered argument or named property of the Whatsit;

  • &function(arg1,arg2,...) invokes the Perl function on the given arguments, arg1,…, returning the result. The arguments should be expressions for values, rather than XML subtrees.

  • ?test(if pattern) or ?test(if pattern)(else pattern) returns the result of either the if or else pattern depending on whether the result of test (typically also an expression) is non-empty;

  • %expression returns a hash (or rather assumes the result is a hash or KeyVals object); this is only allowed within an opening XML tag, where all the key-value pairs are inserted as attributes;

  • ^ if this appears at the beginning of the pattern, the replacement is allowed to float up the current tree to whereever it might be allowed;

In each case, the result of an expression is expected to be either an XML tree, a string or a hash, depending on the context it was used in. In particular, values of attributes are typically given by quoted strings, but expressions within those strings are interpolated into the computed attribute value. The special characters @ # ? % which introduce these expressions can be escaped by preceding with a backslash, when the literal character is desired.

A subroutine used as the $replacement, allows programmatic insertion of XML into, or modification of, the document being constructed. Although one could use LibXML’s DOM API to manipulate the document tree, it is strongly recommended to use Document’s API whereever possible as it maintains consistency and manages namespace prefixes. This is particularly true for insertion of new content, setting attributes and finding existing nodes in the tree using XPath.


  • mode=>(’math’|’text’) switches to math or text mode, if needed;

  • requireMath=>1, forbidMath=>1 requires, or forbids, this constructor to appear in math mode;

  • bounded=>1 specifies that all digestion (of arguments and daemons) will take place within an implicit TeX group, so that any side-effects are localized, rather than affecting the global state;

  • font=>{hash} switches the font used for any created text; recognized font keys are family, series, shape, size, color;

  • properties=> {hash} | CODE($stomach,$arg1,..). provides a set of properties to store in the Whatsit for eventual use in the constructor $replacement. If a subroutine is used, it also should return a hash of properties;

  • beforeDigest=>CODE($stomach),
    afterDigest=>CODE($stomach,$whatsit) provides code to be digested before and after digesting the arguments of the constructor, typically to alter the context of the digestion (before), or to augment the properties of the Whatsit (after);

  • beforeConstruct=>CODE($document,$whatsit),
    afterConstruct=>CODE($document,$whatsit) provides code to be run before and after the main $replacement is effected; occassionaly it is convenient to use the pattern form for the main $replacement, but one still wants to execute a bit of Perl code, as well;

  • captureBody=>(1 | $token) specifies that an additional argument (like an environment body) wiil be read until the current TeX grouping ends, or until the specified $token is encountered. This argument is available to $replacement as $body;

  • scope=>(’global’|’local’|$name) specifies whether this definition is made globally, or in the current stack frame (default), (or in a named scope);

  • reversion=>$string|CODE(...), alias=>$cs can be used when the Whatsit needs to be reverted into TeX code, and the default of simply reassembling based on the prototype is not desired. See the code for examples.

Some additional functions useful when writing constructors:

  • ToString($stuff) converts $stuff to a string, hopefully without TeX markup, suitable for use as document content and attribute values. Note that if $stuff contains Whatsits generated by Constructors, it may not be possible to avoid TeX code. Constrast ToString to the following two functions.

  • UnTeX($stuff) returns a string containing the TeX code that would generate $stuff (this might not be the original TeX). The function Revert($stuff) returns the same information as a Tokens list.

  • Stringify($stuff) returns a string more intended for debugging purposes; it reveals more of the structure and type information of the object and its parts.

  • CleanLabel($arg), CleanIndexKey($arg), CleanBibKey($arg),
    CleanURL($arg) cleans up arguments (converting to string, handling invalid characters, etc) to make the argument appropriate for use as an attribute representing a label, index ID, etc.

  • UTF($hex) returns the Unicode character for the given codepoint; this is useful for characters below 0x100 where Perl becomes confused about the encoding.


Environments are largely a special case of constructors, but the prototype starts with {envname}, rather than \cmd, the replacement will also typically involve #body representing the contents of the environment.

DefEnvironment takes the same options as DefConstructor, with the addition of

  • afterDigestBegin=>CODE($stomach,$whatsit) provides code to digest after the \begin{env} is digested;

  • beforeDigestEnd=>CODE($stomach) provides code to digest before the \end{env} is digested.

For those cases where you do not want an environment to correspond to a constructor, you may still (as in LaTeX), define the two control sequences \envname and \endenvname as you like.