4 Customization

§ 4.1 LaTeXML Customization

This layer of customization deals with modifying the way a LaTeX document is transformed into LaTeXML’s XML, primarily through defining the way that control sequences are handled. In 2.1 the loading of various bindings was described. The facilities described in the following subsections apply in all such cases, whether used to customize the processing of a particular document or to implement a new LaTeX package. We make no attempt to be comprehensive here; please consult the documentation for (LaTeXML::)Global and Package, as well as the binding files included with the system for more guidance.

A LaTeXML binding is actually a Perl module, and as such, a familiarity with Perl is helpful. A binding file will look something like:

  use LaTeXML::Package;
  use strict;
  use warnings;
  # Your code here!
  1;

The final ‘1’ is required; it tells Perl that the module has loaded successfully. In between, comes any Perl code you wish, along with the definitions and declarations as described here.

Actually, familiarity with Perl is more than merely helpful, as is familiarity with TeX and XML! When writing a binding, you will be programming with all three languages. Of course, you need to know the TeX corresponding to the macros that you intend to implement, but sometimes it is most convenient to implement them completely, or in part, in TeX, itself (eg. using DefMacro), rather then in Perl. At the other end, constructors (eg. using DefConstructor) are usually defined by patterns of XML.

§ 4.1.1 Expansion & Macros

DefMacro($prototype,$replacement,%options)

Macros are defined using DefMacro, such as the pointless:

  DefMacro(’\mybold{}’,’\textbf{#1}’);

The two arguments to DefMacro we call the prototype and the replacement. In the prototype, the {} specifies a single normal TeX parameter. The replacement is here a string which will be tokenized and the #1 will be replaced by the tokens of the argument. Presumably the entire result will eventually be further expanded and or processed.

Whereas, TeX normally uses #1, and LaTeX has developed a complex scheme where it is often necessary to peek ahead token by token to recognize optional arguments, we have attempted to develop a suggestive, and easier to use, notation for parameters. Thus a prototype \foo{} specifies a single normal argument, wheere \foo[]{} would take an optional argument followed by a required one. More complex argument prototypes can be found in Package. As in TeX, the macro’s arguments are neither expanded nor digested until the expansion itself is further expanded or digested.

The macro’s replacement can also be Perl code, typically an anonymous sub, which gets the current Gullet followed by the macro’s arguments as its arguments. It must return a list of Token’s which will be used as the expansion of the macro. The following two examples show alternative ways of writing the above macro:

  DefMacro(’\mybold{}’, sub {
    my($gullet,$arg)=@_;
    (T_CS(’\textbf’),T_BEGIN,$arg,T_END); });

or alternatively

  DefMacro(’\mybold{}’, sub {
    Invocation(T_CS(’\textbf’),$_[1]); });

Generally, the body of the macro should not involve side-effects, assignments or other changes to state other than reading Token’s from the Gullet; of course, the macro may expand into control sequences which do have side-effects.

Tokens, Catcodes and friends

Functions that are useful for dealing with Tokens and writing macros include the following:

  • Constants for the corresponding TeX catcodes:

       CC_ESCAPE, CC_BEGIN,  CC_END,     CC_MATH,
       CC_ALIGN,  CC_EOL,    CC_PARAM,   CC_SUPER,
       CC_SUB,    CC_IGNORE, CC_SPACE,   CC_LETTER,
       CC_OTHER,  CC_ACTIVE, CC_COMMENT, CC_INVALID
  • Constants for tokens with the appropriate content and catcode:

      T_BEGIN, T_END,   T_MATH,  T_ALIGN, T_PARAM,
      T_SUB,   T_SUPER, T_SPACE, T_CR
  • T_LETTER($char), T_OTHER($char), T_ACTIVE($char), create tokens of the appropriate catcode with the given text content.

  • T_CS($cs) creates a control sequence token; the string $cs should typically begin with the slash.

  • Token($string,$catcode) creates a token with the given content and catcode.

  • Tokens($token,...) creates a (LaTeXML::Core::)Tokens object containing the list of Tokens.

  • Tokenize($string) converts the string to a Tokens, using TeX’s standard catcode assignments.

  • TokenizeInternal($string) like Tokenize, but treating as a letter.

  • Explode($string) converts the string to a Tokens where letter character are given catcode CC_OTHER.

  • Expand($tokens expands $tokens (a Tokens), returning a Tokens; there should be no expandable tokens in the result.

  • Invocation($cstoken,$arg,...) Returns a Tokens representing the sequence needed to invoke $cstoken on the given arguments (each are Tokens, or undef for an unsupplied optional argument).

§ 4.1.2 Digestion & Primitives

Primitives are processed during the digestion phase in the Stomach, after macro expansion (in the Gullet), and before document construction (in the Document). Our primitives generalize TeX’s notion of primitive; they are used to implement TeX’s primitives, invoke other side effects and to convert Tokens into Boxes, in particular, Unicode strings in a particular font.

Here are a few primitives from TeX.pool:

  DefPrimitive(’\begingroup’,sub {
    $_[0]->begingroup; });
  DefPrimitive(’\endgroup’,  sub {
    $_[0]->endgroup; });
  DefPrimitiveI(’\batchmode’,     undef,undef);
  DefPrimitiveI(’\OE’, undef, "\x{0152}");
  DefPrimitiveI(’\tiny’,        undef, undef,
    font=>{size=>5});

Other than for implementing TeX’s own primitives, DefPrimitive is needed less often than DefMacro or DefConstructor. The main thing to keep in mind is that primitives are processed after macro expansion, by the Stomach. They are most useful for side-effects, changing the State.

DefPrimitive($prototype,$replacement,%options)

The replacement is either a string which will be used to create a Box in the current font, or can be code taking the Stomach and the control sequence arguments as argument; like macros, these arguments are not expanded or digested by default, they must be explicitly digested if necessary. The replacement code must either return nothing (eg. ending with return;) or should return a list (ie. a Perl list (...)) of digested Boxes or Whatsits.

Options to DefPrimitive are:

  • mode=>(’math’|’text’) switches to math or text mode, if needed;

  • requireMath=>1, forbidMath=>1 requires, or forbids, this primitive to appear in math mode;

  • bounded=>1 specifies that all digestion (of arguments and daemons) will take place within an implicit TeX group, so that any side-effects are localized, rather than affecting the global state;

  • font=>{hash} switches the font used for any created text; recognized font keys are family, series, shape, size, color;

    Note that if the font change should only affect the material digested within this command itself, then bounded=>1 should be used; otherwise, the font change will remain in effect after the command is processed.

  • beforeDigest=>CODE($stomach),
    afterDigest=>CODE($stomach) provides code to be digested before and after processing the main part of the primitive.

DefRegister(…)

Needs descrition!

Other Utilities for Digestion

Other functions useful for dealing with digestion and state are important for writing before & after daemons in constructors, as well as in Primitives; we give an overview here:

  • Digest($tokens) digests $tokens (a (LaTeXML::Core::)Tokens), returning a list of Boxes and Whatsits.

  • Let($token1,$token2) gives $token1 the same meaning as $token2, like \let.

Bindings

The following functions are useful for accessing and storing information in the current State. It maintains a stack-like structure that mimics TeX’s approach to binding; braces { and } open and close stack frames. (The Stomach methods bgroup and egroup can be used when explicitly needed.)

  • LookupValue($symbol), AssignValue($string,$value,$scope) maintain arbitrary values in the current State, looking up or assigning the current value bound to $symbol (a string). For assignments, the $scope can be ’local’ (the default, if $scope is omitted), which changes the binding in the current stack frame. If $scope is ’global’, it assigns the value globally by undoing all bindings. The $scope can also be another string, which indicates a named scope — but that is a more advanced topic.

  • PushValue($symbol,$value,...), PopValue($symbol),
    UnshiftValue($symbol,$value,...), ShiftValue($symbol) These maintain the value of $symbol as a list, with the operatations having the same sense as in Perl; modifications are always global.

  • LookupCatcode($char), AssignCatcode($char,$catcode,$scope) maintain the catcodes associated with characters.

  • LookupMeaning($token), LookupDefinition($token) looks up the current meaning of the token, being any executable definition bound for it. If there is no such defniition LookupMeaning returns the token itself, LookupDefinition returns undef.

Counters

The following functions maintain LaTeX-like counters, and generally also associate an ID with them. A counter’s print form (ie. \theequation for equations) often ends up on the refnum attribute of elements; the associated ID is used for the xml:id attribute.

  • NewCounter($name,$within,options), creates a LaTeX-style counters. When $within is used, the given counter will be reset whenever the counter $within is incremented. This also causes the associated ID to be prefixed with $within’s ID. The option idprefix=>$string causes the ID to be prefixed with that string. For example,

      NewCounter(’section’, ’document’, idprefix=>’S’);
      NewCounter(’equation’,’document’, idprefix=>’E’,
        idwithin=>’section’);

    would cause the third equation in the second section to have ID=’S2.E3’.

  • CounterValue($name) returns the Number representing the current value.

  • ResetCounter($name) resets the counter to 0.

  • StepCounter($name) steps the counter (and resets any others ‘within’ it), and returns the expansion of \the$name.

  • RefStepCounter($name) steps the counter and any ID’s associated with it. It returns a hash containing refnum (expansion of \the$name) and id (expansion of \the$name@ID)

  • RefStepID($name) steps the ID associated with the counter, without actually stepping the counter; this is useful for unnumbered units that normally would have both a refnum and ID.

§ 4.1.3 Construction & Constructors

Constructors are where things get interesting, but also complex; they are responsible for defining how the XML is built. There are basic constructors corresponding to normal control sequences, as well as environments. Mathematics generally comes down to constructors, as well, but is covered in Chapter 5.

Here are a couple of trivial examples of constructors:

  DefConstructor(’\emph{}’,
   "<ltx:emph>#1</ltx:emph>", mode=>’text’);
  DefConstructor(’\item[]’,
    "<ltx:item>?#1(<ltx:tag>#1</ltx:tag>)");
  DefEnvironment(’{quote}’,
    ’<ltx:quote>#body</ltx:quote>’,
    beforeDigest=>sub{ Let(’\\\\’,’\@block@cr’);});
  DefConstructor(’\footnote[]{}’,
    "<ltx:noteclass=’footnote’mark=’#refnum’>#2</ltx:note>",
    mode=>’text’,
    properties=> sub {
      ($_[1] ? (refnum=>$_[1]) : RefStepCounter(’footnote’)) });

DefConstructor($prototype,$replacement,%options)

The $replacement for a constructor describes the XML to be generated during the construction phase. It can either be a string representing the XML pattern (described below), or a subroutine CODE($document,$arg1,...props) receiving the arguments and properties from the Whatsit; it would invoke the methods of Document to construct the desired XML.

At its simplest, the XML pattern is a just serialization of the desired XML. For more expressivity, XML trees, text content, attributes and attribute values can be effectively ‘interpolated’ into the XML being constructed by use of the following expressions:

  • #1,#2,…#%name% returns the construction of the numbered argument or named property of the Whatsit;

  • &function(arg1,arg2,...) invokes the Perl function on the given arguments, arg1,…, returning the result. The arguments should be expressions for values, rather than XML subtrees.

  • ?test(if pattern) or ?test(if pattern)(else pattern) returns the result of either the if or else pattern depending on whether the result of test (typically also an expression) is non-empty;

  • %expression returns a hash (or rather assumes the result is a hash or KeyVals object); this is only allowed within an opening XML tag, where all the key-value pairs are inserted as attributes;

  • ^ if this appears at the beginning of the pattern, the replacement is allowed to float up the current tree to whereever it might be allowed;

In each case, the result of an expression is expected to be either an XML tree, a string or a hash, depending on the context it was used in. In particular, values of attributes are typically given by quoted strings, but expressions within those strings are interpolated into the computed attribute value. The special characters @ # ? % which introduce these expressions can be escaped by preceding with a backslash, when the literal character is desired.

A subroutine used as the $replacement, allows programmatic insertion of XML into, or modification of, the document being constructed. Although one could use LibXML’s DOM API to manipulate the document tree, it is strongly recommended to use Document’s API whereever possible as it maintains consistency and manages namespace prefixes. This is particularly true for insertion of new content, setting attributes and finding existing nodes in the tree using XPath.

Options:

  • mode=>(’math’|’text’) switches to math or text mode, if needed;

  • requireMath=>1, forbidMath=>1 requires, or forbids, this constructor to appear in math mode;

  • bounded=>1 specifies that all digestion (of arguments and daemons) will take place within an implicit TeX group, so that any side-effects are localized, rather than affecting the global state;

  • font=>{hash} switches the font used for any created text; recognized font keys are family, series, shape, size, color;

  • properties=> {hash} | CODE($stomach,$arg1,..). provides a set of properties to store in the Whatsit for eventual use in the constructor $replacement. If a subroutine is used, it also should return a hash of properties;

  • beforeDigest=>CODE($stomach),
    afterDigest=>CODE($stomach,$whatsit) provides code to be digested before and after digesting the arguments of the constructor, typically to alter the context of the digestion (before), or to augment the properties of the Whatsit (after);

  • beforeConstruct=>CODE($document,$whatsit),
    afterConstruct=>CODE($document,$whatit) provides code to be run before and after the main $replacement is effected; occassionaly it is convenient to use the pattern form for the main $replacement, but one still wants to execute a bit of Perl code, as well;

  • captureBody=>(1 | $token) specifies that an additional argument (like an environment body) wiil be read until the current TeX grouping ends, or until the specified $token is encountered. This argument is available to $replacement as $body;

  • scope=>(’global’|’local’|$name) specifies whether this definition is made globally, or in the current stack frame (default), (or in a named scope);

  • reversion=>$string|CODE(...), alias=>$cs can be used when the Whatsit needs to be reverted into TeX code, and the default of simply reassembling based on the prototype is not desired. See the code for examples.

Some additional functions useful when writing constructors:

  • ToString($stuff) converts $stuff to a string, hopefully without TeX markup, suitable for use as document content and attribute values. Note that if $stuff contains Whatsits generated by Constructors, it may not be possible to avoid TeX code. Constrast ToString to the following two functions.

  • UnTeX($stuff) returns a string containing the TeX code that would generate $stuff (this might not be the original TeX). The function Revert($stuff) returns the same information as a Tokens list.

  • Stringify($stuff) returns a string more intended for debugging purposes; it reveals more of the structure and type information of the object and its parts.

  • CleanLabel($arg), CleanIndexKey($arg), CleanBibKey($arg),
    CleanURL($arg) cleans up arguments (converting to string, handling invalid characters, etc) to make the argument appropriate for use as an attribute representing a label, index ID, etc.

  • UTF($hex) returns the Unicode character for the given codepoint; this is useful for characters below 0x100 where Perl becomes confused about the encoding.

DefEnvironment($prototype,$replacement,%options)

Environments are largely a special case of constructors, but the prototype starts with {envname}, rather than \cmd, the replacement will also typically involve #body representing the contents of the environment.

DefEnvironment takes the same options as DefConstructor, with the addition of

  • afterDigestBegin=>CODE($stomach,$whatsit) provides code to digest after the \begin{env} is digested;

  • beforeDigestEnd=>CODE($stomach) provides code to digest before the \end{env} is digested.

For those cases where you do not want an environment to correspond to a constructor, you may still (as in LaTeX), define the two control sequences \envname and \endenvname as you like.

§ 4.1.4 Document Model

The following declarations are typically only needed when customizing the schema used by LaTeXML.

  • RelaxNGSchema($schema,namespaces) declares the created XML document should be fit to the RelaxNG schema in $schema; A file $schema.rng should be findable in the current search paths. (Note that currently, LaTeXML is unable to directly parse compact notation).

  • RegisterNamespace($prefix,$url) associates the prefix with the given namespace url. This allows you to use $prefix as a namespace prefix when writing Constructor patterns or XPath expressions.

  • Tag($tag,properties) specifies properties for the given XML $tag. Recognized properties include: autoOpen=>1 indicates that the tag can automatically be opened if needed to create a valid document; autoClose=>1 indicates that the tag can automatically be closed if needed to create a valid document; afterOpen=>$code specifies code to be executed before opening the tag; the code is passed the Document being constructed as well as the Box (or Whatsit) responsible for its creation; afterClose=>code similar to afterOpen, but executed after closing the element.

§ 4.1.5 Rewriting

The following functions are a bit tricky to use (and describe), but can be quite useful in some circumstances.

DefLigature($regexp,%options)

applies a regular expression to substitute textnodes after they are closed; the only option is fontTest=>$code which restricts the ligature to text nodes where the current font passes &$code($font).

DefMathLigature($code,%options)

allows replacement of sequences of math nodes. It applies $code to the current Document and each sequence of math nodes encountered in the document; if a replacement should occur, $code should return a list of the form ($n,$string,attributes) in which case, the text content of the first node is replaced by $string, the given attributes are added, and the following $n-1 nodes are removed.

DefRewrite(%spec)

defines document rewrite rules. These specifications describe what document nodes match:

  • label=>$label restricts to nodes contained within an element whose labels includes $label;

  • scope=>$scope generalizes label; the most useful form a string like ’section:1.3.2’ where it matches the section element whose refnum is 1.3.2;

  • xpath=>$xpath selects nodes matching the given XPath;

  • match=>$tex selects nodes that look like what processing the TeX string $tex would produce;

  • regexp=>$regexp selects text nodes that match the given regular expression.

The following specifications describe what to do with the matched nodes:

  • attributes=>{attr} adds the given attributes to the matching nodes;

  • replace=>$tex replaces the matching nodes with the result of processing the TeX string $tex.

§ 4.1.6 Packages and Options

The following declarations are useful for defining LaTeXML bindings, including option handling. As when defining LaTeX packages, the following, if needed at all, need to appear in the order shown.

  • DeclareOption($option,$handler) specifies the handler for $option when it is passed to the current package or class. If $option is undef, it defines the default handler, for options that are otherwise unrecognized. $handler can be either a string to be expanded, or a sub which is executed like a primitive.

  • PassOptions($name,$type,options) specifies that the given options should be passed to the package (if $type is sty) or class (if $type is cls) $name, if it is ever loaded.

  • ProcessOptions(keys) processes any options that have been passed to the current package or class. If inorder=>1 is specified, the options will be processed in the order passed to the package (\ProcessOptions*); otherwise they will be processed in the declared order (\ProcessOptions).

  • ExecuteOptions(options) executes the handlers for the specific set of options options.

  • RequirePackage($pkgname,keys) loads the specified package. The keyword options have the following effect: options=>$options can provide an explicit array of string specifying the options to pass to the package; withoptions=>1 means that the options passed to the currently loading class or package should be passed to the requested package; type=>$ext specifies the type of the package file (default is sty); raw=>1 specifies that reading the raw style file (eg. pkg.sty) is permissible if there is no specific LaTeXML binding (eg. pkg.sty.ltxml) after=>$after specifies a string or (LaTeXML::Core::)Tokens to be expanded after the package has finished loading.

  • LoadClass($classname,keys) Similar to RequirePackage, but loads a class file (type=>’cls’).

  • AddToMacro($cstoken,$tokens) a little used utilty to add material to the expansion of $cstoken, like an \edef; typically used to add code to a class or package hook.

§ 4.1.7 Miscellaneous

Other useful stuff:

  • RawTeX($texstring) expands and processes the $texstring; This is typically useful to include definitions copied from a TeX stylefile, when they are approriate for LaTeXML, as is. Single-quoting the $texstring is useful, since it isn’t interpolated by Perl, and avoids having to double all the slashes!