This layer of customization deals with modifying the way a LaTeX document is transformed into LaTeXML’s XML, primarily through defining the way that control sequences are handled. In 2.1 the loading of various bindings was described. The facilities described in the following subsections apply in all such cases, whether used to customize the processing of a particular document or to implement a new LaTeX package. We make no attempt to be comprehensive here; please consult the documentation for (LaTeXML::)Global and Package, as well as the binding files included with the system for more guidance.
A LaTeXML binding is actually a Perl module, and as such, a familiarity with Perl is helpful. A binding file will look something like:
The final ‘1’ is required; it tells Perl that the module has loaded successfully. In between, comes any Perl code you wish, along with the definitions and declarations as described here.
Actually, familiarity with Perl is more than merely helpful, as is familiarity with TeX and XML! When writing a binding, you will be programming with all three languages. Of course, you need to know the TeX corresponding to the macros that you intend to implement, but sometimes it is most convenient to implement them completely, or in part, in TeX, itself (eg. using DefMacro), rather then in Perl. At the other end, constructors (eg. using DefConstructor) are usually defined by patterns of XML.
Macros are defined using DefMacro, such as the pointless:
The two arguments to DefMacro we call
the prototype and the replacement.
In the prototype, the {}
specifies a single normal TeX parameter.
The replacement is here a string which will
be tokenized and the #1
will be replaced by the
tokens of the argument. Presumably the entire result will
eventually be further expanded and or processed.
Whereas, TeX normally uses #1
, and LaTeX has developed
a complex scheme where it is often necessary to peek ahead token
by token to recognize optional arguments, we have attempted
to develop a suggestive, and easier to use, notation for parameters.
Thus a prototype \foo{}
specifies a single normal argument,
wheere \foo[]{}
would take an optional argument followed
by a required one. More complex argument prototypes can be
found in Package.
As in TeX, the macro’s arguments are neither expanded
nor digested until the expansion itself is further
expanded or digested.
The macro’s replacement can also be Perl code, typically an anonymous sub, which gets the current Gullet followed by the macro’s arguments as its arguments. It must return a list of Token’s which will be used as the expansion of the macro. The following two examples show alternative ways of writing the above macro:
or alternatively
Generally, the body of the macro should not involve side-effects, assignments or other changes to state other than reading Token’s from the Gullet; of course, the macro may expand into control sequences which do have side-effects.
Functions that are useful for dealing with Tokens and writing macros include the following:
Constants for the corresponding TeX catcodes:
Constants for tokens with the appropriate content and catcode:
T_LETTER($char), T_OTHER($char), T_ACTIVE($char), create tokens of the appropriate catcode with the given text content.
T_CS($cs) creates a control sequence token; the string $cs should typically begin with the slash.
Token($string,$catcode) creates a token with the given content and catcode.
Tokens($token,...) creates a (LaTeXML::Core::)Tokens object containing the list of Tokens.
Tokenize($string) converts the string to a Tokens, using TeX’s standard catcode assignments.
TokenizeInternal($string) like Tokenize, but treating as a letter.
Explode($string) converts the string to a Tokens where letter character are given catcode CC_OTHER.
Primitives are processed during the digestion phase in the Stomach, after macro expansion (in the Gullet), and before document construction (in the Document). Our primitives generalize TeX’s notion of primitive; they are used to implement TeX’s primitives, invoke other side effects and to convert Tokens into Boxes, in particular, Unicode strings in a particular font.
Here are a few primitives from TeX.pool:
Other than for implementing TeX’s own primitives, DefPrimitive is needed less often than DefMacro or DefConstructor. The main thing to keep in mind is that primitives are processed after macro expansion, by the Stomach. They are most useful for side-effects, changing the State.
The replacement is either a string which will be used to create a Box in the current font, or can be code taking the Stomach and the control sequence arguments as argument; like macros, these arguments are not expanded or digested by default, they must be explicitly digested if necessary. The replacement code must either return nothing (eg. ending with return;) or should return a list (ie. a Perl list (...)) of digested Boxes or Whatsits.
Options to DefPrimitive are:
mode=>(’math’|’text’) switches to math or text mode, if needed;
requireMath=>1, forbidMath=>1 requires, or forbids, this primitive to appear in math mode;
bounded=>1 specifies that all digestion (of arguments and daemons) will take place within an implicit TeX group, so that any side-effects are localized, rather than affecting the global state;
font=>{hash} switches the font used for any created text; recognized font keys are family, series, shape, size, color;
Note that if the font change should only affect the material digested within this command itself, then bounded=>1 should be used; otherwise, the font change will remain in effect after the command is processed.
beforeDigest=>CODE($stomach),
afterDigest=>CODE($stomach)
provides code to be digested before and after processing
the main part of the primitive.
Needs descrition!
Other functions useful for dealing with digestion and state are important for writing before & after daemons in constructors, as well as in Primitives; we give an overview here:
Digest($tokens) digests $tokens (a (LaTeXML::Core::)Tokens), returning a list of Boxes and Whatsits.
Let($token1,$token2) gives $token1 the same meaning as $token2,
like \let
.
The following functions are useful for accessing and storing
information in the current State. It maintains a stack-like structure
that mimics TeX’s approach to binding; braces {
and }
open
and close stack frames. (The Stomach methods bgroup and egroup
can be used when explicitly needed.)
LookupValue($symbol), AssignValue($string,$value,$scope) maintain arbitrary values in the current State, looking up or assigning the current value bound to $symbol (a string). For assignments, the $scope can be ’local’ (the default, if $scope is omitted), which changes the binding in the current stack frame. If $scope is ’global’, it assigns the value globally by undoing all bindings. The $scope can also be another string, which indicates a named scope — but that is a more advanced topic.
PushValue($symbol,$value,...), PopValue($symbol),
UnshiftValue($symbol,$value,...), ShiftValue($symbol)
These maintain the value of $symbol as a list, with the operatations
having the same sense as in Perl; modifications are always global.
LookupCatcode($char), AssignCatcode($char,$catcode,$scope) maintain the catcodes associated with characters.
LookupMeaning($token), LookupDefinition($token) looks up the current meaning of the token, being any executable definition bound for it. If there is no such defniition LookupMeaning returns the token itself, LookupDefinition returns undef.
The following functions maintain LaTeX-like counters, and generally
also associate an ID with them. A counter’s print form
(ie. \theequation
for equations) often ends up on the refnum attribute
of elements; the associated ID is used for the xml:id attribute.
NewCounter($name,$within,options), creates a LaTeX-style counters. When $within is used, the given counter will be reset whenever the counter $within is incremented. This also causes the associated ID to be prefixed with $within’s ID. The option idprefix=>$string causes the ID to be prefixed with that string. For example,
would cause the third equation in the second section to have ID=’S2.E3’.
CounterValue($name) returns the Number representing the current value.
ResetCounter($name) resets the counter to 0.
StepCounter($name) steps the counter (and resets any others ‘within’ it),
and returns the expansion of \the$name
.
RefStepCounter($name) steps the counter and any ID’s associated with it.
It returns a hash containing refnum (expansion of \the$name
)
and id (expansion of \the$name@ID
)
RefStepID($name) steps the ID associated with the counter, without actually stepping the counter; this is useful for unnumbered units that normally would have both a refnum and ID.
Constructors are where things get interesting, but also complex; they are responsible for defining how the XML is built. There are basic constructors corresponding to normal control sequences, as well as environments. Mathematics generally comes down to constructors, as well, but is covered in Chapter 5.
Here are a couple of trivial examples of constructors:
The $replacement for a constructor describes the XML to be generated during the construction phase. It can either be a string representing the XML pattern (described below), or a subroutine CODE($document,$arg1,...props) receiving the arguments and properties from the Whatsit; it would invoke the methods of Document to construct the desired XML.
At its simplest, the XML pattern is a just serialization of the desired XML. For more expressivity, XML trees, text content, attributes and attribute values can be effectively ‘interpolated’ into the XML being constructed by use of the following expressions:
#1,#2,…#%name% returns the construction of the numbered argument or named property of the Whatsit;
&function(arg1,arg2,...) invokes the Perl function on the given arguments, arg1,…, returning the result. The arguments should be expressions for values, rather than XML subtrees.
?test(if pattern) or ?test(if pattern)(else pattern) returns the result of either the if or else pattern depending on whether the result of test (typically also an expression) is non-empty;
%
expression
returns a hash (or rather assumes the result is a hash or KeyVals object);
this is only allowed within an opening XML tag, where all the key-value pairs are
inserted as attributes;
^ if this appears at the beginning of the pattern, the replacement is allowed to float up the current tree to whereever it might be allowed;
In each case, the result of an expression is expected to be either an XML tree,
a string or a hash, depending on the context it was used in. In particular,
values of attributes are typically given by quoted strings, but expressions within those
strings are interpolated into the computed attribute value.
The special characters @ # ? %
which introduce these expressions can
be escaped by preceding with a backslash, when the literal character is desired.
A subroutine used as the $replacement, allows programmatic insertion of XML into, or modification of, the document being constructed. Although one could use LibXML’s DOM API to manipulate the document tree, it is strongly recommended to use Document’s API whereever possible as it maintains consistency and manages namespace prefixes. This is particularly true for insertion of new content, setting attributes and finding existing nodes in the tree using XPath.
Options:
mode=>(’math’|’text’) switches to math or text mode, if needed;
requireMath=>1, forbidMath=>1 requires, or forbids, this constructor to appear in math mode;
bounded=>1 specifies that all digestion (of arguments and daemons) will take place within an implicit TeX group, so that any side-effects are localized, rather than affecting the global state;
font=>{hash} switches the font used for any created text; recognized font keys are family, series, shape, size, color;
properties=> {hash} | CODE($stomach,$arg1,..). provides a set of properties to store in the Whatsit for eventual use in the constructor $replacement. If a subroutine is used, it also should return a hash of properties;
beforeDigest=>CODE($stomach),
afterDigest=>CODE($stomach,$whatsit)
provides code to be digested before and after digesting the arguments of
the constructor, typically to alter the context of the digestion (before),
or to augment the properties of the Whatsit (after);
beforeConstruct=>CODE($document,$whatsit),
afterConstruct=>CODE($document,$whatit)
provides code to be run before and after the main $replacement
is effected; occassionaly it is convenient to use the pattern
form for the main $replacement, but one still wants to execute
a bit of Perl code, as well;
captureBody=>(1 | $token) specifies that an additional argument (like an environment body) wiil be read until the current TeX grouping ends, or until the specified $token is encountered. This argument is available to $replacement as $body;
scope=>(’global’|’local’|$name) specifies whether this definition is made globally, or in the current stack frame (default), (or in a named scope);
reversion=>$string|CODE(...), alias=>$cs can be used when the Whatsit needs to be reverted into TeX code, and the default of simply reassembling based on the prototype is not desired. See the code for examples.
Some additional functions useful when writing constructors:
ToString($stuff) converts $stuff to a string, hopefully without TeX markup, suitable for use as document content and attribute values. Note that if $stuff contains Whatsits generated by Constructors, it may not be possible to avoid TeX code. Constrast ToString to the following two functions.
UnTeX($stuff) returns a string containing the TeX code that would generate $stuff (this might not be the original TeX). The function Revert($stuff) returns the same information as a Tokens list.
Stringify($stuff) returns a string more intended for debugging purposes; it reveals more of the structure and type information of the object and its parts.
CleanLabel($arg),
CleanIndexKey($arg),
CleanBibKey($arg),
CleanURL($arg)
cleans up arguments (converting to string, handling invalid characters, etc)
to make the argument appropriate for use as an attribute representing
a label, index ID, etc.
UTF($hex) returns the Unicode character for the given codepoint; this is useful for characters below 0x100 where Perl becomes confused about the encoding.
Environments are largely a special case of constructors,
but the prototype starts with {envname}
, rather than \cmd
,
the replacement will also typically involve #body
representing
the contents of the environment.
DefEnvironment takes the same options as DefConstructor, with the addition of
afterDigestBegin=>CODE($stomach,$whatsit)
provides code to digest after the \begin{env}
is digested;
beforeDigestEnd=>CODE($stomach)
provides code to digest before the \end{env}
is digested.
For those cases where you do not want an environment to correspond
to a constructor, you may still (as in LaTeX), define the
two control sequences \envname
and \endenvname
as you like.
The following declarations are typically only needed when customizing the schema used by LaTeXML.
RegisterNamespace($prefix,$url) associates the prefix with the given namespace url. This allows you to use $prefix as a namespace prefix when writing Constructor patterns or XPath expressions.
Tag($tag,properties) specifies properties for the given XML $tag. Recognized properties include: autoOpen=>1 indicates that the tag can automatically be opened if needed to create a valid document; autoClose=>1 indicates that the tag can automatically be closed if needed to create a valid document; afterOpen=>$code specifies code to be executed before opening the tag; the code is passed the Document being constructed as well as the Box (or Whatsit) responsible for its creation; afterClose=>code similar to afterOpen, but executed after closing the element.
The following functions are a bit tricky to use (and describe), but can be quite useful in some circumstances.
applies a regular expression to substitute textnodes after they are closed; the only option is fontTest=>$code which restricts the ligature to text nodes where the current font passes &$code($font).
allows replacement of sequences of math nodes. It applies $code to the current Document and each sequence of math nodes encountered in the document; if a replacement should occur, $code should return a list of the form ($n,$string,attributes) in which case, the text content of the first node is replaced by $string, the given attributes are added, and the following $n-1 nodes are removed.
defines document rewrite rules. These specifications describe what document nodes match:
label=>$label restricts to nodes contained within an element whose labels includes $label;
scope=>$scope generalizes label; the most useful form a string like ’section:1.3.2’ where it matches the section element whose refnum is 1.3.2;
xpath=>$xpath selects nodes matching the given XPath;
match=>$tex selects nodes that look like what processing the TeX string $tex would produce;
regexp=>$regexp selects text nodes that match the given regular expression.
The following specifications describe what to do with the matched nodes:
attributes=>{attr} adds the given attributes to the matching nodes;
replace=>$tex replaces the matching nodes with the result of processing the TeX string $tex.
The following declarations are useful for defining LaTeXML bindings, including option handling. As when defining LaTeX packages, the following, if needed at all, need to appear in the order shown.
DeclareOption($option,$handler) specifies the handler for $option when it is passed to the current package or class. If $option is undef, it defines the default handler, for options that are otherwise unrecognized. $handler can be either a string to be expanded, or a sub which is executed like a primitive.
PassOptions($name,$type,options) specifies that the given options should be passed to the package (if $type is sty) or class (if $type is cls) $name, if it is ever loaded.
ProcessOptions(keys) processes any options that have
been passed to the current package or class. If inorder=>1 is
specified, the options will be processed in the order passed to the
package (\ProcessOptions*
); otherwise they will be processed
in the declared order (\ProcessOptions
).
ExecuteOptions(options) executes the handlers for the specific set of options options.
RequirePackage($pkgname,keys) loads the specified package. The keyword options have the following effect: options=>$options can provide an explicit array of string specifying the options to pass to the package; withoptions=>1 means that the options passed to the currently loading class or package should be passed to the requested package; type=>$ext specifies the type of the package file (default is sty); raw=>1 specifies that reading the raw style file (eg. pkg.sty) is permissible if there is no specific LaTeXML binding (eg. pkg.sty.ltxml) after=>$after specifies a string or (LaTeXML::Core::)Tokens to be expanded after the package has finished loading.
LoadClass($classname,keys) Similar to RequirePackage, but loads a class file (type=>’cls’).
AddToMacro($cstoken,$tokens) a little used utilty to add
material to the expansion of $cstoken, like an \edef
;
typically used to add code to a class or package hook.
Other useful stuff:
RawTeX($texstring) expands and processes the $texstring; This is typically useful to include definitions copied from a TeX stylefile, when they are approriate for LaTeXML, as is. Single-quoting the $texstring is useful, since it isn’t interpolated by Perl, and avoids having to double all the slashes!