An XML Parser For Python

This document describes (yet) another API for processing XML documents. It focuses on simplifying the kind of processing where a largely linear document is being converted to another largely linear document, such as where a paper or article is being translated to HTML or to print form.


1. Why Another XML Parser Interface?

There are lots of XML parser APIs around, most notably the SAX and DOM APIs supported by the W3C organization. Any new API must provide significant and needed new functionality, must significantly simplify processing, or must be the basis for explaining otherwise difficult concepts.

The current paper and its accompanying implementation, for use with Python, describe a new XML parser API that's a based generally on the commonly-used SAX API, but seems, at least for a large number of applications, to have advantages:

The XML parser API implementation that comes with this paper should be considered a prototype, and an alpha implementation at that. It's work in process. It has been used to successfully convert an XML document to HTML, but that's it.


2. Creating An XML Parser

An XML parser is encapsulated by the anXMLParser class. Creating and initializing an XML parser is done by invoking the anXMLParser's creator method. It has three option arguments:


3. The XML Parser Object

An anXMLParser object has two useful properties and two useful methods:


4. Using The XML Parser In Python

The primary functionality of an anXMLParser object is to return a stream of XML markup tokens. For those familiar with the SAX API, there's one object corresponding to each of SAX's call-back functions/methods, plus a few others for convenience.

4.1 A Simple Example

You create an anXMLParser object, something like this:

 parser = anXMLParser (open ("mydocument.xml"))

In this case, just the input XML document is provided. You can then start getting the markup tokens from the parser by invoking the parser object as if it were a function (this is a Python thing, but it works well in some cases such as this):

 for token in parser ():
    if token.isStartElement ():
       print "Starting element", token.name
    elif token.isEndElement ():
       print "Ending element", token.name
    elif token.isCharacters ():
       print "Characters:", token.characters

(The methods and properties defined for the tokens are described in a later section.)

4.2 A Realistic Example

The first example simply displays the start- and end-tags and any characters found. But a typical application doesn't treat all elements equally -- where they appear, in what context they appear in, determines how they are going to be used. So here's a more realistic example:

 for token in parser ():
    if token.isStartElement ("section"):
       for token in token.children ():
          def outputParaContent (token):
             for token in token.children ("W"):
                if token.isCharacters ():
                  out.write (token.characters)
          if token.isStartElement ("title"):
             out.write ("<H2>")
             outputParaContent (token)
             out.write ("</H2>\n")
          elif token.isStartElement ("para"):
             out.write ("<P>")
             outputParaContent (token)
             out.write ("</P>")

Here, once the start of an "section" element is encountered, the content of the element is asked for using a nested invocation of the parser -- that's what "token.children ()" does. In the example:

4.3 Advantages Of This XML Processing Model

The nice thing about this way of processing is that:

Or to put it another way, it's simpler than either the SAX API or DOM API, it makes available significant DOM-like functionality to SAX-like applications, and it adds some advantages of using declarative XML processing to a main-stream-like programming language.

There is a down-side to this XML processing model: random reordering of the document is not as well supported as by the DOM model. For many applications this difficulty can be dealt with by having a facility for reordering the output as described in Patched Output.


5. What's Returned From The XML Parser

Once you've created an anXMLParser object, you can call the object, and what you'll get is a generator that returns markup tokens from the parer, until it runs out (or if, when you've done a second call, it encounters the end of the current element's content). The following subsections describe the different kinds of tokens that you might receive. Each kind of token has a test method that tells you what kind of token you've got, plus a few other properties.

All tokens support the ".__str__" method, so that you can pass a token to Python's "str" function or use it in any context in which a string coercion is forced. What's returned is a representation of the token more-or-less as it appears in an XML document.

5.1 Characters

 token.isCharacters ()

Returns True only if the token is a markupTokenCharacters object.

 token.characters

The characters returned from the XML Parser.

5.2 End Document

 token.isEndDocument (*names)

Returns True only if the token is a markupTokenEndDocument object.

Only returned if the parser is created with the "S" include option.

5.3 End Element

 token.isEndElement (*names)

Returns True only if the token is a markupTokenEndElement object. If one or more arguments are given, it only returns True if the name of the element being ended is one of those given. If no argument is given, any element will do.

 token.name

The name of the element being ended.

 token.elementDepth

The nesting depth of the element being ended. The root element is of depth 1.

 token.usedAsEnd

A markupTokenEndElement object is not returned to the application if "parser ()" or "token.children ()" is used to generate the content of an element. However, if such a generator is not used, a markupTokenEndElement object will be returned. There are two such cases:

5.4 Ignorable Whitespace

 token.isIgnorableWhitespace ()

Returns True only if the token is a markupTokenIgnorableWhitespace object.

 token.characters

The ignorable whitespace characters returned from the XML Parser.

Only returned if the parser is created with the "I" include option.

5.5 Processing Instruction

 token.isProcessingInstruction (*targets)

Returns True only if the token is a markupTokenProcessingInstruction object. If one or more arguments are given, it only returns True if the target of the processing instruction (name at the start of the processing instruction), if any, is one of those given. If no argument is given, any processing instruction will do. If one or more arguments are given, and there is no target, False will be returned.

 token.target

The target of the processing instruction, if any. If there isn't one, the value is None.

 token.data

The data of the processing instruction (the part of the processing instruction following the target and its terminating space character), if any. If there isn't one, the value is None.

Only returned if the parser is created with the "P" include option.

5.6 Set Document Locator

 token.isSetDocumentLocator ()

Returns True only if the token is a markupTokenSetDocumentLocator object. This object type is an artifact of the SAX API.

 token.columnNumber

The column number of the input when the token is returned.

 token.lineNumber

The line number of the input when the token is returned.

 token.publicId

The public identifier of the input being returned. It's value is None if no public identifier is available.

 token.systemId

The system identifier of the input being returned. It's value is None if no system identifier is available.

Only returned if the parser is created with the "L" include option.

5.7 Skipped Entity

 token.isSkippedEntity (*names)

Returns True only if the token is a markupTokenSkippedEntity object. If one or more arguments are given, it only returns True if the name of the entity is one of those given. If no argument is given, any entity will do.

 token.name

The name of the entity.

5.8 Start Document

 token.isStartDocument (*names)

Returns True only if the token is a markupTokenStartDocument object.

Only returned if the parser is created with the "S" include option.

5.9 Start Element

 token.isStartElement (*names)

Returns True only if the token is a markupTokenStartElement object. If one or more arguments are given, it only returns True if the name of the element being started is one of those given. If no argument is given, any element will do.

 token.name

The name of the element being started.

 token.attrs

The attributes, if any, specified or defaulted for the element being started. token.attrs is a list of markupAttribute objects, each of which has the following properties:

The nesting depth of the element being started. The root element is of depth 1.

 token.usedAsEnd

True if the end of this element will serve as the end of a generated sequence of tokens for overlapped markup. Otherwise it's False.

5.10 Exception

 token.isException (*severitities)

Returns True only if the token is a markupTokenException object. If one or more arguments are given, it only returns True if the severity code of the exception is one of those given. If no argument is given, any exception will do.

 token.severity

The severity of the exception:

The text of the error message.

 token.columnNumber

The column number of the input when the token is returned.

 token.lineNumber

The line number of the input when the token is returned.

 token.publicId

The public identifier of the input being returned. It's value is None if no public identifier is available.

 token.systemId

The system identifier of the input being returned. It's value is None if no system identifier is available.

Only returned for warnings (token.severity == 0) if the parser is created with the "W" include option.

5.11 Resolvable Entity

 token.isEntity (*names)

Returns True only if the token is a markupTokenEntity object. If one or more arguments are given, it only returns True if the name of the entity that needs to be resolved by the application is one of those given. If no argument is given, any entity will do.

 token.name

The name of the entity that needs resolving.

 token.publicId

The public identifier of the input being returned. It's value is None if no public identifier is available.

 token.systemId

The system identifier of the input being returned. It's value is None if no system identifier is available.

 token.setEntity (inputFile)

A method of one argument, providing the XML Parser with a string, Unicode string or input-file-like object that is the text of resolvable entity described by the token. This method, or the correspondingly-named method of the generating anXMLParser object should be called whenever this token is returned to the application.

Only returned if the "entityResolver" argument wasn't specified when the parser was created.

5.12 Notation Declaration

 token.isNotationDecl (*names)

Returns True only if the token is a markupTokenNotationDecl object. If one or more arguments are given, it only returns True if the name of the notation being declared is one of those given. If no argument is given, any notation will do.

 token.name

The name of the notation being declared.

 token.publicId

The public identifier of the input being returned. It's value is None if no public identifier is available.

 token.systemId

The system identifier of the input being returned. It's value is None if no system identifier is available.

Only returned if the parser is created with the "D" include option.

5.13 Unparsed Entity Declaration

 token.isUnparsedEntityDecl (*names)

Returns True only if the token is a markupTokenUnparsedEntityDecl object. If one or more arguments are given, it only returns True if the name of the entity being declared is one of those given. If no argument is given, any entity will do.

 token.name

The name of the entity being declared.

 token.publicId

The public identifier of the input being returned. It's value is None if no public identifier is available.

 token.systemId

The system identifier of the input being returned. It's value is None if no system identifier is available.

Only returned if the parser is created with the "D" include option.

 token.ndata

The value of the SAX "ndata" argument if the declaration is for a NDATA entity. None if not.

5.14 The Begining

This token type is an artifact of this parsing model. It is only returned if the parser needs to ask the application for a document entity. If returned, it is the very first token returned.

 token.isTheBegining ()

Returns True only if the token is a markupTokenTheBegining object.

 token.setEntity (inputFile)

A method of one argument, providing the XML Parser with a string, Unicode string or input-file-like object that is the text of document entity to be parsed. This method, or the correspondingly-named method of the generating anXMLParser object should be called whenever this token is returned to the application.

Not returned if the "documentEntity" argument is specified when creating the anXMLParser object.

5.15 The End

This token type is an artifact of this parsing model. It is returned on "normal" termination of the parser, following all other tokens for a parse. It won't be returned in some cases of a very severe exception (when the exception's token.severity == 3). This token can typically be ignored.

 token.isTheEnd ()

Returns True only if the token is a markupTokenTheEndp object.


6. Retrieving Tokens From The XML Parser

There's a number of ways in which tokens can be retrieved from the XML parser, either one at a time, or through a generator. A generator can:

6.1 Invoking The XML Parser As A Token Generator

The simplest way to invoke the XML parser as an XML token generator is to "call" it:

 for token in parser ():

When called like this, an XML parser object generates all the XML tokens from the document being parsed.

When called a second time, when tokens are already being returned from the parser, the tokens returned from the call are just the children of the currently opened element, as in:

 for token in :red:parser ():
    if token.isStartElement ("section"):
       for token in :red:parser ():
          def outputParaContent ():
             for token in :red:parser ("W"):
                if token.isCharacters ():
                  out.write (token.characters)
          if token.isStartElement ("title"):
             out.write ("<H2>")
             outputParaContent ()
             out.write ("</H2>\n")
          elif token.isStartElement ("para"):
             out.write ("<P>")
             outputParaContent ()
             out.write ("</P>")

A invocation of the parser when there is one opened element or more will return all the tokens up to but not including that for the end tag for the most deeply currently opened element. The end tag is used by the token generator to indicate that it's at the end of what it has to generate. It's not returned to the application. So in the example, there are three uses of "for token in parser ()":

Equivalent to invoking the parser object with a call is invoking the ".children" method of a token returned by another XML token generator. So in the above example, the nested uses of "parser ()" can be equivalently replaced by "token.children -- so long as "token" is available.

There are other sets of XML tokens that can be generated by an XML parser: see Overlapped Markup.

6.2 The Whitespace Option

When invoking an anXMLParser object to generate its XML tokens, or invoking the ".children" method of a token, options can be specified as the first argument, as in the innermost generator in the previous example (the one within outputParaContent):

 for token in parser ("W"):

One specifiable option is the letter "W", upper- or lower-case. By default, invoking a parser with no option does not return character tokens consisting only of whitespace. If "W" is specified, all character data is returned.

The "W" option is especially useful when parsing without the help of a DTD or schema, or where the application has needs in addition to those specified in the DTD or schema.

The first argument of the XML token generator can be specified as "" (as in parser ("")) if no options are wanted.

6.3 Special Token Methods

There are three methods available for all returned tokens.

6.3.1 Retrieving Just One Token

 token.next ()

Return the next token from the XML token generator that returned this token. This token is "consumed" and will not be returned again.

If there are no tokens available (at the end of the document entity or at the end of a generator for an element's content) this method raises the Python StopIteration exception, as is normal for a generator.

6.3.2 Looking Ahead

 token.peek ([count])

Return the next token from the XML token generator that returned this token. This token is not "consumed" and will be returned again.

If the "count" argument is specified, then skip that many as-yet unreturned tokens and return the one following. token.peek (0) returns the next token (just as does token.peek ()), and token.peek (1) returns the token after that.

If there are not enough tokens available (at the end of the document entity or at the end of a generator for an element's content) this method returns a markupToken object that refuses to be recognized as any particular token type -- all of the usual ".is" methods will return False. This makes things easy if you're looking for a particular kind of token up ahead.

6.3.3 Generating Content

 token.children ()

As described earlier, the ".children" method of any returned token can be used to generated tokens from the XML parser. The allowed arguments of ".children" are the same as for invoking the parser, as described in Invoking The XML Parser As A Token Generator, The Whitespace Option and Overlapped Markup.

6.4 Ignoring Children

(Not recommended outside of XML applications.)

Not all components of a marked up document are of interest. The end tag of an element, for example, marks the end of the element's content, but typically doesn't serve any other purpose -- which is why a simple XML token generator in this implemenation doesn't bother to return it to the user, and by default skips comments and other non-structure markup and skips character tokens consisting entirely of white space.

Applications will want to ignore some elements, especially in overlapped markup applications. There are a variety of ways of doing this, based on application needs:

Even for an empty element, that should have no content, this technique is preferable to gobbling the next XML token on the assumption that it's an end tag token. If a non-validating XML parser is being used and an element has whitespace in its content, a token may be returned for that white space:

 <marker>  </marker>

6.3 Overlapped Markup

There are currently a variety of experiments going on attempting to come up with XML encodings for data, especially text, with overlapped structure. An example of overlapped structure is the plays of Shakespear wherein a line of spoken verse is split between two speakers, and both the speaker structure and verse structure need to be captured.

6.3.1 Marking Up Overlapped Structure

A number of techniques for marking up overlapped text have been devised. A number of these techniques use empty elements to mark the start and end of alternative structures. Identification of start/end pairs uses both element names and attribute values:

Element names can do the job alone, or in combination with labeling attribute values.

There are two ways in which this kind of technique can be used:

In all these techniques, once the starting boundary of an overlapping structure has been recognized, the identity of its ending boundary is known: the name of the ending element, and, if used, the name and required value of the labeling attribute. The labeling attribute value is typically that of a same named or other attribute of the starting boundary. The element name and the attribute name may be the same or different. However, a single model describes all these techniques.

In extension to this way of doing things is to allow the ending element to be not an empty element -- to allow it to have content. Doing this doesn't seem to introduce any difficulty in recognizing the end boundary of a component of overlapped markup -- in effect, the overlapped structure ends at the end of the ending element.

6.3.2 Generating Overlapped Structure Content

One kind of processing possible using documents with overlapped markup as described above is to simply extract one of the structures, or at least to make one dominate (contain) the other. This kind of processing can make use of generators of the sort described earlier for element content: the ending condition for the generator is the ending boundary of one overlapped structure. The parser and ".children" generators can recognize such boundaries as follows:

A second argument need not be specified if an end boundary is to be identified solely on the basis of an attribute value. Alternatively, a third argument need not be specified if the element name is sufficient identification. In any case, a first argument of "" can be used if no options are to be specified.

Unlike the case for generating the content of an element, the end boundary element is returned by the generator: its start element token, at least can have attributes of interest. In addition, the end boundary element can have content -- it's the end of the end boundary element that is deemed to end the overlapped structure. The ending element is returned as if content of the overlapped structure. Its markupTokenEndElement object is returned unless the end boundary element uses a generator for its content that suppresses the end tag token.

To help clients recognize an overlapped structure end boundary, both the markupTokenStartElement and markupTokenEndElement objects have a boolean ".usedAsEnd" property, that is only True if the element has been recognized as an overlapped structure end boundary.

This form of processing overlapped markup isn't intended to deal with all processing issues for overlapped markup. Just as the DOM model of processing XML documents is often superior to the sequential processing model, other processing models for overlapped markup are often more appropriate than that described here. On the other hand, there are many cases where sequential processing works -- producing a published form of the document, for instance -- so this approach is worth consideration.

6.3.3 Ending Level Issues

This is where things become a bit tricky. Skip to the examples for cases if this subsection gets a bit heavy.

In XML documents, things are nested within other things -- they each occur at some level of nesting. The issue is at what level of nesting is the end markup to be recognized.

In "normal", well formed markup, it's expected that the end tag will appear at the same level of markup as the corresponding start tag. For overlapped markup there are a number of possibilities:

This situation is exacerbated by the fact that the XML parser is going to validate the well formed markup, but isn't necessarily going to validate the matching of overlapped markup start and end boundary tags. So the model we use has to allow for missing and extra end boundary markers.

To support the possible alternatives, the XML token generator supports three options in addition to the "W" option:

6.4 XML Token Generation Examples

The following examples explain most of the useful cases of XML token generation. However they should not be considered exhaustive.

It should be noted that non-overlapped and overlapped token generation can be combined in many ways. They are in no way mutually exclusive. It's quite possible for a document's structure to be primarily "well formed", with overlapped markup used in a few key places.

6.4.1 Non-Overlapped Content

First, a simple example of non-overlapped markup. Assuming that you've got a well-formed element like the following:

 <name>...</name>

The recognition of the <name> element and the processing of its children will look like the following:

 if token.isStartElement ("name"):
    for token in token.children ():
       # process children

Where "process children" occurs you'll put the recognition and processing logic for the <name> element's children.

As noted earlier, the </name> token doesn't need to be dealt with -- it's used by the XML token generator to signal the end of token generation.

6.4.2 Empty Elements

Empty elements are supposed to have no content:

 <name/>

It would be nice if one just got one XML token for this. However, there's a few reasons why the XML parser doesn't always do this for you:

So one has to deal with the content of empty elements, mostly just ignoring the end tag token. One can safely, as described in Ignoring Children, do this:

 if token.isStartElement ("name"):
    for token in token.children (): pass

6.4.3 Element-Name Based Overlapped Markup

So now for overlapped markup. The simplest way of using XML elements to mark overlapped boundaries is to have matching start and end elements:

 <start/>...<end/>

Telling the generator what element terminates the current content does the job:

 if token.isStartElement ("start"):
    for token in token.children ("", "end"): ...

That's all that's needed. (Note that the first argument needs to be specified in this case.)

By default, when you specify an end element name, the XML token generator looks for it in the content of the parent element of where the generator was invoked. So if the parent element terminates without there being an end element, the generator will terminate without returning a token for the end element.

There are two other ways in which you can deal with this situation. Firstly, you can exit the start element yourself, and then specify that the generator terminates at the end of what then is the currently opened element. This is equivalent to the default ending condition without specifying an option:

 if token.isStartElement ("start"):
    :red:for token in token.children ():
       pass # skip any content and the end token for <start/>
    for token in token.children (:red:"e", "end"): ...

Alternatively, you can specify that well formed element boundaries are to be ignored and that you want that end element:

 if token.isStartElement ("start"):
    for token in token.children (:red:"z", "end"): ...

In this latter case the end of the input will terminate the generator too.

6.4.4 Labels With Different Element Names

A more usable way of marking structure boundaries is to identify them with labels:

 <start id="thisstuff"/>...<end id="thisstuff"/>

In this case the third argument of the generator can be used to specify the required label value for the end boundary:

 if token.isStartElement ("start"):
    for token in token.children ("", "end",
                                 {"id": token.attrs ["id"].value}): ...

6.4.5 Labels On The Same Element Name

Alternatively to using different element names for the start and end boundaries of overlapped structure, you can use the same element name with different attribute names for the labels:

 <q sId="thisstuff"/>...<q eId="thisstuff"/>

The generation of this kind of overlapped content is the same as when using different element names, excepting only in that the element and attribute names change:

 if token.isStartElement ("q") and token.attrs.has_key ("sId"):
    for token in token.children ("", "q",
                                {"eId": token.attrs ["sId"].value}): ...

One advantage of using the same name for start and end boundaries is that you can combine the processing of different elements. In the following case, it's assumed that there are a variety of "quote" structures, all with the same or similar processing. The idea is that whatever of the start element name, the end element must have the same name:

 if token.isStartElement ("q", "q1", "q2", "speech") and \
          token.attrs.has_key ("sId"):
    for token in token.children ("", token.name,
                                 {"eId": token.attrs ["sId"].value}): ...

You can even dynamically maintain a list of currently interesting overlapping element boundaries, and recognize them by "flattening" the list into multiple arguments for the ".isStartElement" method:

 boundaryElements = ["q", "q1", "q2", "speech"]
 if token.isStartElement (:red:*boundaryElements) and token.attrs.has_key ("sId"):
    for token in token.children ("", token.name,
                                 {"eId": token.attrs ["sId"].value}): ...

6.4.6 Just The Attributes

You can also just use labels, and ignore the element names. Note the absence of a specified element name when using the ".isStartElement" method and the use of "None" for the second argument, indicating in both cases that the element name is not interesting:

 if token.isStartElement :red:() and token.attrs.has_key ("sId"):
    for token in token.children ("", :red:None,
                                 {"eId": token.attrs ["sId"].value}): ...

6.4.7 Annotation Elements In Overlapped Markup

The ending boundary element can have content, used as an "annotation" of the overlapped content in some models.

 <start id="thisstuff"/>...<end id="thisstuff">...</end>

The ".usedAsEnd" property is useful in identifying the annotation. Python's "else" option for "for" loops is also useful, as it is on performed if the loop "dropped through" and was not exited by a "break":

 if token.isStartElement ("start"):
    for token in token.children ("", "end",
                                 {"id": token.attrs ["id"].value}):
       if token.usedAsEnd:
          # process the "annotation" element used as the end boundary
          for token in token.children ():
             # annotation content
          break
       # processing for the "content" elements
    else:
       # you'll only get here if there was no end boundary element


7. Terminating An XML Parser

An anXMLParser object can be terminated simply by deleting it (using Python's "del" statement) or letting all references to it disappear (go out of scope). Doing either of these things terminates the coroutine in which the parser is running.

At present, some kind of errors in the user's program can cause the program to terminate without terminating the XML parser. In this case it may be possible that the program as a whole hangs and needs "terminating" using the operating system's facilities.


8. Patched Output

Processing XML documents often requires that data be reordered: the order in which things appear in the input is not necessarily that in which they are to appear in the output. There are different ways of achieving this reordering.

Both the serial XML processing model described in this document and SAX have the difficulty of not having the whole document available during client processing. To a large extent this difficulty can be offset by tools that allow the client to write output in semi-random order. This approach works, in part, because:

The accompanying module, "patchedoutput.py", is one tool for dealing with the reordering problem. A client application writes data and "labels" to a patchedoutput in the application-appropriate order, and defines the values associated with the labels when the information becomes available, in any order. The patchedoutput then writes the final form of the output in the order in which the data and labels are written, replacing the labels with their associated values.

There are a number of ways in which the data and labels can be buffered:

"patchedoutput.py" is simple enough that programmers can modify it to suite their own requirements.


9. Getting The Parser

To use this parser, you need to have Python version 2.3 or later installed, and a Python SAX parser available -- there's one in the Python library. You also need to be using Christian Tismer's Stackless Python. It can be found at www.stackless.com.

The new Python API is available as a ZIP file: xmlparser.zip. This ZIP file includes:


10. Other Work On Serial XML Parsing

There are other approaches to serial XML parsing. The most important alternative to the model presented here is the one in which an XML element or other structure invokes a rule or method (depending on the language), and in which the user can indication where the structure's content is to be processed. Two examples of this alternative are:


11. Updates

The following updates have been made since this document was first posted:

13 August 2004:

A variety of minor organizational changes and clarifications have been made. As well:

22 July 2004:

The xml2htmlp.py program has been removed. I'll put it back when I've time to put a bit more work into it.

13 July 2004:

The section titled "Patched Output" and the accompanying patchedoutput.py module has been added.

© copyright 2004 by Sam Wilmott, All Rights Reserved

Thu Sep 09 20:36:58 2004