Archive for the ‘XML’ Category

Encoding and decoding XML data as path sequences

Friday, July 4th, 2008 by Chris Catalfo

Lately I’ve been thinking about how to represent information about XML paths and data as a string.

For example, I’d like to be able to record the origin of this data:


<titleInfo type="alternative">
<title>Special edition using XSLT</title>
</titleInfo>

as something like this (with id and data as properties in a JSON object):


{"id":"titleInfo-2@type=alternative\title-1","data":"Special+edition+using+XSLT"}

I could then take the preceding id string, extract the provenance of the data, and recreate the original XML document.

Here’s how I’ve tried encoding the XML path and data using an XSLT stylesheet:

For each text element, create an id consisting of:

  1. Each ancestor (except the root)
  2. A dash to delimit the ancestor element’s name from its position
  3. The integer position of that node in the XML file (using )
  4. Each of the ancestor’s attributes, in the form @attrname=attrvalue
  5. A backslash to be used a path delimiter
  6. The text element’s name

With this id, I believe I now have everything I need to reconstruct the node that the data referenced by that id came from.

After playing around with this a bit, I realized that what I’d done was basically reinvent XPath! In XPath, the preceding path in the id string would be represented as:

/titleInfo[1]@type=alternative/title[0]

OK…so next idea is to see if there are libraries out in the wild wild web for creating XML documents from XPath expressions (and not just querying XML documents). I see that the Perl module XML::XPath may offer a solution.

I also wonder if this is how XForms libraries keep track of what parts of an XML document have been edited….

Making the MARC21 specification usable via XML

Monday, April 7th, 2008 by Chris Catalfo

As part of my work on the Biblios project, I need access to the MARC21 specification in a machine-readable form so that Biblios can provide context-sensitive help. To this end, I’ve been extracting field and subfield names, descriptions, and valid values from the MARC21 specification in HTML form, available at the Library of Congress website here. I thought other folks might be interested in this, hence this post.

I enjoy using Python and so I thought I’d try whip something up in it. I discovered the BeautifulSoup Python library for extracting data from HTML pages and it sounded perfect for this task.

A glance at the specification page for the MARC21 Leader field (here) shows that, happily, the pages are constructed fairly logically and can be extracted based on their css class.

I put together a few regular expressions like this one to parse out the key data:


# get the character position like this:
# 18-21 - Illustrations (006/01-04)
charposmatch = re.compile(r'^(?P<position>\d{1,2}(?P<extent>-\d{2})*)\s-\s(?P<name>.*)')

BeautifulSoup lets you write code like this to walk through an HTML document and extract parts:


 for charpos in soup.findAll('div', {'class':'characterposition'}):
        try:
            text = charpos.findNext('strong', recursive=False).contents[0].rstrip()

I wrote a little Python script to download the relevant specification files and ran my extraction scripts. Out came something like this:


<marc21spec>
   <tag code="000">
      <position description="Computer-generated, five-character number equal to the length of the entire
                              record, including itself and the record terminator. The number is right justified
                              and unused positions contain zeros." name="Record length" position="00-04"/>
      <position description="One-character alphabetic code that indicates the relationship of the record to a
                              file for file maintenance purposes." name="Record status" position="05">
         <value code="a" description="Increase in encoding level"/>
         <value code="c" description="Corrected or revised"/>
         <value code="d" description="Deleted"/>
         <value code="n" description="New"/>
         <value code="p" description="Increase in encoding level from prepublication"/>
      </position>

The MARC21 specification in all it’s glory, as xml!

I have no doubt that there are still some incorrect or missing parts in the generated xml files. If you’re interested in double checking the files, or even better, improving the extraction scripts, you can download them here:
marc21controlfields.xml
marc21varfields.xml
extractControlTags.py
extractVariableTags.py

To run the extraction scripts, pass them the path containing the .html files (or a single file) and a filename to output the xml content to:

python extractControlTags.py . marc21controlfields.xml
python extractVariableTags.py . marc21variablefields.xml