Archive for the ‘MARC21’ Category

‡biblios.net collaborative sharing and editing of records

Monday, December 8th, 2008 by Chris Catalfo

LibLime is developing a new open platform for sharing bibliographic records as part of its forthcoming ‡biblios.net project (now in beta at http://beta.biblios.net). As part of this platform, LibLime is making available bibliographic records via several protocols: z39.50 and SRU for starters, eventually via a REST protocol and OAI-PMH.

LibLime is also allowing write access to the records via the protocol ‡biblios uses to interact with Koha. This api (documented more fully at the biblios.org site) looks like the following:

authenticate: POST username=x password=y. Receives: cookie for session use
bib_profile: GET. Receives: xml document describing marc21 fields in use by backend Koha server.
bib/xxx: GET: Retrieves a marcxml document at xxx bib number.
bib/xxx: POST: Saves posted marcxml document to server.
new_bib: POST: Saves marcxml document (new to database) to server.

In the coming weeks we will announce publicly available urls for these access services, so sharpen your pencils and get ready to collaboratively share and edit some records!

Making the MARC21 specification usable via XML

Monday, April 7th, 2008 by Chris Catalfo

As part of my work on the Biblios project, I need access to the MARC21 specification in a machine-readable form so that Biblios can provide context-sensitive help. To this end, I’ve been extracting field and subfield names, descriptions, and valid values from the MARC21 specification in HTML form, available at the Library of Congress website here. I thought other folks might be interested in this, hence this post.

I enjoy using Python and so I thought I’d try whip something up in it. I discovered the BeautifulSoup Python library for extracting data from HTML pages and it sounded perfect for this task.

A glance at the specification page for the MARC21 Leader field (here) shows that, happily, the pages are constructed fairly logically and can be extracted based on their css class.

I put together a few regular expressions like this one to parse out the key data:


# get the character position like this:
# 18-21 - Illustrations (006/01-04)
charposmatch = re.compile(r'^(?P<position>\d{1,2}(?P<extent>-\d{2})*)\s-\s(?P<name>.*)')

BeautifulSoup lets you write code like this to walk through an HTML document and extract parts:


 for charpos in soup.findAll('div', {'class':'characterposition'}):
        try:
            text = charpos.findNext('strong', recursive=False).contents[0].rstrip()

I wrote a little Python script to download the relevant specification files and ran my extraction scripts. Out came something like this:


<marc21spec>
   <tag code="000">
      <position description="Computer-generated, five-character number equal to the length of the entire
                              record, including itself and the record terminator. The number is right justified
                              and unused positions contain zeros." name="Record length" position="00-04"/>
      <position description="One-character alphabetic code that indicates the relationship of the record to a
                              file for file maintenance purposes." name="Record status" position="05">
         <value code="a" description="Increase in encoding level"/>
         <value code="c" description="Corrected or revised"/>
         <value code="d" description="Deleted"/>
         <value code="n" description="New"/>
         <value code="p" description="Increase in encoding level from prepublication"/>
      </position>

The MARC21 specification in all it’s glory, as xml!

I have no doubt that there are still some incorrect or missing parts in the generated xml files. If you’re interested in double checking the files, or even better, improving the extraction scripts, you can download them here:
marc21controlfields.xml
marc21varfields.xml
extractControlTags.py
extractVariableTags.py

To run the extraction scripts, pass them the path containing the .html files (or a single file) and a filename to output the xml content to:

python extractControlTags.py . marc21controlfields.xml
python extractVariableTags.py . marc21variablefields.xml