Archive for the ‘Biblios’ Category

Deciding on an API for Biblios

Saturday, May 24th, 2008 by Chris Catalfo

As I continue to work on Biblios in anticipation of its release (soon, I hope!), it is about time to decide on an API.

I have already put into place a simple macro system for batch editing of bibliographic records. The language is Javascript and makes use of a MarcRecord javascript object to manipulate MARCXML records.

Here is a simple example (record is a MarcRecord instance):


// Check to see if record has 856.  If so, add subfield $u with url.  If not, add a new 856 with url.
if( record.hasField('856') ) {
    record.field('856').subfield('u', 'http://www.google.com');
}
else {
    record.addField( new Field('856', '', '', [ new Subfield('u', 'http://www.google.com')]) );
}

I would like to provide access to Biblios’ main functions for use by plugins. Here are a few ideas for API functions:

  • Run a search
  • Run the current search but limited to something
  • Save all search results to a folder
  • Save record with id n to a particular folder
  • Edit record with id n
  • Run a macro on all records in a folder

I’d be interested to hear what others think: what they’re used to in other cataloging software and what commands/tools that software might be missing which could be ultimately included in Biblios.

Making the MARC21 specification usable via XML

Monday, April 7th, 2008 by Chris Catalfo

As part of my work on the Biblios project, I need access to the MARC21 specification in a machine-readable form so that Biblios can provide context-sensitive help. To this end, I’ve been extracting field and subfield names, descriptions, and valid values from the MARC21 specification in HTML form, available at the Library of Congress website here. I thought other folks might be interested in this, hence this post.

I enjoy using Python and so I thought I’d try whip something up in it. I discovered the BeautifulSoup Python library for extracting data from HTML pages and it sounded perfect for this task.

A glance at the specification page for the MARC21 Leader field (here) shows that, happily, the pages are constructed fairly logically and can be extracted based on their css class.

I put together a few regular expressions like this one to parse out the key data:


# get the character position like this:
# 18-21 - Illustrations (006/01-04)
charposmatch = re.compile(r'^(?P<position>\d{1,2}(?P<extent>-\d{2})*)\s-\s(?P<name>.*)')

BeautifulSoup lets you write code like this to walk through an HTML document and extract parts:


 for charpos in soup.findAll('div', {'class':'characterposition'}):
        try:
            text = charpos.findNext('strong', recursive=False).contents[0].rstrip()

I wrote a little Python script to download the relevant specification files and ran my extraction scripts. Out came something like this:


<marc21spec>
   <tag code="000">
      <position description="Computer-generated, five-character number equal to the length of the entire
                              record, including itself and the record terminator. The number is right justified
                              and unused positions contain zeros." name="Record length" position="00-04"/>
      <position description="One-character alphabetic code that indicates the relationship of the record to a
                              file for file maintenance purposes." name="Record status" position="05">
         <value code="a" description="Increase in encoding level"/>
         <value code="c" description="Corrected or revised"/>
         <value code="d" description="Deleted"/>
         <value code="n" description="New"/>
         <value code="p" description="Increase in encoding level from prepublication"/>
      </position>

The MARC21 specification in all it’s glory, as xml!

I have no doubt that there are still some incorrect or missing parts in the generated xml files. If you’re interested in double checking the files, or even better, improving the extraction scripts, you can download them here:
marc21controlfields.xml
marc21varfields.xml
extractControlTags.py
extractVariableTags.py

To run the extraction scripts, pass them the path containing the .html files (or a single file) and a filename to output the xml content to:

python extractControlTags.py . marc21controlfields.xml
python extractVariableTags.py . marc21variablefields.xml

Biblios at Code4LibCon 2008

Friday, March 14th, 2008 by Chris Catalfo

I attended my first Code4Lib Conference a few weeks ago and did a presentation on “Biblios”, the web-based cataloging software I’ve been working on here at LibLime. I will be posting slides of the presentation in the next few days.

I am very sorry to have missed the presentation at Code4LibCon 2008 on a MODS editor written in XFORMS (link to slides available here). This looks like a very promising approach for editing XML documents. XFORMS is an attractive technology I plan on looking into.

A web site for Biblios is in the works and should go live next week, with links to downloads, a demo, and documentation. As soon as it’s ready I will post links on this blog.