Archive for April, 2008

How Good is Google Book Services? Ask your mother.

Tuesday, April 29th, 2008 by atz

Despite not being even remotely Irish, my mother likes to make a traditional corned-beef and cabbage dinner for our family on St. Patrick’s Day, and this year was no exception. (Sorry, no pix.) My mother is a five-foot tall head reference librarian in a local public library system and she commands the type of incredible memory that you would expect from her profession. She makes use of this in another tradition, the singing of a particular St. Patrick’s Day hymn taught to her by nuns in grade school. Suffice to say, my mother has not been in grade school for a while, and for a song sung only once a year, it seems remarkable to me that anybody can remember all the words without any help. I mean, how far can you get into Auld Lang Syne or Good King Winceslas? If you say “all the way, no problem”, please remind yourself if you are currently, have even been, or are about to become a reference librarian.

So this song is essentially about St. Patrick and the persevering quality of Ireland’s faith in him, but pace is pretty quick and the lyrics are twisty and complicated. I’ve never heard this song anywhere else, and I didn’t know the title, but I wanted to find the lyrics. Since Google Book Services had just come out, I decided this would by my test. Searching on “St. Patrick” was never going to do it. The best I could do is remember the small, odd phrase “is bright with us yet.” In fact it was that phrase that made me want to revisit the lyrics in the first place.

GBS’s first hit was The Hymn-book of the Modern Church: Studies of Hymns and Hymn-writers (1905) by Arthur Edwin Gregory, a compendium of various hymns including two of the “Romanist” verses I was looking for, and a discussion of the author and his relative merits. Interestingly, it does not offer the title of the song, but Google’s results took me directly to the correct page so that immediately I was looking at the relevant content. Very impressive.

The second hit was even better: The Parochial Hymn Book (1897). 3 verses and a complete four-part vocal arrangement! The title, as it turns out, is “St. Patrick’s Day”, not the most helpful string to search against.

The specificity of the texts themselves is most remarkable. When the first rounds of “electronic texts” were circulated, many people were unimpressed with the experience of reading screens of flat ascii text, objecting to the sterile quality as not-bookish enough. The difference between that time and today is stark. Google’s text is not an abstract vision of the PHB’s content, rather it is photographic images from a particular book in a particular stack, with all the peculiarities of the physical original (save, perhaps, smell).

It has provenance: a hand-inked calligraphic block claims it for Andover-Harvard Theological Library of the Harvard Divinity School, with both a stamp and a bookplate noting that it comes from the estate of one Rev. Charles Hutchins on May 24th, 1939. You can even see Harvard’s call number penciled in on the verso page. The effect of these details is to greatly reinforce the validity of the text. Contrast this with a posting on any of a thousand interchangeable lyrics sites. Which would you regard as desireable or authoritative?

My experience was very much like those I suspect any fan of libraries has enjoyed, the feeling of discovering a tangible artifact of another time and place that was produced and preserved specifically so that you might encounter it, and have the information sought. Even 111 years later.

I chatted up my colleague Chris in New Zealand, one of the original Koha developers, about my GBS results. He was fairly impressed. For his test, he put in his father’s name, Ian Cormack, and was promptly returned as the first hit a link to his academic article “Creating an Effective Learning Environment for Maori Students” in Mai i Rangiatea: Maori Wellbeing and Development (1997). So there you have it: two texts separated by 100 years of time, and half the Earth in distance, accurately and immediately retrieved from one repository, for free.

Note: Google has built on to Book Services a bunch of other features, including ratings, reviews, tags and a My Library feature. In my My Library, you can see the three texts I mentioned. There is also a “Find this book in a library” link to OCLC’s WorldCat that tells me the closest (known) copy is 160 miles away.

For comparison, the my search at amazon was empty. Which suggests another question….

Paranoia Alert!! Assume your library had a copy of the same song in one hymnbook or another. Without GBS, based on my limited query data, how long would it take me to retrieve it at your library? I should add that my search began at 10:30PM.

Perhaps most striking is that GBS would be preferable even if I was sitting in Harvard’s library where the original still resides!

From libraries to Skynet

Monday, April 28th, 2008 by Galen Charlton

Who added AI to Koha?

Hint: Git tries its best to properly assign credit to patches, but it doesn’t always get it right.

Library jargon, or translating Koha from English to English

Sunday, April 13th, 2008 by Galen Charlton

As Andrew S. Tanenbaum said, “the nice thing about standards is that there are so many of them to choose from.” Good old non-standardized library jargon provides an even richer field of variation. Do libraries serve members, patrons, clients, or customers? Is a patron placing a hold request or a reservation? When the item arrives and the patron checks it out, do we call the transaction a loan, a checkout, or an issue? Can the library issue an issue to patron? How many synonyms have I missed so far?

Koha’s base HTML templates use “English”; translations to other languages are generated by extracting strings from the templates and giving them to the translators. The files containing translated strings are then used to create a set of HTML templates in the desired language.

I put “English” in scare quotes because while nominally the language of coding is (I think) the New Zealand variant of the Queen’s English, in practice it is a mixture of NZ English, UK English, US English, and so forth. That already opens the door to potentially desirable localizations — after all, one really ought to put one’s “colour” and “flavor” in the right sociogeolinguistic buckets.

Which brings us back to library jargon — a “reservation” in one country is another’s “hold request”. An academic library’s “recall request” is a public library’s “you’ve gotta be kidding!”. A bright idea! Let’s convene an international committee to standardize English-language library jargon! I’m holding my breath with anticipation …

Still holding — but why not expand the scope of the committee and handle French library jargon?

*thunk*

later

OK, so that didn’t work. For now, it looks like a better solution is to embrace the differences and set up en-NZ, en-US, en-GB, etc. as defined translations for Koha, per some recent traffic on the koha-devel list. Localization ultimately doesn’t apply to just language and country; think of en-US-academic_library, en-US-small_public, etc.

Making the MARC21 specification usable via XML

Monday, April 7th, 2008 by Chris Catalfo

As part of my work on the Biblios project, I need access to the MARC21 specification in a machine-readable form so that Biblios can provide context-sensitive help. To this end, I’ve been extracting field and subfield names, descriptions, and valid values from the MARC21 specification in HTML form, available at the Library of Congress website here. I thought other folks might be interested in this, hence this post.

I enjoy using Python and so I thought I’d try whip something up in it. I discovered the BeautifulSoup Python library for extracting data from HTML pages and it sounded perfect for this task.

A glance at the specification page for the MARC21 Leader field (here) shows that, happily, the pages are constructed fairly logically and can be extracted based on their css class.

I put together a few regular expressions like this one to parse out the key data:


# get the character position like this:
# 18-21 - Illustrations (006/01-04)
charposmatch = re.compile(r'^(?P<position>\d{1,2}(?P<extent>-\d{2})*)\s-\s(?P<name>.*)')

BeautifulSoup lets you write code like this to walk through an HTML document and extract parts:


 for charpos in soup.findAll('div', {'class':'characterposition'}):
        try:
            text = charpos.findNext('strong', recursive=False).contents[0].rstrip()

I wrote a little Python script to download the relevant specification files and ran my extraction scripts. Out came something like this:


<marc21spec>
   <tag code="000">
      <position description="Computer-generated, five-character number equal to the length of the entire
                              record, including itself and the record terminator. The number is right justified
                              and unused positions contain zeros." name="Record length" position="00-04"/>
      <position description="One-character alphabetic code that indicates the relationship of the record to a
                              file for file maintenance purposes." name="Record status" position="05">
         <value code="a" description="Increase in encoding level"/>
         <value code="c" description="Corrected or revised"/>
         <value code="d" description="Deleted"/>
         <value code="n" description="New"/>
         <value code="p" description="Increase in encoding level from prepublication"/>
      </position>

The MARC21 specification in all it’s glory, as xml!

I have no doubt that there are still some incorrect or missing parts in the generated xml files. If you’re interested in double checking the files, or even better, improving the extraction scripts, you can download them here:
marc21controlfields.xml
marc21varfields.xml
extractControlTags.py
extractVariableTags.py

To run the extraction scripts, pass them the path containing the .html files (or a single file) and a filename to output the xml content to:

python extractControlTags.py . marc21controlfields.xml
python extractVariableTags.py . marc21variablefields.xml