From libraries to Skynet
April 28th, 2008 by Galen CharltonWho added AI to Koha?
Hint: Git tries its best to properly assign credit to patches, but it doesn’t always get it right.
Who added AI to Koha?
Hint: Git tries its best to properly assign credit to patches, but it doesn’t always get it right.
As Andrew S. Tanenbaum said, “the nice thing about standards is that there are so many of them to choose from.” Good old non-standardized library jargon provides an even richer field of variation. Do libraries serve members, patrons, clients, or customers? Is a patron placing a hold request or a reservation? When the item arrives and the patron checks it out, do we call the transaction a loan, a checkout, or an issue? Can the library issue an issue to patron? How many synonyms have I missed so far?
Koha’s base HTML templates use “English”; translations to other languages are generated by extracting strings from the templates and giving them to the translators. The files containing translated strings are then used to create a set of HTML templates in the desired language.
I put “English” in scare quotes because while nominally the language of coding is (I think) the New Zealand variant of the Queen’s English, in practice it is a mixture of NZ English, UK English, US English, and so forth. That already opens the door to potentially desirable localizations — after all, one really ought to put one’s “colour” and “flavor” in the right sociogeolinguistic buckets.
Which brings us back to library jargon — a “reservation” in one country is another’s “hold request”. An academic library’s “recall request” is a public library’s “you’ve gotta be kidding!”. A bright idea! Let’s convene an international committee to standardize English-language library jargon! I’m holding my breath with anticipation …
…
Still holding — but why not expand the scope of the committee and handle French library jargon?
…
*thunk*
later
OK, so that didn’t work. For now, it looks like a better solution is to embrace the differences and set up en-NZ, en-US, en-GB, etc. as defined translations for Koha, per some recent traffic on the koha-devel list. Localization ultimately doesn’t apply to just language and country; think of en-US-academic_library, en-US-small_public, etc.
As part of my work on the Biblios project, I need access to the MARC21 specification in a machine-readable form so that Biblios can provide context-sensitive help. To this end, I’ve been extracting field and subfield names, descriptions, and valid values from the MARC21 specification in HTML form, available at the Library of Congress website here. I thought other folks might be interested in this, hence this post.
I enjoy using Python and so I thought I’d try whip something up in it. I discovered the BeautifulSoup Python library for extracting data from HTML pages and it sounded perfect for this task.
A glance at the specification page for the MARC21 Leader field (here) shows that, happily, the pages are constructed fairly logically and can be extracted based on their css class.
I put together a few regular expressions like this one to parse out the key data:
# get the character position like this:
# 18-21 - Illustrations (006/01-04)
charposmatch = re.compile(r'^(?P<position>\d{1,2}(?P<extent>-\d{2})*)\s-\s(?P<name>.*)')
BeautifulSoup lets you write code like this to walk through an HTML document and extract parts:
for charpos in soup.findAll('div', {'class':'characterposition'}):
try:
text = charpos.findNext('strong', recursive=False).contents[0].rstrip()
I wrote a little Python script to download the relevant specification files and ran my extraction scripts. Out came something like this:
<marc21spec>
<tag code="000">
<position description="Computer-generated, five-character number equal to the length of the entire
record, including itself and the record terminator. The number is right justified
and unused positions contain zeros." name="Record length" position="00-04"/>
<position description="One-character alphabetic code that indicates the relationship of the record to a
file for file maintenance purposes." name="Record status" position="05">
<value code="a" description="Increase in encoding level"/>
<value code="c" description="Corrected or revised"/>
<value code="d" description="Deleted"/>
<value code="n" description="New"/>
<value code="p" description="Increase in encoding level from prepublication"/>
</position>
The MARC21 specification in all it’s glory, as xml!
I have no doubt that there are still some incorrect or missing parts in the generated xml files. If you’re interested in double checking the files, or even better, improving the extraction scripts, you can download them here:
marc21controlfields.xml
marc21varfields.xml
extractControlTags.py
extractVariableTags.py
To run the extraction scripts, pass them the path containing the .html files (or a single file) and a filename to output the xml content to:
python extractControlTags.py . marc21controlfields.xml
python extractVariableTags.py . marc21variablefields.xml
Last month in Portland, Oregon at the code4lib 2008 convention (a most impressive assemblage of library geekiness), a few of us broke out to the 24-hour bakery Voodoo Doughnut for this:
A donut, with bacon on it. A bacon donut.
I’d expected it to be strange, but the remarkable thing about the bacon donut is how unsurprising the taste is. The sweet maple and salty flavors are, as it turns out, very compatible. So it strikes me that the work I’ve been doing on Koha recently is a lot like the bacon donut: take two things people already like, we do the voodoo and make them work together in a new way.
For the OPAC, the place where this comes up most often is external content, like book cover images. Koha libraries have been using jacket images from Amazon for some time in production, internationally. It’s free and it’s broadly populated: a great feature, especially for small libraries who don’t have the advantage of a lot of subscription content services. Using their API, we can also pull and display content like user reviews, really fleshing out OPAC content.
I recently completed some commissioned Koha code for integrating Baker & Taylor images and content as an alternative to Amazon. Koha can now link to B&T ContentCafe excerpts, ratings, etc. and to their MyLibrary BookStore retail site. For design, my code followed the Amazon model, and certainly something similar could be crafted for other proprietary sources like Blackwell, Syndetics, etc. But upon reflection, I think that the entire model is already on it’s way out!
Enter Google Book Services. I’ll have more to say about GBS later, but suffice to say we now have a second, very widely available source of free book jacket images. (In fact, it may be enough to deflect calls some have been making for the Library of Congress to provide access to cover images like they do for other metadata.) The Google API is essentially javascript based and remarkably easy to integrate. How easy? Code4lib members were posting working example code back and forth within hours, and then within a day or two, other Koha users adapted their own servers to start using Google’s images. This is a great example of how OSS enables agility and adaptability.
So pretty soon we should expect that every current OPAC will have some images from somewhere, and that won’t be a distinguishing feature anymore. The next model to evolve will be to allow ajaxy failover from a ranked menu of many possible image sources (both free and subscription/keyed like B&T/syndetics). In fact, several coders have reported implementing this for their favorite sources already! I’m looking forward to seeing this code synthesized, providing the broadest possible coverage for images. Then we can start to get some abstraction around the other data in common, like reviews, ratings, etc.
Some of my colleagues have already started on LibraryThing and xISBN. If you have other external data sources you would like to see integrated in Koha, feel free to mention them here!
I’m happy to announce that a packaged beta release of Koha 3 is now available. You can download from the usual location:
http://download.koha.org/koha-3.00.00-beta.tar.gz
http://download.koha.org/koha-3.00.00-beta.tar.gz.sig
You can check the integrity of the package; either by verifying the provided GPG signature (.sig) or by comparing the MD5 checksum:
84f6ec3615155cfa755a9e7139bd07df koha-3.00.00-beta.tar.gz
I’ve also tagged this in Git as “version 3.00.00 beta” v3.00.00-beta
This is the second packaged release of Koha 3. Prior to the official stable release of Koha 3.0, software issues, bugs, and unimplemented features must be addressed. These are documented on Koha’s Bugzilla:
and organized on the 3.0 RM’s QA notes Wiki page:
http://wiki.koha.org/doku.php?id=en:development:qanotes3.0
The release notes for this beta version are pasted in an email to the koha-devel and main koha user lists, and will also on the koha.org website sometime over this weekend.
The folks at Index Data have switched from CVS to git for many of their open source products, including YAZ, Zebra, and PazPar2. Visit their gitweb or clone from their repository (git://git.indexdata.com/project) for some distributed version control goodness. They use submodules, so git version 1.5.3 or higher is required.
Recently, on the Koha list, one of the users asked about the possibility of adding a ‘browse’ feature to the detail of a given record. The idea is, you might want to see what books appear on the shelf before and after that item, in a given location and shelf. As it turns out, it was a fairly trivial exercise — I spent Sunday afternoon whipping up a basic browser degradable shelf browser, and Owen Leonard, Koha’s Interface Designer, made it look pretty :-).
Why degradable? Glad you asked. One of the goals of the Koha project from the beginning is that all of the interfaces are fully degradable and will work in any browser. So whenever we code a new feature, we write it for that environment first, then we slap on any additional functionality to make it prettier or more Ajaxy, etc.
Anyway … Here’s a basic screenshot of the display:
Hi, I’m Andrew Moore, the newest addition to the LibLime development team. After my first day at LibLime yesterday, I’ve actually made my first tiny addition to Koha today. I’m excited to work with the rest of the team and help improve Koha as much as I can. Although I’ve been writing perl for a few years, I don’t have much experience with library technologies. I fully expect to goof something up spectacularly real soon. So, keep an eye out for that and please go easy on me when I do!
I attended my first Code4Lib Conference a few weeks ago and did a presentation on “Biblios”, the web-based cataloging software I’ve been working on here at LibLime. I will be posting slides of the presentation in the next few days.
I am very sorry to have missed the presentation at Code4LibCon 2008 on a MODS editor written in XFORMS (link to slides available here). This looks like a very promising approach for editing XML documents. XFORMS is an attractive technology I plan on looking into.
A web site for Biblios is in the works and should go live next week, with links to downloads, a demo, and documentation. As soon as it’s ready I will post links on this blog.
Last Wednesday I gave a lightning talk at Code4LibCon on some musings about Git qua distributed version control system and ideas for distributed cataloging. Check out my slides.
Slides from the other lightning talks are being posted here. Be sure to check out Andy Mullen’s presentation when his slides and the video are posted — making player piano MIDI files from OCRs of scanned scores is special enough, but his sense of dramatic timing during his presentation was marvelous.
Crossposted at Meta Interchange