Archive for the ‘Code4Lib’ Category

Code4Lib 2008: The Internet Archive

Wednesday, February 27th, 2008 by Nicole C. Engard

What a great way to open a conference like Code4Lib. The first keynote was presented by Brewster Kahle of the Internet Archive.

Brewster started by reminding us that the reason he was there talking to us and the reason he is working on the Internet Archive is because the library metaphor easily translates to the Internet - as librarians we’re paid to give stuff away! We work in a $12 billion a year industry which supports the publishing infrastructure. With the Internet Archive, Brewster is not suggesting that we spend less money - but that we spend it better.

He started with a slide of the Boston Public Library which has “Free to All” carved in stone. Brewster says that what people carve in stone is taken seriously - and so this is a great example of what libraries stand for. Our opportunity now is to go digital. Provide free digital content in addition to the traditional content we have been providing. I loved that he then said that this is not just a time for us to be friendly together as librarians - but to work together as a community and build something that can be offered freely to all!

He went on to say that what happens to libraries is that they burn - they tend to get burned by governments who don’t want them around. The Library of Alexandria is probably best known for not being here anymore. This is why lots of copies keeps stuff safe. Along those lines, the Internet Archive makes sure to store their data in mirror locations - and by providing information to the archive we’re ensuring that our data is also kept safe and available. This idea of large scale swap agreements (us sharing with the Internet Archive, us sharing with other libraries, etc) in different geographical regions finds us some level of preservation.

How it started

The internet archive started by collecting the world wide web - every 2 months taking a snap shot of the web. Brewster showed Yahoo! 10 years ago - ironically a bit of data that even Yahoo! didn’t have - so for their 10 year anniversary they had to ask the Internet Archive for a copy of what their site looked like! He showed us the first version of Code4Lib’s site and exclaimed “Gosh is that geeky!” because it was a simple black text on white background page.

While it may have seemed a bit ambitious to archive the web, the Wayback Machine gets about 500 hits a second. And it turns out that the out of print materials on the web are often just as valuable as the in print information on the web. People are looking for the way things were for historical or cultural research reasons and this tool makes it possible.

Audio

The Grateful Dead started a tradition in the 60s of allowing people to record their concerts and share them with others - this tradition of tape trading caught on and lots of bands were doing this. Following in this tradition, the Internet Archive decided to offer unlimited storage and unlimited bandwidth for free to any band who wanted to provide recordings of their concerts to the archive. It’s a bit different than tape trading, but an amazing idea! They are getting 1 or 2 bands a day - around 30,000 concerts now and it’s working! Overall the community is building the best metadata Brewster’s ever seen - beautiful work supported by a community - just what I love to hear!!

This shows that librarians can provide a role other than providing information - they can provide back end storage for information. By giving people like these bands a place to store their music for free, the Internet Archive made it so that concerts are now available online for those in search of them!

Moving Images

1000 movies that are out of copyright are available via the Internet Archive. Interestingly, the things that are popular are movies you can’t get any other way - movies you wouldn’t expect people to be interested in at all - government films, social behavior films like the ones you saw in high school when you had a substitute teacher - they’re fantastically popular. Brewster theorizes, and I tend to agree that people are using these videos as research tools to see what things were like culturally at different times in history.

Brewster is a follower of the “it’s easier to apologize than ask permission” philosophy and it has worked very well for him and the organization. You probably have a closet of video tapes that are just waiting to go online - so put them online and if people ask you to take it down - take it down. One example that most of us have probably seen are the Lego movies. Brewster found this genre of movies fascinating - but he mentions that if it weren’t for the free storage on the archive (pre-YouTube) these movies may never have been so widely spread. He described this as, we as the library supporting a community that had no home before. We’re here to put things of shelves and give things away - so why not put things online and give them away?

Television

The Internet Archive only has 1 week of TV available so far - 9/11 - 9/18/2001. This shows a full picture of what people were watching during that horrible week. (update: I may have misunderstood - as I view the archive site I see more than just this….)

Apparently there is someone in North Carolina out there recording TV non stop on 20 channels in DVD quality. Apparently it costs him about $15 per video hour to digitize and has over 50,000 videos in his archive. You can’t get just one point of view (need multiple channels) news may say it’s fair and balanced - but it’s not - you don’t just want John Stewart as your archive of news :)

Software

Not much because of licensing issues - it’s doable - just not legal yet.

Text

This is where Brewster see the biggest opportunity for traditional libraries to participate. We have in our charge the responsibility to distribute print/books.

We, as librarians, have to work very hard on text. Look at what we did with journals - we handed them to many corporations and now we have to rent them back from them :( if we had never let it happen in the first place we wouldn’t be wondering how to digitize our journals now. The same thing is going on with monographs now - we’re handing them over to corporations - we should be doing this ourselves instead and the Internet Archive wants to help.

There are 26 million books in the Library of Congress - one book is about 1MB that’s 26TB in the Library of Congress. For $60,000 you could have the entire Library of Congress digitized.

Brewster’s goal sounds like a simple one - “one webpage for every book ever published.” What would it take to do this?

First off, we’d have scan a whole heck of a lot of books - and get the catalog data.

The archive has experimented with a few methods, first they worked with the million book project - they shipped their books to India and they learned not to ship their books to India. Brewster recommends that you have the Indians scan the books they like - but keep your books to yourself. Instead they found that for 10 cents a page they could scan their own items in house. They came up with the scanner and have a person turn the pages of the book - they tried the robots but they weren’t great (may be better now). At the University of Toronto this method produces a million pages a month.

So, for the cost of copying a page at Kinkos you can digitize it and add MARC records and share with the world. Most importantly it’s being done by librarians - our of the corporate sphere. We need to demand the right to give our books away - not have our books owned by corporations who will rent the content to us with exceptions tied to it.

Some quotes from Brewster: “Please help support these scanning centers while they’re up and running … take collections that you’ve got and have them digitized and start building services around them.” If we’re going to build one web page for every book, we’re going to have to scan a lot of books. One option of a service you could add is a scan on demand link to your catalog. Have patrons click this link to have a book scanned - same cost as ILL - might as well scan it and put it on the web for anyone to use.

Then you can provide your digital copies via ILL, Brewster states: “I don’t know what loan means in the digital world - but let’s figure it out!” Why wait for someone else to tell us?

Next, let’s scan all the microfilm. Someone came up to Brewster after one of his talks and said - “we’ve done this before - it’s called microfilm.” So why not digitize our microfilm as well? For less than 10 cents a page they can do all microfilm. The Internet Archive is actually doing a large scale microfilm scanning project right now using the Carnegie model. Apparently Carnegie would build your library for you if you promised to stock it with books and materials. So the The Kahle/Austin Foundation will donate a microfilm scanner to your organization for X years if you the library will keep it up and running for X hours a week. This only costs labor and time and no money has to change hands. In the end we’ve digitized all of our microfilm and made it more accessible.

This made me think of a question - if years ago people said you should microfilm everything and now everyone’s saying you should digitize it - what’s to say that in another 50 years there won’t be another format? This sounds to me like a never ending loop - but at the same time it sounds like such an obvious progression given the technology we have and the types of users we’re dealing with.

Next, we need better selection - right now we’re just digitizing whatever we’re handed - this means we don’t have full collections. Because of this the Internet Archive now has 90 sponsor collections - “We need help!”–Brewster asks that we pick an area of cataloged material and share that digitally - think outside of your own library. For some reason librarians seem to think that they’re only responsible for digital copies of materials they have in their own library - keep digital copies of things from other libraries - why only have digital copies of items you have in print? You want a full collection on your area of study for your library. This was something I was working on at the Seminary. I was finding digital copies of materials I thought would be of interest to our students and importing those OCLC records into our catalog. Just another way to provide access to data.

The next step according to Brewster is to build the catalog and “we finally need to do this FRBR thing - come on guys, it’s not that hard!!!” Even if the digital copy of the book isn’t available yet, it makes sense to provide pages for the book with catalog data that pulls information from sites like Amazon and other book information sites.


Code4Lib - Day 1
Originally uploaded by nengard

When the books are available, we need to work on our displays. Many of our displays are lacking. We need better search functions, open APIs to allow people to re-purpose our data in ways that make sense for them. We also need to make book images with pages that flip, provide the ability to zoom in and printable. In fact the Internet Archive offers a service where people can print books out from their service in real paperback looking formats.


Code4Lib - Day 1
Originally uploaded by nengard

Another option is to use the One Laptop per Child as an ebook reader. The kindle handles ASCII formats okay - but not the types of images that we’re creating for our digital collections.

Conclusions

We have to work together on building this! We can’t just check back in a year and see what’s happening - instead of waiting for others to do the work - why not contribute? We want to be able to build some great services that will allow people to bulk download these materials and re-purpose them if they want.

One way is to join the Open Content Alliance - there are over 80 libraries now. It’s free to join, you just have to contribute.

The next step is to get service layers in place - this is where the code4libers come in. We have the skills to make the Internet Archive even more accessible and valuable.

Questions & Answers

Dan Chudnov asked what he called “tough questions” - now that some companies like Reed Elsevier are trying to change their business models from journal sales to other routes, is there an opportunity to go and buy up their journal services so we get our data back?

Brewster’s answer: there is a way to do this - some people are trying - until it comes to the point where they aren’t making money any more we’re going to have to keep scanning ourselves

Dan’s other question - is power an issue?

Brewster - power is costly, but not running out any time soon.

Another question: the data is only good as long as the disks are still spinning - how do you make it last for years?

Brewster: the question is a good one - the real way to have long term preservation is to have access - access drives preservation. dark archives lead to data being lost. we have to replace our machines every few years to keep up. tapes suck! have you ever tried to read them back??? if there are at least 5 copies - 5 organizations then I can sleep

Real Conclusion

“if you’re frustrated enough - please come and help!” — Brewster

What an amazing way to stop! What an amazing way to start the conference! So many people were completely inspired, I can’t wait to see what comes of this talk - I hope some amazing APIs start popping up!

Technorati Tags: , ,

Open Source at Code4Lib

Tuesday, February 19th, 2008 by Nicole C. Engard

I’m very excited to be attending Code4Lib this year in Portland, Oregon (someplace I’ve never been). If you’re attending you’ll want to keep and eye out for these great LibLime open-source events:

OSS Web-based cataloging tool
with Chris Catalfo - Programmer, LibLime
Wednesday, Feb 27 • 9:45-10:05

This presentation introduces a new open source, web-based cataloging application, started for the 2007 Google Summer of Code and currently developed at LibLime. It provides a full featured, customizable, fast application for original and copy cataloging. It uses the ExtJS user interface toolkit, Google Gears for local storage of bibliographic records, PazPar2 for searching multiple Z39.50 servers, and it will feature an integrated Jabber client for exchanging records.

Git: Lightening Talk at Code4Lib
with Galen Charlton - Koha Application Developer, LibLime
Lightening Talk at Code4Lib

Git is a distributed revision control system created in 2005 and is
most notably used by the Linux kernel project. In mid-2007, Git was adopted by the Koha open source ILS project, replacing CVS. Galen will discuss Git’s distributed repository model and the Koha developers’ experience adjusting to it, then end with some speculation about how decentralized information exchange applies to library metadata by playing with the metaphor of LC as a central CVS repository.

Koha Camp at Code4Lib!
with LibLime staff
Monday, Feb 25 • 9:30-4:30

Koha Camp is a unique first opportunity for systems librarians, library software developers and designers to come together for an open source experience with Koha Library Integrated System. The next Koha Camp will be held the day before Code4Lib in Portland Oregon on Monday, February 25. For an agenda… check out Koha Camp at Code4Lib 2008!

In addition, the entire first day will be filled with pre-conference events about open-source goodies like Evergreen, LibraryFind and Zotero (I don’t see descriptions on the conference site for these events). Lastly, these other open-source presentations sound promising as well:

From Idea to Open Source
with Andrew Nagy - Villanova University
Tuesday, Feb 26 • 11:40-12:00

Last year I spoke about my research and initial investigations of building a “Next Generation Catalog” using XML technologies coined as the MyResearch Portal. The software has since progressed into an open source project known as VuFind. In this presentation I will talk about architecture and design decisions that were made to turn VuFind into a viable open source project and what future plans are in store, as well as how making the project open source has aided the project (and put me into project leader overtime).

Show Your Stuff, using Omeka
with Dave Lester - Web Developer, Center for History and New Media, George Mason University & Jeremy Boggs - Creative Lead, Web Designer/Developer, Center for History and New Media, George Mason University
Wednesday, Feb 27 • 1:20-1:40

Libraries need a simple solution for sharing and publishing collections on the web. Omeka can help. Open source, robust, and easy to install, Omeka gives cultural and academic institutions the means to publish archived content into beautiful, customizable web sites and exhibits. We’ll show you how Omeka works, and how to extend it with plugins and custom themes. Finally, we’ll explore the possibilities for migrating and publishing existing collections from other management systems using Omeka.

Zotero and You, or Bibliography on the Semantic Web
with Trevor Owens - Technology Evangelist, Center for History and New Media, George Mason University
Tuesday, Feb 26 • 1:00-1:20

Representatives from the Center for History and New Media will introduce Zotero, a free and open source extension for Firefox that allows you to collect, organize and archive your research materials. After a brief demo and explanation, we will discuss best practices for making your projects “Zotero ready” and other opportunities to integrate with your digital projects through the Zotero API.

Now that I’ve wet your appetite, I should let you know that registration is closed for this event, but in the spirit of openness I will be blogging every event I can! For those who are attending, the full schedule can be found on the Code4Lib site.