You are here
The 2010 Code4Lib Conference
Media, Blacklight, and Viewers Like You
- Chris Beer, WGBH, firstname.lastname@example.org
Code4Lib 2010 - Wednesday, February 24 - 10:35-10:55
There are many shared problems (and solutions) for libraries and archives in the interest of helping the user. There are also many "new" developments in the archives world that the library communities have been working on for ages, including item-level cataloging, metadata standards, and asset management. Even with these similarities, media archives have additional issues that are less relevant to libraries: the choice of video players, large file sizes, proprietary file formats, challenges of time-based media, etc. In developing a web presence, many archives, including the WGBH Media Library and Archives, have created custom digital library applications to expose material online. In 2008, we began a prototyping phase for developing scholarly interfaces by creating a custom-written PHP front-end to our Fedora repository.
In late 2009, we finally saw the (black)light, and after some initial experimentation, decided to build a new, public website to support our IMLS-funded /Vietnam: A Television History/ archive (as well as existing legacy content). In this session, we will share our experience of and challenges with customizing Blacklight as an archival interface, including work in rights management, how we integrated existing Ruby on Rails user-generated content plugins, and the development of media components to support a rich user experience.
Slides in PDF (2.61 MB)
I Am Not Your Mother: Write Your Test Code
- Naomi Dushay, Stanford University, email@example.com
- Willy Mene, Stanford University, firstname.lastname@example.org
- Jessie Keck, Stanford University, email@example.com
Code4Lib 2010 - Wednesday, February 24 - 09:55-10:15
How is it worth it to slow down your code development to write tests? Won't it take you a long time to learn how to write tests? Won't it take longer if you have to write tests AND develop new features, fix bugs? Isn't it hard to write test code? To maintain test code? We will address these questions as we talk about how test code is crucial for our software. By way of illustration, we will show how it has played a vital role in making Blacklight a true community collaboration, as well as how it has positively impacted coding projects in the Stanford Libraries.
Vampires vs. Werewolves: Ending the War Between Developers and Sysadmins with Puppet
- Bess Sadler, University of Virginia, firstname.lastname@example.org
Code4Lib 2010 - Wednesday, February 24 - 09:35-09:55
Developers need to be able to write software and deploy it, and often require cutting edge software tools and system libraries. Sysadmins are charged with maintaining stability in the production environment, and so are often resistant to rapid upgrade cycles. This has traditionally pitted us against each other, but it doesn't have to be that way. Using tools like puppet for maintaining and testing server configuration, nagios for monitoring, and hudson for continuous code integration, UVA has brokered a peace that has given us the ability to maintain stable production environment with a rapid upgrade cycle. I'll discuss both the individual tools, our server configuration, and the social engineering that got us here.
iBiblio copy of presentation (PDF)
Iterative Development Done Simply
- Emily Lynema, North Carolina State University Libraries, email@example.com
Code4Lib 2010 - Wednesday, February 24 - 09:15-09:35
With a small IT unit and a wide array of projects to support, requests for development from business stakeholders in the library can quickly spiral out of control. To help make sense of the chaos, increase the transparency of the IT "black box," and shorten time lag between requirements definition and functional releases, we have implemented a modified Agile/SCRUM methodology within the development group in the IT department at NCSU Libraries.
This presentation will provide a brief overview of the Agile methodology as an introduction to our simplified approach to iteratively handling multiple projects across a small team. This iterative approach allows us to regularly re-evaluate requested enhancements against institutional priorities and more accurately estimate timelines for specific units of functionality. The presentation will highlight how we approach each development cycle (from planning to estimating to re-aligning) as well as some of the actual tools and techniques we use to manage work (like JIRA and Greenhopper). It will identify some challenges faced in applying an established development methodology to a small team of multi-tasking developers, the outcomes we've seen, and the areas we'd like to continue improving. These types of iterative planning/development techniques could be adapted by even a single developer to help manage a chaotic workplace.
Slides in Powerpoint (2.15 MB)
Metadata Editing – A Truly Extensible Solution
- David Kennedy, Duke University, firstname.lastname@example.org
- David Chandek-Stark, Duke University, email@example.com
Code4Lib 2010 - Tuesday, February 23 - 14:00-14:20
We set out in the Trident project to create a metadata tool that scales. In doing so we have conceived of the metadata application profile, a profile which provides instructions for software on how to edit metadata. We have built a set of web services and some web-based tools for editing metadata. The metadata application profile allows these tools to extend across different metadata schemes, and allows for different rules to be established for editing items of different collections. Some features of the tools include integration with authority lists, auto-complete fields, validation and clean integration of batch editing with Excel. I know, I know, Excel, but in the right hands, this is a powerful tool for cleanup and batch editing.
In this talk, we want to introduce the concepts of the metadata application profile, and gather feedback on its merits, as well as demonstrate some of the tools we have developed and how they work together to manage the metadata in our Fedora repository.
Link: Trident Project site
Slides in Google Docs
HIVE: A New Tool for Working With Vocabularies
- Ryan Scherle, National Evolutionary Synthesis Center, firstname.lastname@example.org
- Jose Aguera, Universitty of North Carolina, email@example.com
Code4Lib 2010 - Tuesday, February 23 - 13:40-14:00
HIVE is a toolkit that assists users in selecting vocabulary and ontology terms to annotate digital content. HIVE combines the ease of folksonomies with the rigor of traditional vocabularies. By combining semantic web standards with text mining techniques, HIVE will improve the effectiveness of subject metadata generation, allowing users to search and browse terms from a variety of vocabularies and ontologies. Documents can be submitted to HIVE to automatically generate suggested vocabulary terms.
Your system can interact with common vocabularies such as LCSH and MESH via the central HIVE server, or you can install a local copy of HIVE with your own custom set of vocabularies. This talk will give an overview of the current features of HIVE and describe how to build tools that use the HIVE services.
Slides in PowerPoint
Matching Dirty Data – Yet Another Wheel
- Anjanette Young, University of Washington Libraries, younga3 at u washington edu
- Jeff Sherwood, University of Washington Libraries, jeffs3 at u washington edu
Code4Lib 2010 - Tuesday, February 23 - 13:20-13:40
Regular expressions is a powerful tool to identify matching data between similar files. When one or both of these files has inconsistent data due to differing character encodings or miskeying, the use of regular expressions to find matches becomes impractically complex.
The Levenshtein distance (LD) algorithm is a basic sequence comparison technique that can be used to measure word similarity more flexibly. Employing the LD to calculate difference eliminates the need to identify and code into regex patterns all of the ways in which otherwise matching strings might be inconsistent. Instead, a similarity threshold is tuned to identify close matches while eliminating false positives.
Recently, the UW Libraries began an effort to store Electronic Theses and Dissertations (ETD) in our institutional repository which runs on DSpace. We received 6,756 PDFs along with a file of UMI-created MARC records which needed to be matched to our library's custom MARC records (60,175 records). Once matched, merged information from both records would be used to create the dublin_core.xml file needed for batch ingest into DSpace. Unfortunately, records within the MARC data had no common unique identifiers to facilitate matching. Direct matching by title or author was impractical due to slight inconsistencies in data entry. Additionally, one of the files had "flattened" characters in title and author fields to ASCII. We successfully employed LD to match records between the two files before merging them.
This talk demonstrates one method of matching sets of MARC records that lack common unique identifiers and might contain slight differences in the matching fields. It will cover basic usage of several python tools. No large stack traces, just the comfort of pure python and basic computational algorithms in a step-by-step presentation on dealing with an old library task: matching dirty data. While much literature exists on matching/merging duplicate bibliographic records, most of this literature does not specify how to accomplish the task, just reports on the efficiency of the tools used to accomplish the task, often within a larger system such as an ILS.
Slides on Slideshare
Presentation Slides (PDF)
Taking Control of Library Metadata and Websites Using the eXtensible Catalog
- Jennifer Bowen, University of Rochester, firstname.lastname@example.org
Code4Lib 2010 - Tuesday, February 23 - 13:00-13:20
The eXtensible Catalog Project has developed four open-source software toolkits that enable libraries to build and share their own web- and metadata-focused applications on top of a service-oriented architecture that incorporates Solr in Drupal, a robust metadata management platform, and OAI-PMH and NCIP-compatible tools that interact with legacy library systems in real-time.
XC's robust metadata management platform allows libraries to orchestrate and sequence metadata processing services on large batches of metadata. Libraries can build their own services using the available "service-writers toolkit" or choose from our initial set of metadata services that clean up and "FRBRize" MARC metadata. Another service will aggregate metadata from multiple repositories to prepare it for use in unified discovery applications. XC software provides an RDA metadata test bed and a Solr-based metadata "navigator" that can aggregate and browse metadata (or data) in any XML format. XC's user interface platform is the first suite of Drupal modules that treat both web content and library metadata as native Drupal nodes, allowing libraries to build web-applications that interact with metadata from library catalogs and institutional repositories as well as with library web pages. XC's Drupal modules enable Solr in a FRBRized data environment, as a first step toward a full implementation of RDA. Other currently-available XC toolkits expose legacy ILS metadata, circulation, and patron functionality via web services for III, Voyager and Aleph (to date) using standard protocols (OAI-PMH and NCIP), allowing libraries to easily and regularly extract MARC data from an ILS in valid MARCXML and keep the metadata in their discovery applications "in sync" with source repositories.
This presentation will showcase XC's metadata processing services, the metadata "navigator" and the Drupal user interface platform. The presentation will also describe how libraries and their developers can get started using and contributing to the XC code.
7 Ways to Enhance Library Interfaces with OCLC Web Services
- Karen A. Coombs, OCLC, email@example.com
Code4Lib 2010 - Tuesday, February 23 - 11:40-12:00
OCLC Web Services such as xISSN, WorldCat Search API, WorldCat Identities, and the WorldCat Registry provide a variety of data which can be used to enhance and improve current library interfaces. This talk will discuss several simple ideas to improve current users interfaces using data from these services.
Code Samples includes all demos (zip of version 1.0 code)
Handout Explaining Code
Public Datasets in the Cloud
- Rosalyn Metz, Wheaton College, firstname.lastname@example.org
- Michael B. Klein, Oregon State University, Michael.Klein@oregonstate.edu
Code4Lib 2010 - Tuesday, February 23, 2010 - 11:20-11:40
When most people think about cloud computing (if they think about it at all), it usually takes one of two forms: Infrastructure Services, such as Amazon EC2 and GoGrid, which provide raw, elastic computing capacity in the form of virtual servers, and Platform Services, such as Google App Engine and Heroku, which provide preconfigured application stacks and specialized deployment tools. Several providers, however, offer access to large public datasets that would be impractical for most organizations to download and work with locally. From a 67-gigabyte dump of DBpedia's structured information store to the 180-gigabyte snapshot of astronomical data from the Sloan Digital Sky Survey, chemistry and biology to economic and geographic data, these datasets are available instantly and backed by enough pay-as-you-go server capacity to make good use of them. We will present an overview of currently-available datasets, what it takes to create and use snapshots of the data, and explore how the library community might push some of its own large stores of data and metadata into the cloud.
Slides in PowerPoint (1.32 MB)