Last year Declan Fleming presented ALL TEH METADATAS and reviewed our UC San Diego Library Digital Asset Management system and RDF data model. You may be shocked to hear that all that metadata wasn't quite enough to handle increasingly complex digital library and research data in an elegant way. Our ad-hoc, 8-year-old data model has also been added to in inconsistent ways and our librarians and developers have not always been perfectly in sync in understanding how the data model has evolved over time.
In this presentation we'll review our process of locking a team of librarians and developers in a room to figure out a new data model, from domain definition through building and testing an OWL ontology. We¹ll also cover the challenges we ran into, including the review of existing controlled vocabularies and ontologies, or lack thereof, and the decisions made to cover the gaps. Finally, we'll discuss how we engaged the digital library community for feedback and what we have to do next. We all know that Things Fall Apart, this is our attempt at Doing Better This Time.
Richard Wolf, University of Illinois at Chicago, firstname.lastname@example.org
Mobile is the new hotness ... and you can't be one of the cool kids unless you've got your own mobile app ... but the road to mobility is daunting. I'll argue that it's actually easier than it seems ... and that the simplest way to mobility is to bring your data to the party, create a REST API around the data, tell developers about your API, and then let the magic happen. To make my argument concrete, I'll show (lord help me!) how to go from an interesting REST API to a fun iOS tool for librarians and the general public in twenty minutes.
Bess Sadler, Stanford University Library, email@example.com
The difference between an open source software project that gets new adopters and new contributing community members (which is to say, a project that goes on existing for any length of time) and a project that doesn't, often isn't a question of superior design or technology. It's more often a question of whether the advocates for the project can convince institutional leaders AND front line developers that a project is stable and trustworthy. What are successful strategies for attracting development partners? I'll try to answer that and talk about what we could do as a community to make collaboration easier.
Shawn Averkamp, University of Iowa, shawn-averkamp at uiowa.edu
Matthew Butler, University of Iowa, matthew-butler at uiowa.edu
After a low-tech experiment in crowdsourced transcription grew into a surprisingly successful library initiative and demanded new commitments to user engagement, we found ourselves looking for a more efficient and user-friendly solution. We customized CHNM’s Scripto community transcription tool and various other Omeka plugins to develop a new site: DIYHistory.
We often receive questions about the technical side of both platforms, usually (to our dismay) from libraries who already assume they don't have the IT resources to pursue their own crowdsourcing initiatives. But we found that the software makes up only half of the recipe for success. Do you have compelling content? A long-term commitment to engaging with your users? Are you ready to promote your project far and wide? If so, then deploying a crowdsourcing initiative may be easier than you think.
Our very small development team, which consisted of a healthy mix of technologists and other stakeholders, worked closely and collaboratively on all aspects of the site. We’ll talk about customizing open-source software--how we scaled up functionality and scaled back design to improve user experience and production-level workflows--and how that process served to gently introduce collaborative software practices, such as using Git for version control, into a small, but agile, organization ready to grow. Finally, we'll share our transcription starter kit of forked Scipto and Omeka code and associated documentation for those interested in doing it themselves.
Hands off! Best Practices and Top Ten Lists for Code Handoffs
Naomi Dushay, Stanford University Library, firstname.lastname@example.org
Transition points in who is the primary developer on an actively developing code base can be a source of frustration for everyone involved. We've tried to minimize that pain point as much as possible through the use of agile methods like test driven development, continuous integration, and modular design. Has optimizing for developer happiness brought us happiness? What's worked, what hasn't, and what's worth adopting? How do you keep your project in a state where you can easily hand it off?
Adam Wead, Rock and Roll Hall of Fame and Museum, email@example.com
At the Library and Archives of the Rock and Roll Hall of Fame, we use available tools such as Archivists' Toolkit to create EAD finding aids of our collections. However, managing digital content created from these materials and the born-digital content that is also part of these collections represents a significant challenge. In my presentation, I will discuss how we solve the problem of our hybrid collections by using Hydra as a digital asset manager and Blacklight as a unified presentation and discovery interface for all our materials.
Our strategy centers around indexing ead xml into Solr as multiple documents: one for each collection, and one for every series, sub-series and item contained within a collection. For discovery, we use this strategy to leverage item-level searching of archival collections alongside our traditional library content. For digital collections, we use this same technique to represent a finding aid in Hydra as a set of linked objects using RDF. New digital items are then linked to these parent objects at the collection and series level. Once this is done, the items can be exported back out to the Blacklight solr index and the digital content appears along with the rest of the items in the collection.
Citation search in SOLR and second-order operators
Roman Chyla, Astrophysics Data System, roman.chyla AT (cfa.harvad.edu|gmail.com)
Citation search is basically about connections (Is the paper read by a friend of mine more important than others? Get me a paper read by somebody who cites many papers/is cited by many papers?), but the implementation of the citation search is surprisingly useful in many other areas.
I will show 'guts' of the new citation search for astrophysics, it is generic and can be applied recursively to any Lucene query. Some people would call it a second-order operation because it works with the results of the previous (search) function. The talk will see technical details of the special query class, its collectors, how to add a new search operator and how to influence relevance scores. Then you can type with me: friends_of(friends_of(cited_for(keyword:"black holes") AND keyword:"red dwarf"))
Jeremy Nelson, Colorado College, firstname.lastname@example.org
Sheila Yeh, University of Denver, Sheila.Yeh@du.edu
The current state of technology in library automation is not keeping pace with the explosive growth in information storage and retrieval system. The lag costs institutions as well as users’ resource discovery. To address this problem, we should look into how successfully enterprise such as Craigslist and StackOverflow manage and scale their enormous volume of data. The key lies in the Redis, a NoSQL open source advanced key-value data structure server. Therefore, Colorado College and the University of Denver, along with the Colorado Alliance of Research Libraries are exploring and co-developing a MARCR Redis Datastore. It is a peer-to-peer bibliographic datastore, modeled using the Library of Congress Bibliographic Framework's new Linked Data based MARC 21 replacement, called MARCR (MARC Resources). The structure of MARCR leads itself to an advanced Consortium catalog where a Work is cataloged once and multiple institutions have complete control over their own Instances of the Work, de-duplicating cataloging efforts while supporting real-time resource sharing between the Instances. Control, access, and discovery of records in the proposed MARCR Redis Datastore are provided through lightweight HTML5 responsive apps built with Django, Bootstrap, and KnockoutJS that also integrate with both open-source and commercial discovery products.
Redis offers many advantages for a shared MARCR bibliographic datastore, such as speed, scalability, and ease-of-deployment. Especially it can support multiple cloud models that benefits institution of various size and capital. We will demonstrate a MVP (Minimal Viable Product) iteration of this MARCR Datastore using the transformed MARC 21 records from Colorado College and the University of Denver into Redis with coordination by Colorado Alliance of Research Libraries.
Jay Luker, IT Specialist, Smithsonian Astrophysics Data System, email@example.com
When it comes to author names the disconnect between our metadata and what a user might enter into a search box presents challenges when trying to maximize both precision and recall . When indexing a paper written by "Wäterwheels, A" a goal should be to preserve as much as possible the original information. However, users searching by author name may frequently omit the diaeresis and search for simply, "Waterwheels". The reverse of this scenario is also possible, i.e., your decrepit metadata contains only the ASCII, "Supybot, Zoia", whereas the user enters, "Supybot, Zóia". If recall is your highest priority the simple solution is to always downgrade to ASCII when indexing and querying. However this strategy sacrifices precision, as you will be unable to provide an "exact" search, necessary in cases where "Hacker, J" and "Häcker, J" really are two distinct authors.
This talk will describe the strategy ADS has devised for addressing common and edge-case problems faced when dealing with author name indexing and searching. I will cover the approach we devised to not only the transliteration issue described above, but also how we deal with author initials vs. full first and/or middle names, authors who have published under different forms of their name, authors who change their names (wha? people get married?!). Our implementation relies on Solr/Lucene, but my goal is an 80/20 mix of high- vs. low-level details to keep things both useful and stackgnostic .
Tom Burton-West, University of Michigan Library, firstname.lastname@example.org
HathiTrust Full-text search indexes the full-text and metadata for over 10 million books. There are many challenges in tuning relevance ranking for a collection of this size. This talk will discuss some of the underlying issues, some of our experiments to improve relevance ranking, and our ongoing efforts to develop a principled framework for testing changes to relevance ranking.
Some of the topics covered will include:
Length normalization for indexing the full-text of book-length documents
Indexing granularity for books
Testing new features in Solr 4.0:
New ranking formulas that should work better with book-length documents: BM25 and DFR.
Grouping/Field Collapsing. Can we index 3 billion pages and then use Solr's field collapsing feature to rank books according to the most relevant page(s)?
Finite State Automota/Block Trees for storing the in-memory index to the index. Will this allow us to allow wildcards/truncation despite over 2 billion unique terms per index?
Relevance testing methodologies:Query log analysis, Click models, Interleaving, A/B testing, and Test collection based evaluation.
Testing of a new high-performance storage system to be installed in early 2013. We will report on any tests we are able to run prior to conference time.