HathiTrust Large Scale Search: Scalability meets Usability - Tom Burton-West

HathiTrust Large Scale Search: Scalability meets Usability

Tom Burton-West, DLPS, University of Michigan Library, tburtonw AT umich edu
Slides

code4lib 2012, Tuesday 7 February 2012, 13:00-13:20

HathiTrust Large-Scale search provides full-text search services over nearly 10 million full-text books using Solr for the back-end. Our index is around 5-6 TB in size and each shard contains over 3 billion unique terms due to content in over 400 languages and dirty OCR.

Searching the full-text of 10 million books often results in very large result sets. By conference time a number of features designed to help users narrow down large result sets and to do exploratory searching will either be in production or in preparation for release. There are often trade-offs between implementing desirable user features and keeping response time reasonable in addition to the traditional search trade-offs of precision versus recall.

We will discuss various scalability and usability issues including:

Trade-offs between desirable user features and keeping response time reasonable and scalable
Our solution to providing the ability to search within the 10 million books and also search within each book
Migrating the personal collection builder application from a separate Solr instance to an app which uses the same back-end as full-text search.
Design of a scalable multilingual spelling suggester
Providing advanced search features combining MARC metadata with OCR
The dismax mm and tie parameters
Weighting issues and tuning relevance ranking
Displaying only the most "relevant" facets
Tuning relevance ranking
Dirty OCR issues
CJK tokenizing and other multilingual issues.