HathiTrust Large Scale Search: Scalability meets Usability

  • Tom Burton-West, DLPS, University of Michigan Library, tburtonw AT umich edu
  • Slides

code4lib 2012, Tuesday 7 February 2012, 13:00-13:20

HathiTrust Large-Scale search provides full-text search services over nearly 10 million full-text books using Solr for the back-end. Our index is around 5-6 TB in size and each shard contains over 3 billion unique terms due to content in over 400 languages and dirty OCR.

Searching the full-text of 10 million books often results in very large result sets. By conference time a number of features designed to help users narrow down large result sets and to do exploratory searching will either be in production or in preparation for release. There are often trade-offs between implementing desirable user features and keeping response time reasonable in addition to the traditional search trade-offs of precision versus recall.

We will discuss various scalability and usability issues including:

  • Trade-offs between desirable user features and keeping response time reasonable and scalable
  • Our solution to providing the ability to search within the 10 million books and also search within each book
  • Migrating the personal collection builder application from a separate Solr instance to an app which uses the same back-end as full-text search.
  • Design of a scalable multilingual spelling suggester
  • Providing advanced search features combining MARC metadata with OCR
  • The dismax mm and tie parameters
  • Weighting issues and tuning relevance ranking
  • Displaying only the most "relevant" facets
  • Tuning relevance ranking
  • Dirty OCR issues
  • CJK tokenizing and other multilingual issues.