Practical Relevance Ranking for 10 million books.

Practical Relevance Ranking for 10 million books

Tom Burton-West, University of Michigan Library, tburtonw@umich.edu

HathiTrust Full-text search indexes the full-text and metadata for over 10 million books. There are many challenges in tuning relevance ranking for a collection of this size. This talk will discuss some of the underlying issues, some of our experiments to improve relevance ranking, and our ongoing efforts to develop a principled framework for testing changes to relevance ranking.

Some of the topics covered will include:

Length normalization for indexing the full-text of book-length documents
Indexing granularity for books
Testing new features in Solr 4.0:
- New ranking formulas that should work better with book-length documents: BM25 and DFR.
- Grouping/Field Collapsing. Can we index 3 billion pages and then use Solr's field collapsing feature to rank books according to the most relevant page(s)?
- Finite State Automota/Block Trees for storing the in-memory index to the index. Will this allow us to allow wildcards/truncation despite over 2 billion unique terms per index?
Relevance testing methodologies:Query log analysis, Click models, Interleaving, A/B testing, and Test collection based evaluation.
Testing of a new high-performance storage system to be installed in early 2013. We will report on any tests we are able to run prior to conference time.

http://www.hathitrust.org/documents/HathiTrust-Code4LIB-201302.pptx

Download the video