Practical Relevance Ranking for 10 million books
- Tom Burton-West, University of Michigan Library, firstname.lastname@example.org
HathiTrust Full-text search indexes the full-text and metadata for over 10 million books. There are many challenges in tuning relevance ranking for a collection of this size. This talk will discuss some of the underlying issues, some of our experiments to improve relevance ranking, and our ongoing efforts to develop a principled framework for testing changes to relevance ranking.
Some of the topics covered will include:
- Length normalization for indexing the full-text of book-length documents
- Indexing granularity for books
- Testing new features in Solr 4.0:
- New ranking formulas that should work better with book-length documents: BM25 and DFR.
- Grouping/Field Collapsing. Can we index 3 billion pages and then use Solr's field collapsing feature to rank books according to the most relevant page(s)?
- Finite State Automota/Block Trees for storing the in-memory index to the index. Will this allow us to allow wildcards/truncation despite over 2 billion unique terms per index?
- Relevance testing methodologies:Query log analysis, Click models, Interleaving, A/B testing, and Test collection based evaluation.
- Testing of a new high-performance storage system to be installed in early 2013. We will report on any tests we are able to run prior to conference time.
Download slides: http://www.hathitrust.org/documents/HathiTrust-Code4LIB-201302.pptx