You are here

code4lib 2013

All Teh Metadatas Re-Revisited

  • Esme Cowles, UC San Diego Library, escowles AT ucsd DOT edu
  • Matt Critchlow, UC San Diego Library, mcritchlow AT ucsd DOT edu
  • Bradley Westbrook, UC San Diego Library, bdwestbrook AT ucsd DOT edu

Slides, Video

The Care and Feeding of a Crowd

The Care and Feeding of a Crowd

  • Shawn Averkamp, University of Iowa, shawn-averkamp at uiowa.edu
  • Matthew Butler, University of Iowa, matthew-butler at uiowa.edu

After a low-tech experiment in crowdsourced transcription grew into a surprisingly successful library initiative and demanded new commitments to user engagement, we found ourselves looking for a more efficient and user-friendly solution. We customized CHNM’s Scripto community transcription tool and various other Omeka plugins to develop a new site: DIYHistory.

We often receive questions about the technical side of both platforms, usually (to our dismay) from libraries who already assume they don't have the IT resources to pursue their own crowdsourcing initiatives. But we found that the software makes up only half of the recipe for success. Do you have compelling content? A long-term commitment to engaging with your users? Are you ready to promote your project far and wide? If so, then deploying a crowdsourcing initiative may be easier than you think.

Citation search in SOLR and second-order operators

Citation search in SOLR and second-order operators

  • Roman Chyla, Astrophysics Data System, roman.chyla AT (cfa.harvad.edu|gmail.com)

Citation search is basically about connections (Is the paper read by a friend of mine more important than others? Get me a paper read by somebody who cites many papers/is cited by many papers?), but the implementation of the citation search is surprisingly useful in many other areas.

I will show 'guts' of the new citation search for astrophysics, it is generic and can be applied recursively to any Lucene query. Some people would call it a second-order operation because it works with the results of the previous (search) function. The talk will see technical details of the special query class, its collectors, how to add a new search operator and how to influence relevance scores. Then you can type with me: friends_of(friends_of(cited_for(keyword:"black holes") AND keyword:"red dwarf"))

n Characters in Search of an Author

n Characters in Search of an Author

  • Jay Luker, IT Specialist, Smithsonian Astrophysics Data System, jluker@cfa.harvard.edu

When it comes to author names the disconnect between our metadata and what a user might enter into a search box presents challenges when trying to maximize both precision and recall [0]. When indexing a paper written by "Wäterwheels, A" a goal should be to preserve as much as possible the original information. However, users searching by author name may frequently omit the diaeresis and search for simply, "Waterwheels". The reverse of this scenario is also possible, i.e., your decrepit metadata contains only the ASCII, "Supybot, Zoia", whereas the user enters, "Supybot, Zóia". If recall is your highest priority the simple solution is to always downgrade to ASCII when indexing and querying. However this strategy sacrifices precision, as you will be unable to provide an "exact" search, necessary in cases where "Hacker, J" and "Häcker, J" really are two distinct authors.

Practical Relevance Ranking for 10 million books.

Practical Relevance Ranking for 10 million books

  • Tom Burton-West, University of Michigan Library, tburtonw@umich.edu

HathiTrust Full-text search indexes the full-text and metadata for over 10 million books. There are many challenges in tuning relevance ranking for a collection of this size. This talk will discuss some of the underlying issues, some of our experiments to improve relevance ranking, and our ongoing efforts to develop a principled framework for testing changes to relevance ranking.

Some of the topics covered will include:

  • Length normalization for indexing the full-text of book-length documents
  • Indexing granularity for books
  • Testing new features in Solr 4.0:
    • New ranking formulas that should work better with book-length documents: BM25 and DFR.
    • Grouping/Field Collapsing. Can we index 3 billion pages and then use Solr's field collapsing feature to rank books according to the most relevant page(s)?
    • Finite State Automota/Block Trees for storing the in-memory index to the index. Will this allow us to allow wildcards/truncation despite over 2 billion unique terms per index?

Pages

Subscribe to RSS - code4lib 2013