You are here

Feed aggregator

DuraSpace News: Telling VIVO Stories at Duke University with Julia Trimmer

planet code4lib - Wed, 2015-09-02 00:00

“Telling VIVO Stories” is a community-led initiative aimed at introducing project leaders and their ideas to one another while providing details about VIVO implementations for the community and beyond. The following interview includes personal observations that may not represent the opinions and views of Duke University or the VIVO Project. Carol Minton Morris from DuraSpace interviewed Julia Trimmer from Duke University to learn about Scholars@Duke.

SearchHub: Better Search with Fusion Signals

planet code4lib - Tue, 2015-09-01 21:45

Signals in Lucidworks Fusion leverage information about external activity, e.g., information collected from logfiles and transaction databases, to improve the quality of search results. This post follows on my previous post, Basics of Storing Signals in Solr with Fusion for Data Engineers, which showed how to index and aggregate signal data. In this post, I show how to write and debug query pipelines using this aggregated signal information.

User clicks provide a link between what people ask for and what they choose to view, given a set of search results, usually with product images. In the aggregate, if users have winnowed the set of search results for a given kind of thing, down to a set of products that are exactly that kind of thing, e.g., if the logfile entries link queries for “Netgear”, or “router”, or “netgear router” to clicks for products that really are routers, then this information can be used to improve new searches over the product catalog.

The Story So Far

To show how signals can be used to improve search in an e-commerce application, I created a set of Fusion collections:

  • A collection called “bb_catalog”, which contains Best Buy product data, a dataset comprised of over 1.2M items, mainly consumer electronics such as household appliances, TVs, computers, and entertainment media such as games, music, and movies. This is the primary collection.
  • An auxiliary collection called “bb_catalog_signals”, created from a synthetic dataset over Best Buy query logs from 2011. This is the raw signals data, meaning that each logfile entry is stored as an individual document.
  • An auxiliary collection called “bb_catalog_signals_aggr” derived from the data in “bb_catalog_signals” by aggregating all raw signal records based on the combination of search query, field “query_s”, item clicked on, field “doc_id_s”, and search categories, field “filters_ss”.

All documents in collection “bb_catalog” have a unique product ID stored in field “id”. All items belong to one of more categories which are stored in the field “categories_ss”.

The following screenshot shows the Fusion UI search panel over collection “bb_catalog”, after using the Search UI Configuration tool to limit the document fields displayed. The gear icon next to the search box toggles this control open and closed. The “Documents” settings are set so that the primary field displayed is “name_t”, the secondary field is “id”, and additional fields are “name_t”, “id”, and “category_ss”. The document in the yellow rectangle is a Netgear router with product id “1208844”.

For collection “bb_catalog_signals”, the search query string is stored in field “query_s”, the timestamp is stored in field “tz_timestamp_txt”, the id of the document clicked on is stored in field “doc_id_s”, and the set of category filters are stored in fields “filters_ss” as well as “filters_orig_ss”.

The following screenshot shows the results of a search for raw signals where the id of the product clicked on was “1208844”.

The collection “bb_catalog_signals_aggr” contains aggregated signals. In addition to the fields “doc_id_s”, “query_s”, and “filter_ss”, aggregated click signals contain fields:

  • “count_i” – the number of raw signals found for this query, doc, filter combo.
  • “weight_d” – a real-number used as a multiplier to boost the score of these documents.
  • “tz_timestamp_txt” – all timestamps of raw signals, stored as a list of strings.

The following screenshot shows aggregated signals for searches for “netgear”. There were 3 raw signals where the search query “netgear” and some set of category choices resulted in a click on the item with id “1208844”:

Using Click Signals in a Fusion Query Pipeline

Fusion&aposs Query Pipelines take as input a set of search terms and process them into Solr query request. The Fusion UI Search panel has a control which allows you to choose the processing pipeline. In the following screenshot of the collection “bb_catalog”, the query pipeline control is just below the search input box. Here the pipeline chosen is “bb_catalog-default” (circled in yellow):

The pre-configured default query pipelines consist of 3 stages:

  • A Search Fields query stage, used to define common Solr query parameters. The initial configuration specifies that the 10 best-scoring documents should be returned.
  • A Facet query stage which defines the facets to be returned as part of the Solr search results. No facet field names are specified in the initial defaults.
  • A Solr query stage which transforms a query request object into a Solr query and submits the request to Solr. The default configuration specifies the HTTP method as a POST request.

In order to get text-based search over the collection “bb_catalog” to work as expected, the Search Field query stage must be configured to specify the set of fields that which contain relevant text. For the majority of the 1.2M products in the product catalog, the item name, found in field “name_t” is only field amenable to free text search. The following screenshot shows how to add this field to the Search Fields stage by editing the query pipeline via the Fusion 2 UI:

The search panel on the right displays the results of a search for “ipad”. There were 1,359 hits for this query, which far exceeds the number of items that are an Apple iPad. The best scoring items contain “iPad” in the title, sometimes twice, but these are all iPad accessories, not the device itself.

Recommendation Boosting query stage

A Recommendation Boosting stage uses aggregated signals to selectively boost items in the set of search results. The following screenshot show the results of the same search after adding a Recommendations Boosting stage to the query pipeline:

The edit pipeline panel on the left shows the updated query pipeline “bb_catalog-default” after adding a “Recommendations Boosting” stage. All parameter settings for this stage have been left at their default values. In particular, the recommendation boosts are applied to field “id”. The search panel on the right shows the updated results for the search query “ipad”. Now the three most relevant items are for Apple iPads. They are iPad 2 models because the click dataset used here is based on logfile data from 2011, and at that time, the iPad 2 was the most recent iPad on the market. There were more clicks on the 16GB iPads over the more expensive 32GB model, and for the color black over the color white.

Peeking Under the Hood

Of course, under the hood, Fusion is leveraging the awesome power of Solr. To see how this works, I show both the Fusion query and the JSON of the Solr response. To display the Fusion query, I go into the Search UI Configuration and change the “General” settings and check the set “Show Query URL” option. To see the Solr response in JSON format, I change the display control from “Results” to “JSON”.

The following screenshot shows the Fusion UI search display for “ipad”:

The query “ipad” entered via the Fusion UI search box is transformed into the following request sent to the Fusion REST-API:


This request to the Query Pipelines API sends a query through the query pipeline “bb_catalog-default” for the collection “bb_catalog” using the Solr “select” request handler, where the search query parameter “q” has value “ipad”. Because the parameter “debug” has value “true”, the Solr response contains debug information, outlined by the yellow rectangle. The “bb_catalog-default” query pipeline transforms the query “ipad” into the following Solr query:

"parsedquery": "(+DisjunctionMaxQuery((name_t:ipad)) id:1945531^4.0904393 id:2339322^1.5108471 id:1945595^1.0636971 id:1945674^0.4065684 id:2842056^0.3342921 id:2408224^0.4388061 id:2339386^0.39254773 id:2319133^0.32736558 id:9924603^0.1956079 id:1432551^0.18906432)/no_coord"

The outer part of this expression, “( … )/no_coord” is a reporting detail, indicating Solr&aposs “coord scoring” feature wasn&apost used.

The enclosed expression consists of:

  • The search: “+DisjunctionMaxQuery(name_t:ipad)”.
  • A set of selective boosts to be applied to the search results

The field name “name_t” is supplied by the set of search fields specified by the Search Fields query stage. (Note: if no search fields are specified, the default search field name “text” is used. Since the documents in collection “bb_catalog” don&apost contain a field named “text”, this stage must be configured with the appropriate set of search fields.)

The Recommendations Boosting stage was configured with the default parameters:

  • Number of Recommendations: 10
  • Number of Signals: 100

There are 10 documents boosted, with ids ( 1945531, 2339322, 1945595, 1945674, 2842056, 2408224, 2339386, 2319133, 9924603, 1432551 ). This set of 10 documents represents documents which had at least 100 clicks where “ipad” occurred in the user search query. The boost factor is a number derived from the aggregated signals by the Recommendation Boosting stage. If those documents contain the term “name_t:ipad”, then they will be boosted. If those documents don&apost contain the term, then they won&apost be returned by the Solr query.

To summarize: adding in the Recommendations Boosting stage results in a Solr query where selective boosts will be applied to 10 documents, based on clickstream information from an undifferentiated set of previous searches. The improvement in the quality of the search results is dramatic.

Even Better Search

Adding more processing to the query pipeline allows for user-specific and search-specific refinements. Like the Recommendations Boosting stage, these more complex query pipelines leverage Solr&aposs expressive query language, flexible scoring, and lightning fast search and indexing. Fusion query pipelines plus aggregated signals give you the tools you need to rapidly improve the user search experience.

The post Better Search with Fusion Signals appeared first on Lucidworks.

FOSS4Lib Recent Releases: Koha - 3.20.3, 3.18.10, 3.16.14

planet code4lib - Tue, 2015-09-01 19:27
Package: KohaRelease Date: Monday, August 31, 2015

Last updated September 1, 2015. Created by David Nind on September 1, 2015.
Log in to edit this page.

Monthly maintenance releases for Koha.

See the release announcements for the details:

DPLA: New Exhibitions from the Public Library Partnerships Project

planet code4lib - Tue, 2015-09-01 15:44

We are pleased to announce the publication of 10 new exhibitions created by DPLA Hubs and public librarian participants in our Public Library Partnerships Project (PLPP), funded by the Bill and Melinda Gates Foundation. Over the course of the last six months, curators from Digital Commonwealth, Digital Library of Georgia, Minnesota Digital Library, the Montana Memory Project, and Mountain West Digital Library researched and built these exhibitions to showcase content digitized through PLPP. Through this final phase of the project, public librarians had the opportunity to share their new content, learn exhibition curation skills, explore Omeka for future projects, and contribute to an open peer review process for exhibition drafts.

Congratulations to all of our curators and, in particular, our exhibition organizers: Greta Bahnemann, Jennifer Birnel, Hillary Brady, Anna Fahey-Flynn, Greer Martin, Mandy Mastrovita, Anna Neatrour, Carla Urban, Della Yeager, and Franky Abbott.

Thanks to the following reviewers who participated in our open peer review process: Dale Alger, Cody Allen, Greta Bahnemann, Alexandra Beswick, Jennifer Birnel, Hillary Brady, Wanda Brown, Anne Dalton, Carly Delsigne, Liz Dube, Ted Hathaway, Sarah Hawkins, Jenny Herring, Tammi Jalowiec, Stef Johnson, Greer Martin, Sheila McAlister, Lisa Mecklenberg-Jackson, Tina Monaco, Mary Moore, Anna Neatrour, Michele Poor, Amy Rudersdorf, Beth Safford, Angela Stanley, Kathy Turton, and Carla Urban.

For more information about the Public Library Partnerships Project, please contact PLPP project manager, Franky Abbott:

District Dispatch: Momentum, we have it!

planet code4lib - Tue, 2015-09-01 15:27

Source: Real Momentum

As you may have read here, school libraries are well represented in S. 1177, the Every Child Achieves Act.  In fact, we were more successful with this bill than we have been in recent history and this is largely due to your efforts in contacting Congress.

Currently, the House Committee on Education and Workforce (H.R. 5, the Student Success Act) and the Senate Committee on Health, Education, Labor and Pensions are preparing to go to “conference” in an attempt to work out differences between the two versions of the legislation and reach agreement on reauthorization of ESEA. ALA is encouraged that provisions included under S. 1177, would support effective school library programs. In particular, ALA is pleased that effective school library program provisions were adopted unanimously during HELP Committee consideration of an amendment offered by Senator Whitehouse (D-RI)) and on the Senate floor with an amendment offered by Senators Reed (D-RI) and Cochran (R-MS).

ALA is asking (with your help!) that any conference agreement to reauthorize ESEA maintain the following provisions that were overwhelmingly adopted by the HELP Committee and the full Senate under S. 1177, the Every Child Achieves Act:

  1. Title V, Part H – Literacy and Arts Education – Authorizes activities to promote literacy programs that support the development of literacy skills in low-income communities (similar to the Innovative Approaches to Literacy program that has been funded through appropriations) as well as activities to promote arts education for disadvantaged students.
  2. Title I – Improving Basic Programs Operated by State and Local Educational Agencies – Under Title I of ESEA, State Educational Agencies (SEAs) and local educational agencies (LEAs) must develop plans on how they will implement activities funded under the Act.
  3. Title V, Part G – Innovative Technology Expands Children’s Horizons (I-TECH) – Authorizes activities to ensure all students have access to personalized, rigorous learning experiences that are supported through technology and to ensure that educators have the knowledge and skills to use technology to personalize learning.

Now is the time to keep the momentum going! Contact your Senators and Representative to let them know that you support the effective school library provisions found in the Senate bill and they should too!

A complete list of school library provisions found in S.1177 can be found here.

The post Momentum, we have it! appeared first on District Dispatch.

Access Conference: Call for Convenors

planet code4lib - Mon, 2015-08-31 22:41

Do you want to be part of the magic of AccessYYZ? Well, aren’t you lucky? Turns out we’re  looking for some convenors!

Convening isn’t much work (not that we think you’re a slacker or anything)–all you have to do is introduce the name of the session, read the bio of the speaker(s), and thank any sponsors. Oh, and facilitate any question and answer segments. Which doesn’t actually mean you’re on the hook to come up with questions (that’d be rather unpleasant of us) so much as you’ll repeat questions from the crowd into the microphone. Yup, that’s it. We’ll give you a script and everything!

In return, you’ll get eternal gratitude from the AccessYYZ Organizing Committee. And also a high five! If you’re into that sort of thing. Even if you’re not, you’ll get to enjoy the bright lights and the glory that comes with standing up in front of some of libraryland’s most talented humans for 60 seconds. Sound good? We thought so.

You can dibs a session by filling out the Doodle poll.

Peter Sefton: Supporting ProseMirror inline HTML editor

planet code4lib - Mon, 2015-08-31 22:00

The world needs a good, sane in-browser editing component, one that edits document structure (headings, lists, quotes etc) rather than format (font, size etc). I’ve been thinking for a while that an editing component based around Markdown (or Commonmark) would be just the thing. Markdown/Commonmark is effectively a spec for the minimal sensible markup set for documents, it’s more than adequate for articles, theses, reports etc. And it can be extended with document semantics.

Anyway, there’s a crowdfunding campaign going on for an editor called ProseMirror that does just that, and promises collaborative editing as well. It’s beta quality but looks promising, I chipped in 50 Euros to try to get it over the line to be released as open source.

The author says:

Who I am

This campaign is being run by Marijn Haverbeke, author of CodeMirror, a widely used in-browser code editor, Eloquent JavaScript, a freely available JavaScript book, and Tern, which is an editor-assistance engine for JavaScript coding that I also crowd-funded here. I have a long history of releasing and maintaining solid open software. My work on CodeMirror (which you might know as the editor in Chrome and Firefox’s dev tools) has given me lots of experience with writing a fast, solid, extendable editor. Many of the techniques that went into ProseMirror have already proven themselves in CodeMirror.

There’s a lot to like with this editor - it has a nice floating toolbar that pops up at the right of the paragraph, with a couple of non-quite-standard behaviours that just might catch on. Mostly works, but has some really obvious bugs usability issues , like when I try to make a nested list it makes commonmark like this:

* List item * List item * * List item

And it even renders the two bullets side by side in the HTML view. Even thought that is apparently supported by commonmark, for a prose editor it’s just wrong. Nobody means two bullets unless they’re up to no good, typographically speaking.

The editor should do the thing you almost certainly mean. Something like:

* List item * List item * List item

But, if that stuff gets cleaned up then this will be perfect for producing Scholarly Markdown, and Scholarly HTML. The $84 AUD means I’ll get priority on a reporting a bug, assuming it reaches its funding goal.

SearchHub: Apache Solr for Multi-language Content Discovery Through Entity Driven Search

planet code4lib - Mon, 2015-08-31 19:21
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Alessandro Benedetti’s session on using entity driven search for multi-language content discovery and search. This talk is about the description of the implementation of a Semantic Search Engine based on Solr. Meaningfully structuring content is critical, Natural Language Processing and Semantic Enrichment is becoming increasingly important to improve the quality of Solr search results. Our solution is based on three advanced features:
  1. Entity-oriented search – Searching not by keyword, but by entities (concepts in a certain domain)
  2. Knowledge graphs – Leveraging relationships amongst entities: Linked Data datasets (Freebase, DbPedia, Custom …)
  3. Search assistance – Autocomplete and Spellchecking are now common features, but using semantic data makes it possible to offer smarter features, driving the users to build queries in a natural way.
The approach includes unstructured data processing mechanisms integrated with Solr to automatically index semantic and multi-language information. Smart Autocomplete will complete users’ query with entity names and properties from the domain knowledge graph. As the user types, the system will propose a set of named entities and/or a set of entity types across different languages. As the user accepts a suggestion, the system will dynamically adapt following suggestions and return relevant documents. Semantic More Like This will find similar documents to a seed one, based on the underlying knowledge in the documents, instead of tokens. Alessandro Benedetti is a search expert and semantic technology passionate, working in the R&D division of Zaizi. His favorite work is in R&D on information retrieval, NLP and machine learning with a big emphasis on data structures, algorithms and probability theory. Alessandro earned his Masters in Computer Science with full grade in 2009, then spent 6 month with Universita’ degli Studi di Roma working on his masters thesis around a new approach to improve semantic web search. Alessandro spent 3 years with Sourcesense as a Search and Open Source consultant and developer. Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Apache Solr for Multi-language Content Discovery Through Entity Driven Search appeared first on Lucidworks.

FOSS4Lib Recent Releases: Pazpar2 - 1.12.2

planet code4lib - Mon, 2015-08-31 12:43

Last updated August 31, 2015. Created by Peter Murray on August 31, 2015.
Log in to edit this page.

Package: Pazpar2Release Date: Monday, August 31, 2015

SearchHub: Mining Events for Recommendations

planet code4lib - Mon, 2015-08-31 10:33
Summary: TheEventMiner” feature in Lucidworks Fusion can be used to mine event logs to power recommendations. We describe how the system uses graph navigation to generate diverse and high-quality recommendations. User Events The log files that most web services generate are a rich source of data for learning about user behavior and modifying system behavior based on this. For example, most search engines will automatically log details on user queries and the resulting clicked documents (URLs). We can define a (user, query, click, time) record which records a unique “event” that occurred at a specific time in the system. Other examples of event data include e-commerce transactions (e.g. “add to cart”, “purchase”), call data records, financial transactions etc. By analyzing a large volume of these events we can “surface” implicit structures in the data (e.g. relationships between users, queries and documents), and use this information to make recommendations, improve search result quality and power analytics for business owners. In this article we describe the steps we take to support this functionality. 1. Grouping Events into Sessions Event logs can be considered as a form of “time series” data, where the logged events are in temporal order. We can then make use of the observation that events close together in time will be more closely related than events further apart. To do this we need to group the event data into sessions. A session is a time window for all events generated by a given source (like a unique user ID). If two or more queries (e.g. “climate change” and “sea level rise”) frequently occur together in a search session then we may decide that those two queries are related. The same would apply for documents that are frequently clicked on together. A “session reconstruction” operation identifies users’ sessions by processing raw event logs and grouping them based on user IDs, using the time-intervals between each and every event. If two events triggered by the same user occur too far apart in time, they will be treated as coming from two different sessions. For this to be possible we need some kind of unique ID in the raw event data that allows us to tell that two or more events are related because they were initiated by the same user within a given time period. However, from a privacy point of view, we do not need an ID which identifies an actual real person with all their associated personal information. All we need is an (opaque) unique ID which allows us to track an “actor” in the system. 2. Generating a Co-Occurrence Matrix from the Session Data We are interested in entities that frequently co-occur, as we might then infer some kind of interdependence between those entities. For example, a click event can be described using a click(user, query, document) tuple, and we associate each of those entities with each other and with other similar events within a session. A key point here is that we generate the co-occurrence relations not just between the same field types e.g. (query, query) pairs, but also “cross-field” relations e.g. (query, document), (document, user) pairs etc. This will give us an N x N co-occurrence matrix, where N = all unique instances of the field types that we want to calculate co-occurrence relations for. Figure 1 below shows a co-occurrence matrix that encodes how many times different characters co-occur (appear together in the text) in the novel “Les Miserables”. Each colored cell represents two characters that appeared in the same chapter; darker cells indicate characters that co-occurred more frequently. The diagonal line going from the top left to the bottom right shows that each character co-occurs with itself. You can also see that the character named “Valjean”, the protagonist of the novel, appears with nearly every other character in the book.

Figure 1. “Les Miserables” Co-occurrence Matrix by Mike Bostock.

In Fusion we generate a similar type of matrix, where each of the items is one of the types specified when configuring the system. The value in each cell will then be the frequency of co-occurrence for any two given items e.g. a (query, document) pair, a (query, query) pair, a (user, query) pair etc.

For example, if the query “Les Mis” and a click on the web page for the musical appear together in the same user session then they will be treated as having co-occurred. The frequency of co-occurrence is then the number of times this has happened in the raw event logs being processed.

3. Generating a Graph from the Matrix The co-occurrence matrix from the previous step can also be treated as an “adjacency matrix”, which encodes whether two vertices (nodes) in a graph are “adjacent” to each other i.e. have a link or “co-occur”. This matrix can then be used to generate a graph, as shown in Figure 2:

Figure 2. Generating a Graph from a Matrix.

Here the values in the matrix are the frequency of co-occurrence for those two vertices. We can see that in the graph representation these are stored as “weights” on the edge (link) between the nodes e.g. nodes V2 and V3 co-occurred 5 times together.

We encode the graph structure in a collection in Solr using a simple JSON record for each node. Each record contains fields that list the IDs of other nodes that point “in” at this record, or which this node points “out” to.

Fusion provides an abstraction layer which hides the details of constructing queries to Solr to navigate the graph. Because we know the IDs of the records we are interested in we can generate a single boolean query where the individual IDs we are looking for are separated by OR operators e.g. (id:3677 OR id:9762 OR id:1459). This means we only make a single request to Solr to get the details we need.

In addition, the fact that we are only interested in the neighborhood graph around a start point means the system does not have to store the entire graph (which is potentially very large) in memory.

4. Powering Recommendations from the Graph

At query/recommendation time we can use the graph to make suggestions on which other items in that graph are most related to the input item, using the following approach:

  1. Navigate the co-occurrence graph out from the seed item to harvest additional entities (documents, users, queries).
  2. Merge the list of entities harvested from different nodes in the graph so that the more lists an entity appears in the more weight it receives and the higher it rises in the final output list.
  3. Weights are based on the reciprocal rank of the overall rank of the entity. The overall rank is calculated as the sum of the rank of the result the entity came from and the rank of the entity within its own list.

The following image shows the graph surrounding the document “Midnight Club: Los Angeles” from a sample data set:

Figure 3. An Example Neighborhood Graph.

Here the relative size of the nodes shows how frequently they occurred in the raw event data, and the size of the arrows is a visual indicator of the weight or frequency of co-occurrence between two elements.

For example, we can see that the query “midnight club” (blue node on bottom RHS) most frequently resulted in a click on the “Midnight Club: Los Angeles Complete Edition Platinum Hits” product (as opposed to the original version above it). This is the type of information that would be useful to a business analyst trying to understand user behavior on a site.

Diversity in Recommendations For a given item, we may only have a small number of items that co-occur with it (based on the co-occurrence matrix). By adding in the data from navigating the graph (which comes from the matrix), we increase the diversity of suggestions. Items that appear in multiple source lists then rise to the top. We believe this helps improve the quality of the recommendations & reduce bias. For example, in Figure 4 we show some sample recommendations for the query “Call of Duty”, where the recommendations are coming from a “popularity-based” recommender i.e. it gives a large weight to items with the most clicks. We can see that the suggestions are all from the “Call of Duty” video game franchise:

Figure 4. Recommendations from a “popularity-based” recommender system.

In contrast, in Figure 5 we show the recommendations from EventMiner for the same query:

Figure 5. Recommendations from navigating the graph.

Here we can see that the suggestions are now more diverse, with the first two being games from the same genre (“First Person Shooter” games) as the original query.

In the case of an e-commerce site, diversity in recommendations can be an important factor in suggesting items to a user that are related to their original query, but which they may not be aware of. This in turn can help increase the overall CTR (Click-Through Rate) and conversion rate on the site, which would have a direct positive impact on revenue and customer retention.

Evaluating Recommendation Quality To evaluate the quality of the recommendations produced by this approach we used CrowdFlower to get user judgements on the relevance of the suggestions produced by EventMiner. Figure 6 shows an example of how a sample recommendation was presented to a human judge:

Figure 6. Example relevance judgment screen (CrowdFlower).

Here the original user query (“resident evil”) is shown, along with an example recommendation (another video game called “Dead Island”). We can see that the judge is asked to select one of four options, which is used to give the item a numeric relevance score:

  1. Off Topic
  2. Acceptable
  3. Good
  4. Excellent
  In this example the user might judge the relevance for this suggestion as “good”, as the game being recommended is in the same genre (“survival horror”) as the original query. Note that the product title contains no terms in common with the query i.e. the recommendations are based purely on the graph navigation and do not rely on an overlap between the query and the document being suggested. In Table 1 we summarize the results of this evaluation: Items Judgements Users Avg. Relevance (1 – 4) 1000 2319 30 3.27  

Here we can see that the average relevance score across all judgements was 3.27 i.e. “good” to “excellent”.

Conclusion If you want an “out-of-the-box” recommender system that generates high-quality recommendations from your data please consider downloading and trying out Lucidworks Fusion.

The post Mining Events for Recommendations appeared first on Lucidworks.

Hydra Project: Michigan becomes the latest Hydra Partner

planet code4lib - Mon, 2015-08-31 08:49

We are delighted to announce that the University of Michigan has become the latest formal Hydra Partner.  Maurice York, their Associate University Librarian for Library Information Technology, writes:

“The strength, vibrancy and richness of the Hydra community is compelling to us.  We are motivated by partnership and collaboration with this community, more than simply use of the technology and tools. The interest in and commitment to the community is organization-wide; last fall we sent over twenty participants to Hydra Connect from across five technology and service divisions; our showing this year will be equally strong, our enthusiasm tempered only by the registration limits.”

Welcome Michigan!  We look forward to a long collaboration with you.

Eric Hellman: Update on the Library Privacy Pledge

planet code4lib - Mon, 2015-08-31 02:39
The Library Privacy Pledge of 2015, which I wrote about previously, has been finalized. We got a lot of good feedback, and the big changes have focused on the schedule.

Now, any library , organization or company that signs the pledge will have 6 months to implement HTTPS from the effective date of their signature. This should give everyone plenty of margin to do a good job on the implementation.

We pushed back our launch date to the first week of November. That's when we'll announce the list of "charter signatories". If you want your library, company or organization to be included in the charter signatory list, please send an e-mail to

The Let's Encrypt project will be launching soon. They are just one certificate authority that can help with HTTPS implementation.

I think this is an very important step for the library information community to take, together. Let's make it happen.

Here's the finalized pledge:

The Library Freedom Project is inviting the library community - libraries, vendors that serve libraries, and membership organizations - to sign the "Library Digital Privacy Pledge of 2015". For this first pledge, we're focusing on the use of HTTPS to deliver library services and the information resources offered by libraries. It’s just a first step: HTTPS is a privacy prerequisite, not a privacy solution. Building a culture of library digital privacy will not end with this 2015 pledge, but committing to this first modest step together will begin a process that won't turn back.  We aim to gather momentum and raise awareness with this pledge; and will develop similar pledges in the future as appropriate to advance digital privacy practices for library patrons.
We focus on HTTPS as a first step because of its timeliness. The Let's Encrypt initiative of the Electronic Frontier Foundation will soon launch a new certificate infrastructure that will remove much of the cost and technical difficulty involved in the implementation of HTTPS, with general availability scheduled for September. Due to a heightened concern about digital surveillance, many prominent internet companies, such as Google, Twitter, and Facebook, have moved their services exclusively to HTTPS rather than relying on unencrypted HTTP connections. The White House has issued a directive that all government websites must move their services to HTTPS by the end of 2016. We believe that libraries must also make this change, lest they be viewed as technology and privacy laggards, and dishonor their proud history of protecting reader privacy.
The 3rd article of the American Library Association Code of Ethics sets a broad objective:
We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.It's not always clear how to interpret this broad mandate, especially when everything is done on the internet. However, one principle of implementation should be clear and uncontroversial: Library services and resources should be delivered, whenever practical, over channels that are immune to eavesdropping.
The current best practice dictated by this principle is as following: Libraries and vendors that serve libraries and library patrons, should require HTTPS for all services and resources delivered via the web.
The Pledge for Libraries:
1. We will make every effort to ensure that web services and information resources under direct control of our library will use HTTPS within six months. [ dated______ ]
2. Starting in 2016, our library will assure that any new or renewed contracts for web services or information resources will require support for HTTPS by the end of 2016.
The Pledge for Service Providers (Publishers and Vendors):
1. We will make every effort to ensure that all web services that we (the signatories) offer to libraries will enable HTTPS within six months. [ dated______ ]
2. All web services that we (the signatories) offer to libraries will default to HTTPS by the end of 2016.
The Pledge for Membership Organizations:
1. We will make every effort to ensure that all web services that our organization directly control will use HTTPS within six months. [ dated______ ]
2. We encourage our members to support and sign the appropriate version of the pledge.
There's a FAQ available, too. All this will soon be posted on the Library Freedom Project website.

Harvard Library Innovation Lab: Link roundup August 30, 2015

planet code4lib - Sun, 2015-08-30 17:41

This is the good stuff.

Rethinking Work

When employees negotiate, they negotiate for improved compensation, since nothing else is on the table.

Putting Elon Musk and Steve Jobs on a Pedestal Misrepresents How Innovation Happens

“Rather than placing tech leaders on a pedestal, we should put their successes”

Lamp Shows | HAIKU SALUT

Synced lamps as part of a band’s performance

Lawn Order | 99% Invisible

Jail time for a brown lawn? A wonderfully weird dive into the moral implications of lawncare

Sky-high glass swimming pool created to connect south London apartment complex

Swim through the air

DuraSpace News: Cineca DSpace Service Provider Update

planet code4lib - Sun, 2015-08-30 00:00

From Andrea Bollini, Cineca

It has been a hot and productive summer here in Cineca,  we have carried out several DSpace activities together with the go live of the National ORCID Hub to support the adoption of ORCID in Italy [1][2].

Ed Summers: iSchool

planet code4lib - Sat, 2015-08-29 20:05

As you can see, I’ve recently changed things around here at Yeah, it’s looking quite spartan at the moment, although I’m hoping that will change in the coming year. I really wanted to optimize this space for writing in my favorite editor, and making it easy to publish and preserve the content. Wordpress has served me well over the last 10 years and up till now I’ve resisted the urge to switch over to a static site. But yesterday I converted the 394 posts, archived the Wordpress site and database, and am now using Jekyll. I haven’t been using Ruby as much in the past few years, but the tooling around Jekyll feels very solid, especially given GitHub’s investment in it.

Honestly, there was something that pushed me over the edge to do the switch. Next week I’m starting in the University of Maryland iSchool, where I will be pursuing a doctoral degree. I’m specifically hoping to examine some of the ideas I dredged up while preparing for my talk at NDF in New Zealand a couple years ago. I was given almost a year to think about what I wanted to talk about – so it was a great opportunity for me to reflect on my professional career so far, and examine where I wanted to go.

After I got back I happened across a paper by Steven Jackson called Rethinking Repair, which introduced me to what felt like a very new and exciting approach to information technology design and innovation that he calls Broken World Thinking. In hindsight I can see that both of these things conspired to make returning to school at 46 years of age look like a logical thing to do. If all goes as planned I’m going to be doing this part-time while also working at the Maryland Istitute for Technology in the Humanities, so it’s going to take a while. But I’m in a good spot, and am not in any rush … so it’s all good as far as I’m concerned.

I’m planning to use this space for notes about what I’m reading, papers, reflections etc. I thought about putting my citations, notes into Evernote, Zotero, Mendeley etc, and I may still do that. But I’m going to try to keep it relatively simple and use this space as best I can to start. My blog has always had a navel gazy kind of feel to it, so I doubt it’s going to matter much.

To get things started I thought I’d share the personal statement I wrote for admission to the iSchool. I’m already feeling more focus than when I wrote it almost a year ago, so it will be interesting to return to it periodically. The thing that has become clearer to me in the intervening year is that I’m increasingly interested in examining the role that broken world thinking has played in both the design and evolution of the Web.

So here’s the personal statement. Hoepfully it’s not too personal :-)

For close to twenty years I have been working as a software developer in the field of libraries and archives. As I was completing my Masters degree in the mid-1990s, the Web was going through a period of rapid growth and evolution. The computer labs at Rutgers University provided me with what felt like a front row seat to the development of this new medium of the World Wide Web. My classes on hypermedia and information seeking behavior gave me a critical foundation for engaging with the emerging Web. When I graduated I was well positioned to build a career around the development of software applications for making library and archival material available on the Web. Now, after working in the field, I would like to pursue a PhD in the UMD iSchool to better understand the role that the Web plays as an information platform in our society, with a particular focus on how archival theory and practice can inform it. I am specifically interested in archives of born digital Web content, but also in what it means to create a website that gets called an archive. As the use of the Web continues to accelerate and proliferate it is more and more important to have a better understanding of its archival properties.

My interest in how computing (specifically the World Wide Web) can be informed by archival theory developed while working in the Repository Development Center under Babak Hamidzadeh at the Library of Congress. During my eight years at LC I designed and built both internally focused digital curation tools as well as access systems intended for researchers and the public. For example, I designed a Web based quality assurance tool that was used by curators to approve millions of images that were delivered as part of our various digital conversion projects. I also designed the National Digital Newspaper Program’s delivery application, Chronicling America, that provides thousands of researchers access to over 8 million pages of historic American newspapers every day. In addition, I implemented the data management application that transfers and inventories 500 million tweets a day to the Library of Congress. I prototyped the Library of Congress Linked Data Service which makes millions of authority records available using Linked Data technologies.

These projects gave me hands on, practical experience using the Web to manage and deliver Library of Congress data assets. Since I like to use agile methodologies to develop software, this work necessarily brought me into direct contact with the people who needed the tools built, namely archivists. It was through these interactions over the years that I began to recognize that my Masters work at Rutgers University was in fact quite biased towards libraries, and lacked depth when it came to the theory and praxis of archives. I remedied this by spending about two years of personal study focused on reading about archival theory and practice with a focus on appraisal, provenance, ethics, preservation and access. I also began participating member of the Society of American Archivists.

During this period of study I became particularly interested in the More Product Less Process (MPLP) approach to archival work. I found that MPLP had a positive impact on the design of archival processing software since it oriented the work around making content available, rather than on often time consuming preservation activities. The importance of access to digital material is particularly evident since copies are easy to make, but rendering can often prove challenging. In this regard I observed that requirements for digital preservation metadata and file formats can paradoxically hamper preservation efforts. I found that making content available sooner rather than later can serve as an excellent test of whether digital preservation processing has been sufficient. While working with Trevor Owens on the processing of the Carl Sagan collection we developed an experimental system for processing born digital content using lightweight preservation standards such as BagIt in combination with automated topic model driven description tools that could be used by archivists. This work also leveraged the Web and the browser for access by automatically converting formats such as WordPerfect to HTML, so they could be viewable and indexable, while keeping the original file for preservation.

Another strand of archival theory that captured my interest was the work of Terry Cook, Verne Harris, Frank Upward and Sue McKemmish on post-custodial thinking and the archival enterprise. It was specifically my work with the Web archiving team at the Library of Congress that highlighted how important it is for record management practices to be pushed outwards onto the Web. I gained experience in seeing what makes a particular web page or website easier to harvest, and how impractical it is to collect the entire Web. I gained an appreciation for how innovation in the area of Web archiving was driven by real problems such as dynamic content and social media. For example I worked with the Internet Archive to archive Web content related to the killing of Michael Brown in Ferguson, Missouri by creating an archive of 13 million tweets, which I used as an appraisal tool, to help the Internet Archive identify Web content that needed archiving. In general I also saw how traditional, monolithic approaches to system building needed to be replaced with distributed processing architectures and the application of cloud computing technologies to easily and efficiently build up and tear down such systems on demand.

Around this time I also began to see parallels between the work of Matthew Kirschenbaum on the forensic and formal materiality of disk based media and my interests in the Web as a medium. Archivists usually think of the Web content as volatile and unstable, where turning off a web server can result in links breaking, and content disappearing forever. However it is also the case that Web content is easily copied, and the Internet itself was designed to route around damage. I began to notice how technologies such as distributed revision control systems, Web caches, and peer-to-peer distribution technologies like BitTorrent can make Web content extremely resilient. It was this emerging interest in the materiality of the Web that drew me to a position in the Maryland Institute for Technology in the Humanities where Kirschenbaum is the Assistant Director.

There are several iSchool faculty that I would potentially like to work with in developing my research. I am interested in the ethical dimensions to Web archiving and how technical architectures embody social values, which is one of Katie Shilton’s areas of research. Brian Butler’s work studying online community development and open data is also highly relevant to the study of collaborative and cooperative models for Web archiving. Ricky Punzalan’s work on virtual reunification in Web archives is also of interest because of its parallels with post-custodial archival theory, and the role of access in preservation. And Richard Marciano’s work on digital curation, in particular his recent work with the NSF on Brown Dog, would be an opportunity for me to further my experience building tools for digital preservation.

If admitted to the program I would focus my research on how Web archives are constructed and made accessible. This would include a historical analysis of the development of Web archiving technologies and organizations. I plan to look specifically at the evolution and deployment of Web standards and their relationship to notions of impermanence, and change over time. I will systematically examine current technical architectures for harvesting and providing access to Web archives. Based on user behavior studies I would also like to reimagine what some of the tools for building and providing access to Web archives might look like. I expect that I would spend a portion of my time prototyping and using my skills as a software developer to build, test and evaluate these ideas. Of course, I would expect to adapt much of this plan based on the things I learn during my course of study in the iSchool, and the opportunities presented by working with faculty.

Upon completion of the PhD program I plan to continue working on digital humanities and preservation projects at MITH. I think the PhD program could also qualify me to help build the iSchool’s new Digital Curation Lab at UMD, or similar centers at other institutions. My hope is that my academic work will not only theoretically ground my work at MITH, but will also be a source of fruitful collaboration with the iSchool, the Library and larger community at the University of Maryland. I look forward to helping educate a new generation of archivists in the theory and practice of Web archiving.

Cherry Hill Company: Learn About Islandora at the Amigos Online Conference

planet code4lib - Fri, 2015-08-28 19:42

On September 17, 2015, I'll be giving the presentation "Bring you Local, Unique Content to the Web Using Islandora" at the Amigos Open Source Software and Tools for the Library and Archive online conference. Amigos is bringing together practitioners from around the library field who have used open source in projects at their library. My talk will be about the Islandora digital asset management system, the fundamental building block of the Cherry Hill LibraryDAMS service.

Every library has content that is unique to itself and its community. Islandora is open source software that enables libraries to store, present, and preserve that unique content to their communities and to the world. Built atop the popular Drupal content management system and the Fedora digital object repository, Islandora powers many digital projects on the...

Read more »

SearchHub: How Shutterstock Searches 35 Million Images by Color Using Apache Solr

planet code4lib - Fri, 2015-08-28 18:00
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shutterstock engineer Chris Becker’s session on how they use Apache Solr to search 35 million images by color. This talk covers some of the methods they’ve used for building color search applications at Shutterstock using Solr to search 40 million images. A couple of these applications can be found in Shutterstock Labs – notably Spectrum and Palette. We’ll go over the steps for extracting color data from images and indexing them into Solr, as well as looking at some ways to query color data in your Solr index. We’ll cover some issues such as what does relevance mean when you’re searching for colors rather than text, and how you can achieve various effects by ranking on different visual attributes. At the timeof this presetnation, Chris was the Principal Engineer of Search at Shutterstock– a stock photography marketplace selling over 35 million images– where he’s worked on image search since 2008. In that time he’s worked on all the pieces of Shutterstock’s search technology ecosystem from the core platform, to relevance algorithms, search analytics, image processing, similarity search, internationalization, and user experience. He started using Solr in 2011 and has used it for building various image search and analytics applications. Searching Images by Color: Presented by Chris Becker, Shutterstock from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How Shutterstock Searches 35 Million Images by Color Using Apache Solr appeared first on Lucidworks.

DPLA: DPLA Welcomes Four New Service Hubs to Our Growing Network

planet code4lib - Fri, 2015-08-28 16:50

The Digital Public Library of America is pleased to announce the addition of four Service Hubs that will be joining our Hub network. The Hubs represent Illinois, Michigan, Pennsylvania and Wisconsin.  The addition of these Hubs continues our efforts to help build local community and capacity, and further efforts to build an on-ramp to DPLA participation for every cultural heritage institution in the United States and its territories.

These Hubs were selected from the second round of our application process for new DPLA Hubs.  Each Hub has a strong commitment to bring together the cultural heritage content in their state to be a part of DPLA, and to build community and data quality among the participants.

In Illinois, the Service Hub responsibilities will be shared by the Illinois State Library, the Chicago Public Library, the Consortium of Academic and Research Libraries of Illinois (CARLI), and the University of Illinois at Urbana Champaign. More information about the Illinois planning process can be found here. Illinois plans to make available collections documenting coal mining in the state, World War II photographs taken by an Illinois veteran and photographer, and collections documenting rural healthcare in the state.

In Michigan, the Service Hub responsibilities will be shared by the University of Michigan, Michigan State University, Wayne State University, Western Michigan University, the Midwest Collaborative for Library Services and the Library of Michigan.  Collections to be shared with the DPLA cover topics including the history of the Motor City, historically significant American cookbooks, and Civil War diaries from the Midwest.

In Pennsylvania, the Service Hub will be led by Temple University, Penn State University, University of Pennsylvania and Free Library of Philadelphia in partnership with the Philadelphia Consortium of Special Collections Libraries (PACSCL) and the Pennsylvania Academic Library Consortium (PALCI), among other key institutions throughout the state.  More information about the Service Hub planning process in Pennsylvania can be found here.  Collections to be shared with DPLA cover topics including the Civil Rights Movement in Pennsylvania, Early American History, and the Pittsburgh Iron and Steel Industry.

The final Service Hub, representing Wisconsin will be led by Wisconsin Library Services (WiLS) in partnership with the University of Wisconsin-Madison, Milwaukee Public Library, University of Wisconsin-Milwaukee, Wisconsin Department of Public Instruction and Wisconsin Historical Society.  The Wisconsin Service Hub will build off of the Recollection Wisconsin statewide initiative.  Materials to be made available document the American Civil Rights Movement’s Freedom Summer and the diversity of Wisconsin, including collections documenting the lives of Native Americans in the state.

“We are excited to welcome these four new Service Hubs to the DPLA Network,” said Emily Gore, DPLA Director for Content. “These four states have each led robust, collaborative planning efforts and will undoubtedly be strong contributors to the DPLA Hubs Network.  We look forward to making their materials available in the coming months.”

DPLA: The March on Washington: Hear the Call

planet code4lib - Thu, 2015-08-27 19:00

Fifty-two years ago this week, more than 200,000 Americans came together in the nation’s capitol to rally in support of the ongoing Civil Rights movement. It was at that march that Martin Luther King Jr.’s iconic “I Have A Dream” speech was delivered. And it was at that march that the course of American history was forever changed, in an event that resonates with protests, marches, and movements for change around the country decades later.

Get a new perspective on the historic March on Washington with this incredible collection from WGBH via Digital Commonwealth. This collection of audio pieces, 15 hours in total, offers uninterrupted coverage of the March on Washington, recorded by WGBH and the Educational Radio Network (a small radio distribution network that later became part of National Public Radio). This type of coverage was unprecedented in 1963, and offers a wholly unique view on one of the nation’s most crucial historic moments.

In this audio series, you can hear Martin Luther King Jr.’s historic speech, along with the words of many other prominent civil rights leaders–John Lewis, Bayard Rustin, Jackie Robinson, Roy Wilkins,  Rosa Parks, and Fred Shuttlesworth. There are interviews with Hollywood elite like Marlon Brando and Arthur Miller, alongside the complex views of the “everyman” Washington resident. There’s also the folk music of the movement, recorded live here, of Joan Baez, Bob Dylan, and Peter, Paul, and Mary. There are the stories of some of the thousands of Americans who came to Washington D.C. that August–teachers, social workers, activists, and even a man who roller-skated to the march all the way from Chicago.

Hear speeches made about the global nonviolence movement, the labor movement, and powerful words from Holocaust survivor Joachim Prinz. Another notable moment in the collection is an announcement of the death of W.E.B DuBois, one of the founders of the NAACP and an early voice for civil rights issues.

These historic speeches are just part of the coverage, however. There are fascinating, if more mundane, announcements, too, about the amount of traffic in Washington and issues with both marchers’ and commuters’ travel (though they reported that “north of K Street appears just as it would on a Sunday in Washington”). Another big, though less notable, issue of the day, according to WGBH reports, was food poisoning from the chicken in boxed lunches served to participants at the march. There is also information about the preparation for the press, which a member of the march’s press committee says included more than 300 “out-of-town correspondents.” This was in addition to the core Washington reporters, radio stations, like WGBH, TV networks, and international stations from Canada, Japan, France, Germany and the United Kingdom. These types of minute details and logistics offer a new window into a complex historic event, bringing together thousands of Americans at the nation’s capitol (though, as WGBH reported, not without its transportation hurdles!).

At the end of the demonstration, you can hear for yourself a powerful pledge, recited from the crowd, to further the mission of the march. It ends poignantly: “I pledge my heart and my mind and my body unequivocally and without regard to personal sacrifice, to the achievement of social peace through social justice.”

Hear the pledge, alongside the rest of the march as it was broadcast live, in this inspiring and insightful collection, courtesy of WGBH via Digital Commonwealth.

Banner image courtesy of the National Archives and Records Administration.

A view of the March on Washington, showing the Reflecting Pool and the Washington Monument. Courtesy of the National Archives and Records Administration.


Subscribe to code4lib aggregator