You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 1 hour 52 min ago

Open Library Data Additions: Amazon Crawl: part ee

Fri, 2015-05-15 09:17

Part ee of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

District Dispatch: How do library patrons feel about digital content?

Fri, 2015-05-15 06:26

Join a panel of Book Industry Study Group (BISG) and American Library Association (ALA) leaders at this year’s 2015 ALA Annual Conference in San Francisco when they discuss the results of a newly-released study on public library patrons’ use of digital content.

During the conference session “Digital Content in Public Libraries: What Do Patrons Think?” panelists will discuss the results of a new study by the BISG and ALA that was designed to provide invaluable insight into how readers interact with e-books in a library environment. The session takes place from 3:00 to 4:00 p.m. on Sunday, June 28, 2015, at the Moscone Convention Center in room 131 of the North Building.

The digital content survey was developed to understand the behavior of library patrons, including their use of digital resources and other services offered by public libraries. The study examined the impact of digital consumption behaviors, including the adoption of new business models, on library usage across America.

  • Kathy Rosa, director, Office for Research and Statistics, American Library Association
  • Carrie Russell, program director, Public Access to Information, Office for Information Technology Policy, American Library Association
  • Nadine Vassallo, project manager, Research & Information, Book Industry Study Group

View all ALA Washington Office conference sessions

The post How do library patrons feel about digital content? appeared first on District Dispatch.

Jonathan Rochkind: On approaches to Bridging the Gap in access to licensed resources

Thu, 2015-05-14 20:57

A previous post I made reviewing the Ithaka report “Streamlining access to Scholarly Resources” got a lot of attention. Thanks!

The primary issue I’m interested in there: Getting our patrons from a paywalled scholarly citation on the open unauthenticated web, to an authenticated library-licensed copy, or other library services. “Bridging the gap”.

Here, we use Umlaut to turn our “link resolver” into a full-service landing page offering library services for both books and articles:  Licensed online copies, local print copies, and other library services.

This means we’ve got the “receiving” end taken care of — here’s a book and an article example of an Umlaut landing page — the problem reduces to getting the user from the open unauthenticated web to an Umlaut page for the citation in question.

Which is still a tricky problem.  In this post, brief discussion of two things: 1) The new “Google Scholar Button” browser extension from Google, which is interesting in this area, but I think ultimately not enough of a solution to keep me from looking for more, and 2) Possibilities of Zotero open source code toward our end.

The Google Scholar Button

In late April Google released a browser plugin for Chrome and Firefox called the “Google Scholar Button”.

This plugin will extract the title of an article from a page (either text you’ve selected on the page first, or it will try to scrape a title from HTML markup), and give you search results for that article title from Google Scholar, in a little popup window.

Interestingly, this is essentially the same thing a couple of third party software packages have done for a while: The LibX “Magic Button”, and Lazy Scholar.  But now we get it in an official Google release, instead of hacky workarounds to Google’s lack of API from open source.

The Google Scholar Button is basically trying to bridge the same gap we are; it provides a condensed version of google scholar search results, with a link to an open access PDF if Google knows about one (I am still curious how many of these open access PDF’s are not-entirely-licensed copies put up by authors or professors without publisher permissions);

And it in some cases provides an OpenURL link to a library link resolver, which is just what we’re looking for.

However, it’s got some limitations that keep me from considering it a satisfactory ‘Bridging the Gap’ solution:

  • In order to get the OpenURL link to your local library link resolver while you are off campus, you have to set your Google Scholar preferences in your browser, which is pretty confusing to do.
  • The title has to match in Google Scholar’s index of course. Which is definitely extensive enough to still be hugely useful, as evidenced by the open source predecessors to Google Scholar Button trying to do the same thing.
  • But most problematically at all, Google Scholar Button results will only show the local library link resolver link for some citations: The ones that have been registered as having institutional fulltext access in your institutional holdings registered with Google.  I want to get users to the Umlaut landing page for any citation they want, even if we don’t have licensed fulltext (and we might even if Google doesn’t think we do, the holdings registrations are not always entirely accurate), I want to show them local physical copies (especially for books), and ILL and other document delivery services.
    • The full Google Scholar gives a hard-to-find but at least it’s there OpenURL link for “no local fulltext” under a ‘more’ link, but the Google Scholar Button version doesn’t offer even this.
    • Books/monographs might not be the primary use case, but I really want a solution that works for books too — and books are something users may be especially interested in a physical copy instead of online fulltext for, and books are also something that our holdings registration with Google pretty much doesn’t include, even ebooks.  And book titles are a lot less likely to return hits in Google Scholar at all.

I really want a solution that works all or almost all of the time to get the patron to our library landing page, not just some of the time, and my experiments with Google Scholar Button revealed more of a ‘sometimes’ experience.

I’m not sure if the LibX or Lazy Scholar solutions can provide an OpenURL link in all cases, regardless of Google institutional holdings registration.  They are both worth further inquiry for sure.  But Lazy Scholar isn’t open source and I find it’s UI not great for our purposes. And I find LibX a bit too heavy weight for solving this problem, and have some other concerns about it.

So let’s consider another avenue for “Bridging the Gap”….

Zotero’s scraping logic

Instead of trying to take a title and find a hit in a mega-corpus of scholarly citations  like the Google Scholar Button approach, another approach would be to try to extract the full citation details from the source page, and construct an OpenURL to send straight to our landing page.

And, hey, it has occurred to me, there’s some software that already can scrape citation data elements from quite a long list of web sites our patrons might want to start from.  Zotero. (And Mendeley too for that matter).

In fact, you could use Zotero as a method of ‘Bridging the Gap’ right now. Sign up for a Zotero account, install the Zotero extension. When you are on a paywalled citation page on the unauthenticated open web (or a search results page on Google Scholar, Amazon, or other places Zotero can scrape from), first import your citation into Zotero. Then go into your Zotero library, find the citation, and — if you’ve properly set up your OpenURL preferences in Zotero — it’ll give you a link to click on that will take you to your institutional OpenURL resolver. In our case, our Umlaut landing page.

We know from some faculty interviews that some faculty definitely use Zotero, hard to say if a majority do or not. I do not know how many have managed to set up their OpenURL preferences in Zotero, if this is part of their use of it.

Even of those who have, I wonder how many have figured out on their own that they can use Zotero to “bridge the gap” in this way.  But even if we undertook an education campaign, it is a somewhat cumbersome process. You might not want to actually import into your Zotero library, you might want to take a look at the article first. And not everyone chooses to use Zotero, and we don’t want to require them to for a ‘briding the gap’ solution.

But that logic is there in Zotero, the pretty tricky task of compiling and maintaining ‘scraping’ rules for a huge list of sites likely to be desirable as ‘Bridging the Gap’ sources. And Zotero is open source, hmm.

We could imagine adding a feature to Zotero that let the user choose to go right to an institutional OpenURL link after scraping, instead of having to import and navigate to their Zotero library first.  But I’m not sure such a feature would match the goals of the Zotero project, or how to integrate it into the UX in a clear way without disturbing from Zotero’s core functionality.

But again, it’s open source.  We could imagine ‘forking’ Zotero, or extracting just the parts of Zotero that matter for our goal, into our own product that did exactly what we wanted. I’m not sure I have the local resources to maintain a ‘forked’ version of plugins for several browsers.

But Zotero also offers a bookmarklet.  Which doesn’t have as good a UI as the browser plugins, and which doesn’t support all of the scrapers. But which unlike a browser plugin you can install on iOS and Android mobile browsers (although it’s a bit confusing to do so, at least it’s possible).  And which it’s probably ‘less expensive’ for a developer to maintain a ‘fork’ of — we really just want to take Zotero’s scraping behavior, implemented via bookmarklet, and completely replace what you do with it after it’s scraped. Send it to our institutional OpenURL resolver.

I am very intrigued by this possibility, it seems at least worth some investigatory prototypes to have patrons test.  But I haven’t yet figured out how where to actually find the bookmarklet code, and related code in Zotero that may be triggered by it, let alone the next step of figuring out if it can be extracted into a ‘fork’.  I’ve tried looking around on the Zotero repo, but I can’t figure out what’s what.  (I think all of Zotero is open source?).

Anyone know the Zotero devs, and want to see if they want to talk to me about it with any advice or suggestions? Or anyone familiar with the Zotero source code themselves and want to talk to me about it?

Filed under: General

Patrick Hochstenbach:

Thu, 2015-05-14 15:31
Our webshop at Filed under: Comics Tagged: cartoon, logo, webshop

Patrick Hochstenbach: Figure drawing on mondays

Thu, 2015-05-14 14:31
Filed under: Figure Drawings Tagged: art, art model, Nude, Nudes, sketchbook

Patrick Hochstenbach: Homework assignment #4 Sketchbookskool

Thu, 2015-05-14 14:26
Filed under: Comics Tagged: cartoon, comic, copic, Photoshop, sketchbook, sketchbookskool

Patrick Hochstenbach: Homework assignment #3 Sketchbookskool

Thu, 2015-05-14 14:23
Use an child drawing as basis and complete the drawingFiled under: Comics Tagged: brushpen, cartoon, portret, sketch, sketchbook, sketchbookskool, urbansketching

Patrick Hochstenbach: Homework assignment #2 Sketchbookskool

Thu, 2015-05-14 14:20
Filed under: Doodles Tagged: crosshatching, ipad, ipad paper crosshatching, sketchbook, sketchbookskool, urbansketching

FOSS4Lib Updated Packages: Binder

Thu, 2015-05-14 14:17

Last updated May 14, 2015. Created by Peter Murray on May 14, 2015.
Log in to edit this page.

Binder is an open source digital repository management application, designed
to meet the needs and complex digital preservation requirements of museum
collections. Binder was created by
Artefactual Systems and the
Museum of Modern Art.

Binder aims to facilitate digital collections care, management, and
preservation for time-based media and born-digital artworks and is built
from integrating functionality of the
Archivematica and
AtoM projects.

A presentation on Binder's functionality (Binder was formerly known as the
DRMC during development) can be found here:

Slides from a presentation at Code4LibBC 2014, including screenshots from the
application, can be found here:

Further resources

Package Type: Archival Record Manager and EditorLicense: GPLv3 Package Links Development Status: In DevelopmentOperating System: Browser/Cross-PlatformTechnologies Used: XSLTProgramming Language: PHPDatabase: MySQLworks well with: Archivematica

District Dispatch: U.S. House poised to pass real reforms to USA PATRIOT Act

Wed, 2015-05-13 19:41

[FBI, child, library bookdrop], June 25, 2002. Brush and ink and opaque white over pink pencil on bristol board. Prints and Photographs Division, Library of Congress, LC-DIG-ppmsca-04691; LC-USZ62-134267. Courtesy of Tribune Media Services (31)

Section 215 of the USA PATRIOT Act became, and remains, known as the “library provision” of that law because of intense and ongoing librarian opposition to the sweeping power it grants the government to compel libraries, without a probable cause-based search warrant, to divulge personal patron reading and internet usage records, and to the “gag orders” associated with Section 215 and “National Security Letters” (NSLs) that impede judicial and public oversight of such activity.

Tonight, the House of Representatives will vote on the USA FREEDOM Act of 2015, H.R. 2048 to finally ban the “bulk collection” of Americans’ personal communications records (library, telephone and otherwise) under Section 215. Critically, it also would preclude the use of other surveillance laws (related to “PEN registers”) and NSLs to get around that prohibition and would bring the “gag order” provisions of the USA PATRIOT Act into compliance with the First Amendment by permitting them to be meaningfully challenged in court.
The bill, not incidentally, also permits phone and internet companies to publish information (in a sufficiently specific form to be useful) about the number of requests they receive from the government to produce personal subscriber information.  It also, for the first time, would create opportunities for specially cleared civil liberties advocates to appear before the secret Foreign Intelligence Surveillance Act (FISA) court that authorizes surveillance activities.  The bill also makes important “first step” reforms to privacy-hostile provisions, including Section 702, of the FISA Amendments Act.

ALA and its many public and private sector coalition partners strongly support passage of H.R. 2048.  That message was underscored by the more than 400 librarian lobbyists who took to Capitol Hill on May 5, during the American Library Association’s (ALA) National Library Legislative Day.  They carried with them a stirring and emphatic OpEd urging real reform entitled “Long Lines for Freedom” by ALA President Courtney Young, which was published that morning in The Hill, a Congress-centric newspaper widely read by Members of Congress, their staffs and the national press.

While House passage of the USA FREEDOM Act is widely expected, its fate in the Senate is uncertain at best. Stay tuned for more on how you can help!

The post U.S. House poised to pass real reforms to USA PATRIOT Act appeared first on District Dispatch.

LITA: Jobs in Information Technology: May 13

Wed, 2015-05-13 19:38

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Automated Services Manager, Fairbanks, AK

Emerging Technologies/Learning Technologies Librarian, Queensborough Community College, Bayside, NY

Web Site Designer / Developer, Senior, University of Arizona, Tucson, AZ

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Nicole Engard: Building Robots in Pasco County Library

Wed, 2015-05-13 19:13

Today I got to attend a talk by Pasco County Library system at the Florida Library Association conference on how they are building robots in the library. They work with a non-profit called First that helps get kids excited in areas of STEM. Pasco is the only public library in the US doing this and has named their team Edgar Allan Ohms.

It’s important to not be scared of this. You don’t have to be an engineer to participate in this program, it’s about more than robot building. The students build these robots, compete with them and then can apply for scholarships through First. The students run the entire program. They build the website, design the logos and signs, build the robots, etc.

How did Pasco do this? They converted a space in their library to a makerspace with outlets, tools and even non-robotic tools like sewing machines and autoCAD tools. Of course they are trying to do this as cheaply and quickly as possible – this too teaches the kids on how to use ‘found’ items to make these things happen. Another skill they’re teaching the kids is out to sell themselves, how to fund-raise, and how to talk to people to get funding and promotion. It’s so much more than kids just sitting around playing games all day – they are learning real life skills.

How do librarians (with no engineering background) do this? You go out in to your community and find people who want to help out! They are using family members, community members, library fans, and local businesses to help provide tools, supplies and services. People know about First and so everyone wants to help. In some cases people will come to you and offer to help if they hear about what you’re doing. If you can’t find anyone yourself First will help you.

They start each August, and this year there are so many interested that they will be interviewing kids to find those who will commit. They attend workshops weekly and bi-weekly August through December to talk about the rules and plans. In January they go in to competition mode – this is when they start to build the robot. This video shows the rules that the team had to use last year in order to build their robot – this was shown to all teams at the exact same time and from then they spend the next 6 weeks building.

They need to start with some planning based on the rules in the video. The kids will start designing on CAD, testing it in modeling software online, and go from there to building something that will run.

Everything these groups are doing is open and shared. This means that the kids of learning job skills not just in engineering but marketing and writing and others. The groups that will be competing go out on scouting missions where they see what other groups have done and learn from them.

So, if you want to do this in your library how do you get funding and approval from your lawyers? First off explain that you will get some funding from the program itself. Next show that the this program is going to help the community members by offering scholarships to the kids, teaching them real skills and bringing the kids out into the community. Think about it this way – how much does a high school pay for a football team? For a fraction of that you can bring together 25 kids and teach them a skill for life whereas most of those kids who play football in high school don’t end up in the NFL. For the lawyers the library basically said that this is a valuable program and went to bat to get it to go through. In the end the lawyers wrote up a disclaimer that all the kids have to sign in order to participate.

This is the kind of program that more libraries should be offering to encourage kids to learn about STEM and bring library awareness to the entire community – our libraries are about so much more than books and DVDs and this is a great way to show that.

The post Building Robots in Pasco County Library appeared first on What I Learned Today....

Related posts:

  1. Keynote: Licensing Models and Building an Open Source Community
  2. How To Get More Kids To Code
  3. SxSW: Building the Open Source Society

HangingTogether: Shift to Linked Data for production

Wed, 2015-05-13 18:54


That was the topic discussed several times recently by OCLC Research Library Partners metadata managers, initiated by Philip Schreur of Stanford, who is also involved in the Linked Data for Libraries (LD4L) project.  Linked data may well be the next common infrastructure both for communicating library data and embedding it into the fabric of the semantic web. There have been a number of different models developed: Digital Public Library of America’s Metadata Application Profile,, BIBFRAME, etc. Much of a research library’s routine production is tied directly to its local system and makes use of MARC for internal and external data communication.  Linked data offers an opportunity to go beyond the library domain and authority files to draw on information about entities from diverse sources.

Publishing metadata for digital collections as linked data directly, bypassing MARC record conversion, may offer more flexibility and accuracy. (An example of losing information when converting from one format to another is documented in Jean Godby’s 2012 report, A Crosswalk from ONIX 3.0 for Books to MARC 21.) Stanford is pulling together information about faculty members and publications in a way that they could never do without utilizing linked data.

Some of the issues raised in the focus group discussions included:

Critical components in linked data that could be started now: Including persistent identifiers in the MARC bibliographic and authority records created now will help in transitioning to a future linked data environment. The entities are more clearly identified in authority records than in bibliographic records where it’s not always clear which elements represent a work versus an expression of a work. OCLC is already adding FAST identifiers in the $0 subfield (the authority control number or standard number) in the subject fields of WorldCat records. The British Library expects to launch a pilot this summer to match the LC/NACO authority file against the ISNI database and add ISNI identifiers to the authority record’s 024 field. Adding $4 role codes in personal name added entries will help establish relationships among name entities in the future. Creating identifiers for entities that do not yet have them will build a larger pool of data to help disambiguate them later. The community could also consider a wider range of authorities beyond the LC/NACO authority file for re-using existing identifiers (e.g., VIAF, ISNI and identifiers in other national authority files) and “get us into the habit”.

Provenance:  How to resolve or reconcile conflicts between statements? We will likely see different types of inconsistencies than we see now with, for example, different birthdates. OCLC has been looking at the work of Google researchers on a “knowledge graph” (the basis of knowledge cards. As Google harvests the Web, it comes across incorrect or conflicting statements. Researchers have documented using algorithms based on frequency and the source of links to come up with a “confidence measure”.  (Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion.) Aggregations such as WorldCat, VIAF and Wikidata may allow the library community to view statements from these sources with more confidence than others.

Importance of holdings data in a linked data environment: Metadata managers see the need to communicate both the availability and eligibility of the resource being described. A W3C document, Holdings via Offer, recommends mappings from bibliographic holdings data to

Impact on workflow:  In the next phase of the Linked Data for Libraries project, six libraries (Columbia, Cornell, Harvard, Stanford, Princeton and the Library of Congress) hope to figure out how to use linked data in production using BIBFRAME. They will be looking at how to link into acquisitions and circulation as well as cataloging workflows, and hope to collaborate with cataloging and local system vendors. Metadata managers noted it’s important to collaborate with the book vendors that supply them with MARC records now – even if they cannot generate linked data themselves, perhaps they could enhance MARC records so that transforming them into BIBFRAME is cleaner. Linked data may also encourage more sharing of metadata via statements rather than copy-cataloging a record that is then maintained as a local copy that is not shared with others.


  • During this transition period the environment and standards are a moving target.
  • It’s unclear how libraries will share “statements” rather than records in a linked data environment
  • How to involve the many vendors which supply or process MARC records now? Working with others in the linked data environment involves people unfamiliar with the library environment, requiring metadata specialists to explain what their needs are in terms non-librarians can understand.
  • Differing interpretations of what is a “work” may hamper the ability to re-use data created elsewhere.

Success metrics: Moving into a production linked data environment will take time, and each institution may well have a different timetable. Discussions indicated that linked data experiments could be considered successful if:

  • The data is more integrated than it is now.
  • Data created by different workflows are interoperable.
  • Libraries can offer users new, valued services that current data models can’t support.
  • The resource descriptions are more machine-actionable than current standards.
  • Outside parties use library resource descriptions more.
  • The data is better and richer because more parties share in its creation.


Graphic: Partial view of Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak.

About Karen Smith-Yoshimura

Karen Smith-Yoshimura, program officer, works on topics related to renovating descriptive and organizing practices with a focus on large research libraries and area studies requirements.

Mail | Web | Twitter | More Posts (58)

Open Knowledge Foundation: Announcing the new open data handbook

Wed, 2015-05-13 13:29

We are thrilled to announce that the Open Data Handbook, the premier guide for open data newcomers and veterans alike, has received a much needed update! The Open Data Handbook, originally published in 2012, has become the go to resource for the open data community. It was written by expert members of the open data community and has been translated into over 18 languages. Read it now »

The Open Data Handbook elaborates on the what, why & how of open data. In other words – what data should be open, what are the social and economic benefits of opening that data, and how to make effective use of it once it is opened.

The handbook is targeted at a broad audience, including civil servants, journalists, activists, developers, and researchers as well as open data publishers. Our aim is to ensure open data is widely available and applied in as many contexts as possible, we welcome your efforts to grow the open knowledge movement in this way!

The idea of open data is really catching on and we have learned many important lessons over the past three years. We believe that is time that the Open Data Handbook reflect these learnings. The revised Open Data Handbook has a number of new features and plenty of ways to contribute your experience and knowledge, please do!

 Inspire Open Data Newcomers

The original open data guide discussed the theoretical reasons for opening up data – increasing transparency and accountability of government, improving public and commercial services, stimulating innovation etc. We have now reached a point where we are able to go beyond theoretical arguments — we have real stories that document the benefits open data has on our lives. The Open Data Value Stories are use cases from across the open knowledge network that highlighting the social and economic value and the varied applications of open data in the world.

This is by no means an exhaustive list; in fact just the beginning! If you have an open data value story that you would like to contribute, please get in touch.

 Learn How to Publish & Use Open Data

The Open Data Guide remains the premier open data how-to resource and in the coming months we will be adding new sections and features! For the time being, we have moved the guide to Github to streamline contributions and facilitate translation. We will be reaching out to the community shortly to determine what new content we should be prioritising.

While in 2012, when we originally published the open data guide, the open data community was still emerging and resources remained scarce, today as the global open data community is mature, international and diverse and resources now exist that reflect this maturity and diversity. The Open Data Resource Library is curated collection of resources, including articles, longer publications, how to guides, presentations and videos, produced by the global open data community — now available all in one place! If you want to contribute a resource, you can do so here! We are particularly interested in expanding the number of resources we have in languages other than English so please add them if you have them!

Finally, as we are probably all aware, the open data community likes its jargon! While the original open data guide had a glossary of terms, it was far from exhaustive — especially for newcomers to the open data movement. In the updated version we have added over 80 new terms and concepts with easy to understand definitions! Have we missed something out? Let us know what we are missing here.

The updated Open Data Handbook is a living resource! In the coming months, we will be adding new sections to the Open Data Guide and producing countless more value stories! We invite you to contribute your stories, your resources and your ideas! Thank you for your contributions past, present and future and your continued efforts in pushing this movement forward.

SearchHub: Query Autofiltering Revisited – Lets be more precise!!!

Wed, 2015-05-13 10:58
In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find.  A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this.  A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns.  The traditional way to do this is to use Natural Language Processing or NLP to parse the user query.  This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”.  In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed.  The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search. The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way.  The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting.  It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”.  It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it.  This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context.  The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata. Searching vs Navigating A little historical background here.  With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links.  In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata.  The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference.  Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will). A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching: In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use  <field>:<value> syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user.  However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words.  The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful.  So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender.  But the information is still somewhere in the search index, we just need a way to get it back at at “query time”.  Then, we can both have our cake AND eat it! Noun Phrases and the Hierarchy of meta information When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing.  In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do.  That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth.  Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types. Rebuilding Context: Identifying field name patterns to find relevant phrase patterns So now its time to put Humpty Dumpty back together again.  With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate  metadata mapping on behalf of the user.  We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store.  Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example. The Query AutoFiltering Search Component To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things.  The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map.  This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field.  If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query.  The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer.  Assuming that we have indexed the following records: Field:         color       product_type         brand Record 1:   red    shoes Record 2:   red socks Record 3:   brown    socks Record 4: green socks              red lion Record 5:   blue     socks red lion Record 6: blue     socks      red dragon Record 7:            pizza               red baron Record 8:             whiskey            red label Record 9: smoke detector red light Record 10: yeast red star Record 11: red wine gallo Record 12: red wine vinegar heinz Record 13: red grapes dole Record 14: red brick acme Record 15: red pepper dole Record 16: red pepper flakes mccormick This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records.  However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query.  Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values.  So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”.  Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too.  If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”.  It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type.  From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query: q=blue red lion socks to a filter query: fq=color:blue&fq=brand:”red lion”&fq=product_type:socks This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%! Adding case sensitivity and synonym support: One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality.  To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed.  Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future. Source Code For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it).  The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above. Conclusions The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post.  Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can.  This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost).  The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications.  The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.

The post Query Autofiltering Revisited – Lets be more precise!!! appeared first on Lucidworks.

LibUX: 018: The Kano Model is Awesome – Really …

Wed, 2015-05-13 04:49

Okay, so we found it sort of tricky to explain, but the Kano Model really is awesome. In this episode, we try our best to tell you that the Kano Model is a sophisticated tool used to measure the impact of service features on the user experience. It is a way that you and your stakeholders can visualize the weight of a new feature, whether it will produce delight but require a huge investment, or that carousel will make you rue the day.

The post 018: The Kano Model is Awesome – Really … appeared first on LibUX.

DuraSpace News: GET READY for Fedora Camp

Wed, 2015-05-13 00:00

Winchester, MA  Save the dates now! The Fedora Project is pleased to announce that the first Fedora Camp will be offered November 16-18 (Monday-Wednesday) at Duke University (specifically the The Edge: The Ruppert Commons for Research, Technology, and Collaboration [1]).

HangingTogether: I Come Neither to Praise Nor Bury

Tue, 2015-05-12 22:21

I come to bury Caesar, not to praise him. – Antony, in The Tragedy of Julius Caesar, William Shakespeare

My esteemed colleague Thom Hickey, who knows the MARC format more intimately than I ever will, has penned a defense of that venerable metadata format. He was kind enough to cite a column I wrote in 2002 for Library Journal. But even back then, my opinion had changed such so that I wrote a much longer and thorough piece that laid out the bibliographic future I wished to see. The journal in which it was published thought highly enough of it to award it the paper of the year award. I think my bribe helped.

Thom’s post lays out a pretty compelling use case for MARC, and that’s awesome. Frankly, if MARC wasn’t as good as it was it would not have lasted as long as it has. And let’s be clear, it’s far from dead.

But that is a fairly specific use case, and such specific use cases may still apply long after MARC is replaced with BIBFRAME (which is the intent of the Library of Congress). Or, perhaps, something else yet to be determined.

But I’m more concerned about the broader ecology of library bibliographic data, and how we fit within the even larger ecology of non-library bibliographic data. And there MARC is showing its age. I still think we will likely need to have a fairly complex metadata element set for library work, and a much simplified version for syndicating out in the world. And I think that a very good choice for that much simpler format for syndicating is At least that’s what we’re presently going with.

Meanwhile, we at OCLC will be consuming and offering MARC as well as other formats for some undetermined length of time to come. I come to neither praise nor bury MARC. I come to help create a bibliographic infrastructure that will take us into the future by accommodating many strategies, tools, and formats.

About Roy Tennant

Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.

Mail | Web | Twitter | Facebook | LinkedIn | Flickr | YouTube | More Posts (88)

FOSS4Lib Upcoming Events: Knoxville Fedora Workshop

Tue, 2015-05-12 21:05
Date: Friday, June 26, 2015 - 08:00 to 17:00Supports: Fedora Repository

Last updated May 12, 2015. Created by Peter Murray on May 12, 2015.
Log in to edit this page.

From the announcement:

Join us in beautiful Knoxville, Tennessee for an all-day workshop on Fedora, the open source digital content repository system.


The workshop will occur from 9 AM to 5 PM on Friday, June 26, with a break for lunch.

Library of Congress: The Signal: Nominations Now Open for the 2015 NDSA Innovation Awards

Tue, 2015-05-12 20:38

Elise Depew Strang L’Esperance (1878-1959), Cornell University, shown here in 1951 with her Lasker Clinical Medical Research Award, was a pioneer in cancer treatment for women and had received the award jointly with Catherine Macfarlane. Smithsonian Institution Archives. Image SIA2008-5264

The National Digital Stewardship Alliance Innovation Working Group is proud to open the nominations for the 2015 NDSA Innovation Awards. As a diverse membership group with a shared commitment to digital preservation, the NDSA understands the importance of innovation and risk-taking in developing and supporting a broad range of successful digital preservation activities. These awards are an example of the NDSA’s commitment to encourage and recognize innovation in the digital stewardship community.

This slate of annual awards highlights and commends creative individuals, projects, organizations and future stewards demonstrating originality and excellence in their contributions to the field of digital preservation. The program is administered by a committee drawn from members of the NDSA Innovation Working Group.

Last year’s winners are exemplars of the diversity and collaboration essential to supporting the digital stewardship community as it works to preserve and make available digital materials.

The NDSA Innovation Awards focus on recognizing excellence in one or more of the following areas:

  • Individuals making a significant, innovative contribution to the field of digital preservation;
  • Projects whose goals or outcomes represent an inventive, meaningful addition to the understanding or processes required for successful, sustainable digital preservation stewardship;
  • Organizations taking an innovative approach to providing support and guidance to the digital preservation community;
  • Future stewards, especially students, but including educators, trainers or curricular endeavors, taking a creative approach to advancing knowledge of digital preservation theory and practices.

Acknowledging that innovative digital stewardship can take many forms, eligibility for these awards has been left purposely broad. Nominations are open to anyone or anything that falls into the above categories and any entity can be nominated for one of the four awards. Nominees should be US-based people and projects or collaborative international projects that contain a US-based partner. This is your chance to help us highlight and reward novel, risk-taking and inventive approaches to the challenges of digital preservation.

Nominations are now being accepted and you can submit a nomination using this quick, easy online submission form. You can also submit a nomination by emailing a brief description, justification and the URL and/or contact information of your nominee to ndsa (at)

Nominations will be accepted until Tuesday, June 30 and winners announced in mid-July. Help us recognize and reward innovation in digital stewardship and submit a nomination!