You are here

Feed aggregator

Patrick Hochstenbach: Homework assignment #4 Sketchbookskool

planet code4lib - Thu, 2015-05-14 14:26
Filed under: Comics Tagged: cartoon, comic, copic, Photoshop, sketchbook, sketchbookskool

Patrick Hochstenbach: Homework assignment #3 Sketchbookskool

planet code4lib - Thu, 2015-05-14 14:23
Use an child drawing as basis and complete the drawingFiled under: Comics Tagged: brushpen, cartoon, portret, sketch, sketchbook, sketchbookskool, urbansketching

Patrick Hochstenbach: Homework assignment #2 Sketchbookskool

planet code4lib - Thu, 2015-05-14 14:20
Filed under: Doodles Tagged: crosshatching, ipad, ipad paper crosshatching, sketchbook, sketchbookskool, urbansketching

FOSS4Lib Updated Packages: Binder

planet code4lib - Thu, 2015-05-14 14:17

Last updated May 14, 2015. Created by Peter Murray on May 14, 2015.
Log in to edit this page.

Binder is an open source digital repository management application, designed
to meet the needs and complex digital preservation requirements of museum
collections. Binder was created by
Artefactual Systems and the
Museum of Modern Art.

Binder aims to facilitate digital collections care, management, and
preservation for time-based media and born-digital artworks and is built
from integrating functionality of the
Archivematica and
AtoM projects.

A presentation on Binder's functionality (Binder was formerly known as the
DRMC during development) can be found here:

Slides from a presentation at Code4LibBC 2014, including screenshots from the
application, can be found here:

Further resources

Package Type: Archival Record Manager and EditorLicense: GPLv3 Package Links Development Status: In DevelopmentOperating System: Browser/Cross-PlatformTechnologies Used: XSLTProgramming Language: PHPDatabase: MySQLworks well with: Archivematica

District Dispatch: U.S. House poised to pass real reforms to USA PATRIOT Act

planet code4lib - Wed, 2015-05-13 19:41

[FBI, child, library bookdrop], June 25, 2002. Brush and ink and opaque white over pink pencil on bristol board. Prints and Photographs Division, Library of Congress, LC-DIG-ppmsca-04691; LC-USZ62-134267. Courtesy of Tribune Media Services (31)

Section 215 of the USA PATRIOT Act became, and remains, known as the “library provision” of that law because of intense and ongoing librarian opposition to the sweeping power it grants the government to compel libraries, without a probable cause-based search warrant, to divulge personal patron reading and internet usage records, and to the “gag orders” associated with Section 215 and “National Security Letters” (NSLs) that impede judicial and public oversight of such activity.

Tonight, the House of Representatives will vote on the USA FREEDOM Act of 2015, H.R. 2048 to finally ban the “bulk collection” of Americans’ personal communications records (library, telephone and otherwise) under Section 215. Critically, it also would preclude the use of other surveillance laws (related to “PEN registers”) and NSLs to get around that prohibition and would bring the “gag order” provisions of the USA PATRIOT Act into compliance with the First Amendment by permitting them to be meaningfully challenged in court.
The bill, not incidentally, also permits phone and internet companies to publish information (in a sufficiently specific form to be useful) about the number of requests they receive from the government to produce personal subscriber information.  It also, for the first time, would create opportunities for specially cleared civil liberties advocates to appear before the secret Foreign Intelligence Surveillance Act (FISA) court that authorizes surveillance activities.  The bill also makes important “first step” reforms to privacy-hostile provisions, including Section 702, of the FISA Amendments Act.

ALA and its many public and private sector coalition partners strongly support passage of H.R. 2048.  That message was underscored by the more than 400 librarian lobbyists who took to Capitol Hill on May 5, during the American Library Association’s (ALA) National Library Legislative Day.  They carried with them a stirring and emphatic OpEd urging real reform entitled “Long Lines for Freedom” by ALA President Courtney Young, which was published that morning in The Hill, a Congress-centric newspaper widely read by Members of Congress, their staffs and the national press.

While House passage of the USA FREEDOM Act is widely expected, its fate in the Senate is uncertain at best. Stay tuned for more on how you can help!

The post U.S. House poised to pass real reforms to USA PATRIOT Act appeared first on District Dispatch.

LITA: Jobs in Information Technology: May 13

planet code4lib - Wed, 2015-05-13 19:38

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Automated Services Manager, Fairbanks, AK

Emerging Technologies/Learning Technologies Librarian, Queensborough Community College, Bayside, NY

Web Site Designer / Developer, Senior, University of Arizona, Tucson, AZ

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Nicole Engard: Building Robots in Pasco County Library

planet code4lib - Wed, 2015-05-13 19:13

Today I got to attend a talk by Pasco County Library system at the Florida Library Association conference on how they are building robots in the library. They work with a non-profit called First that helps get kids excited in areas of STEM. Pasco is the only public library in the US doing this and has named their team Edgar Allan Ohms.

It’s important to not be scared of this. You don’t have to be an engineer to participate in this program, it’s about more than robot building. The students build these robots, compete with them and then can apply for scholarships through First. The students run the entire program. They build the website, design the logos and signs, build the robots, etc.

How did Pasco do this? They converted a space in their library to a makerspace with outlets, tools and even non-robotic tools like sewing machines and autoCAD tools. Of course they are trying to do this as cheaply and quickly as possible – this too teaches the kids on how to use ‘found’ items to make these things happen. Another skill they’re teaching the kids is out to sell themselves, how to fund-raise, and how to talk to people to get funding and promotion. It’s so much more than kids just sitting around playing games all day – they are learning real life skills.

How do librarians (with no engineering background) do this? You go out in to your community and find people who want to help out! They are using family members, community members, library fans, and local businesses to help provide tools, supplies and services. People know about First and so everyone wants to help. In some cases people will come to you and offer to help if they hear about what you’re doing. If you can’t find anyone yourself First will help you.

They start each August, and this year there are so many interested that they will be interviewing kids to find those who will commit. They attend workshops weekly and bi-weekly August through December to talk about the rules and plans. In January they go in to competition mode – this is when they start to build the robot. This video shows the rules that the team had to use last year in order to build their robot – this was shown to all teams at the exact same time and from then they spend the next 6 weeks building.

They need to start with some planning based on the rules in the video. The kids will start designing on CAD, testing it in modeling software online, and go from there to building something that will run.

Everything these groups are doing is open and shared. This means that the kids of learning job skills not just in engineering but marketing and writing and others. The groups that will be competing go out on scouting missions where they see what other groups have done and learn from them.

So, if you want to do this in your library how do you get funding and approval from your lawyers? First off explain that you will get some funding from the program itself. Next show that the this program is going to help the community members by offering scholarships to the kids, teaching them real skills and bringing the kids out into the community. Think about it this way – how much does a high school pay for a football team? For a fraction of that you can bring together 25 kids and teach them a skill for life whereas most of those kids who play football in high school don’t end up in the NFL. For the lawyers the library basically said that this is a valuable program and went to bat to get it to go through. In the end the lawyers wrote up a disclaimer that all the kids have to sign in order to participate.

This is the kind of program that more libraries should be offering to encourage kids to learn about STEM and bring library awareness to the entire community – our libraries are about so much more than books and DVDs and this is a great way to show that.

The post Building Robots in Pasco County Library appeared first on What I Learned Today....

Related posts:

  1. Keynote: Licensing Models and Building an Open Source Community
  2. How To Get More Kids To Code
  3. SxSW: Building the Open Source Society

HangingTogether: Shift to Linked Data for production

planet code4lib - Wed, 2015-05-13 18:54


That was the topic discussed several times recently by OCLC Research Library Partners metadata managers, initiated by Philip Schreur of Stanford, who is also involved in the Linked Data for Libraries (LD4L) project.  Linked data may well be the next common infrastructure both for communicating library data and embedding it into the fabric of the semantic web. There have been a number of different models developed: Digital Public Library of America’s Metadata Application Profile,, BIBFRAME, etc. Much of a research library’s routine production is tied directly to its local system and makes use of MARC for internal and external data communication.  Linked data offers an opportunity to go beyond the library domain and authority files to draw on information about entities from diverse sources.

Publishing metadata for digital collections as linked data directly, bypassing MARC record conversion, may offer more flexibility and accuracy. (An example of losing information when converting from one format to another is documented in Jean Godby’s 2012 report, A Crosswalk from ONIX 3.0 for Books to MARC 21.) Stanford is pulling together information about faculty members and publications in a way that they could never do without utilizing linked data.

Some of the issues raised in the focus group discussions included:

Critical components in linked data that could be started now: Including persistent identifiers in the MARC bibliographic and authority records created now will help in transitioning to a future linked data environment. The entities are more clearly identified in authority records than in bibliographic records where it’s not always clear which elements represent a work versus an expression of a work. OCLC is already adding FAST identifiers in the $0 subfield (the authority control number or standard number) in the subject fields of WorldCat records. The British Library expects to launch a pilot this summer to match the LC/NACO authority file against the ISNI database and add ISNI identifiers to the authority record’s 024 field. Adding $4 role codes in personal name added entries will help establish relationships among name entities in the future. Creating identifiers for entities that do not yet have them will build a larger pool of data to help disambiguate them later. The community could also consider a wider range of authorities beyond the LC/NACO authority file for re-using existing identifiers (e.g., VIAF, ISNI and identifiers in other national authority files) and “get us into the habit”.

Provenance:  How to resolve or reconcile conflicts between statements? We will likely see different types of inconsistencies than we see now with, for example, different birthdates. OCLC has been looking at the work of Google researchers on a “knowledge graph” (the basis of knowledge cards. As Google harvests the Web, it comes across incorrect or conflicting statements. Researchers have documented using algorithms based on frequency and the source of links to come up with a “confidence measure”.  (Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion.) Aggregations such as WorldCat, VIAF and Wikidata may allow the library community to view statements from these sources with more confidence than others.

Importance of holdings data in a linked data environment: Metadata managers see the need to communicate both the availability and eligibility of the resource being described. A W3C document, Holdings via Offer, recommends mappings from bibliographic holdings data to

Impact on workflow:  In the next phase of the Linked Data for Libraries project, six libraries (Columbia, Cornell, Harvard, Stanford, Princeton and the Library of Congress) hope to figure out how to use linked data in production using BIBFRAME. They will be looking at how to link into acquisitions and circulation as well as cataloging workflows, and hope to collaborate with cataloging and local system vendors. Metadata managers noted it’s important to collaborate with the book vendors that supply them with MARC records now – even if they cannot generate linked data themselves, perhaps they could enhance MARC records so that transforming them into BIBFRAME is cleaner. Linked data may also encourage more sharing of metadata via statements rather than copy-cataloging a record that is then maintained as a local copy that is not shared with others.


  • During this transition period the environment and standards are a moving target.
  • It’s unclear how libraries will share “statements” rather than records in a linked data environment
  • How to involve the many vendors which supply or process MARC records now? Working with others in the linked data environment involves people unfamiliar with the library environment, requiring metadata specialists to explain what their needs are in terms non-librarians can understand.
  • Differing interpretations of what is a “work” may hamper the ability to re-use data created elsewhere.

Success metrics: Moving into a production linked data environment will take time, and each institution may well have a different timetable. Discussions indicated that linked data experiments could be considered successful if:

  • The data is more integrated than it is now.
  • Data created by different workflows are interoperable.
  • Libraries can offer users new, valued services that current data models can’t support.
  • The resource descriptions are more machine-actionable than current standards.
  • Outside parties use library resource descriptions more.
  • The data is better and richer because more parties share in its creation.


Graphic: Partial view of Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak.

About Karen Smith-Yoshimura

Karen Smith-Yoshimura, program officer, works on topics related to renovating descriptive and organizing practices with a focus on large research libraries and area studies requirements.

Mail | Web | Twitter | More Posts (58)

Open Knowledge Foundation: Announcing the new open data handbook

planet code4lib - Wed, 2015-05-13 13:29

We are thrilled to announce that the Open Data Handbook, the premier guide for open data newcomers and veterans alike, has received a much needed update! The Open Data Handbook, originally published in 2012, has become the go to resource for the open data community. It was written by expert members of the open data community and has been translated into over 18 languages. Read it now »

The Open Data Handbook elaborates on the what, why & how of open data. In other words – what data should be open, what are the social and economic benefits of opening that data, and how to make effective use of it once it is opened.

The handbook is targeted at a broad audience, including civil servants, journalists, activists, developers, and researchers as well as open data publishers. Our aim is to ensure open data is widely available and applied in as many contexts as possible, we welcome your efforts to grow the open knowledge movement in this way!

The idea of open data is really catching on and we have learned many important lessons over the past three years. We believe that is time that the Open Data Handbook reflect these learnings. The revised Open Data Handbook has a number of new features and plenty of ways to contribute your experience and knowledge, please do!

 Inspire Open Data Newcomers

The original open data guide discussed the theoretical reasons for opening up data – increasing transparency and accountability of government, improving public and commercial services, stimulating innovation etc. We have now reached a point where we are able to go beyond theoretical arguments — we have real stories that document the benefits open data has on our lives. The Open Data Value Stories are use cases from across the open knowledge network that highlighting the social and economic value and the varied applications of open data in the world.

This is by no means an exhaustive list; in fact just the beginning! If you have an open data value story that you would like to contribute, please get in touch.

 Learn How to Publish & Use Open Data

The Open Data Guide remains the premier open data how-to resource and in the coming months we will be adding new sections and features! For the time being, we have moved the guide to Github to streamline contributions and facilitate translation. We will be reaching out to the community shortly to determine what new content we should be prioritising.

While in 2012, when we originally published the open data guide, the open data community was still emerging and resources remained scarce, today as the global open data community is mature, international and diverse and resources now exist that reflect this maturity and diversity. The Open Data Resource Library is curated collection of resources, including articles, longer publications, how to guides, presentations and videos, produced by the global open data community — now available all in one place! If you want to contribute a resource, you can do so here! We are particularly interested in expanding the number of resources we have in languages other than English so please add them if you have them!

Finally, as we are probably all aware, the open data community likes its jargon! While the original open data guide had a glossary of terms, it was far from exhaustive — especially for newcomers to the open data movement. In the updated version we have added over 80 new terms and concepts with easy to understand definitions! Have we missed something out? Let us know what we are missing here.

The updated Open Data Handbook is a living resource! In the coming months, we will be adding new sections to the Open Data Guide and producing countless more value stories! We invite you to contribute your stories, your resources and your ideas! Thank you for your contributions past, present and future and your continued efforts in pushing this movement forward.

SearchHub: Query Autofiltering Revisited – Lets be more precise!!!

planet code4lib - Wed, 2015-05-13 10:58
In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find.  A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this.  A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns.  The traditional way to do this is to use Natural Language Processing or NLP to parse the user query.  This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”.  In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed.  The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search. The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way.  The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting.  It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”.  It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it.  This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context.  The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata. Searching vs Navigating A little historical background here.  With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links.  In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata.  The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference.  Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will). A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching: In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use  <field>:<value> syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user.  However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words.  The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful.  So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender.  But the information is still somewhere in the search index, we just need a way to get it back at at “query time”.  Then, we can both have our cake AND eat it! Noun Phrases and the Hierarchy of meta information When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing.  In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do.  That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth.  Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types. Rebuilding Context: Identifying field name patterns to find relevant phrase patterns So now its time to put Humpty Dumpty back together again.  With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate  metadata mapping on behalf of the user.  We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store.  Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example. The Query AutoFiltering Search Component To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things.  The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map.  This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field.  If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query.  The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer.  Assuming that we have indexed the following records: Field:         color       product_type         brand Record 1:   red    shoes Record 2:   red socks Record 3:   brown    socks Record 4: green socks              red lion Record 5:   blue     socks red lion Record 6: blue     socks      red dragon Record 7:            pizza               red baron Record 8:             whiskey            red label Record 9: smoke detector red light Record 10: yeast red star Record 11: red wine gallo Record 12: red wine vinegar heinz Record 13: red grapes dole Record 14: red brick acme Record 15: red pepper dole Record 16: red pepper flakes mccormick This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records.  However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query.  Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values.  So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”.  Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too.  If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”.  It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type.  From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query: q=blue red lion socks to a filter query: fq=color:blue&fq=brand:”red lion”&fq=product_type:socks This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%! Adding case sensitivity and synonym support: One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality.  To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed.  Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future. Source Code For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it).  The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above. Conclusions The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post.  Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can.  This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost).  The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications.  The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.

The post Query Autofiltering Revisited – Lets be more precise!!! appeared first on Lucidworks.

LibUX: 018: The Kano Model is Awesome – Really …

planet code4lib - Wed, 2015-05-13 04:49

Okay, so we found it sort of tricky to explain, but the Kano Model really is awesome. In this episode, we try our best to tell you that the Kano Model is a sophisticated tool used to measure the impact of service features on the user experience. It is a way that you and your stakeholders can visualize the weight of a new feature, whether it will produce delight but require a huge investment, or that carousel will make you rue the day.

The post 018: The Kano Model is Awesome – Really … appeared first on LibUX.

DuraSpace News: GET READY for Fedora Camp

planet code4lib - Wed, 2015-05-13 00:00

Winchester, MA  Save the dates now! The Fedora Project is pleased to announce that the first Fedora Camp will be offered November 16-18 (Monday-Wednesday) at Duke University (specifically the The Edge: The Ruppert Commons for Research, Technology, and Collaboration [1]).

HangingTogether: I Come Neither to Praise Nor Bury

planet code4lib - Tue, 2015-05-12 22:21

I come to bury Caesar, not to praise him. – Antony, in The Tragedy of Julius Caesar, William Shakespeare

My esteemed colleague Thom Hickey, who knows the MARC format more intimately than I ever will, has penned a defense of that venerable metadata format. He was kind enough to cite a column I wrote in 2002 for Library Journal. But even back then, my opinion had changed such so that I wrote a much longer and thorough piece that laid out the bibliographic future I wished to see. The journal in which it was published thought highly enough of it to award it the paper of the year award. I think my bribe helped.

Thom’s post lays out a pretty compelling use case for MARC, and that’s awesome. Frankly, if MARC wasn’t as good as it was it would not have lasted as long as it has. And let’s be clear, it’s far from dead.

But that is a fairly specific use case, and such specific use cases may still apply long after MARC is replaced with BIBFRAME (which is the intent of the Library of Congress). Or, perhaps, something else yet to be determined.

But I’m more concerned about the broader ecology of library bibliographic data, and how we fit within the even larger ecology of non-library bibliographic data. And there MARC is showing its age. I still think we will likely need to have a fairly complex metadata element set for library work, and a much simplified version for syndicating out in the world. And I think that a very good choice for that much simpler format for syndicating is At least that’s what we’re presently going with.

Meanwhile, we at OCLC will be consuming and offering MARC as well as other formats for some undetermined length of time to come. I come to neither praise nor bury MARC. I come to help create a bibliographic infrastructure that will take us into the future by accommodating many strategies, tools, and formats.

About Roy Tennant

Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.

Mail | Web | Twitter | Facebook | LinkedIn | Flickr | YouTube | More Posts (88)

FOSS4Lib Upcoming Events: Knoxville Fedora Workshop

planet code4lib - Tue, 2015-05-12 21:05
Date: Friday, June 26, 2015 - 08:00 to 17:00Supports: Fedora Repository

Last updated May 12, 2015. Created by Peter Murray on May 12, 2015.
Log in to edit this page.

From the announcement:

Join us in beautiful Knoxville, Tennessee for an all-day workshop on Fedora, the open source digital content repository system.


The workshop will occur from 9 AM to 5 PM on Friday, June 26, with a break for lunch.

Library of Congress: The Signal: Nominations Now Open for the 2015 NDSA Innovation Awards

planet code4lib - Tue, 2015-05-12 20:38

Elise Depew Strang L’Esperance (1878-1959), Cornell University, shown here in 1951 with her Lasker Clinical Medical Research Award, was a pioneer in cancer treatment for women and had received the award jointly with Catherine Macfarlane. Smithsonian Institution Archives. Image SIA2008-5264

The National Digital Stewardship Alliance Innovation Working Group is proud to open the nominations for the 2015 NDSA Innovation Awards. As a diverse membership group with a shared commitment to digital preservation, the NDSA understands the importance of innovation and risk-taking in developing and supporting a broad range of successful digital preservation activities. These awards are an example of the NDSA’s commitment to encourage and recognize innovation in the digital stewardship community.

This slate of annual awards highlights and commends creative individuals, projects, organizations and future stewards demonstrating originality and excellence in their contributions to the field of digital preservation. The program is administered by a committee drawn from members of the NDSA Innovation Working Group.

Last year’s winners are exemplars of the diversity and collaboration essential to supporting the digital stewardship community as it works to preserve and make available digital materials.

The NDSA Innovation Awards focus on recognizing excellence in one or more of the following areas:

  • Individuals making a significant, innovative contribution to the field of digital preservation;
  • Projects whose goals or outcomes represent an inventive, meaningful addition to the understanding or processes required for successful, sustainable digital preservation stewardship;
  • Organizations taking an innovative approach to providing support and guidance to the digital preservation community;
  • Future stewards, especially students, but including educators, trainers or curricular endeavors, taking a creative approach to advancing knowledge of digital preservation theory and practices.

Acknowledging that innovative digital stewardship can take many forms, eligibility for these awards has been left purposely broad. Nominations are open to anyone or anything that falls into the above categories and any entity can be nominated for one of the four awards. Nominees should be US-based people and projects or collaborative international projects that contain a US-based partner. This is your chance to help us highlight and reward novel, risk-taking and inventive approaches to the challenges of digital preservation.

Nominations are now being accepted and you can submit a nomination using this quick, easy online submission form. You can also submit a nomination by emailing a brief description, justification and the URL and/or contact information of your nominee to ndsa (at)

Nominations will be accepted until Tuesday, June 30 and winners announced in mid-July. Help us recognize and reward innovation in digital stewardship and submit a nomination!

SearchHub: Rule-Based Replica Assignment for SolrCloud

planet code4lib - Tue, 2015-05-12 17:43

When Solr needs to assign nodes to collections, it can either automatically assign them randomly or the user can specify a set nodes where it should create the replicas. With very large clusters, it is hard to specify exact node names and it still does not give you fine grained control over how nodes are chosen for a shard. The user should be in complete control of where the nodes are allocated for each collection, shard and replica. This helps to optimally allocate hardware resources across the cluster.

Rule-based replica assignment is a new feature coming to Solr 5.2 that allows the creation of rules to determine the placement of replicas in the cluster. In the future, this feature will help to automatically add/remove replicas when systems go down or when higher throughput is required. This enables a more hands-off approach to administration of the cluster.

This feature can be used in the following instances:

  • Collection creation
  • Shard creation
  • Shard splitting
  • Replica creation
Common use cases
  • Don’t assign more than 1 replica of this collection to a host
  • Assign all replicas to nodes with more than 100GB of free disk space or, assign replicas where disk space is more
  • Do not assign any replica on a given host because I want to run an overseer there
  • Assign only one replica of a shard in a rack
  • Assign replica in nodes hosting less than 5 cores or assign replicas in nodes hosting least number of cores
What is a rule?   A rule is a set of conditions that a node must satisfy before a replica core can be created there. A rule consists of three conditions:
  • shard – this is the name of a shard or a wild card (* means for each shard). If shard is not specified, then the rule applies to the entire collection
  • replica – this can be a number or  a wild-card ( * means any number zero to infinity )
  • tag – this is an attribute of a node in the cluster that can be used in a rule .eg: “freedisk” “cores”, “rack”, “dc” etc. The tag name can be a custom string. If creating a custom tag, a Solr plugin called a snitch is responsible for providing tags and values.
  Operators A condition can have one of the four operators
  • equals (no operator required) :   tag:x means tag value must be equal to ‘x’
  • greater than (>) : tag:>x means tag value greater than ‘x’. x must be a number
  • less than (<) :  tag:<x means tag value less than ‘x’. x must be a number
  • not equal (!) :  tag:!x means tag value MUST NOT be equal to ‘x’. The equals check is performed on String value
  Examples example 1 :  keep less than 2 replicas (at most 1 replica) of this collection on any node replica:<2,node:* example 2 : for a given shard , keep less than 2 replicas on any node shard:*,replica:<2,node:*   example 3 :  “assign all replicas in shard1 to rack 730 “ shard:shard1,replica:*,rack:730 default value of replica is *. So, it can be omitted and the rule can be reduced to shard:shard1,rack:730 note: This means that there should be a snitch which provides values for the tag ‘rack’   example 4  :  “create replicas in nodes with less than 5 cores only” replica:*,cores:<5 or simplified as, cores:<5 example 5: Do not create any replicas in host host:! fuzzy operator (~) This can be used as a suffix to any condition. This would first try to satisfy the rule strictly. If Solr can’t find enough nodes to match the criterion, it tries to find the next best match which may not satisfy the criterion For example, best match . freedisk:>200~.  Try to assign replicas of this collection on nodes with more than 200GB of free disk space and if that is not possible choose the node which has the most free disk space   Choosing among equals The nodes are sorted first and the rules are used to sort them. This ensures that even if many nodes match the rules, the best nodes are picked up for node assignment. For example, if there is a rule that says “freedisk:>20” nodes are sorted on disk space descending and a node with the most disk space is picked up first. Or if the rule is “cores:<5”, nodes are sorted with number of cores ascending and the node with least number of cores is picked up first.   Snitch Tag values come from  a plugin called Snitch. If there is a tag called ‘rack’ in a rule, there must be Snitch which provides the value for ‘rack’ for each node in the cluster . A snitch implements Snitch interface . Solr, by default, provides a default snitch which provides the following tags
  • cores : No:of cores in the node
  • freedisk : Disk space available in the node
  • host : host name of the node
  • node: node name
  • sysprop.{PROPERTY_NAME} : These are values available from system properties. sysprop.key means a value that is passed to the node as -Dkey=keyValue during the node startup. It is possible to use rules like sysprop.key:expectedVal,shard:*
  How are Snitches configured? It is possible to use one or more snitches for a set of rules. If the rules only need tags from default snitch it need not be explicitly configured. example: snitch=class:fqn.ClassName,key1:val1,key2:val2,key3:val3   How does the system collect tag values?
  1. Identify the set of tags in the rules
  2. Create instances of Snitches specified. The default snitch is created anyway
  3. Ask each snitch if it can provide values for the any of the tags. If, even one tag does not have a snitch, the assignment fails
  4. After identifying the snitches, ask them to provide the tag values for each node in the cluster
  5. If the value for a tag is not obtained for a given node , it cannot participate in the assignment
  How to configure rules? Rules are specified per collection during collection creation as request parameters. It is possible to specify multiple ‘rule’ and ‘snitch’ params as in this example: snitch=class:EC2Snitch&rule=shard:*,replica:1,dc:dc1&rule=shard:*,replica:<2,dc:dc3 These rules are persisted in the clusterstate in Zookeeper and are available throughout the lifetime of the collection. This enables the system to perform any future node allocation without direct user interaction  

The post Rule-Based Replica Assignment for SolrCloud appeared first on Lucidworks.

District Dispatch: Some debates don’t have E-asy winners

planet code4lib - Tue, 2015-05-12 15:24

A major focus of ALA’s Office for Information Technology Policy (OITP) of late has been recasting the common perception of libraries among many decision makers and influencers to reflect current reality. To this end, OITP has started a Policy Revolution!—the punchy name of our most recent grant project—through which we aim to increase our community’s visibility and capacity for engagement in national policymaking. To bring the “Revolution” to fruition, we are producing a national public policy agenda for libraries, building the capacity of library advocates to communicate effectively with beltway decision makers, and devising new strategies for library advocacy at the national level. As part of these coordinated efforts, OITP (with Senior Counsel Alan Fishel) coined The E’s of Libraries trademark—a pithy shorthand for what today’s libraries do for people: Education, Employment, Entrepreneurship, Empowerment, and Engagement.

Since we began the Policy Revolution! initiative, we’ve been soliciting a wide range of perspectives on the services modern libraries provide—so ALA was eager to help moderate a “tri-bate” on “The E’s” between local area high school students last Tuesday at the Washington, D.C.-based law firm Arent Fox. Participants in the tri-bate were assigned to one of three teams, each of which represented a particular component of the E’s. Each team was asked to argue that their component represented the area in which libraries provide the most benefit to the public: Side 1—Employment and Entrepreneurship; Side 2—Education; Side 3—Engagement and Empowerment.

Tri-baters included Penelope Blackwell, Crystal Carter, Diamond Green, Taylor McDonald, Zinquarn Wright and Devin Wingfield of McKinley Technology High School; and Amari Hemphill, Lauren Terry, Layonna Mathis, Jacques Doby, Davon Thomas, Malik Morris and David Johnson of Eastern Senior High School. OITP’s Larra Clark and Marijke Visser, and ALA Executive Director Keith Michael Fiels comprised the panel of judges.

The discussion was spirited, with each team demonstrating clear strengths. The Employment and Entrepreneurship teams had a strong command of library statistics, citing data from the Institute of Museum and Library Services (IMLS) and the ALA/University of Maryland Digital Inclusion Survey (e.g., 75% of libraries provide assistance with preparing resumes, and 96% offer online job and employment resources). They made a strong case for libraries not just as employment hubs, but also as trusted workforce development centers, where people of all ages can build the skills and competencies needed to be competitive in the digital age. Their arguments were particularly apropos, given ALA’s ongoing efforts to ensure libraries are recognized as eligible participants in the skills training activities recently authorized by the Workforce Investment and Opportunity Act (WIOA).

The Education teams did a strong job of describing libraries as safe spaces that support learning within and beyond school walls. They shared a clear understanding that libraries not only provide opportunities for K-12 students, but also to non-traditional students seeking to gain skills and credentials critical for participation in today’s economy. Perhaps the most impressive aspect of one team’s performance was their decision to describe education as the foundation that undergirds every other aspect of life in which libraries provide assistance: “How can you create a resume if you can’t read or write,” the team asked in their rebuttal, providing one of the lines of the day.

The Engagement and Empowerment teams also turned in impressive performances. Despite their formidable task of describing library involvement in two hard-to-define areas, the team met the challenge by depicting libraries as places where people have the freedom and the resources to pursue individual passions and interests. They also displayed a strong understanding of the modern library as a one-stop community hub, explaining that libraries of all kinds are secure spaces that keep young people on the path to productivity, and provide all people the opportunity to participate in society.

As impressed as we were by the students’ firm grasp of the resources and services today’s libraries provide, the day was fundamentally not about gauging their ability to articulate what our community does on a national scale. It was rather about gaining their personal perspectives on the strengths and challenges of library service, and their expectations for what libraries should do to meet the needs of communities of all kinds.

The discussions the judges had with the students following the conclusion of the tri-bate were particularly informative in this regard. Several students suggested that libraries should find new ways to engage young people, which we at OITP particularly appreciate, given our ongoing work to build a new program on children and youth. Students called for librarians to connect with them at recreation centers and other non-library spaces to raise awareness and connect library services in new (and fun!) ways. Others suggested that library professionals should continue and even enhance their focus on providing instruction in digital technologies and basic computer and internet skills.

We found the students’ perspectives and input invaluable, and we look forward to using it to inform our continued work to raise awareness of all today’s libraries do for the public, and to increase the library community’s profile in national public policy debates. We want to thank Arent Fox for hosting the day’s session, and—most importantly—all of the students who participated for an invigorating and informative discussion. Great job to all!

The post Some debates don’t have E-asy winners appeared first on District Dispatch.

David Rosenthal: Potemkin Open Access Policies

planet code4lib - Tue, 2015-05-12 15:00
Last September Cameron Neylon had an important post entitled Policy Design and Implementation Monitoring for Open Access that started:
We know that those Open Access policies that work are the ones that have teeth. Both institutional and funder policies work better when tied to reporting requirements. The success of the University of Liege in filling its repository is in large part due to the fact that works not in the repository do not count for annual reviews. Both the NIH and Wellcome policies have seen substantial jumps in the proportion of articles reaching the repository when grantees final payments or ability to apply for new grants was withheld until issues were corrected.He points out that:
Monitoring Open Access policy implementation requires three main steps. The steps are:
  1. Identify the set of outputs are to be audited for compliance
  2. Identify accessible copies of the outputs at publisher and/or repository sites
  3. Check whether the accessible copies are compliant with the policy
Each of these steps are difficult or impossible in our current data environment. Each of them could be radically improved with some small steps in policy design and metadata provision, alongside the wider release of data on funded outputs.He makes three important recommendations:
  • Identification of Relevant Outputs: Policy design should include mechanisms for identifying and publicly listing outputs that are subject to the policy. The use of community standard persistable and unique identifiers should be strongly recommended. Further work is needed on creating community mechanisms that identify author affiliations and funding sources across the scholarly literature.
  • Discovery of Accessible Versions: Policy design should express compliance requirements for repositories and journals in terms of metadata standards that enable aggregation and consistent harvesting. The infrastructure to enable this harvesting should be seen as a core part of the public investment in scholarly communications.
  • Auditing Policy Implementation: Policy requirements should be expressed in terms of metadata requirements that allow for automated implementation monitoring. RIOXX and ALI proposals represent a step towards enabling automated auditing but further work, testing and refinement will be required to make this work at scale.
What he is saying is that defining policies that mandate certain aspects of Web-published materials without mandating that they conform to standards that make them enforceable over the Web is futile. This should be a no-brainer. The idea that, at scale, without funding, conformance will be enforced manually is laughable. The idea that researchers will voluntarily comply when they know that there is no effective enforcement is equally laughable.

LITA: LITA Presents Two Webinars on Kids, Technology and Libraries

planet code4lib - Tue, 2015-05-12 13:00

Technology and Youth Services Programs: Early Literacy Apps and More
Wednesday May 20, 2015
1:00 pm – 2:00 pm Central Time


After Hours: Circulating Technology to Improve Kids’ Access
Wednesday May 27, 2015
1:00 pm – 2:00 pm Central Time

Register now for either or both of these exciting and lively webinars

Technology and Youth Services Programs, join Claire Moore, Head of Children’s Service at Darien Library (CT). In this digital age it has become increasingly important for libraries to infuse technology into their programs and services. Youth services librarians are faced with many technology routes to consider and app options to evaluate and explore. Claire will discuss innovative and effective ways the library can create opportunities for children, parents and caregivers to explore new technologies.

After Hours: Circulating Technology to Improve Kids’ Access, join Megan Egbert the Youth Services Manager for the Meridian Library District (ID). For years libraries have been providing access and training to technology through their services and programs. Kids can learn to code, build a robot, and make a movie with an iPad at the library. But what can they do when they get home? The Meridian Library (ID) has chosen to start circulating new types of technology such as Arduinos, Raspberry Pi’s, robots, iPads and apps. Join Megan to discover benefits, opportunities and best practices.

Register for either one or both of the webinars

Full details
Can’t make the date but still want to join in? Registered participants will have access to the recorded webinar.


LITA Member: $45
Non-Member: $105
Group: $196

Registration Information

Register Online page arranged by session date (login required)
Mail or fax form to ALA Registration
Call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4269 or Mark Beatty,


Subscribe to code4lib aggregator