You are here

Feed aggregator

SearchHub: Parallel Computing in SolrCloud

planet code4lib - Mon, 2016-08-15 16:30

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Joel Bernstein’s session about Parallel Computing in SolrCloud.

This presentation provides a deep dive into SolrCloud’s parallel computing capabilities – breaking down the framework into four main areas: shuffling, worker collections, the Streaming API, and Streaming Expressions. The talk describes how each of these technologies work individually and how they interact with each other to provide a general purpose parallel computing framework.

Also included is a discussion of some of the core use cases for the parallel computing framework. Use cases involving real-time map reduce, parallel relational algebra, and streaming analytics will be covered.

Joel Bernstein is a Solr committer and search engineer for the open source ECM company Alfresco.

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Parallel Computing in SolrCloud appeared first on

Islandora: Meet Your New Technical Lead

planet code4lib - Mon, 2016-08-15 13:23

Hi, I'm Danny, and I'm the newest hire of the Islandora Foundation. My role within the Foundation is to serve as Technical Lead, and I want to take some time to introduce myself and inform everyone of just exactly what I'll be doing for them.

I guess for starters, I should delve a bit into my background. Since a very young age, I've always considered myself to be pretty nerdy. As soon as I learned how to read, my father had me in front of a 386 and taught me DOS. In high school, I discovered Linux and was pretty well hooked. It was around that time I started working with HTML, and eventually javascript and Flash. I graduated with honors from East Tennessee State University with a B.Sc. in Mathematics and a Physics minor, and was exposed to a lot of C++ and FORTRAN. I specialized in Graph Theory, which I didn't think at the time would lead to a career as a programmer, since I had decided to be an actuary after completing school. Fast forward a few years, and I have a couple actuarial exams under my belt and have become well versed in VBA programming and Microsoft Office. But I didn't really like it, and wanted to do more than spreadsheet automation. So I moved to Canada and went back to school for Computer Science, but quickly found my way into the workforce for pragmatic reasons (read: I had bills to pay). I managed to score a job in the small video game industry that's evolved on PEI. I did more Flash (sadly) but was also exposed to web frameworks like Ruby on Rails and Django. A lot of my time was spent writing servers for Facebook games, and tackled everything from game logic to payment systems. But that career eventually burned me out, as it eventually does to most folks, and I applied for a position at a small company named discoverygarden that I heard was a great place to work.

And that's how I first met Islandora. I was still pretty green for the transition from Drupal 6 to 7, but shortly after the port I was able to take on more meaningful tasks. After learning tons about the stack while working with discoverygarden, I was given the opportunity to work on what would eventually become CLAW. And now I'm fortunate enough to have the opportunity to see that work through as an employee of the Islandora Foundation. So before I start explaining my duties as Tech Lead, I'd really like to thank the Islandora Foundation for hiring me, and discoverygarden for helping me gain the skills I needed to grow into this position.

Now, as is tradition in Islandora, I'll describe my roles as hats.  I'm essentially wearing three of them:

  • Hat #1: 7.x-1.x guy 
    1. We have increasingly more defined processes and workflows, and I'm committed to making sure those play out the way they should. But, for whatever reason, If there's a time where a pull request has sat for too long and the community hasn't responded, I will make sure it is addressed. I will either try to facilitate a community member who has time/interest to look at it, and if that's not possible, I will review myself.
    2. I will take part in and help chair committers' calls every other Thursday.
    3. I will attend a handful of Interest Group meetings.  There's too many for me to attend them all, but I'm starting with the IR and Security interest groups.
    4. Lastly, I will be serving as Release Manager for the next release, and will be working towards continuing to document and standardize the process to the best of my abilities, so that it's easier for other community members to take part in and lead that process from here on out.

  • Hat #2: CLAW guy
    1. We're currently in the process of pivoting from a Drupal 7 to Drupal 8 codebase, and I'm going to be sheparding that process as transparently as possible.  This means I will be working with community members to develop a plan for the minimum viable product (or MVP for short).  This will help defend against scope creep, and force ourselves as a community to navigate what all these changes mean.  Between Fedora 4, PCDM, and Drupal 8, there's a lot that's different, and we need to all be on the same page.  For everyone's sake, this work will be conducted as much as possible by conversations through the mailing lists, instead of solely at the tech calls.  In the Apache world, if it doesn't happen on the list, it never happened.  And I think that's a good approach to making sure people can at least track down the rationale for why we're doing certain things in the codebase.
    2. Using the MVP, I will be breaking down the work into the smallest consumable pieces possible.  In the past few years I've learned a lot about working with volunteer developers, and I fully understand that people have day jobs with other priorities.  By making the units of work as small as possible, we have better chance of receiving more contributions from interested community members.  In practice, I think this means I will be writing a lot of project templates to decrease ramp-up time for people, filling in boilerplate, and ideally even providing tests beforehand.
    3. I will be heavily involved in Sprint planning, and will be running community sprints for CLAW.
    4. I will be chairing and running CLAW tech calls, along with Nick Ruest, the CLAW Project Director.

  • Hat #3: UBC project guy
    • As part of a grant, the Foundation is working with the University of British Columbia Library and Press to integrate a Fedora 4 with a front-end called Scalar. They will also be using CLAW as a means of ingesting multi-pathway books. So I will be overseeing contractor work for the integration with Scalar, while also pushing CLAW towards full book support.

I hope that's enough for everyone to understand what I'll be doing for them, and how I can be of help if anyone needs it.  If you need to get in touch with me, I can be found on the lists, in #islandora on IRC as dhlamb, or at  I look forward to working with everyone in the future to help continue all the fantastic work that's been done by everyone out there.


LibUX: Content Style Guide – University of Illinois Library

planet code4lib - Mon, 2016-08-15 00:36

University of Illinois Library has made their content style guide available through a creative commons license. I feel like more than anyone I point to Suzanne Chapman‘s work. I saw her credited in the site’s footer and thought, “oh, yeah – of course.” She’s pretty great.

One of the best ways to ensure that our website is user-friendly is to follow industry best practices, keep the content focused on key user tasks, and keep our content up-to-date at all times.

Also, walk through their 9 Principles for Quality Content.

We have many different users with many different needs. They are novice and expert users, desktop and mobile users, people with visual, hearing, motor, or cognitive impairments, non-native English speakers, and users with different cultural expectations. Following these guidelines will help ensure a better experience for all our users. They will also help us create a more sustainable website.

  1. Content is in the right place
  2. Necessary, needed, useful, and focused on patron needs
  3. Unique
  4. Correct and complete
  5. Consistent, clear, and concise
  6. Structured
  7. Discoverable and makes sense out of context
  8. Sustainable (future-friendly)
  9. Accessible

All of these are elaborated and link out to a rabbit-hole of further reading.

Jonathan Rochkind: UC Berkeley Data Science intro to programming textbook online for free

planet code4lib - Sat, 2016-08-13 22:36

Looks like a good resource for library/information professionals who don’t know how to program, but want to learn a little bit of programming along with (more importantly) computational and inferential thinking, to understand the technological world we work in. As well as those who want to learn ‘data science’!

Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference. This text develops a consistent approach to all three, introducing statistical ideas and fundamental ideas in computer science concurrently. We focus on a minimal set of core techniques that they apply to a vast range of real-world applications. A foundation in data science requires not only understanding statistical and computational techniques, but also recognizing how they apply to real scenarios.

For whatever aspect of the world we wish to study—whether it’s the Earth’s weather, the world’s markets, political polls, or the human mind—data we collect typically offer an incomplete description of the subject at hand. The central challenge of data science is to make reliable conclusions using this partial information.

In this endeavor, we will combine two essential tools: computation and randomization. For example, we may want to understand climate change trends using temperature observations. Computers will allow us to use all available information to draw conclusions. Rather than focusing only on the average temperature of a region, we will consider the whole range of temperatures together to construct a more nuanced analysis. Randomness will allow us to consider the many different ways in which incomplete information might be completed. Rather than assuming that temperatures vary in a particular way, we will learn to use randomness as a way to imagine many possible scenarios that are all consistent with the data we observe.

Applying this approach requires learning to program a computer, and so this text interleaves a complete introduction to programming that assumes no prior knowledge. Readers with programming experience will find that we cover several topics in computation that do not appear in a typical introductory computer science curriculum. Data science also requires careful reasoning about quantities, but this text does not assume any background in mathematics or statistics beyond basic algebra. You will find very few equations in this text. Instead, techniques are described to readers in the same language in which they are described to the computers that execute them—a programming language.

Filed under: General

Jenny Rose Halperin: Hello world!

planet code4lib - Fri, 2016-08-12 22:30

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

LITA: Using Text Editors in Everyday Work

planet code4lib - Fri, 2016-08-12 14:35

In the LITA Blog Transmission featuring yours truly I fumbled in trying to explain a time logging feature in the Windows native program Notepad. You can see in the screenshot above that the syntax is .LOG and you put it at the top of the file. Then every time you open the file it adds a time stamp at the end of the file and places the cursor there so you can begin writing. This specific file is where I keep track of my continuing education work. Every time I finish a webinar or class I open this file and write it down (I’ll be honest, I’ve missed a few). At the end of the year I’ll have a nice document with dates and times that I can use to write my annual report.

I use Microsoft Notepad several times a day. In addition to its cool logging function I find it a dependable, simple text editor. Notepad is great for distraction-free, no-frills writing. If I have to copy and paste something from the web into a document—or from one document to another—and I don’t want to drag in all the formatting from the source I put the text in Notepad first. It cleans all the formatting off the text and lets me use it as I need. I use it as a quick way to create to-do lists or jot down notes while working on something else. It launches quickly and lets me get to work right away. Plus, if you delete all the text you’ve written—after you’re done working of course—you can close Notepad and there’s no dialog box asking if you want to save the file.

Prior to becoming a librarian I worked as a programmer writing code. Every single coder I worked with used Notepad to create, revise, and edit code. Sure, you can work in the program you’re writing and your office’s text editor—and you often do; we used something like the vi text editor—but sometimes you need to think through your code and you can’t do that in an executable. I used to have several Notepad files of handy code so that I could reference it quickly without needing to search through source code for it.

I’ve been thinking about Notepad more and more as I prepare for a coding program at my library. A good text editor is essential to writing code. Once you start using one you’ll find yourself reaching for it all the time. But it isn’t all Notepad all the time. If I actually have to troubleshoot code—which these days is mostly things in WordPress—I use Notepad++:

You can see the color highlighting that Notepad++ uses which is a great visual way to see if there are problems in your code without even reading it. It also features a document map which is a high-level view of your entire document on the right-hand side of the screen that highlights where you are in the code. There’s a function list that lists all the functions called in the file. Notepad++ has some other cool text editor functions like multi-editing (editing in several places in the file at the same time), and column mode editing (where you can select a column of text to edit instead of entire lines of code). It’s a very handy tool when you’re working on code.

These are not the only text editors out there. A quick search for lists of text editors gives you more choices than you need. Notepad++ is at the top of several lists and I have to say that I like it better than others I’ve tried. The best thing is most of these text editors are free so they’re easy to try out and see what works for you. They all have very similar feature sets so it often comes down to the user interface. While these two options are Windows operating system only, there are plenty of good text editors for Mac users, too.

Text editors won’t be the starting point for my coding program. We’ll focus on some non-tech coding exercises and some online tools like Scratch or Tynker and some physical items like Sphero or LEGO Mindstorm. While these are geared towards children they are great for adults who have never interacted with code. (Sphero and Mindstorm do have a cost associated with them) When I get to the point in our coding program where I want to talk about text editors I’ll focus on Notepad and Notepad++ but let people know there are other options. If I know my patrons, they’ll have suggestions for me.

Do you have any cool tips for your favorite text editor or perhaps just a recommendation?

SearchHub: Pivoting to the Query: Using Pivot Facets to build a Multi-Field Suggester

planet code4lib - Fri, 2016-08-12 13:43

Suggesters, also known as autocomplete, typeahead or “predictive search” are powerful ways to accelerate the conversation between user and search application. Querying a search application is a little like a guessing game – the user formulates a query that they hope will bring back what they want – but sometimes there is an element of “I don’t know what I don’t know” – so the initial query may be a bit vague or ambiguous. Subsequent interactions with the search application are sometimes needed to “drill-in” to the desired information. Faceted navigation and query suggestions are two ways to ameliorate this situation. Facets generally work after the fact – after an initial attempt has been made, whereas suggesters seek to provide feedback in the act of composing the initial query – to improve it’s precision from the start. Facets also provide a contextual multi-dimensional visualization of the result set that can be very useful in the “discovery” mode of search.

A basic tenet of suggester implementations is to never suggest queries that will not bring back results. To do otherwise is pointless (it also does not inspire confidence in your search application!). Suggestions can come from a number of sources – previous queries that were found to be popular, suggestions that are intended to drive specific business goals and suggestions that are based on the content that has been indexed into the search collection. There are also a number of implementations that are available in Solr/Lucene out-of-the-box.

My focus here is on providing suggestions that go beyond the single term query – that provide more detail on the desired results and combine the benefits of multi-dimensional facets with typeahead. Suggestions derived from query logs can have this context but these are not controlled in terms of their structure. Suggestions from indexed terms or field values can also be used but these only work with one field at a time. Another focus of this and my previous blogs is to inject some semantic intelligence into the search process – the more the better. One way to do that is to formulate suggestions that make grammatical sense – constructed from several metadata fields – that create query phrases that clearly indicate what will be returned.

So what do I mean by “suggestions that make grammatical sense”? Just that we can think of the metadata that we may have in our search index (and if we don’t have, we should try to get it!) as attributes or properties of some items or concepts represented by indexed documents. There are potentially a large number of permutations of these attribute values, most of which make no sense from a grammatical perspective. Some attributes describe the type of thing involved (type attributes), and others describe the properties of the thing. In a linguistic sense, we can think of these as noun and adjective properties respectively.

To provide an example of what I mean, suppose that I have a search index about people and places. We would typically have fields like first_name, last_name, profession, city and state. We would normally think of these fields in this order or maybe last_name, first_name city, state – profession as in:

Jones, Bob Cincinnati, Ohio – Accountant


Bob Jones, Accountant, Cincinnati, Ohio

But we would generally not use:

Cincinnati Accountant Jones Ohio Bob

Even though this is a valid mathematical permutation of field value ordering. So if we think of all of the possible ways to order a set of attributes, only some of these “make sense” to us as “human-readable” renderings of the data.

Turning Pivot Facets “Around” – Using Facets to generate query suggestions

While facet values by themselves are a good source of query suggestions because they encapsulate a record’s “aboutness”, they can only do so one attribute at a time. This level of suggestion is already available out-of-the-box with Solr/Lucene Suggester implementations which use the same field value data that facets do in the form of a so-called uninverted index (aka the Lucene FieldCache or indexed Doc Values). But what if we want to combine facet fields as above? Solr pivot facets (see “Pivot Facets Inside And Out” for background on pivot facets) provide one way of combining an arbitrary set of fields to produce a cascading or nested sets of field values. Think of is as a way of generating a facet value “taxonomy” – on the fly. How does this help us? Well, we can use pivot facets (at index time) to find all of the permutations for a compound phrase “template” composed of a sequence of field names – i.e. to build what I will call “facet phrases”. Huh? Maybe an example will help.

Suppose that I have a music index, which has records for things like songs, albums, musical genres and the musicians, bands or orchestras that performed them as well as the composers, lyricists and songwriters that wrote them. I would like to search for things like “Jazz drummers”, “Classical violinists”, “progressive rock bands”, “Rolling Stones albums” or “Blues songs” and so on. Each of these phrases is composed of values from two different index fields – for example “drummer”, “violinist” and “band” are musician or performer types. “Rolling Stones” are a band which as a group is a performer (we are dealing with entities here which can be single individuals or groups like the Stones). “Jazz”, “Classical”, “Progressive Rock” and “Blues” are genres and “albums” and “songs” are recording types (“song” is also a composition type). All of these things can be treated as facets. So if I create some phrase patterns for these types of queries like “musician_type, recording_type” or “genre, musician_type” or “performer, recording_type” and submit these as pivot facet queries, I can construct many examples of the above phrases from the returned facet values. So for example, the pivot pattern “genre, musician_type” would return things like, “jazz pianist”, “rock guitarist”, “classical violinist”, “country singer” and so on – as long as I have records in the collection for each of these category combinations.

Once I have these phrases, I can use them as query suggestions by indexing them into a collection that I use for this purpose. It would also be nice if the precision that I am building into my query suggestions was honored at search time. This can be done in several ways. When I build my suggester collection using these pivot patterns, I can capture the source fields and send them back with the suggestions. This would enable precise filter or boost queries to be used when they are submitted by the search front end. One potential problem here is if the user types the exact same query that was suggested – i.e. does not select from the typeahead dropdown list. In this case, they wouldn’t get the feedback from the suggester but we want to ensure that the results would be exactly the same.

The query autofiltering technique that I have been developing and blogging about is another solution to matching the precision of the response with the added precision of these multi-field queries. It would work whether or not the user clicked on a suggestion or typed in the phrase themselves and hit “enter”. Some recent enhancements to this code that enable it to respond to verb, prepositional or adjectives and to adjust the “context” of the generated filter or boost query, provide another layer of precision that we can use in our suggestions. That is, suggestions can be built from templates or patterns in which we can add “filler” terms such as the verbs, prepositions and adjectives that the query autofilter now supports.

Once again, an example may help to clear up the confusion. In my music ontology, I have attributes for “performer” and “composer” on documents about songs or recordings of songs. Many artists whom we refer to as “singer-songwriters” for example, occur as both composers and performers. So if I want to search for all of their songs regardless of whether they wrote or performed them, I can search for something like:

Jimi Hendrix songs

If I want to just see the songs that Jimi Hendrix wrote, I would like to search for

“songs Jimi Hendrix wrote” or “songs written by Jimi Hendrix”

which should return titles like “Purple Haze”, “Foxy Lady” and “The Wind Cries Mary”

In contrast the query:

“songs Jimi Hendrix performed”

should include covers like “All Along the Watchtower” (for your listening pleasure, here’s a link), “Hey Joe” and “Sgt Peppers Lonely Hearts Club Band”


“songs Jimi Hendrix covered”

would not include his original compositions.

In this case, the verb phrases “wrote” or “written by”, “performed” or “covered” are not field values in the index but they tell us that the user wants to constrain the results either to compositions or to performances. The new features in the query autofilter can handle these things now but what if we want to make suggestions like this?

To do this, we write pivot template pattern like this

${composition_type} ${composer} wrote

${composition_type} written by ${composer}

${composition_type} ${performer} performed

Code to do Pivot Facet Mining

The source code to build multi-field suggestions using pivot facets is available on github. The code works as a Java Main client that builds a suggester collection in Solr.

The design of the suggester builder includes one or more “query collectors” that feed query suggestions to a central “suggester builder” that a) validates the suggestions against the target content collection and b) can obtain context information from the content collection using facet queries (see below). One of the implementations of query collector is the PivotFacetQueryCollector. Other implementations can get suggestions from query logs, files, Fusion signals and so on.

The github distribution includes the music ontology dataset that was used for this blog article and a sample configuration file to build a set of suggestions on the music data. The ontology itself is also on github as a set of XML files that can be used to create a Solr collection but note that some preprocessing of the ontology was done to generate these files. The manipulations that I did on the ontology to ‘denormalize’ or flatten it will be the subject of a future blog post as it relates to techniques that can be used to ‘surface’ interesting relationships and make them searchable without the need for complex graph queries.

Using facets to obtain more context about the suggestions

The notion of “aboutness” introduced above can be very powerful. Once we commit to building a special Solr collection (also known as a ‘sidecar’ collection) just for typeahead, there are other powerful search features that we now have to work with. One of them is contextual metadata. We can get this by applying facets to the query that the suggester builder uses to validate the suggestion against the content collection. One application of this is to generate security trimming ACL values for a suggestion by getting the set of ACLs for all of the documents that a query suggestion would hit on – using facets. Once we have this, we can use the same security trimming filter query on the suggester collection that we use on the content collection. That way we never suggest a query to a user that cannot return any results for them – in this case because they don’t have access to any of the documents that the query would return. Another thing we can do when we build the suggester collection is to use facets to obtain context about various suggestions. As discussed in the next section, we can use this context to boost suggestions that share contextual metadata with recently executed queries. 

Dynamic or On-The-Fly Predictive Analytics

One of the very powerful and extremely user-friendly things that you can do with typeahead is to make it sensitive to recently issued queries. Typeahead is one of those use cases where getting good relevance is critical because the user can only see a few results and can’t use facets or paging to see more. Relevance is often dynamic in a search session meaning that what the user is looking for can change – even in the course of a single session. Since typeahead starts to work with only a few characters entered, the queries start at a high level of ambiguity. If we can make relevance sensitive to recently searched things we can save the user a lot of a) work and b) grief. Google seems to do just this. When I was building the sample Music Ontology, I was using Google and Wikipedia (yes, I did contribute!) to lookup songs and artists and to learn or verify things like who was/were the songwriter(s) etc. I found that if I was concentrating on a single artist or genre, after a few searches, Google would start suggesting long song titles with just two or three characters entered!! It felt as if it “knew” what my search agenda was! Honestly, it was kinda spooky but very satisfying.

So how can we get a little of Google’s secret sauce in our own typeahead implementations? Well the key here is context. If we can know some things about what the user is looking for we can do a better job of boosting things with similar properties. And we can get this context using facets when we build the suggestion collection! In a nutshell, we can use facet field values to build boost queries to use in future queries in a user session. The basic data flow is shown below:



This requires some coordination between the suggester builder and the front-end (typically Javascript based) search application. The suggester builder extracts context metadata for each query suggestion using facets obtained from the source or content collection and stores these values with the query suggestions in the suggester collection. To demonstrate how this contextual metadata can be used in a typeahead app, I have written a simple Angular JS application that uses this facet-based metadata in the suggester collection to boost suggestions that are similar to recently executed queries. When a query is selected from a typeahead list, the metadata associated with that query is cached and used to construct a boost query on subsequent typeahead actions.

So, for example if I type in the letter ‘J’ into the typeahead app, I get

Jai Johnny Johanson Bands
Jai Johnny Johanson Groups
J.J. Johnson
Jai Johnny Johanson
Juke Joint Jezebel
Juke Joint Jimmy
Juke Joint Johnny

But if I have just searched for ‘Paul McCartney’, typing in ‘J’ now brings back:

John Lennon
John Lennon Songs
John Lennon Songs Covered
James P Johnson Songs
John Lennon Originals
Hey Jude

The app has learned something about my search agenda! To make this work, the front end application caches the returned metadata for previously executed suggester results and stores this in a circular queue on the client side. It then uses the most recently cached sets of metadata to construct a boost query for each typeahead submission. So when I executed the search for “Paul McCartney”, the returned metadata was:

genres_ss:Rock,Rock & Roll,Soft Rock,Pop Rock
hasPerformer_ss:Beatles,Paul McCartney,José Feliciano,Jimi Hendrix,Joe Cocker,Aretha Franklin,Bon Jovi,Elvis Presley ( … and many more)
composer_ss:Paul McCartney,John Lennon,Ringo Starr,George Harrison,George Jackson,Michael Jackson,Sonny Bono

From this returned metadata – taking the top results, the cached boost query was:

genres_ss:”Rock”^50 genres_ss:”Rock & Roll”^50 genres_ss:”Soft Rock”^50 genres_ss:”Pop Rock”^50
hasPerformer_ss:”Beatles”^50 hasPerformer_ss:”Paul McCartney”^50 hasPerformer_ss:”José Feliciano”^50 hasPerformer_ss:”Jimi Hendrix”^50
composer_ss:”Paul McCartney”^50 composer_ss:”John Lennon”^50 composer_ss:”Ringo Starr”^50 composer_ss:”George Harrison”^50
memberOfGroup_ss:”Beatles”^50 memberOfGroup_ss:”Wings”^50

And since John Lennon is both a composer and a member of the Beatles, records with John Lennon are boosted twice which is why these records now top the typeahead list. (not sure why James P Johnson snuck in there except that there are two ‘J’s in his name).

This demonstrates how powerful the use of context can be. In this case, the context is based on the user’s current search patterns. Another take home here is that use of facets besides the traditional use as a UI navigation aid are a powerful way to build context into a search application. In this case, they were used in several ways – to create the pivot patterns for the suggester, to associate contextual metadata with suggester records and finally to use this context in a typeahead app to boost records that are relevant to the user’s most recent search goals. (The source code for the Angular JS app is also included in the github repository.)

We miss you Jimi – thanks for all the great tunes! (You are correct, I listened to some Hendrix – Beatles too – while writing this blog – is it that obvious?)


The post Pivoting to the Query: Using Pivot Facets to build a Multi-Field Suggester appeared first on

LibUX: How to Talk about User Experience — The Webinar!

planet code4lib - Fri, 2016-08-12 04:28

Hey there. My writeup (“How to Talk about User Experience“) is now a 90-minute LITA webinar. I have pretty strong ideas about treating the “user experience” as a metric and I am super grateful to my friends at LITA for another opportunity to make the case.


The explosion of new library user experience roles, named and unnamed, the community growing around it, the talks, conferences, and corresponding literature signal a major shift. But the status of library user experience design as a professional field is impacted by the absence of a single consistent definition of the area. While we can workshop card sorts and pick apart library redesigns, even user experience librarians can barely agree about what it is they do – let alone why it’s important. How we talk about the user experience matters. So, in this 90 minute talk, we’ll fix that.

  • September 7, 2016
  • 1 – 2:30 p.m. (Eastern)
  • $45 – LITA Members
  • $105 – Non-members
  • $196 – Groups


Cynthia Ng: A Reflection on Two Years as Content Coordinator

planet code4lib - Fri, 2016-08-12 04:26
After a little over two years, my time at the BC Libraries Cooperative (the co-op) working on the National Network for Equitable Library Service (NNELS) project will be coming to an end. As I prepare to leave, I thought I would reflect on my work while at the co-op. As always, I’ll start with my … Continue reading A Reflection on Two Years as Content Coordinator

LibUX: Jobs to be Done and New Feature Planning – Workshop

planet code4lib - Fri, 2016-08-12 04:11

I am teaching a 90 minute (!) online workshop on September 13th on Jobs to be Done and New Feature Planning, where — yep! — I will be talking about the Kano model. Those of you not familiar with the jobs to be done framework might still have heard

People don’t want a quarter-inch drill, they want a quarter-inch hole. Theodore Levitt, paraphrasing

— the observation being that people buy services not for the services themselves but to get jobs done. There is a lot actually written which takes this farther, noting that while demographics and characteristics play some minor role the task or job at hand is largely independent from that, rendering feature-planning around demographics — you might know them as personas — sort of useless. We’ll try to reconcile that, too.


Core to improving the library user experience is identifying need and introducing new and useful services, features, and content, but the risk of failure sometimes trumps our willingness to try anything out of the ordinary. What a shame, right? In this workshop, Michael Schofield — a developer, librarian, and chief #libuxer — will introduce you to methods and models for identifying the tasks patrons want to perform (or, their “jobs to be done”), and whether providing a new service or feature won’t actually have a negative impact on the overall library user experience.

  • September 13, 2016
  • 2 – 3:30 p.m. (Eastern)

The cost is free to library staff in the state of Florida, but you might still try to give it a whirl and let us know in the comments.


LibUX: Service Design for Libraries Workshop

planet code4lib - Fri, 2016-08-12 03:47

Hey there. Michael here. I am running an online service design workshop courtesy of my amaaaazing friends at NEFLIN called Service Design for Libraries: From Map to Blueprint.


In this practical workshop, Michael Schofield — a developer, librarian, and chief #libuxer — introduces service design, its place is in the user experience zeitgeist, and its role deconstructing library services to hammer out the kinks. A brief conceptual overview is made-up for by a useful workshop that has attendees creating a customer journey map before morphing it into the practical service blueprint.

  • August 23, 2016
  • 2 – 3:30 p.m. (Eastern)
  • Free to library staff in the state of Florida (although, I think it’s free otherwise, too. Give it a whirl and let us know in the comments).


FOSS4Lib Recent Releases: Sufia - 7.1.0

planet code4lib - Thu, 2016-08-11 22:29

Last updated August 11, 2016. Created by Peter Murray on August 11, 2016.
Log in to edit this page.

Package: SufiaRelease Date: Thursday, August 11, 2016

SearchHub: Queue Based Indexing & Collection Management at Gannett

planet code4lib - Thu, 2016-08-11 18:52

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Devansh Dhutia’s session on how Gannet manages schema changes to large Solr collections.

Deploying schema-changes to solr collections with large volumes of data can be problematic when the reindex activity can take almost a whole day. Keeping in mind that Gannett’s 16 million document index grows by approximately 800,000 documents per month, the status quo isn’t satisfactory. A side effect of the current architecture is that during a Solr outage, not only are all reindex activities paused, but upstream authoring engines suffer from latency issues.

This talk demonstrates how Gannett is switching to a queue based solution with creative use of collections & aliases to dramatically improve the deployment, reindex, and authoring experiences. The solution also incorporates keeping a pair of Solr clouds in geographically dispersed data centers in an eventually synchronized state.

Devansh joined the Gannett family in 2006 and has been an active contributor to Gannett’s search strategy starting with Lucene, and for the last 2 years, Solr. Devansh was one of the primary developers involved in switching Gannett from the traditional master-slave solr setup to a geo-replicated Solr Cloud environment. When Devansh isn’t working with Solr, he enjoys spending time with his wife & 3 year old daughter and trying new recipes.

Queue Based Solr Indexing with Collection Management: Presented by Devansh Dhutia, Gannett Co. from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Queue Based Indexing & Collection Management at Gannett appeared first on

LITA: LITA online continuing education for September 2016

planet code4lib - Thu, 2016-08-11 17:40

Start out the fall with these all new sessions, including a web course and two webinars:

Web Course:

Social Media For My Institution; from “mine” to “ours”
Instructor: Plamen Miltenoff
Starting Wednesday September 21, 2016, running for 4 weeks
Register Online, page arranged by session date (login required)

A course for librarians who want to explore the institutional application of social media. Based on an established academic course at St. Cloud State University “Social Media in Global Context” (more information at ). A theoretical introduction will assist participants to detect and differentiate the private use of social media from the structured approach to social media for an educational institution. Legal and ethical issues will be discussed, including future trends and management issues. The course will include hands-on exercises on creation and dissemination of textual and multimedia content and patrons’ engagement. Brainstorming on suitable for the institution strategies regarding resources, human and technological, workload share, storytelling, and branding.

This is a blended format web course:

The course will be delivered as 4 separate live webinar lectures, one per week on:

Wednesdays, September 21, 28, October 5 and 12
2:00 – 3:00 pm Central

You do not have to attend the live lectures in order to participate. The webinars will be recorded and distributed through the web course platform, Moodle for asynchronous participation. The web course space will also contain the exercises and discussions for the course.

Details here and Registration here


How to Talk About User Experience
Presenter: Michael Schofield
Wednesday September 7, 2016
Noon – 1:30 pm Central Time
Register Online, page arranged by session date (login required)

The explosion of new library user experience roles, named and unnamed, the community growing around it, the talks, conferences, and corresponding literature signal a major shift. But the status of library user experience design as a professional field is impacted by the absence of a single consistent definition of the area. While we can workshop card sorts and pick apart library redesigns, even user experience librarians can barely agree about what it is they do – let alone why it’s important. How we talk about the user experience matters. So, in this 90 minute talk, we’ll fix that.

Details here and Registration here

Online Productivity Tools: Smart Shortcuts and Clever Tricks
Presenter: Jaclyn McKewan
Tuesday September 20, 2016
11:00 am – 12:30 pm Central Time
Register Online, page arranged by session date (login required)

Become a lean, mean productivity machine! In this 90 minute webinar we’ll discuss free online tools that can improve your organization and productivity, both at work and home. We’ll look at to-do lists, calendars, and other programs. We’ll also explore ways these tools can be connected, as well as the use of widgets on your desktop and mobile device to keep information at your fingertips.

Details here and Registration here

And don’t miss the other upcoming LITA fall continuing education offerings:


Beyond Usage Statistics: How to use Google Analytics to Improve your Repository, with Hui Zhang
Offered: Tuesday October 11, 2016, 11:00 am – 12:30 pm

Web courses:

Project Management for Success, with Gina Minks
Offered: October 2016, runs for 4 weeks

Contextual Inquiry: Using Ethnographic Research to Impact your Library UW, with Rachel Vacek and Deirdre Costello
Offered: October 2016, running for 6 weeks.

Check the Online Learning web page for more details as they become available.

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

Hydra Project: Hydra Connect 2016 Program news!

planet code4lib - Thu, 2016-08-11 15:30

The Hydra Connect 2016 Program Committee thought that you might appreciate an update on how planning is going, so…

The list of workshops for Monday has been available on the wiki for some time now.  We shall shortly be asking delegates to indicate which sessions they hope to attend so that we can allocate appropriately sized rooms and so that convenors can send out any pre-workshop materials to them.

The conference proper will start on Tuesday with a plenary session, a mix of key presentations and lightning talks as at previous Connects.  On Tuesday afternoon we shall have the very popular poster session for which we ask a poster from every attending institution – please start planning!  As last year, we shall arrange for printing at a FedEx branch near the conference venue for those who prefer not to travel with a poster tube!  Details soon.

We received far more suggestions for Connect sessions than we have had in the past – in particular there were a lot of suggestions for panels and breakouts.  We’re pleased to report that by extending the “traditional” Wednesday morning parallel tracks into the afternoon we have managed to accommodate everyone’s requests.  We’ve timetabled presentations in 30-minute slots (a 20-minute presentation, 5 minutes or so for questions and a bit of time for possible movement between rooms).  Panel and breakout sessions have been timetabled in one hour slots (50-55 minutes plus movement time).  If you are involved in presenting or facilitating any of these sessions you should hear from us with confirmation at the end of next week when we have finished tweaking the timetable.  We have included a number of slots for lightning talks and we’ll start soliciting these at the end of the month.  We anticipate having the Tuesday and Wednesday programs on the wiki in ten days’ time or so and you’ll find there is so much to choose from that, inevitably, you will have to make some hard choices about which sessions to attend.  We are hoping (though this is yet to be confirmed) that we may be able to make, and subsequently post, audio recordings of all the sessions so that you can listen to those that you couldn’t attend once you return home.

Thursday morning has been given over to unconference sessions and we hope to make “Sessionizer” available to delegates in about three weeks’ time so that you can start requesting slots.  Thursday afternoon is available for Interest Groups and Working Groups to have face-time.  We shall make any spare room capacity on Thursday available for booking to allow ad-hoc gatherings, Birds of a Feather sessions, and the like.

Booking is beginning to fill up and if you haven’t yet registered now would be a good time to do so!  Full details of registration and the conference hotel are on the wiki. Please note that the specially negotiated hotel rate is only valid until September 6th and you must register by that same date to receive a Hydra t-shirt!

If you can only make it to one Hydra meeting in 2016/17, this is the one to attend! 

Open Knowledge Foundation: Update on OpenTrialsFDA: finalist for the Open Science Prize

planet code4lib - Thu, 2016-08-11 11:59

In May, the OpenTrialsFDA team (a collaboration between Erick Turner, Dr. Ben Goldacre and the  OpenTrials team at Open Knowledge) was selected as a finalist for the Open Science Prize. This global science competition is focused on making both the outputs from science and the research process broadly accessible to the public. Six finalists will present their final prototypes at an Open Science Prize Showcase in early December 2016, with the ultimate winner to be announced in late February or early March 2017.

As the name suggests, OpenTrialsFDA is closely related to OpenTrials, a project funded by The Laura and John Arnold Foundation that is developing an open, online database of information about the world’s clinical research trials. OpenTrialsFDA will work on increasing access, discoverability and opportunities for re-use of a large volume of high quality information currently hidden in user-unfriendly Food and Drug Administration (FDA) drug approval packages (DAPs).

The FDA publishes these DAPs as part of the general information on drugs via its data portal known as Drugs@FDA. These documents contain detailed information about the methods and results of clinical trials, and are unbiased, compared to reports of clinical trials in academic journals. This is because FDA reviewers require adherence to the outcomes and analytic methods prespecified in the original trial protocols, so, in contrast to most journal editors, they are unforgiving of practices such as post hoc switching of outcomes and changes to the planned statistical analyses. These review packages also often report on clinical trials that have never been published.

A more complete picture: contrasting the journal version of antidepressant trials with the FDA information (image: Erick Turner, adapted from

However, despite their high value, these FDA documents are notoriously difficult to access, aggregate, and search. The website itself is not easy to navigate, and much of the information is stored in PDFs or non-searchable image files for older drugs. As a consequence, they are rarely used by clinicians and researchers. OpenTrialsFDA will work on improving this situation, so that valuable information that is currently hidden away can be discovered, presented, and used to properly inform evidence-based treatment decisions.

The team has started to scrape the FDA website, extracting the relevant information from the PDFs through a process of OCR (optical character recognition). A new OpenTrialsFDA interface will be developed to explore and discover the FDA data, with application programming interfaces (APIs) allowing third party platforms to access, search, and present the information, thus maximising discoverability, impact, and interoperability. In addition, the information will be integrated into the OpenTrials database, so that for any trial for which a match exists, users can see the corresponding FDA data.

Future progress will be shared both through this blog and the OpenTrials blog: you can also sign up for the OpenTrials newsletter to receive regular updates and news. More information about the Open Science Prize and the other finalists is available from

Twitter: @opentrials

SearchHub: 10 Things You Don’t Want to Miss at Lucene/Solr Revolution 2016

planet code4lib - Wed, 2016-08-10 16:51

Are we the only ones who feel like this summer is flying by? While the thought of saying goodbye to warm summer days pains us a little inside, we are excited that Lucene/Solr Revolution 2016 is just two short months away! The conference will be held October 11-14 in Boston, MA. If you haven’t secured your spot yet, don’t wait. Here’s a list of 10 things (in no particular order) that you won’t want to miss at this year’s conference:

1. Mix and Mingle with the Brains Behind Solr

Hear talks from and mingle with those who shape the Apache Solr project – the committers! Lucene/Solr Revolution is the unofficial annual gathering of Solr committers from around the world. It’s not often that this many committers gather at one event, so don’t miss the chance to meet the people who know Solr best.

2. Learn From the Best

Over 50 breakout sessions from Solr users across all industries. Learn how companies like Salesforce, IBM, Bloomberg, Sony, The Home Depot, Microsoft, Allstate Insurance Company, and more use Solr to solve business problems. Check out the agenda here.

3. Very Happy Hours

Happy Hours to network with attendees, speakers, sponsors, and committers. There is a ton of content to digest at Lucene/Solr Revolution and we guarantee you will learn a lot. However, we aren’t skimping on the fun!

4. Superstar Keynotes

Keynotes from Cathy Polinksy, SVP of Search at Salesforce and Sridhar Sudarsan, CTO of Product Management and Partnerships at IBM Watson. Hear about the role Solr plays in the strategy and execution of two of the world’s leading enterprises.

5. World-Class Training

Pre-conference training to polish your skills and prime your brain for all of the information coming your way during the conference. Two-day hands-on Solr training is offered on October 11th & 12th. Check out the course listings here.

6. Ask the Experts – in Person!

Meet with experts on specific questions you have about Solr during Office Hours.

7. Stump the Chump!

Catch the popular “Stump the Chump” session with Solr Committer, Chris “Hoss” Hostetter. Be prepared for a laugh as you watch Hoss answer questions from attendees and community members trying to stump him. Check out last year’s video here. Want to Stump the Chump? Submit questions to before October 12.

8. Get a New Profile Pic

A professional headshot – come on, you know you need one. Get yours for free at the conference. Your LinkedIn followers will thank you.

9. Party in the Sky

Conference party at Skywalk Boston – check out 360 degree views of Boston from the top of the Prudential Center while enjoying food, drinks, games, and music. It’s sure to be a good time!

10. Chat with Partners and Sponsors

Visit our sponsor showcase to learn about products and services for your search and big data needs or about industry job opportunities. Participate in contests to win prizes.

Register today to join the fun and spend the week with us learning about all things Solr.

The post 10 Things You Don’t Want to Miss at Lucene/Solr Revolution 2016 appeared first on

Open Knowledge Foundation: Why we chose Slack to update our team page.

planet code4lib - Wed, 2016-08-10 13:14

After reading Mor’s post on the recent website update, I thought I’d elaborate a little on the team page, and how we ended up using Slack to update it. The following is from a post on my personal blog.

I recently undertook the task of redesigning a couple of key pages for the Open Knowledge International website. A primary objective here was to present ourselves as people, as much as an organisation. After all, it’s the (incredible) people that make Open Knowledge International what it is. One of the pages to get some design time was the team page. Here I wanted to show that behind this very static page, were real people, doing real stuff. I started to explore the idea of status updates for each staff member. This would, if all goes to plan, keep the content fresh, while hopefully making us a little more relatable.

My work here wasn’t done. In this scenario, my colleagues become “users”, and if this idea had any chance of working it needed a close to effortless user experience. Expecting anyone other than a few patient techies to interact with the website’s content management system (CMS) on regular basis just isn’t realistic. As it happens, my colleagues and I were already creating the content I was looking for. We chat daily using an internal instant messaging app (we use Slack). As well as discussing work related issues, people often share water-cooler items such as an interesting article they have read, or a song that captures how they are feeling. Sharing stuff like this can be as easy as copy and pasting a link. Slack will grab the title of the page and present it nicely for us. So what if at that moment of sharing, you could choose to share more widely, via the website? After some discussions, we introduced a system that facilitated just this, where if you add one a a few specific hashtags to your your Slack message, it would get pushed to the website and become your most recent status. The implementation still needs a little polishing, but I’m happy to say that after a few weeks of use, it seems to be working well, in terms of uptake at least. Whether anyone visiting the site really cares, remains to be proven.

I really like this solution. I like that it achieves its objective of not requiring additional effort, of course. Moreover, I like that it doesn’t introduce any barriers. It doesn’t assume that anyone wanting to contribute have a certain amount of knowledge (outside of what is already proven) or are happy to learn a new tool. It doesn’t ask anyone to change their way of working. It makes me wonder, how far could you take this model? It’s a big leap, but could we expand on this to the point where the interface being interacted with is that of whatever application the content creator sees fit? Just as the (slightly modified) action of sending a message in Slack became enough to make this small content change, could/should the action of saving a Word document to your local drive be enough to publish a blog post? (That particular example is not difficult to imagine, if you assume it’s happening within a walled-off Microsoft infrastructure, but that of course would be contrary to what I’m pondering here.)

Originally posted on

Richard Wallis: Hidden Gems in the new 3.1 Release

planet code4lib - Wed, 2016-08-10 11:55

I spend a significant amount of time working with Google folks, especially Dan Brickley, and others on the supporting software, vocabulary contents, and application of  So it is with great pleasure, and a certain amount of relief, I share the announcement of the release of 3.1.

That announcement lists several improvements, enhancements and additions to the vocabulary that appeared in versions 3.0 & 3.1. These include:

  • Health Terms – A significant reorganisation of the extensive collection of medical/health terms, that were introduced back in 2012, into the ‘health-lifesci’ extension, which now contains 99 Types, 179 Properties and 149 Enumeration values.
  • Finance Terms – Following an initiative and work by Financial Industry Business Ontology (FIBO) project (which I have the pleasure to be part of), in support of the W3C Financial Industry Business Ontology Community Group, several terms to improve the capability for describing things such as banks, bank accounts, financial products such as loans, and monetary amounts.
  • Spatial and Temporal and DatasetsCreativeWork now includes spatialCoverage and temporalCoverage which I know my cultural heritage colleagues and clients will find very useful.  Like many enhancements in the community, this work came out of a parallel interest, in which  Dataset has received some attention.
  • Hotels and Accommodation – Substantial new vocabulary for describing hotels and accommodation has been added, and documented.
  • Pending Extension – Introduced in version 3.0 a special extension called “pending“, which provides a place for newly proposed terms to be documented, tested and revised.  The anticipation being that this area will be updated with proposals relatively frequently, in between formal releases.
  • How We Work – A HowWeWork document has been added to the site. This comprehensive document details the many aspects of the operation of the community, the site, the vocabulary etc. – a useful way in for casual users through to those who want immerse themselves in the vocabulary its use and development.

For fuller details on what is in 3.1 and other releases, checkout the Releases document.

Hidden Gems

Often working in the depths of the vocabulary, and the site that supports it, I get up close to improvements that on the surface are not obvious which some [of those that immerse themselves] may find interesting that I would like to share:

  • Snappy Performance – The site, a Python app hosted on the Google App Engine, is shall we say a very popular site.  Over the last 3-4 releases I have been working on taking full advantage of muti-threaded, multi-instance, memcache, and shared datastore capabilities. Add in page caching imrovements plus an implementation of Etags, and we can see improved site performance which can be best described as snappiness. The only downsides being, to see a new version update you sometimes have to hard reload your browser page, and I have learnt far more about these technologies than I ever thought I would need!
  • Data Downloads – We are often asked for a copy of the latest version of the vocabulary so that people can examine it, develop form it, build tools on it, or whatever takes their fancy.  This has been partially possible in the past, but now we have introduced (on a developers page we hope to expand with other useful stuff in the future – suggestions welcome) a download area for vocabulary definition files.  From here you can download, in your favourite format (Triples, Quads, JSON-LD, Turtle), files containing the core vocabulary, individual extensions, or the whole vocabulary.  (Tip: The page displays the link to the file that will always return the latest version.)
  • Data Model Documentation – Version 3.1 introduced updated contents to the Data Model documentation page, especially in the area of conformance.  I know from working with colleagues and clients, that it is sometimes difficult to get your head around’s use of Multi-Typed Entities (MTEs) and the ability to use a Text, or a URL, or Role for any property value.  It is good to now have somewhere to point people when they question such things.
  • Markdown – This is a great addition for those enhancing, developing and proposing updates to the vocabulary.  The rdfs:comment section of term definitions are now passed through a Markdown processor.  This means that any formatting or links to be embedded in term description do not have to be escaped with horrible coding such as & and > etc.  So for example a link can be input as [The Link]( and italic text would be input as *italic*.  The processor also supports WikiLinks style links, which enables the direct linking to a page within the site so [[CreativeWork]] will result in the user being taken directly to the CreativeWork page via a correctly formatted link.   This makes the correct formatting of type descriptions a much nicer experience, as it does my debugging of the definition files.

I could go on, but won’t  – If you are new to, or very familiar, I suggest you take a look.


Subscribe to code4lib aggregator