You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 3 hours 57 min ago

Karen Coyle: More is more

Tue, 2016-02-16 14:26
Here's something that drives me nuts:

These are two library catalog displays for Charles Darwin's "The original of species". One shows a publication date of 2015, the other a date of 2003. Believe me that neither of them anywhere lets the catalog user know that these are editions of a book first published in 1859. Nor do they anywhere explain that this book can be considered the founding text for the science of evolutionary biology. Imagine a user coming to the catalog with no prior knowledge of Darwin (*) - they might logically conclude that this is the work of a current scientist, or even a synthesis of arguments around the issue of evolution. From the second book above one could conclude that Darwin hangs with Richard Dawkins, maybe they have offices near each other in the same university.

This may seem absurd, but it is no more absurd than the paucity of information that we offer to users of our catalogs. The description of these books might be suitable to an inventory of the warehouse, but it's hardly what I would consider to be a knowledge organization service. The emphasis in cataloging on description of the physical item may serve librarians and a few highly knowledgeable users, but the fact that publications are not put into a knowledge context makes the catalog a dry list of uninformative items for many users. There are, however, cataloging practices that do not consider describing the physical item the primary purpose of the catalog. One only needs to look at archival finding aids to see how much more we could tell users about the collections we hold. Another area of more enlightened cataloging takes place in the non-book world.

The BIBFRAME AV Modeling Study was commissioned by the Library of Congress to look at BIBFRAME from the point of view of libraries and archives whose main holdings are not bound volumes. The difference between book cataloging and the collections covered by the study is much more than a difference in the physical form of the library's holdings. What the study revealed to me was that, at least in some cases, the curators of the audio-visual materials have a different concept of the catalog's value to the user. I'll give a few examples.

The Online Audiovisual Catalogers have a concept of primary expression, which is something like first edition for print materials. The primary expression becomes the representative of what FRBR would call the work. In the Darwin example, above, there would be a primary expression that is the first edition of Darwin's work. The AV paper says "...the approach...supports users' needs to understand important aspects of the original, such as whether the original release version was color or black and white." (p.13) In our Darwin case, including information about the primary expression would place the work historically where it belongs.

Another aspect of the AV cataloging practice that is included in the report is their recognition that there are many primary creator roles. AV catalogers recognize a wider variety of creation than standards like FRBR and RDA allow. With a film, for example, the number of creators is both large and varied: director, editor, writer, music composer, etc. The book-based standards have a division between creators and "collaborators" that not all agree with, in particular when it comes to translators and illustrators. Although some translations are relatively mundane, others could surely be elevated to a level of being creative works of their own, such as translations of poetry.

The determination of primary creative roles and roles of collaboration are not ones that can be made across the board; not all translators should necessarily be considered creators, not all sound incorporated into a film deserves to get top billing. The AV study recognizes that different collections have different needs for description of materials. This brings out the tension in the library and archives community between data sharing and local needs. We have to allow communities to create their own data variations and still embrace their data for linking and sharing. If, instead, we go forward with an inflexible data model, we will lose access to valuable collections within our own community.

(*) You, dear reader, may live in a country where the ideas of Charles Darwin are openly discussed in the classroom, but in some of the United States there are or have been in the recent past restrictions on imparting that information to school children.

Islandora: Islandora CLAW Lessons: Starting March 1st

Tue, 2016-02-16 13:49

Looking ahead to Fedora 4? Interested in working with Islandora CLAW? Want to help out but don't know where to start? Want to adopt it and need some training? CLAW Committer Diego Pino will be giving a several-part series of lessons on how to develop in the CLAW project, starting March 1st at 11AM EST and continuing weekly until you're all CLAW experts. These will be held as interactive lessons via Google Hangouts (class size permitting). Registration is completely free but spaces may be limited. Sign up here to take part.

Journal of Web Librarianship: OAI-PMH Harvested Collections and User Engagement

Tue, 2016-02-16 09:26
DeeAnn Allison

Journal of Web Librarianship: A Review of “Electronic Resource Management”

Tue, 2016-02-16 09:26
Charlie Sicignano

Journal of Web Librarianship: A "Review of Learning JavaScript Design Patterns"

Tue, 2016-02-16 09:25
John Rodzvilla

FOSS4Lib Recent Releases: CollectionSpace - 4.3

Mon, 2016-02-15 22:36

Last updated February 15, 2016. Created by Peter Murray on February 15, 2016.
Log in to edit this page.

Package: CollectionSpaceRelease Date: Monday, February 15, 2016

LITA: Level Up – Gamification for Promotion in the Academic Library

Mon, 2016-02-15 15:59
Kirby courtesy of Torzk

Let me tell you the truth- I didn’t begin to play games until my late twenties. In my youth, I resisted the siren call of Super Nintendo and Sega Genesis. As an adult, I studiously avoided Playstation and XBox. When the Wii came out, I caved. I am very glad I did, because finding games in my twenties proved to be a tremendous stress reducer, community builder, and creative outlet. I cannot imagine completing my MLIS while working full-time and planning my wedding without Super Smash Bros.

It was a time in my life when I really needed to punch something.

In case you are wondering, I specialize as Kirby and I am a crusher. Beyond video games, I like board games (mainly cooperative ones, like Pandemic) and trivia. Lately, I have also been toying with getting into Dungeons & Dragons because what I really need is more hobbies.

More to the point – This isn’t the first time I’ve talked about the value of gamification or my interest in it on the LITA Blog, but this is a first for me in that I am offering gamification as a tool towards a specific professional goal, namely promotion in the academic library.

A quick note- gamification doesn’t necessarily require technology, though I do recommend apps for this process. In writing this blog post, my key aim is to offer academic librarians looking for a natural starting place to apply gamification in their professional lives a recommended way to do so.

SuperBetter by Jane McGonigal

In the course of pursuing promotion in an academic library or seeking professional development opportunities in the workplace, it can be easy to feel overwhelmed, isolated, and even paralyzed. What if, instead of binging on Girl Scout cookies and listening to sad Radiohead (this may just be me), we chose to work gamefully? What if we framed promotion as a mission for an epic win, with quests, battles, and rewards along the way?

In her book, SuperBetter: A Revolutionary Approach to Getting Stronger, Happier, Braver, and more Resilient — Powered by the Science of Games (phew), Jane McGonigal boldly posits, “Work ethic is not a moral virtue that can be cultivated simply by wanting to be a better person. It’s actually a biochemical condition that can be fostered, purposefully, through activity that increases dopamine levels in the brain.”

She goes on to provide seven rules for implementing her SuperBetter method which are:

  • Challenge yourself.
  • Collect and activate power-ups.
  • Find and battle bad guys.
  • Seek out and complete quests.
  • Recruit allies.
  • Adopt a secret identity.
  • Go for the epic win.

Gamification is still something most of us are figuring out how to incorporate into library programming and services; however, I can think of no better way to begin to understand gamification as a learning theory than to apply it towards your work. Seeing how gamification can help you structure the steps it takes to be promoted in your library will offer inspiration. In the process, you will naturally think of ways to apply gamification to instruction, collection engagement, and other library outreach.

Think of the promotion process through the lens of SuperBetter’s rules. Quests might include identifying and contacting collaborators (allies) for your research project or a coach/mentor for your promotion process. You might make a spreadsheet of conferences you want to present at in the next two or three years. Is there a particularly impressive journal where you would like to publish? That’s a fine quest.

Make sure that as you complete these quests, all part of your effort for the eventual “epic win,” you track your efforts. The road to promotion is one that requires a well-rounded portfolio of activities, and gamifying each will keep you on track. Remember that each quest you complete provides you with a power-up, leaving you with more professional clout and experience, extending your network and leaving you SuperBetter. The quest is its own reward.

Habit RPG – I am a Level 10 Mage with a Panda Companion!

One tool I have found tremendously helpful for framing my own quests towards my promotion is Habit RPG, an app I have mentioned in previous posts. With Habit RPG, I can put all my quests and daily tasks in an already gamified context where I earn fancy armor and other gear for my avatar. SuperBetter has an app component which also looks great. Whether or not you are interested in investigating an app, I would encourage you to read SuperBetter, which is an excellent starting place for thinking about gamification and provides plenty of example and starter quests. Not a reader? No problem. Jane McGonigal has a Ted Talk which sums up the ideas very neatly.

Ultimately, the road to tenure can feel lonely. The solo nature of the pursuit means that no one’s experience is exactly the same. However, by approaching the process through gamification, you can put the joy back into the job. Get questing, and let me know your thoughts on gamifying promotion.

Suzanne Chapman: Web Content Debt is the New Technical Debt

Mon, 2016-02-15 14:37

We worry a lot about “technical debt” in the IT world. The classic use of this metaphor describes the phenomenon where messy code is written in the interest of quick execution, causing a debt that will need to be repaid (time spent fixing the code later) or it will accumulate interest (additional work on the system will be complicated by the messy code). “Technical debt” is also used more broadly to describe the ongoing maintenance of legacy systems that we spend a great deal of time just keeping alive.

Technical & content debt holds us back from doing new and better things.

But in addition to technical debt, organizations (like libraries) with large websites have a growing problem with what I’ve started calling “content debt.” And like with “deferred maintenance” of buildings (the practice of postponing repairs to save costs), allowing too much technical debt and/or content debt will result in costing you much more in the long run. Beyond the costs, the big problem with technical & content debt is that they hold us back from doing new and better things.

Take for example the website I’m working on right now that currently has over 16,000 pages of content that were created by hundreds of different people over many many years. Redesigning this website isn’t just a matter of developing a new CMS with a more modern design and then hiring a room full of interns to copy and paste from the old CMS to the new CMS. We also need to look closely at all of the existing pages to evaluate what needs to be done differently this time around to ensure a more user-friendly and future-friendly site. It’s no easy task to detangle this mass of pages and the organic processes that generated them.

Some might say that you should just set the old stuff aside and start from scratch but if you don’t take the time to discover what’s causing your problems, you’ve little chance of not replicating them. The wikipedia page for technical debt offers some common causes for technical debt–many of which also fit with my concept of content debt. Here’s my revised version to help illustrate the similarities:

  • Business pressures: organization favors speed of releasing code [or content] as more important than a complete and quality release.
  • Lack of shared standards/best practices: no shared standards/best practices for developers [or content authors] means inconsistent quality of output.
  • Lack of alignment to standards: code [or content] standards/best practices aren’t followed.
  • Lack of knowledge: despite trying to follow standards, the developer [or content author] still doesn’t have the skills to do it properly.
  • Lack of collaboration: code [or content] activities aren’t done together, and new contributors aren’t properly mentored.
  • Lack of process or understanding: organization is blind to debt issues and makes code [or content] without understanding long-term implications beyond the immediate need.
  • Parallel development: parallel efforts, in isolation, to create similar code [or content] result in debt for time to merge things later (e.g., multiple units creating their own (redundant) pages about how to renew books, where to pay fines, how to use ILL, etc.).
  • Delayed refactoring: as a project evolves and issues with code [or content] become unwieldy, the longer remediation is delayed and more code [or content] is added, the debt becomes exponentially larger.
  • Lack of a test suite: results in release of messy code [or content] (e.g., I once worked on a large website with no pre-release environment for testing or training which resulted in a TON of published pages that said things like “looks like I can put some text here”).
  • Lack of ownership: outsourced software [or content] efforts result in extra effort to fix and recreate properly (e.g., content outsourced to interns).
  • Lack of leadership: technical [or UX/content strategy] leadership isn’t present or doesn’t train/encourage/enforce adherence to coding [or content] standards/best practices.

I also find this list useful because when talking about content issues, there’s a risk of seeming judgmental towards the individuals who made said content– but the reality is that there are tons of factors that lead to this “debt” situation. Approaching the problem from all the angles will lead to a more well-rounded solution.

Open Knowledge Foundation: Introducing Viderum

Mon, 2016-02-15 10:00

Ten years ago, Rufus started CKAN as an “apt-get for data” in order to enable governments and corporations to provide their data as truly open data. Today, CKAN is used by countless open data publishers around the globe and has become the de facto standard.

With CKAN as the technical foundation, Open Knowledge has offered commercial services to governments and public institutions within its so-called Services division for many years. Some of the most prominent open data portals around the world have been launched by the team, including,,,, and—most recently—

Today, we’re spinning off this division into its own company: Viderum.

We’re doing this because we want to lend a stronger focus on further development and promotion of these services without distracting Open Knowledge’s core mission as an advocate for openness and transparency. We’ve also heard from our customers that they are asking for a commercial-grade service offering that is best realized in an organization dedicated to that end.

Viderum’s mission will be simple: to make the world’s public data discoverable and accessible to everyone. They will provide services and products to further expand the reach of open data around the world.

Says CEO of Viderum Sebastian Moleski:

I’m personally very excited about this opportunity to bring open data publishing to the next level. In all reality, the open data revolution has only just begun. As it moves further, it is imperative to build on core principles of openness and interoperability. When it comes to open data, there is no good reason to use closed, proprietary, and expensive solutions that tie governments and public institutions to particular vendors. Viderum will help prove that point again and again.”

As a first step in fulfilling their mission, Viderum is offering a cloud-based, multi-tenant solution to host CKAN that has been live since mid-November. This allows anyone to get their own CKAN instance and publish data without the hassle, cost, and learning curve involved in setting one up individually. By lowering technological barriers, we believe there are now even more reasons for governments, institutions, and local authorities to publish open data for everyone’s use.

Viderum have set up an office in Berlin and are currently hiring developers! If you know anyone who’s passionate about building software and the infrastructure for open data around the world, please pass the link along to them.

To find out more about Viderum, check out their website, read the FAQ or contact the team at


FOSS4Lib Recent Releases: Koha - 3.22.3

Mon, 2016-02-15 09:53
Package: KohaRelease Date: Friday, February 12, 2016

Last updated February 15, 2016. Created by David Nind on February 15, 2016.
Log in to edit this page.

Koha 3.22.3 is a security release. It includes one security fix, four enhancements and 57 bug fixes.

As this is a security release, we strongly recommend anyone running Koha 3.22.* to upgrade.

See the release announcements for the details:

Terry Reese: MarcEdit: Thinking about Charactersets and MARC

Sun, 2016-02-14 23:48

The topic of charactersets is likely something most North American catalogers rarely give a second thought to.  Our tools, systems – they all are built around a very anglo-centric world-view that assumes data is primarily structured in MARC21, and recorded in either MARC-8 or UTF8.  However, when you get outside of North America, the question of characterset, and even MARC flavor for that matter, becomes much more relevant.  While many programmers and catalogers that work with library data would like to believe that most data follows a fairly regular set of common rules and encodings – the reality is that it doesn’t.  While MARC21 is the primary MARC encoding for North American and many European libraries – it is just one of around 40+ different flavors of MARC, and while MARC-8 and UTF-8 are the predominate charactersets in libraries coding in MARC21, move outside of North American and OCLC, and you will run into Big5, Cyrillic (codepage 1251), Central European (codepage 1250), ISO-5426, Arabic (codepage 1256), and a range of many other localized codepages in use today.  So while UTF-8 and MARC-8 are the predominate encodings in countries using MARC21, a large portion of the international metadata community still relies on localized codepages when encoding their library metadata.  And this can be a problem for any North American library looking to utilize metadata encoded in one of these local codepages, or share data with a library utilizing one of these local codepages.

For years, MarcEdit has included a number of tools for handling this soup of character encodings – tools that work at different levels to allow the tool to handle data from across the spectrum of different metadata rules, encodings, and markups.  These get broken into two different types of processing algorithms.

Characterset Identification:

This algorithm is internal to MarcEdit and vital to how the tool handles data at a byte level.  When working with file streams for rendering, the tool needs to decide if the data is in UTF-8 or something else (for mnemonic processing) – otherwise, data won’t render correctly in the graphical interface without first determining characterset for use when rendering.  For a long time (and honestly, this is still true today), the byte in the LDR of a MARC21 record that indicates if a record is encoded in UTF-8 or something else, simply hasn’t been reliable.  It’s getting better, but a good number of systems and tools simply forget (or ignore) this value.  But more important for MarcEdit, this value is only useful for MARC21.  This encoding byte is set in a different field/position within each different flavor of MARC.  In order for MarcEdit to be able to handle this correctly, a small, fast algorithm needed to be created that could reliably identify UTF8 data at the binary level.  And that’s what’s used – a heuristical algorthm that reads bytes to determine if the characterset might be in UTF-8 or something else.

Might be?  Sadly, yes.  There is no way to auto detect characterset.  It just can’t happen.  Each codepage reuses the same codepoints, they just assign different characters to those codepoints based on which encoding is in use. So, a tool won’t know how to display textual data without first knowing the set of codepointer rules that data was encoded under.  It’s a real pain the backside.

To solve this problem, MarcEdit uses the following code in an identification function:

int x = 0; int lLen = 0; try { x = 0; while (x < p.Length) { //System.Windows.Forms.MessageBox.Show(p[x].ToString()); if (p[x] <= 0x7F) { x++; continue; } else if ((p[x] & 0xE0) == 0xC0) { lLen = 2; } else if ((p[x] & 0xF0) == 0xE0) { lLen = 3; } else if ((p[x] & 0xF8) == 0xF0) { lLen = 4; } else if ((p[x] & 0xFC) == 0xF8) { lLen = 5; } else if ((p[x] & 0xFE) == 0xFC) { lLen = 6; } else { return RET_VAL_ANSI; } while (lLen > 1) { x++; if (x > p.Length || (p[x] & 0xC0) != 0x80) { return RET_VAL_ERR; } lLen--; } iEType = RET_VAL_UTF_8; } x++; } } catch (System.Exception kk) { iEType= RET_VAL_ERROR } return iEType;

This function allows the tool to quickly evaluate any data at a byte level and identify if that data might be UTF-8 or not.  Which is really handy for my usage.

Character Conversion

MarcEdit has also included a tool that allows users to convert data from one character encoding to another.

This tool requires users to identify the original characterset encoding for the file to be converted.  Without that information, MarcEdit would have no idea which set of rules to apply when shifting the data around based on how characters have been assigned to their various codepoints.  Unfortunately, a common problem that I hear from librarians, especially librarians in the United States that don’t have to deal with regularly this problem, is that they don’t know the file’s original characterset encoding, or how to find it.  It’s a common problem – especially when retrieving data from some Eastern European publishers and Asian publishers.  In many of these cases, users send me files, and based on my experience looking at different encodings, I can make a couple educated guesses and generally figure out how the data might be encoded.

Automatic Character Detection

Obviously, it would be nice if MarcEdit could provide some kind of automatic characterset detection.  The problem is that this is a process that is always fraught with errors.  Since there is no way to definitively determine the characterset of a file or data simply by looking at the binary data – we are left having to guess.  And this is where heuristics comes in again.

Current generation web browsers automatically set character encodings when rendering pages.  This is something that they do based on the presence of metadata in the header, information from the server, and a heuristic analysis of the data prior to rendering.  This is why everyone has seen pages that the browser believes is one character set, but is actually in another, making the data unreadable when it renders.  However, the process that browsers are currently using, well, as sad as this may be, it’s the best we got currently.

And so, I’m going to be pulling this functionality into MarcEdit.  Mozilla has made the algorithm that they use public, and some folks have ported that code into C#.  The library can be found on git hub here:  I’ve tested it – it works pretty well, though is not even close to perfect.  Unfortunately, this type of process works best when you have lots of data to evaluate – but most MARC records are just a few thousand bytes, which just isn’t enough data for a proper analysis.  However, it does provide something — and maybe that something will provide a way for users working with data in an unknown character encodings to actually figure out how their data might be encoded.

The new character detection tools will be added to the next official update of MarcEdit (all versions).

And as I noted – this is a tool that will be added to give users one more tool to evaluating their records.  While detection may still only be a best guess – its likely a pretty good guess.

The MARC8 problem

Of course, not all is candy and unicorns.  MARC8, the lingua franca for a wide range of ILS systems and libraries – well, it complicates things.  Unlike many of the localized codepages that are actually well defined standards and in use by a wide range of users and communities around the world – MARC-8 is not.  MARC8 is essentially a made up encoding – it simply doesn’t exist outside of the small world of MARC21 libraries.  To a heuristical parser evaluating character encoding, MARC-8 looks like one of four different characterset: USASCII, Codepage 1252, ISO-8899, and UTF8.  The problem is that MARC-8, as an escape-base language, reuses parts of a couple different encodings.  This really complicates the identification of MARC-8, especially in a world where other encodings may (probably) will be present.  To that end, I’ve had to add a secondary set of heuristics that will evaluate data after detection so that if the data is identified as one of these four types, some additional evaluation is done looking specifically for MARC-8’s fingerprints.  This allows, most of the time, for MARC8 data to be correctly identified, but again, not always.  It just looks too much like other standard character encodings.  Again, it’s a good reminder that this tool is just a best guess at the characterset encoding of a set of records – not a definitive answer.

Honestly, I know a lot of people would like to see MARC as a data structure retired.  They write about it, talk about it, hope that BibFrame might actually do it.  I get their point – MARC as a structure isn’t well suited for the way we process metadata today.  Most programmers simply don’t work with formats like MARC, and fewer tools exist that make MARC easy to work with.  Likewise, most evolving metadata models recognize that metadata lives within a larger context, and are taking advantage of semantic linking to encourage the linking of knowledge across communities.  These are things libraries would like in their metadata models as well, and libraries will get there, though I think in baby steps.  When you consider the train-wreck RDA adoption and development was for what we got out of it (at a practical level) – making a radical move like BibFrame will require a radical change (and maybe event that causes that change).

But I think that there is a bigger problem that needs more immediate action.  The continued reliance on MARC8 actually posses a bigger threat to the long-term health of library metadata.  MARC, as a structure, is easy to parse.  MARC8, as a character encoding, is essentially a virus, one that we are continuing to let corrupt our data and lock it away from future generations.  The sooner we can toss this encoding to the trash heap, the better it will be for everyone – especially since we are likely the passing of one generation away from losing the knowledge of how this made up character encoding actually works.  And when that happens, it won’t matter how the record data is structured – because we won’t be able to read it anyway.


Suzanne Chapman: UX photo booth 2011 (My ideal library…)

Sat, 2016-02-13 23:15

A few weeks ago I helped out again with the MLibrary Undergraduate library’s annual “Party for Your Mind” event to welcome the students back and introduce new students to the library.

Like last year, I did a photo booth where I asked the students to complete the sentence “My ideal library ______” and like last year, I got a lovely combination of silly and serious responses. Quiet/Loud and food/sleeping were again popular themes!

My ideal library… loud & fun!

See the full set here.

Suzanne Chapman: MLibrary – by the numbers

Sat, 2016-02-13 23:14

Last fall I created some graphics for a slide show for our annual library reception event to demonstrate some of what we do via stats and graphics. This was such a fun side project and I couldn’t have done it without the data gathering help of Helen Look and others.

Full set here:

Suzanne Chapman: Guiding principles for a shared understanding

Sat, 2016-02-13 17:49

One of the biggest things I’ve learned over the years in working on library web projects is that you can’t make assumptions that the teams, internal stakeholders, or higher-ups have a shared understanding of how the website work is done and priorities are set. Often, by the time this is figured out, a great deal of time and energy has been spent going around and around in circles. Establishing a set of shared guiding principles is a great way to make sure everyone is on the same page, or, in the worst case scenario, establish that the team isn’t actually able to agree so you can figure out where to go from there.

Guiding principles can be anything you want but I think they work best when they balance theoretical with practical. My approach to developing my guiding principles began by taking a step back to think about all the challenges, roadblocks, and repetitive conversations that naturally tend to occur in the process of designing and developing interfaces. With universal design principles and UX best practices in mind, I then developed themes and a vision for how we should be operating.

Library Web Presence – Guiding Principles

Having just started a new job in August at the University of Illinois Library, I decided to share this with the new team I’m working with to help jump start conversations and get a better understanding of how they like to work. After some great conversation and minor language adjustments to fit this new context, we were able to unanimously agree that we should formally adopt it.

I hope that others find it useful and make their own versions as well! If you do, please let me know!

William Denton: The Triple Staple

Sat, 2016-02-13 03:10

On Wednesday 10 February 2016, at about 3:45 pm, at the ref desk at Steacie, I hit a new record: three consecutive questions about refilling staplers. We have three staplers here and they all emptied within minutes of each other. Three questions, three different staplers. I call this the Triple Staple.

“Ready, aye, ready”

I’ve had a few Double Staples in my time—nothing to brag about, really—but this was a personal best I will never forget.

Has anyone ever done a Quadruple Staple? Discussion among my colleagues revealed that at least one person refills the staplers at the start of his shift, and hence he has never even had a Double Staple. This was new to me: certainly no such practice was discussed at library school, but in our profession one continues to learn every year, and this is a fine example of how we share tacit knowledge. I am curious to know about your local practice.

SearchHub: Solr’s DateRangeField, How Does It Perform?

Sat, 2016-02-13 02:15
Solr’s DateRangeField

I have to credit David Smiley as co-author here. First of all, he’s largely responsible for the spatial functionality and second he’s been very generous explaining some details here. Mistakes are my responsibility of course. Solr has had a new DateRangeField for quite some time (since 5.0, see SOLR-6103. DateRangeFields are based on more of the magic of Solr Spatial and allow some very interesting ways of working with dates. Here are a couple of references to get you started. Working with dates, Solr Reference Guide Spatial for Time Durations

About DateRangeField:
  • It is a fieldType
  • It supports friendlier date specifications, that is you can form queries like q=field:[2000-11-01 TO 2014-12-01] or q=field:2000-11
  • It supports indexing a date range in a single field. For instance a field in a document could be added to a document in SolrJ as solrInputDocument.addField(“dateRange”, “[2000 TO 2014-05-21]”) or in an XML format as <field name="dateRange">[2000 TO 2014-05-21]</field>
  • It supports multi-valued date ranges. This has always been a difficult thing to do with Solr/Lucene. To index a range, one had to have two fields, say “date_s” and “date_e”. It was straightforward to perform a query that found docs spaning some date, it looked something like q=date_e:[* TO target] AND date_s:[target TO *]. This worked fine if the document only had one range, but when two or more ranges were necessary, this approach falls down since if date_s and date_e have multiValued=”true”, the query above would find the doc if any entry in date_s was < than target date and any date_e was > the target date..

Minor rant: I really approve of Solr requiring full date specifications in UTC time, but I do admit it is sometimes a bit awkward, the ability to specify partial dates is pretty cool. DateRangeField more naturally expresses some of the concepts we often need to support with dates in documents. For instance, “this document is valid from dates A to B, C to D and M to N”. There are other very interesting things that can be done with this “spatial” stuff, see: Hossman’s Spatial for Non Spatial. Enough of the introduction. In the Reference Guide, there’s the comment “Consider using this [DateRangeField] even if it’s just for date instances, particularly when the queries typically fall on UTC year/month/day/hour etc. boundaries.” The follow-on question is “well, how does it perform?” I recently had to try to answer that question and realized I had no references so I set out to make some. The result is this blog.


For this test, there are a few things to be aware of.

  • This test does not use the fancy range capabilities. There are some problems that are much easier if you can index a range, but this is intended to compare the “just for date instances” from the quote above. Thus it is somewhat apples-to-oranges. What it is intended to help evaluate is the consequences of using DateRangeField as a direct substitute for TrieDate (with or without DocValues)
  • David has a series of improvements in mind that will change some of these measurements, particularly the JVM heap necessary. These will probably not require re-indexing.
  • The setup has 25M documents in the index. There are a series of 1,000 different queries sent to the server and the results tallied. Measurements aren’t taken until after 100 warmup queries are executed. Each group of 1,000 queries are one of the following patterns:
    • q=field:date. These are removed from the results since it isn’t interesting, the response times are all near 0 milliseconds after warmup.
    • simple q=field:[date1 TO date2]. These are not included in the graph as they’re not interesting, they all are satisfied too quickly to be of consequence.
    • interval facets, facet=true&...facet.range.start=date1&facet.range.end=date2& (or MINUTE or..).
    • 1-5 facet.query clauses where q=*:*
    • The setup is not SolrCloud as it shouldn’t really impact the results.
  • The queries were run with 1, 10, 20, and 50 threads to see if there was some weirdness when the Solr instances got really busy. There weren’t, the results produced essentially the same graphs so the graph below is for the 10 thread version.
  • The DateRangeType was compared to:
    • TrieDate, indexed=”true” docValues=”false” (TrieDate for the rest of this document)
    • TrieDate, indexed=”true” docValues=”true” (DocValues in the rest of this document)
  • I had three cores, one for each type. Each core had identical documents with very simple docs, basically the ID field and the dateRange field (well, the _version_ field was defined too). For each test
    • Only the core under test was active, the other two were not loaded (trickery with if you must know)
    • At the end of each test I measured the memory consumption, but the scale is too small to draw firm conclusions. What I _can_ report is that DateRangeType is not wildly different at this point. That said, see the filterCache comments in David’s comments below.
    • Statistics were gathered on an external client where QTimes were recorded
  • As the graph a bit later shows, DateRangeField out-performed both TrieDate and DocValues in general.
  • The number of threads made very little difference in the relative performance of DateRangeField .vs. the other two. Of course the absolute response time will increase as enough threads are executing at once that the CPU gets saturated.
  • DateRangeFields have a fairly constant improvement when measured against TrieDate fields and TrieDate+DocValues.
  • The facet.range.method=dv option was not enabled on these tests. For small numbers of hits, specifying this value may well significantly improve performance, but this particular test uses a minimum bucket size of 1M which empirically is beyond the number of matches where specifying that parameter is beneficial. I’ll try to put together a follow-on blog with smaller numbers of hits in the future.
The Graph

These will take a little explanation. Important notes.

  • The interval and query facets are over the entire 25M documents. These are the points on the extreme right of the graph. These really show that for interval and query facets, in terms of query time, the difference isn’t huge.
  • The rest of the marks (0-24 on the X axis) are performance over hits of that many million docs for day, minute and millisecond ranges. So a value of 10 on the x axis is the column for result sets of 10M documents for TrieDate and TrieDate+DocValues.
  • The few marks above 1 (100%) are instances where DateRangeFields were measured as a bit slower. This may be a test artifact.
  • The Y-axis is the percent of the time the DateRange fields took .vs. the TrieDate (green Xs) and TrieDate+DocValues (red Xs).

Index and memory size

The scale is too small to report on index and memory differences. At this size (25M date fields), the difference between index only and docValues in both memory and disk sizes (not counting DateRangeField) was small enough that it was buried in the noise so even though I looked at it, it’s misleading at best to report, say, a difference of 1%. See David’s comments below. We do know that DocValues will increase the on-disk index size and decrease JVM memory required by roughly the size of the *.dvd files on disk.

David Smiley’s enrichment

Again, thanks to David for his tutorials. Here are some things to keep in mind:

  • The expectation is that DateRangeField should be faster than TrieDateField for ranges that are aligned to units of a second or coarser than that; but perhaps not any coarser than a hundred years apart.
  • So if you expect to do range queries from some random millisecond to another, you should continue to use TrieDate; otherwise consider DateRangeField.
  • [EOE] I have to emphasize again that the DateRangeField has applications to types of queries that were not exercised by this test. There are simply problems that are much easier than using DateRangeField. This exercise was just to stack up DateRangeField against the other variants.
  • TrieDate+DocValues does not use the filterCache at all, whereas TrieDate-only and DateRangeField do. At present there isn’t a way to tell DateFangeField to not use filterCache. That said, one of the future enhancements is to enable facet.range with DateRangeField to use the low-level facet implementation in common with the spatial heatmap faceting, which would result in DateRangeField not using filterCache.
    • [EOE] If you want more details on this, ask David, he’s the wizard. I’ll add that the heatmap stuff is very cool, I saw someone put this in their application in 2 hours one day (not DateRangeField, just a heatmap). Admittedly the browser display bits were an off-the-shelf bit of code.
  • Another thing on the radar worth mentioning is the advent of “PointValues” (formerly known as DimensionalValues) in Lucene 6. It would stack up like a much faster TrieDateField (without DocValues)
  • Discussions pertaining to memory use or realtime search mostly just apply to facet.range.  For doing a plain ‘ol range query search, the DV doesn’t even apply and there’s no memory requirements/concern.
Closing remarks

As always, your mileage may vary when it comes to using DateRangeFields.

  • For realtime searches, docValues are preferred over both TrieDate-only and DateRangeFields, although in the future that may change.
  • As more work is done here more functionality will be pushed down into the OS’s memory so the JVM usage by DateRangeField will be reduced.
  • If your problem maps more fully into the enhanced capabilities of DateRangeField, it should be preferentially used. Performace will not suffer (at least as measured by these tests), but you will pay a memory cost over TrieDate+DocValues
  • I had the filterCache turned off for this exercise. This is likely a closer simulation of NRT setups, but in a relatively static index DateRangeField using the fiterCache needs to be evaluated in your specific situation to determine the consequences.

Originally, I wanted to compare memory usage, disk space etc. There’s a tendency to try to pull information that just isn’t there out of a limited test. After I dutifully gathered many of those bits of information I realized that… there wasn’t enough information there to extract any generalizations from. Anything I could say based on this data other than what David provided and what I know to be true (e.g. docValues increase index size on disk but reduce JVM memory) would not be particularly relevant… As always, please post any comments you have, especially Mr. Smiley! Erick Erickson

The post Solr’s DateRangeField, How Does It Perform? appeared first on

Code4Lib: Systems and Web Services Librarian

Fri, 2016-02-12 18:56

Concordia College is seeking a collaborative, innovative, and service-oriented Systems and Web Services Librarian who will have primary responsibility for the library’s technology infrastructure (library management system, digital collections, discovery solutions, and related tools) and integration with campus systems (Moodle, campus content management system, financial and student systems). This 10-month faculty position will begin August 2016.

The successful candidate will have a critical role in ensuring that library systems support and enhance student learning. The primary responsibility of the position will be oversight of the technical aspects of library systems across functional areas including acquisitions, cataloging, circulation, serials, digital collections, and metadata. This responsibility will include management of the library website and collaboration with other library staff on usability and the development of virtual resources and services.

The successful candidate will work collaboratively with campus IT and Communications Departments to ensure efficiency and interoperability among library and campus systems.

The successful candidate will take a leadership role in implementing and evaluating emerging technologies and services as they pertain to the library.

As a secondary level of responsibility, the successful candidate will participate, at some level, in the standard duties of all librarians (reference services, library instruction, liaison to academic departments, collection development, and outreach).

Within the tradition of the liberal arts, Concordia College is dedicated to student learning and becoming responsibly engaged in the world. Current initiatives as the college include integrative learning, online learning, and digital humanities. The Carl B. Ylvisaker Library is located in the heart of campus and is recognized for its emphasis on teaching and engagement with student research. The library maintains a complex array of integrated systems, resources, and service in support of its mission.

Minimum Qualifications:
• ALA-accredited master’s degree or equivalent
• Experience working with library systems
• Knowledge of database structures, creation, application and maintenance
• Demonstrated knowledge of emerging technologies
• A demonstrated commitment to continuous learning
• A strong commitment to user services
• Strong interpersonal and communication skills
• Commitment to working in a collaborative environment

Preferred Qualifications
• Academic library experience
• Experience in an online learning environment
• Knowledge of instructional design
• Knowledge of user analysis and usability testing
• Knowledge of web design and scripting languages
• Project management skills
• Reference and instruction experience
• STEM background

Screening begins 02/26/2016

District Dispatch: Copyright Abandonment

Fri, 2016-02-12 17:10

Could the House Judiciary subcommittee consider tailoring back copyright for the greater good or will they abandon the public interest and leave it, like an old couch, in the front yard?

The Cato Institute sponsored a program yesterday, Intellectual Property and First Principles, on the differing opinions of conservatives and libertarians on intellectual property [sic]. The conservative argued that copyright was a natural right and therefore intellectual property was indeed property worthy of government protection. The libertarian, on the other hand, argued that copyright is a creation of Congress— a grant of limited monopoly rights. Just like the post office (Article I, Section 8, Clause 9), copyright (Article I, Section 8, Clause 8) is a government program.

I agreed with the libertarian. This means that Congress determines if copyright even exists. Congress has the authority to change the law, although most changes seem to benefit only a select group of people (consider term extension, DMCA, “automatic” copyright protection and so on).

There was also some discussion about abandoning property. People in my neighborhood tend to abandon property by leaving beds, couches and Lazy Boys in their front yards in the hope that someone will take the property, saving them a trip to the dump. Sometimes to encourage the taking and clarify intent, a sign will be posted or pinned to the property, “free” (or if you really want to get rid of it, “$10.”) This led to a pondering: if copyright is property, can you abandon it? If copyright is a natural right, how do you abandon yourself? After flashing to the old Calgon commercial (“take me away”), I was thinking about people just lying in their front yards, “Take me, please. Save me a trip to the dump. Free.”

Putting all silliness aside, the real reason for this post is a reminder that copyright exists “to advance the progress of science and useful arts.” The Framers believed that by giving creators monopoly rights, they would more likely make their creative works and inventions available to the public thereby benefiting society as a whole. Therefore, exclusive rights should only be as broad as they need be to incentivize creation. It is as simple as that. During this multi-year copyright review, might the House Judiciary subcommittee consider tailoring back copyright for the greater good or will they abandon the public interest and leave it in the front yard?

The post Copyright Abandonment appeared first on District Dispatch.