You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 34 min 30 sec ago

Open Knowledge Foundation: Global Open Data Index 2015 – United Kingdom Insight

Wed, 2015-12-09 09:12

This post was written by Owen Boswarva

For a third year running the United Kingdom has come out at or near the top of the Global Open Data Index. Unlike many of the countries that did well in previous years, the UK’s overall standing has not been greatly affected by the addition of five new categories. This demonstrates the broad scope of the UK’s open data programme. Practitioners within UK government who work to develop and release open datasets have much of which to be proud.

However the UK’s role as an open data leader also carries the risk of overconfidence. Policymakers can easily be tempted to rest on their laurels. If we look in more detail at this year’s submissions we can find plenty of learning points and areas for further development. There are also some signs the UK open data agenda may be losing momentum.

The biggest gap this year is in election results data, with the Electoral Commission dataset disqualified because it only reports down to constituency level. The criteria have changed from previous years, so this decision may seem a little harsh. But globally most electoral fraud takes place at the polling station. The UK is a mature democracy and should set an example by publishing more granular data.

There is a similar weakness in UK public data on water quality, which is available only at a high level in annual reports from regulators. Environmental data in general has been a mixed bag in 2015. Ordnance Survey, which maps most of the UK, published the first detailed open map of the river network; and the environment department Defra announced an ambition to release 8,000 open datasets. However there is a noticeable absence of open bulk data for historical weather observations and air pollution measurements.

UK progress on open data is also held back by the status of land ownership data. Ownership records and land boundaries are maintained by Land Registry and other government agencies. But despite (or perhaps because of) the considerable public interest in understanding how property wealth is distributed in the UK, this invaluable data is accessible only on commercial terms.

In other categories we can see deteriorations in the quality of UK open data.

National Archives is struggling to maintain its much-admired dataset. The latest version of Contracts Finder, an open search facility for public sector procurement contracts, no longer offers bulk downloads. Government digital strategy is turning steadily towards APIs and away from support for analytic re-use of public data.

Can the UK sustain its record of achievement in open data policy? Most of the central funding streams that supported open data release in recent years came to an end in 2015. A number of user engagement groups and key initiatives have either been wound up or left to drift. Urban and local open data hubs are thriving, but political devolution and lack of centralised collection are creating regional disparities in the availability of open data. Truly national datasets, those that help us understand the UK as a nation, are becoming harder to find.

UK open data policy may play well on the international stage, but at home there is still plenty of work for campaigners to do.

Open Knowledge Foundation: The Global Open Data Index 2015 is live – what is your country status?

Wed, 2015-12-09 08:22

We are excited to announce that we have published the third annual Global Open Data Index. This year’s Index showed impressive gains from non-OECD countries with Taiwan topping the Index and Colombia and Uruguay breaking into the top ten at four and seven respectively. Overall, the Index evaluated 122 places and 1586 datasets and determined that only 9%, or 156 datasets, were both technically and legally open.

The Index ranks countries based on the availability and accessibility of data in thirteen key categories, including government spending, election results, procurement, and pollution levels. Over the summer, we held a public consultation, which saw contributions from individuals within the open data community as well as from key civil society organisations across an array of sectors. As a result of this consultation, we expanded the 2015 Index to include public procurement data, water quality data, land ownership data and weather data; we also decided to removed transport timetables due to the difficulties faced when comparing transport system data globally.

Open Knowledge International began to systematically track the release of open data by national governments in 2013 with the objective of measuring if governments were releasing the key datasets of high social and democratic value as open data. That enables us to better understand the current state of play and in turn work with civil society actors to address the gaps in data release. Over the course of the last three years, the Global Open Data Index has become more than just a benchmark – we noticed that governments began to use the Index as a reference to inform their open data priorities and civil society actors began to use the Index advocacy tool to encourage governments to improve their performance in releasing key datasets.

Furthermore, indices such as the Global Open Data Index are not without their challenges. The Index measures the technical and legal openness of datasets deemed to be of critical democratic and social value – it does not measure the openness of a given government. It should be clear that the release of a few key datasets is not a sufficient measure of the openness of a government. The blurring of lines between open data and open government is nothing new and has been hotly debated by civil society groups and transparency organisations since the sharp rise in popularity of open data policies over the last decade.

While the goal of the Index has never been to measure the openness of governments, we have been working in collaborations with others to make the index more than just a benchmark of data release. This year, by collaborating with topical experts across an array of sectors, we were able to improve our dataset category definitions to ensure that we are measuring data that civil society groups require rather than simply the data that governments happen to be collecting.

Next year we will be doubling down on this effort to work in collaboration with topical experts to go beyond a “baseline” of reference datasets which are widely held to be important, to tracking the release of datasets deemed critical by the civil society groups working in a given field. This effort is both experimental and ambitious. Measuring open data is not trivial and we are keenly aware of the balance that needs to be struck between international comparability and local context and we will continue to work to get this balance right. Join us on the Index forum to join these future discussions.

Open Knowledge Foundation: Government contracts: Still a long way from open

Wed, 2015-12-09 07:55

This blog post was written by Georg Neumann from Open Contracting

Global Open Data Index: State of disclosure and data of government contracts

For the first time ever, the Global Open Data Index is assessing open data on tenders and awards in this year’s index. This is crucial information. Government deals with companies amount to $9.5 trillion globally. That’s about 15% of global GDP. Schools, hospitals, roads, street lights, paper clips: all of these are managed through public procurement.

Public procurement is the number one risk of corruption and fraud in government. Too often, when government and business meet, public interest is not the highest priority. Scandals from failed contracting processes abound: ‘tofu’ schools, constructed to substandard specifications in an earthquake zone, that fell down on their students; provision of fake medicine and medical equipment that kills patients; phantom contracts siphoning off billions of dollars dedicated to national security; or kick-back schemes in contracts that steal direly needed monies from school children.

To make sure this money is spent fairly and honestly, it is essential that data is disclosed on how much, when and with whom governments spend money on.

This is why we were delighted that the Open Contracting Partnership was asked to evaluate the submissions from the more than 120 countries participating in this year’s index. The Index assesses whether monthly information on both tenders and awards is available and basic information disclosed. This should include name, description and status of a tender, as well as the name, description, value and supplier information of an award or a contract. Ideally, this information should not only be public, but also machine-readable and downloadable as a complete dataset.

Here are some of the highlights and aggregated findings of what we learned.

The bad news first: Government contracts are a long way from being open. Less than 10% of the countries surveyed provide timely, machine-readable and openly licensed data on tenders and awards of government contracts. Without this information available to the public, there is little opportunity to scrutinise and monitor public spending.

Now the good news: We can see a trend towards publishing more data and doing so in an open format. 11 countries is definitely progress and it is clear that some global champions are beginning to emerge. Some of the credit, we hope, could go to the Open Contracting Data Standard, that is being implemented in countries such as Canada, Paraguay, and the ProZorro procurement portal being currently developed in Ukraine. Other countries are implementing open data on contracting after committing to them in their National Action Plans under the Open Government Partnership, such as Colombia, Mexico and Romania.

The Index doesn’t score quality, so this is important to keep this in mind when looking at the data. At the Open Contracting Partnership, we are aware that the quality of disclosed data on tenders and contracts varies strongly. The better the quality of contract information and the more recent is, the more valuable it is for evaluation, analysis, and monitoring. While timeliness of the data in general was relatively good, the quality of the open data is still problematic.

Digiwhist, a research initiative, has looked at the amount of information actually available in tenders and contracts in the EU and found common id’s are rarely available to match tender and award data, government agencies and business. a key quality issue of this data.

Other findings of the Open Data Index include:

More countries publish tenders rather than awards data: Focussing on publishing tender data makes business sense. More competition drives savings. In Slovakia, opening up the procurement process has led to doubling the number of bidders and reduced costs. But to be able to track public spending, matching awards to tenders is crucial. Over 100 countries and places now publish both tender and award data.

Information can be public, but is not necessarily open. The majority of countries have well developed e-procurement platforms through which they publish information. In the majority, these systems are closed and so is the data. You could have a lot more public openness from a relatively small investment in publishing the data in reuseable, accessible formats. Contract data needs to be aggregated to ensure all information is available in one place. In many countries only some sectors of government publish tenders and awards data.

Countries shouldn’t hide behind official thresholds of beyond which amount contracts should be publish. All contracts should be made available publicly. However, thresholds for publication of tenders and awards varies strongly by country. EU countries, for examples, have to publish tenders above specific amounts via the electronic tender database TED. Systems for publishing nationally or locally are much less standardised. This allows for all EU submissions to qualify as publishing open procurement data even though some countries, such as Germany, do not publish award value for contracts below those thresholds, and others have closed systems to access specific information on contracts awarded.

Timeliness and completeness of downloadable data on awarded contracts, is an issue. Formats for downloads are diverse as well, going anywhere from a PDF download to a full set of XML or JSON data. We have given brownie points to 18 countries that publish some bulk downloads, but this criteria definitely needs tightening. To qualify, the datasets needed to be updated more frequently than just yearly to ensure data is relevant to analyze recent contracts.

Lot’s of work needed, as these results of the index shows. Looking forward, at the Open Contracting Partnership we believe that next year’s results can be only improved by implementing the following policies.

Open data on contracts is a key dataset that adds value to all stakeholders involved. This is why contracts should not be legal and enforceable until published. There are too few countries who have made publishing contract data open by default. Credit to those who have such as Colombia, Georgia, Slovakia and, hopefully, the Czech Republic.

Public procurement is a process that starts at the planning stage and ends only with the final service or construction being delivered. To harness the most of open data, engagement and feedback channels need to be available to get problems fixed. Data on complaints and contract adjustments needs to be made public as well.

By publishing more data on government contracts, citizens will be able to follow the money from planning to implementation of policies, from the budget to the actual contracts, to showing how taxes are being spent, and how this benefits citizens.

While this year’s Open Data Index first assessment of open data of awards and tenders highlights many of the deficiencies, some positive trends can be detected where information has been made public. Everyone has to gain from open and better quality data on government contracts. Government receive more value for money in their investments. Business, especially small and medium enterprises and minority-owned enterprises, gain access to a fair marketplace of contracting opportunities with the government. Citizen can evaluate how money is spent and managed. Going the extra mile to make government deals with companies truly open might be just the way to show the real impact of open data.

Library of Congress: The Signal: Acquiring at Digital Scale: Harvesting the Collection

Tue, 2015-12-08 21:23

This post was originally published on the Folklife Today blog, which features folklife topics, highlighting the collections of the Library of Congress, especially the American Folklife Center and the Veterans History Project.  In this post, Nicole Saylor, head of the American Folklife Center Archive, talks about the mobile app and interviews Kate Zwaard and David Brunton, who manage the Library of Congress’s Digital Library software development, about the transfer of the StoryCorps collection to the Library.

Thanks to The Great Thanksgiving Listen, the StoryCorps collection of interviews has doubled! Since the launch of mobile app in March, more than 68,000 interviews have been uploaded as of today—the vast majority of them in the few days following Thanksgiving.

The American Folklife Center at the Library of Congress is the archival home for StoryCorps, one of the largest oral history projects in existence, and the interviews are regularly featured on NPR’s Morning Edition. Since 2003, StoryCorps has collected more than 50,000 interviews via listening booths throughout the country.

However, with the advent of the mobile app, StoryCorps has created a global platform where anyone in the world can record and upload an oral history interview. This effort is a wish come true for StoryCorps founder Dave Isay, the 2015 recipient of the TED Prize. The prize comes with $1 million to invest in a powerful idea. Dave’s was to create an app, with a companion website at

The surge in mobile interviews is the result of StoryCorps’ efforts to partner with major national education organizations and a half dozen of the nation’s biggest school districts. More than 20,000 teacher toolkits were downloaded, the initiative was featured on the homepage of Google, and in both the Apple and Google app stores, not to mention remarkable media coverage. (There is already talk of The Great Thanksgiving Listen 2016.)

At the Library, we are able to meet the challenge of acquiring tens of thousands of interviews at a time thanks to the ability to harvest them via the web. The process involves using StoryCorps’ application programming interface (API) to download the data. For the last several months, Kate Zwaard and David Brunton, who manage the Library of Congress’s Digital Library software development, have been working with Dean Haddock,’s lead developer, to perfect this means of transfer. This interview with Zwaard and Brunton explains that process, provides advice for those who want to do similar projects, and ponders the future of scaling archival acquisitions.

Nicole: Can you explain, in layman’s terms, the technology that makes this automated acquisition possible?

David: To get collections out of we use a Python program called Fetcher. Part of what made this project so fun was that we got to use our lessons learned from past experiences. We worked with the fine folks at StoryCorps and you guys in Folklife to define the StoryCorps API. We worked iteratively with StoryCorps developers on customizing the API.

The Python program connects to the StoryCorps API and downloads the content into a bagged format and gets handed off to the Library’s ingest service, which we use to move and inventory digital collections in the Library. Curatorial staff kick off automated processes that verify collections, make copies onto a durable storage system, then verify the copies were made and release the space on the virtual loading docks. The ingest of collections happens once a month. After we’re collecting successfully for a while, we will turn off the manual parts of the process. We’re now manually kicking off the retrieval and the rest is automated. Eventually, we will make the retrieval automated and the system will notify us if there is a problem.

Nicole: You worked with’s developer to customize the way the collection is exposed on the web. What special requirements had to be met to accommodate the Library’s needs and why?

Kate: At the Library of Congress, we have to make sure that items we are collecting are exactly what the producer intended them to be. That’s why we ask partners to provide checksums, a digital fingerprint, a sequence of characters that’s unique to a digital file. This way, we can make sure no mistakes have been made in the transmission, and it allows us to establish a chain of custody.

David: The best moment for me during this process was when we asked StoryCorps for checksums, and they told us they could do so for all but one item type hosted by a third-party vendor. StoryCorps told us they told the vendor that the Library of Congress asked, and the third-party vendor added checksums. It was really satisfying.

Nicole: What take-aways do you have for donors and archivists who want to collaborate in this way?

Kate: We’ve done digital collecting in a number of ways. Fetcher is a good model.

David: Yes, don’t let people push content to us, let us pull it. Another take-away is that in order to accession something, it had to be fixed, and it’s not how thought about their content. They thought about it as fluid.

Kate: With the popularity of digital photography and recent news events, the concept of metadata has moved from being library jargon into the common lexicon. Most people understand now that metadata (information about the content, like the name of a subject or the time of a phone call) can be just as important as the data it describes. I would encourage donors to think not only about their content but about what metadata they would like preserved alongside it. We had a discussion about whether to archive the comments about the interviews and the “likes.” Likes are so fluid that a snapshot at an arbitrary time might not say much; tags, however, could be useful to researchers.

Nicole: While public access to this collection is available through, what happens to the interviews when they arrive at the Library to ensure long-term preservation and access?

David: Copies are stored in multiple locations with multiple copies. Curatorial staff manages the collection with our inventory system, which keeps track of every StoryCorps file. There are two big classes of errors we try to protect against: 1) a single file or tape gets destroyed—the checksum reveals this, and then we can replace the bad copy with a good copy; and 2) losing access to collections through correlated errors – a class of mistakes where somebody followed a bad process or relied on bad code. In that case, we can use our inventory to identify those kinds of problems.

Kate: Tooling allows us to establish repeatable and automated processes that allow us to identify mistakes. Another thing cultural heritage institutions are worried about is the usability of digital objects over time. We’ve seen file formats go obsolete in a few years. We’re lucky that this collection is in commonly used formats.

David: Yes, the file formats JPG and MP3 are all 30 years old and continually available and have broad use.

Nicole: This is the second time that you have helped AFC acquire a collection in this way. (The earlier effort was harvesting Flickr photos showing how people celebrate Halloween.) How prevalent is this acquisition method and how do you see it shaping 21st century archival collecting?

Kate: We would like to establish an area of practice. Born-digital material is currently a small part of our collection that will grow explosively. We can make tools available to enable its processing. We already have a robust web archiving program, which focuses on collecting websites as a publication itself. Born digital collections differ from that in that they are focused on collecting items (photographs, tweets, books, blogs) that are published on the web. There are huge economies of scale in this kind of acquisition, and the results can be extraordinarily useful to researchers.

District Dispatch: ESSA passes the Senate!

Tue, 2015-12-08 17:54

Your voices have been heard! With a vote today of 84-12, the Senate moved us one step closer to a reauthorization of the Elementary and Secondary Education Act (ESEA).  As we mentioned last week, the Every Student Succeeds Act (ESSA) is a bipartisan bill that recognizes the vital role a school library and its librarian play in a student’s education. 

As we await the President’s signature to turn this bill into law, please take the time to thank those in Congress who voted in favor of ESSA. To learn if your Representative voted in favor of the bill visit this page (we’ll update this post with a link to the Senate votes once it has been posted).

If you would like to learn more about the library provisions in ESSA, this document may prove useful.  ALA is also working on compiling a document that explains how your school library can reap the positive effects of ESSA.  That document will be available and given out during the Midwinter conference. Stay tuned!

The post ESSA passes the Senate! appeared first on District Dispatch.

SearchHub: Visualizing search results in Solr: /browse and beyond

Tue, 2015-12-08 16:00
The Series

This is the second in a three part series demonstrating how it’s possible to build a real application using just a few simple commands.  The three parts to this are:

  • Getting data into Solr using bin/post
  • ==> (you are here) Visualizing search results: /browse and beyond
  • Up next: Putting it together realistically: example/files – a concrete useful domain-specific example of bin/post and /browse
/browse – A simple, configurable, built-in templated results view

We foreshadowed to this point in the previous, bin/post, article, running these commands –

$ bin/solr create -c solr_docs $ bin/post -c solr_docs docs/

And here we are: http://localhost:8983/solr/solr_docs/browse?q=faceting

Or sticking with the command-line, this will get you there:

$ open http://localhost:8983/solr/solr_docs/browse?q=faceting The legacy “collection1”, also known as techproducts

Seasoned Solr developers probably have seen the original incarnation of /browse. Remember /collection1/browse with the tech products indexed? With Solr 5, things got a little cleaner with this example and it can easily be launched with the -e switch:

$ bin/solr start -e techproducts

The techproducts example will not only create a techproducts collection it will also index a set of example documents, the equivalent of running:

$ bin/solr create -c techproducts -d sample_techproducts_configs $ bin/post -c techproducts example/exampledocs/*.xml

You’re ready to /browse techproducts.   This can be done using “open” from the command-line:

$ open http://localhost:8983/solr/techproducts/browse

An “ipod” search results in:

The techproducts example is the fullest featured /browse interface, but it suffers from the kitchen sink syndrome.  It’s got some cool things in there like as-you-type term suggest (type “ap” and pause, you’ll see “apple” appear), geographic search (products have contrived associated “store” locations), results grouping, faceting, more-like-this links, and “did you mean?” suggestions.   While those are all great features often desired in our search interfaces, the techproducts /browse has been overloaded to support not only just the tech products example data, but also the example books data (also in example/exampledocs/) and even made to demonstrate rich text files (note the content_type facet).  It’s convoluted to start with the techproducts templates and trim it down to your own needs, so the out of the box experience got cleaned up for Solr 5.

New and… generified

With Solr 5, /browse has been designed to come out of the box with the default configuration, data_driven_configs (aka “schema-less”).  The techproducts example has its own separate configuration (sample_techproducts_configs) and custom set of templates, and they were left alone and as you see above.  In order to make the templates work generically for most any type of data you’ve indexed, the default templates were stripped down to the basics and baked in.  The first example above, solr_docs, illustrates the out of the box “data driven” experience with /browse.  It doesn’t matter what data you put in to a data driven collection, the /browse experience starts with the basic search box and results display.  Let’s delve into the /browse side of things with some very simple data in a fresh collection:

$ bin/solr create -c example $ bin/post -c example -params "f.tags.split=true" -type text/csv \ -d $'id,title,tags\n1,first document,A\n2,second document,"A,B"\n3,third document,B' $ open http://localhost:8983/solr/example/browse

This generic interface shows search results from a query specified in the search box, displays stored field values, includes paging controls, has debugging/troubleshooting features (covered below) and includes a number of other capabilities that aren’t initially apparent.


Because the default templates make no assumptions about the type of data or values in fields, there is no faceting on by default, but the templates support it.  Add facet.field=tags to a /browse request such as http://localhost:8983/solr/example/browse?facet.field=tags and it’ll render as shown here.

Clicking the value of a facet filters the results as naturally expected, using Solr’s fq parameter.  The built-in, generic /browse templates, as of Solr 5.4, only support field faceting. Other faceting options (range, pivot, and query) are not supported by the templates – they simply won’t render in the UI.  You’ll notice as you click around after manually adding “facet.field=tags” that the links do not include the manually added parameter.  We’ll see below how to go about customizing the interface, including how to add a field facet to the UI.  But let’s first delve into how /browse works.

Note: the techproducts templates do have some hard-coded support for other facets, which can be borrowed from as needed; continue on to see how to customize the view to suit your needs.

What makes /browse work?

In Solr technical speak, /browse is a search request handler, just like /select – in fact, on any /browse request you can set wt=xml to see the standard results that drive the view.   The difference is that /browse has some additional parameters defined as defaults to enhance querying, faceting, and response writing.  Queries are configured to use the edismax query parser.  Faceting is turned on though no fields are specified initially, and facet.mincount=1 so as to not show zero count buckets.  Response writing tweaks make the secret sauce to /browse, but otherwise it’s just a glorified /select.  


Requests to /browse are standard Solr search requests with the addition of three parameters:

  • wt=velocity: Use the VelocityResponseWriter for generating the HTTP response from the internal SolrQueryRequest and SolrQueryResponse objects
  • v.template=browse: The name of the template to render
  • v.layout=layout: The name of the template to use as a “layout”, a wrapper around the main v.template specified

Solr generally returns search results as data, as XML, JSON, CSV, or even other data formats.  At the end of a search request processing the response object is handed off to a QueryResponseWriter to render.  In the data formats, the response object is simply traversed and wrapped with angle, square, and curly brackets.  The VelocityResponseWriter is a bit different, handing off the response data object to a flexible templating system called Velocity.

“Velocity”?  Woah, slow down!   What is this ancient technology of which you speak?  Apache Velocity has been around for a long time; it’s a top-notch, flexible, templating library.  Velocity lives up to its name – it’s fast too.  A good starting point to understanding Velocity is an article I wrote many (fractions of) light years ago: “Velocity: Fast Track to Templating”.  Rather than providing a stand-alone Velocity tutorial here, we’ll do it by example in the context of customizing the /browse view.  Refer to the VelocityResponseWriter documentation in the Reference Guide for more detailed information.

Note: Unless you’ve taken other precautions, users that can access /browse could also add, modify, change, delete or otherwise affect collections, documents, and all kinds of things opening the possibility to data security leaks, denial of service attacks, or wiping out partial or complete collections.   Sounds bad, but nothing new or different when it comes to /browse compared to /select, it just looks prettier, and user-friendly enough to want to expose to non-developers.

Customizing the view

There are several ways to customize the view; it ultimately boils down to the Velocity templates rendering what you want.   Not all modifications require template hacking though.  The built-in /browse handler uses a relatively new feature to Solr called “param sets”, which debuted in Solr 5.0.   The handler is defined like this:

<requestHandler name="/browse" class="solr.SearchHandler" useParams="query,facets,velocity,browse">

The useParams setting specifies which param set(s) to use as default parameters, allowing them to be grouped and controlled through an HTTP API.  An implementation detail, but param sets are defined in a conf/params.json file, and the default set of parameters is spelled out as such:

{"params":{ "query":{ "defType":"edismax", "q.alt":"*:*", "rows":"10", "fl":"*,score", "":{"v":0} }, "facets":{ "facet":"on", "facet.mincount": "1", "":{"v":0} }, "velocity":{ "wt": "velocity", "v.template":"browse", "v.layout": "layout", "":{"v":0} } }}

The various sets aim to keep parameters grouped by function.  Note that the “browse” param set is not defined, but it is used as a placeholder set name that can be filled in later.  So far so good with straightforward typical Solr parameters being used initially. Again, ultimately everything that renders is a result of the template driving it.  In the case of facets, all field facets in the Solr response will be rendered (from facets.vm).   Using the param set API, we can add the “tags” field to the “facets” param set:

$ curl http://localhost:8983/solr/example/config/params -H 'Content-type:application/json' -d '{ "update" : { "facets": { "facet.field":"tags" } } }'

Another nicety about param sets – their effect is immediate, whereas changes to request handler definitions require the core to be reloaded or Solr to be restarted.  Just hit refresh in your browser on /browse, and the new tags facet will appear without being explicitly specified in the URL.

See also the file example/films/README.txt for an example adding a facet field and query term highlighting.  The built-in templates are already set up to render field facets and field highlighting when enabled, making it easy to do some basic domain-specific adjustments without having to touch a template directly.

At this point, /browse is equivalent to this /select request: http://localhost:8983/solr/example/select?defType=edismax&q.alt=*:*&rows=10&fl=*,score&facet=on&facet.mincount=1&wt=velocity&v.template=browse&v.layout=layout&facet.field=tags

Again, set wt=xml or wt=json and see the standard Solr response.

Overriding built-in templates

VelocityResponseWriter has a somewhat sophisticated mechanism for locating templates to render.  Using a “resource loader” search path chain, it can get templates from a file system directory, the classpath, a velocity/ subdirectory of the conf/ directory (either on the file system or in ZooKeeper), and even optionally from request parameters.  By default, templates are only configured to come from Solr’s resource loader which pulls from conf/velocity/ or from the classpath (including solrconfig.xml configured JAR files or directories).  The built-in templates live within the solr-velocity JAR file.  These templates can be extracted, even as Solr is running, to conf/velocity so that they can be adjusted.  To extract the built-in templates to your collections conf/velocity directory, the following command can be used, assuming the “example” collection that we’re working with here.

$ unzip dist/solr-velocity-*.jar velocity/*.vm -d server/solr/example/conf/

This trick works when Solr is running in standalone mode.  In SolrCloud mode, conf/ is in ZooKeeper as is conf/velocity/ and the underlying template files; if you’re not seeing your changes to a template be sure the template is where Solr is looking for it which may require uploading it to ZooKeeper.  With these templates extracted from the JAR file, you can now edit them to suit  your needs.  Template files use the extension .vm, which stands for “Velocity macro” (“macro” is a bit overloaded, unfortunately, really these are best called “template” files).  Let’s demonstrate changing the Solr logo in the upper left to a magnifying glass clip art image.   Open server/solr/example/conf/velocity/layout.vm with a text editor, change the <div id="head"> to the following, save the file, and refresh /browse in your browser:

<div id="head"> <a href="#url_for_home"> <!-- Borrowed from --> <img src=""/> </a> </div>

#protip: your boss will love seeing the company logo on your quick Solr prototype.  Don’t forget the colors too: the CSS styles can be customized in head.vm.

Customizing results list

The /browse results list is rendered using results_list.vm, which just iterates over all “hits” in the response (the current page of documents) rendering hit.vm for each result.  The rendering of a document in the search results commonly is an area that needs some domain-specific attention. The templates that were extracted will now be used, overriding the built-in ones.  Any templates that you don’t need to customize can be removed, falling back to the default ones.  In this example, the template changed was specific to the “example” core.  Newly created collections, even data-driven based ones, won’t have this template change.

NOTE: changes made will be lost if you delete the example collection – see the -Dvelocity.template.base.dir technique to externalize templates from the configuration.


I like using /browse for debugging and troubleshooting.  In the footer of the default view there is an “enable debug” link adding a “debug=true” to the current search request.  The /browse templates add a “toggle parsed query” link under the search box and a “toggle explain” by each search result hit. Searching for “faceting”, enabling debug, and toggling the parsed query tells us how the users query was interpreted, including what field(s) are coming into play and any analysis transformations like stemming or stop word removal that took place.

Toggling the explain on a document provides detailed, down to the Lucene-level, explanation of how this document matched and how the relevancy score was computed.  As shown below, “faceting” appears in the _text_ field (a data_driven_configs copyField destination for all fields making everything searchable).  “faceting” appears 4 times in this particular document (tf, term frequency), and appears in 24 total documents (docFreq, document frequency).  The fieldNorm factor can be a particularly important one, a factor based on the number of terms in the field generally giving shorter fields a relevancy advantage over longer ones.



VelocityResponseWriter: it’s not for everyone or every use case.  Neither is wt=xml for that matter.  Over the years, /browse has gotten flack for being a “toy” or not “production-ready”.  It’s both of those, and then some.  VelocityResponseWriter has been used for:

  • effective sales demos
  • rapid prototyping
  • powering one our Lucidworks Fusion customers entire UI, through the secure Fusion proxy
  • and even generating nightly e-mails from a job search site!

Ultimately, wt=velocity is for generating textual (not necessarily HTML) output from a Solr request.  

The post Visualizing search results in Solr: /browse and beyond appeared first on

David Rosenthal: National Hosting with LOCKSS Technology

Tue, 2015-12-08 16:00
For some years now the LOCKSS team has been working with countries to implement National Hosting of electronic resources, including subscription e-journals and e-books. JISC's SafeNet project in the UK is an example. Below the fold I look at the why, what and how of these systems.

Why National Hosting?The details of how universities, research institutes and libraries acquire subscription electronic resources vary considerably between countries. But in most cases they share an important feature in common; the money may flow through the system in different ways, but it starts from the government. This leads governments to focus on a number of issues in the system.
Post-Cancellation AccessIn the paper world, a library's subscription to a journal purchased a copy of the content. It provided the library's readers with access until the library decided to de-accession it, and it could be used to assist other libraries via inter-library loan. The advent of the Web made journal content vastly more accessible and useful, but it forced libraries to switch from purchasing a copy to leasing access to the publisher's copy. If the library stopped paying the subscription, their readers lost access not just to content published in the future, but also to content published in the past while the subscription was being paid. This problem became known as post-cancellation access (PCA), and elicited over time a number of different responses:
  • Some publishers promised to provide PCA themselves, providing access to paid-for content without further payment, but librarians were rightly skeptical of these unfunded promises.
  • The enthusiasm for open access led HighWire Press to pioneer and some publishers to adopt the moving wall concept. Because the value of content decays through time, providing open access to content older than, for example, 12 months does little to reduce the motivation for libraries to subscribe, while rendering PCA less of an issue.
  • Several efforts took place to establish e-journal archives, in order that PCA not be at the whim of the publisher. I posted a historical overview of e-journal archiving to my blog back in 2011.
It is becoming clear that the sweet spot for providing PCA is at a national level, matching the source of most of the subscription funds. A global third-party archive such as ITHAKA's Portico poses essentially the same problem as the original publishers do, access is contingent on continued subscription payment. National archives match national publisher licensing, and are large enough to operate efficiently.
Fault-Tolerant AccessIt is tempting to assume that once the subscription has been paid the content will be available from the publisher, but there are many reasons this may not always be the case:
  • Publisher's subscription management and access systems are typically separate, and errors can occur as information is transferred between them that deny subscribers access.
  • Publisher's access systems are not completely reliable. Among the publishers who have recently suffered significant outages are JSTOR, Elsevier, Springer, Nature, IEEE, Taylor & Francis, and Oxford.
  • The Domain Name System that links readers via URLs to publishers is brittle, subject to failures such as that which took down DOI resolution, and which enable journal websites to be hijacked.
  • Publishers often fail to maintain their URL space consistently, leading to the reference rot and content drift that impacts at least 20% of journal articles.
  • The Internet that connects readers to publishers' Web sites is a best-efforts network passing through many links and routers on the way. There are no guarantees that requested content will be delivered completely, or at all. The LOCKSS system's crawlers constantly detect complete or partial failure to deliver content. These could be caused by errors at the publisher, or along the network path to the crawler. Experience suggests that the fewer hops along the route the fewer errors, so at least some errors are network problems.
The ability to fail over from the publisher to a National Hosting Network (NHN), which is closer to the readers, and operates independently of the publisher, can significantly enhance the availability of the content.
Extra-territorialityFor most countries, the publishers of the vast majority of the content to which they subscribe are located in, or have substantial business interests in, the USA. Interestingly, the USA has long had a National Hosting program. Los Alamos National Labs collects all e-journals Federal researchers could access and re-hosts them. The goal is to ensure that details of what their researchers, possibly working on classified projects, look at is not visible outside the Federal government.

Edward Snowden's revelations make it clear that the NSA and GCHQ devote massive efforts to ensuring that they can capture Internet traffic, especially traffic that crosses national borders. This includes traffic to e-journal publishers. Details of almost everything your national researchers look at is visible to US and UK authorities, who are not above using this access for commercial advantage. And, of course, many academic publishers are based, or have much of their infrastructure, in the US or the UK, so they can be compelled to turn over access information even if the traffic is encrypted.

An even greater concern is that the recent legal battle between the US Dept. of Justice and Microsoft over access to e-mails stored on a server in Ireland has made it clear that the US is determined to establish that content under the control of a company with a substantial connection to the USA be subject to US jurisdiction irrespective of where the content is stored. The EU has also passed laws claiming extra-territorial jurisdiction over data, so is in a poor position to object to US claims. Note that software is also data in this context.

Thus the access rights notionally conveyed by subscription agreements are actually subject to a complex, opaque and evolving legal framework to which the subscriber, and indeed the government footing the bill, are not parties. The US is claiming the unilateral right to terminate access to content held anywhere by any organization with significant business in the US. This clearly includes all major e-journal publishers, such as EBSCO, Elsevier, Wiley, Taylor & Francis, Nature, AAAS and JSTOR.
Tracking Value For MoneyAs governments, and thus taxpayers, are the ultimate source of most of the approximately €7.6B/yr spent on academic publishing, concern for the value obtained in return is natural. Many countries, such as the UK, negotiate nationally with publishers. Institutions obtain access by opting-in to the national license terms and paying the publisher. The institution can obtain COUNTER usage statistics from the publisher, so can determine its local cost per usage. But the opt-in license and direct institution-to-publisher interaction make it difficult to aggregate this information at a national level.
Preserving National Open-Access OutputGovernments also fund most of the research that appears in smaller, locally published open-access journals. Much of this is, for content or language reasons, of little interest outside the country of origin. It is thus unlikely to be preserved unless NHNs, such as Brazil's Cariniana, take on the task.
What Does National Hosting Need To Do?To address these issues, the goals of a National Hosting system are to ensure that:
  • Copies of subscribed content are maintained by national institutions within the national boundaries.
  • Using software that is open source, or of national origin.
  • Upon terms allowing access by institutions to content to which they subscribed if, for any reason, it is not available to them from the publisher.
Accomplishing these goals requires:
  1. A database (the entitlement registry) tracking the content to which, at any given time, each institution subscribes.
  2. A system for collecting and preserving for future use the content to which any institution subscribes.
  3. A system for delivering content to readers that they cannot access from the publisher.
A number of countries are at various stages of implementing NHNs using the LOCKSS technology to satisfy these requirements.
How Does LOCKSS Implement National Hosting?The LOCKSS Program at the Stanford Libraries started in 1998. It was the first to demonstrate practical technology for e-journal archiving, the first to enter production, and the first to achieve economic sustainability without grant funding (in 2007). The basic idea was to restore libraries' ability to purchase rather than lease content, by building technology that worked for the Web as the stacks worked for paper.

Libraries run a "LOCKSS box", a low-cost computer that uses open-source software. Just as in the paper world, each library's box individually collects the subscribed content by crawling the publisher's Web site. Boxes extract Dublin Core and DOI metadata so that the library's readers (but no-one else) can access the box's content if for any reason it is ever not available from the publisher. Boxes holding the same content cooperate to:
  • Detect and correct any content that a box failed to collect completely.
  • Detect and repair any damage to content over time.
Most publishers allowed libraries to use LOCKSS but, for various reasons some of the large publishers were not happy with this model. They took the initiative to set up the CLOCKSS archive, which uses the same technology to implement a dark archive holding each publisher's total output in a network of currently 12 boxes around the world. A National Hosting Network would be similar, but with perhaps half as many boxes.

Full details of how the CLOCKSS archive works are in the documentation that supported its recent successful TRAC audit. Briefly, CLOCKSS ingests content in two ways:
  • A network of 4 LOCKSS boxes harvests content from publishers' Web sites and collaborates to detect and correct collection failures.
  • Some publishers supply content via file transfer to a machine which adds metadata and packages it.
The LOCKSS daemon software on each of the 12 production boxes then collects both kinds of content from the ingest machines by crawling them, under control of the appropriate plugin. Subject to publisher agreement, the same ingest technology can be used to ingest content into NHNs. This would satisfy requirement #2 above.

Requirement #1, setting up an entitlement registry, involves setting up a database of quads [Institution, Journal ID, start date, end date]. The UK's SafeNet project has done so, and is working to populate it with subscription information. They have defined an API to this database. Countries that already have a suitable database can implement this API, countries that don't can use the UK's open-source implementation.

The LOCKSS software is being enhanced to query the entitlement registry via this API before supplying content to readers, thus satisfying requirement #3 above. Readers can access the content via their institution's OpenURL resolver, or via a national DOI resolver.
A Typical National Hosting NetworkThis diagram shows the configuration of the ingest and preservation aspects of a typical Private LOCKSS Network (PLN). This one has a configuration server controlling a network of 6 LOCKSS boxes, each collecting its content from a content source, and communicating with the other boxes to detect and correct any missing or damaged content. LOCKSS-based NHNs are a type of PLN, and typically would look exactly like the diagram, with 6 or more boxes scattered around the country managed by a central configuration server.

Just as with the CLOCKSS Archive, some content for an NHN would be available via harvest, and some via file transfer. The file transfer ingest pipeline for CLOCKSS is shown in this diagram, the pipeline for an NHN would be similar. For security reasons, content that the publisher uploads to a CLOCKSS FTP server goes to an isolated machine for that purpose only, and is then downloaded from it to the main ingest machine via rsync. Content that publishers make available on their FTP servers, or via rsync, is downloaded to the main ingest machine directly. Two backup copies, one on-site and one off-site are maintained via rsync to ensure that the content is safe while in transit to the production LOCKSS boxes.

The CLOCKSS archive ingests harvest content using a separate ingest network of 4 machines which crawl the publishers' sites, and are then crawled by the 12 production CLOCKSS boxes. The boxes in an NHN would typically harvest content directly, by crawling the publishers' sites, so an ingest network would not be needed.

The CLOCKSS Archive is dark; the content in production CLOCKSS boxes is never accessed by readers. If content is ever triggered, the CLOCKSS team executes a trigger process to copy content out of the network to a set of triggered content servers, which make it openly accessible. NHNs need to make their content accessible only to authorized users, so they need an access path similar to that for LOCKSS. It must consult the entitlement registry to determine whether the requested content was covered by the requester's institutional subscription.

The LOCKSS software can disseminate harvest content in four ways:
  1. By acting as an HTTP proxy, in which case the reader whose browser is proxied via their institution's LOCKSS box will access the content at the publisher's URL but, if the publisher does not supply the content, it will be supplied by the LOCKSS box.
  2. By acting as an HTTP server, in which case the reader will access the content at a URL pointing to the LOCKSS box but including the publisher's URL
  3. By resolving OpenURL queries, using either the LOCKSS daemon's internal resolver, or an institutional resolver such as SFX. LOCKSS boxes can output KBART reports detailing their holdings, which can be input to an institutional resolver. The LOCKSS box will then appear as a alternate source for the resolved content. Access will be at a URL pointing to the LOCKSS box.
  4. By resolving DOI queries, using the LOCKSS daemon's internal resolver. Access will be at a URL pointing to the LOCKSS box.
In each case, the LOCKSS box is configured by default to forward the request to the publisher's Web site and supply content only if the publisher does not.

Methods 1 & 2 rely on knowing the original publisher's URL for the content. For file transfer content, this is generally not available, so the method 3 or 4 must be used. The format in which a publisher supplies file transfer content varies, but it generally includes metadata sufficient to resolve OpenURL or DOI queries, and a PDF that can be returned as a result.

FOSS4Lib Updated Packages: VIPS

Tue, 2015-12-08 15:21

Last updated December 8, 2015. Created by Peter Murray on December 8, 2015.
Log in to edit this page.

VIPS is a free image processing system. It includes a range of filters, arithmetic operations, colour processing, histograms, and geometric transforms. It supports ten pixel formats, from 8-bit unsigned int to 128-bit complex. As well as the usual JPEG, TIFF, PNG and WebP images, it also supports scientific formats like FITS, OpenEXR, Matlab, Analyze, PFM, Radiance, OpenSlide and DICOM (via libMagick). Compared to similar libraries, VIPS is fast and does not need much memory, see the Speed and Memory Use page.
It comes in two main parts: libvips is the image-processing library and nip2 is the graphical user-interface. libvips has a simple GObject-based API and comes with interfaces for C, C++, the command-line and Python. Other bindings have been made, including Ruby and Go. libvips is used as an image processing engine by sharp (on node.js), carrierwave-vips, mediawiki, photoflow and others.
The official VIPS GUI, nip2, aims to be about half-way between Photoshop and Excel. It is very bad at retouching photographs, but very handy for the many other imaging tasks that programs like Photoshop are used for.
Both work on Linux/Unix (with convenient packages for most popular distributions, see links), Windows XP and later and MacOS 10.2 and later.

Package Type: Image Display and ManipulationLicense: GPLv2 Package Links Development Status: Production/StableOperating System: LinuxMacWindows Releases for VIPS Programming Language: COpen Hub Link: Hub Stats Widget: 

HangingTogether: What do MARC descriptions of archival materials really look like?

Mon, 2015-12-07 21:28

Taking Our Pulse showed that 44% of those in the surveyed institutions had no online record. Colleagues worldwide are working hard to improve this sorry situation. In the mean time, I hope you agree with me that doing the work as effectively and efficiently as possible is essential.

At the same time, our professional community is developing new data structures for storing and communicating library and archival descriptions to meet 21st-century needs. This includes enabling improved discovery through the promise of linked open data. As we move forward in that work, what can we learn from past practice? As my OCLC Research colleague Karen Smith-Yoshimura showed in her 2010 report on MARC field usage, fewer than thirty of the more than two hundred fields in MARC21 have been used in 10% of more of all WorldCat records. It makes one wonder: should we be carrying all those data elements forward? Do we need such granular data structures if so many bits are little-used? Would simpler approaches improve workflow efficiencies without sacrificing effectiveness for identification and discovery of unique materials?

These are among the questions lurking behind my current project to look at data element usage in the four million WorldCat that describe archival materials. Some others:

  • How do we define “archival materials” for purposes of such a study?
  • Is archival use of MARC accurate and fulfilling its potential?
  • How does archival description differ across types of material?
  • Are archival materials usually described as collections?
  • Does the archival control byte (Leader 08) capture all archival descriptions?
  • How often is DACS specified as the content standard?
  • To what extent have DACS minimum requirements been met?

And, referring back to my comments above, the bonus question: What implications for “next-gen” cataloging do the data suggest?

Last Thursday I presented an OCLC Research Library Partnership work-in-progress webinar to offer a first look at the data, solicit feedback on how I’ve approached the analysis, and lob a few tentative recommendations. Take a look at the slides and the recording.

First, the overall profile of the dataset by broad types of material: Visual materials are represented by the largest number of records (1.4 million), followed by mixed (1.3 million) and textual materials (800,000). More than 300,000 recordings are included, over 120,000 music scores, and much smaller quantities of several other types of material.

Here are a few of the data points I find interesting:

  • The record type byte (Leader 06) is used incorrectly in some significant ways.
  • Archival descriptive standards are specified in only 20% of records.
  • Twenty-five percent of mixed materials are described as items, as are up to 95% of materials describing other formats.
  • Some format-specific note fields are greatly underutilized.
  • Archival control is specified in only 28% of records.
  • Cataloging practices reveal format-specific silos.
  • One-third of records link (856) to digital content.

What do these data suggest for the likely success of archival descriptions to connect users with materials? How should practice change going forward?

I’d love to hear your feedback after you take a look at the webinar outputs. A report will be published early in 2016, so please get in touch soon.

And, in the mean time, I wish you and yours a joyful holiday season!

About Jackie Dooley

Jackie Dooley leads OCLC Research projects to inform and improve archives and special collections practice.

Mail | Web | Twitter | Facebook | More Posts (18)

DPLA: DPLA Board Call: Tuesday, December 15, 3:00 PM Eastern

Mon, 2015-12-07 18:01

The next DPLA Board of Directors call is scheduled for Tuesday, December 15 at 3:00 PM Eastern. Agenda and dial-in information is included below. This call is open to the public, except where noted.

  1. [Public] General updates from Executive Director
  2. [Public] Questions/Comments from the public
  3. [Executive session] Membership model update
Dial in

FOSS4Lib Recent Releases: Jpylyzer - 1.16.0

Mon, 2015-12-07 17:45

Last updated December 7, 2015. Created by Peter Murray on December 7, 2015.
Log in to edit this page.

Package: JpylyzerRelease Date: Monday, December 7, 2015

Manage Metadata (Diane Hillmann and Jon Phipps): The Jane-athons continue!

Mon, 2015-12-07 16:19

The Jane-athon series is alive, well, and expanding its original vision. I wrote about the first ‘official’ Jane-athon earlier this year, after the first event at Midwinter 2015.

Since then the excitement generated at the first one has spawned others:

  • the Ag-athon in the UK in May 2015, sponsored by CILIP
  • the Maurice Dance in New Zealand (October 16, 2015 at the National Library of New Zealand in Wellington, focused on Maurice Gee)
  • the Jane-in (at ALA San Francisco at Annual 2015)
  • the RLS-athon (November 9, 2015, Edinburgh, Scotland), following the JSC meeting there and focused on Robert Louis Stevenson
  • Like good librarians we have an archive of the Jane-athon materials, for use by anyone who wants to look at or use the presentations or the data created at the Jane-athons

    We’re still at it: the next Jane-athon in the series will be the Boston Thing-athon at Harvard University on January 7, 2016. Looking at the list of topics gives a good idea about how the Jane-athons are morphing to a broader focus than that of a creator, while training folks to create data with RIMMF. The first three topics (which may change–watch this space) focus not on specific creators, but on moving forward on topics identified of interest to a broader community.

    * Strings vs things. A focus on replacing strings in metadata with URIs for things.
    * Institutional repositories, archives and scholarly communication. A focus on issues in relating and linking data in institutional repositories and archives with library catalogs.
    * Rare materials and RDA. A continuing discussion on the development of RDA and DCRM2 begun at the JSC meeting and the international seminar on RDA and rare materials held in November 2015.

    For beginners new to RDA and RIMMF but with an interest in creating data, we offer:
    * Digitization. A focus on how RDA relates metadata for digitized resources to the metadata for original resources, and how RIMMF can be used to improve the quality of MARC 21 records during digitization projects.
    * Undergraduate editions. A focus on issues of multiple editions that have little or no change in content vs. significant changes in content, and how RDA accommodates the different scenarios.

    Further on the horizon is a recently approved Jane-athon for the AALL conference in July 2016, focusing on Hugo Grotius (inevitably, a Hugo-athon, but there’s no link yet).

    NOTE: The Thing-a-thon coming up at ALA Midwinter is being held on Thursday rather than the traditional Friday to open the attendance to those who have other commitments on Friday. Another new wrinkle is the venue–an actual library away from the conference center! Whether you’re a cataloger or not-a-cataloger, there will be plenty of activities and discussions that should pique your interest. Do yourself a favor and register for a fun and informative day at the Thing-athon to begin your Midwinter experience!

    Instructions for registering (whether or not you plan to register for MW) can be found on the Toolkit Blog.

    Ariadne Magazine: Figshare Fest 2015

    Mon, 2015-12-07 14:57

    Gary Brewerton reports on figshare fest 2015, held in London on 12th October 2015.

    Figshare [1] is a cloud-hosted repository where users can upload their various research outputs (e.g. figures, datasets, presentations, etc.) and make them publically available so they are discoverable, shareable and citable. Read more about Figshare Fest 2015

    Authors: Issue number: Article type: Organisations: Projects: Date published: Mon, 12/07/201575

    FOSS4Lib Recent Releases: Jpylyzer - 1.15.1

    Mon, 2015-12-07 13:35

    Last updated December 7, 2015. Created by Peter Murray on December 7, 2015.
    Log in to edit this page.

    Package: JpylyzerRelease Date: Thursday, December 3, 2015

    LITA: LITA Bloggers Reflect on LITA Forum 2015

    Mon, 2015-12-07 13:00
    LITA bloggers, L-R: Whitni, Lindsay, Brianna, Bill, Michael, Jacob

    Connections – Michael Rodriguez

    Several LITA bloggers, including myself, attended our first-ever LITA Forum in November 2015. For me, the Forum was a phenomenal experience. I had a great time presenting on OCLC products, open access integration, and technology triage, with positive, insightful audience questions and feedback. The sessions were excellent, the hotel was amazing, the Minneapolis location was perfect, but best of all, LITA was a superb networking conference. With about 300 attendees, it was small enough for us to meet everyone, but large enough to offer diverse perspectives. I got to meet dozens of people, including LITA bloggers Bill, Jacob, and Whitni, whom I knew via LITA or via Twitter but had never met IRL. I got to reenergize old comradeships with Lindsay and Brianna and finally meet the hard-working LITA staff, Mark Beatty and Jenny Levine. I formed an astonishing number of new connections over breakfast, lunch, dinner, and water coolers. Our connections were warm and revitalizing and will be with us lifelong. Thanks, LITA!

    To Name – Jacob Shelby

    LITA Forum 2015 was my first professional library conference to attend, and I will say that it was an amazing experience. The conference was just the right size! I was fortunate to meet some awesome, like-minded people who inspired me at the conference, and who continue to inspire me in my daily work. There were so many great sessions that it was a real challenge choosing which ones to go to! My particular favorite (if I had to choose only one) was Mark Matienzo’s keynote: To Hell With Good Intentions: Linked Data, Community and the Power to Name. As a metadata and cataloging professional, I thought it was enlightening to think about how we “name” communities and to consider how we can give the power to name and tell stories back to the communities. In all, I made connections with some wonderful professionals and picked up some great ideas to bring back to my library. Thanks for an awesome experience, LITA!

    Game On – Lindsay Cronk

    A conference is an investment for many of us, and so we always look for ROI. We fret about costs and logistics. We expect to be stimulated by and learn from speakers and presentations. We hope for networking opportunities. At LITA Forum, my expectations and hopes were met and exceeded. Then I got to go to Game Night. What better way to reward a conferenced-out brain than with a few rounds of Love Letter and a full game of Flash Point? I had a terrific time talking shop and then just playing around with fellow librarians and library tech folks. It reminded me that play and discovery are always touted as critical instructional tools. At this point I’m going to level a good-natured accusation- LITA Forum gamified my conference experience, and I loved it. I hope you’ll come out and play next year, LITA Blog readers!

    No, get YOUR grub on! – Whitni Watkins

    As someone on the planning committee for LITA Forum, I spent a decent amount of time doing my civic duty and making sure things were in place. After a couple of years of conference heavy attending, I learned that you cannot do it all and come out on top. I was selective this year, I attended a few sessions that peaked my interest and spent a few hours discussing a project I was working on in the Poster session. I’ve learned that conferences are best for networking, for finding people with the same passion to help you hack things in the library (and not so library) world. My fondest memory of this year’s LITA forum was the passionate discussion we had during one of our networking dinners on the hierarchy in libraries, how we can break it, and why it is important to do so. Also, afterwards meeting up as LITA Bloggers and hanging out with each other IRL. A great group of people behind the screen, happy to be a part of it.

    Did you attend this year’s LITA Forum? What was your experience like?

    DuraSpace News: UPDATE: Fedora Community Book Club Kick-off

    Mon, 2015-12-07 00:00

    From Andrew Woods, Fedora Tech Lead

    Winchester, MA  An organizing group met recently to discuss and define the logistics for discussions of the Fedora Community Book Club selection: "Semantic Web for the Working Ontologist". Here are the details:

    Ted Lawless: Python ETL and JSON-LD

    Sat, 2015-12-05 05:00

    I've written an extension to petl, a Python ETL library, that applies JSON-LD contexts to data tables for transformation into RDF.

    The problem

    Converting existing data to RDF, such as for VIVO, often involves taking tabular data exported from a system of record, transforming or augmenting it in some way, and then mapping it to RDF for ingest into the platform. The W3C maintains an extensive list of tools designed to map tabular data to RDF.

    General purpose CSV to RDF tools, however, almost always require some advanced preparation or cleaning of the data. This means that developers and data wranglers often have to write custom code. This code can quickly become verbose and difficult to maintain. Using an ETL toolkit can help with this.

    ETL with Python

    One such ETL tool that I'm having good results with is petl, Python ETL. petl started at an informatics group at the University of Oxford and is maintained by Alistair Mles. It has clear documentation and is available under an open license.

    petl provides adapters for reading data from a variety of sources - csv, Excel, databases, XML - and many utilities for cleaning, transforming, and validating. For example adding a column of static values to a petl table is as simple as:

    etl.addfield(table1, 'type', 'person') petl and JSON-LD for RDF

    petl, however, doesn't have utilities for outputting tables to RDF. To add this functionality, I've written a small extension, called petl-ld, to use JSON-LD contexts to map petl's table data structure to RDF. This allows the developer to clean, enhance, and validate the incoming data with petl functionality and patterns - and then, as a final step, apply a JSON-LD context to create an RDF serialization.

    The JSON-LD transformation utilizes the rdflib-jsonld extenstion to RDFLib maintained by Niklas Lindström.

    Here is an example:

    import petl as etl import petlld # set up a petl table to demonstrate table1 = [['uri', 'name'], ['n1', "Smith, Bob"], ['n2', "Jones, Sally"], ['n3', "Adams, Bill"]] # use petl utilities to add a column with our data type - foaf:Person table2 = etl.addfield(table1, 'a', 'foaf:Person') # a JSON-LD context for our data ctx = { "@base": "", "a": "@type", "uri": "@id", "rdfs": "", "foaf": "", "name": "rdfs:label" } # serialize the data as table2.tojsonld(ctx, indent=2)

    The JSON-LD output:

    { "@context":{ "a":"@type", "foaf":"", "name":"rdfs:label", "rdfs":"", "uri":"@id", "@base":"" }, "@graph":[ { "a":"foaf:Person", "uri":"n1", "name":"Smith, Bob" }, { "a":"foaf:Person", "uri":"n2", "name":"Jones, Sally" }, { "a":"foaf:Person", "uri":"n3", "name":"Adams, Bill" } ] }

    If you would rather output an RDFLib Graph for serialization in another format, that is possible too.

    graph = table2.tograph(ctx) print graph.serialize(format='turtle')

    The turtle output:

    @prefix foaf: <> . @prefix rdf: <> . @prefix rdfs: <> . @prefix xml: <> . @prefix xsd: <> . <> a foaf:Person ; rdfs:label "Smith, Bob" . <> a foaf:Person ; rdfs:label "Jones, Sally" . <> a foaf:Person ; rdfs:label "Adams, Bill" . Summary

    If you are working with Python and converting tabular data to RDF, take a look at petl-ld) and see if it helps you write less, more readable code. Feedback is welcome.

    William Denton: Diana Athill on reading

    Sat, 2015-12-05 03:02

    From Mariella Frostrup’s Open Book interview with Diana Athill:

    Mariella Frostrup: What do you feel reading’s given to you over the many decades?

    Diana Athill: Pretty well everything. I mean, an enormous amount of one’s experience comes from books.

    Diana Athill is about to turn 98.

    LibUX: Core Content Audit

    Fri, 2015-12-04 21:59

    A content audit produces an inventory of your website’s media, articles, ad copy, posts, pages, status updates — the whole shebang. It is your bird’s-eye view of the message, its voice and tone, an objective representation of the quality of your content and the architecture of your site. And audits are just no fun at all.

    For sites of any scale — particularly those with handfuls of content creators, subject specialists, instructors, sometimes representing totally disparate departments serving niche user needs — having that snapshot of your sprawl can make inroads for everyone’s sanity. Organizations with any level of churn need to be able to identify and assign orphaned content so that its disrepair doesn’t misinform or otherwise endanger their credibility. More importantly, content-making without a disciplined editorial process can build-up redundant content, like plaque on the gums. The difficulty of maintaining the currency of this duplicate content rises exponentially, thus the faster the integrity of your content overall breaks down.

    Just the process of building the inventory can be the worst ever, but what most often follows is a frank arbitrary sense of hard work down the memory hole. “Whew, we’re done – uh, what’s next?” Having the inventory implies nothing about what then to do with it, and it is tempting to look ahead to the so-what anticlimax of an audit and decide that the undertaking isn’t really worth the time.

    My thinking was that the completed audit, then, should be a tool – not just a reference. It needs to offer some sort answer to the inevitable question – now what?

    Pinpoint problem content by assigning a score

    Because no one wants to be the sucker on lone audit duty, it dilutes the ugggh if we can use a system that doesn’t rely on the work of trained content strategists but can be crowdsourced using “objective-ish” heuristics judging the quality of a piece of content.

    Ideally, content that needs attention bubbles to the surface.

    Imagine a piece of content — an event — compared against a rubric that measured whether it was good. It gets a grade, a solid B — 80%. Another event ranks lower. This comparison, however simple, implies forward movement: time is better spent to improve the content of the second event because the first is good enough.

    Can we determine objectively good content?

    My high-school English teacher once told me that only when you know the rules of writing can you break them. The value of writing is largely subjective, it says more about the state of the reader than the quality of the writer, but there are nevertheless base axioms you adhere to when you learn: i-before-e-except-after-c, the five-paragraph essay, thesis in the first sentence, and so on.

    In the same way, there are objective-ish markers for quality web content supported by user experience research. I stress the research part, because determining and enforcing a score-based content audit can hurt some feelings. What is important about emphasizing the UX when you begin this process is to communicate that scoring isn’t an attack on a person’s ability to write, but a means for optimizing content for user behavior so that the content meets a business goal.

    I keep using “objective-ish” because, well, of course this is subjective. The objective-ness in the end doesn’t really matter except as a gut-check, a gut-instinct that helps improve the value of a content audit as an internal tool. We use it help us make decisions about what content needs work.

    Here are qualities I think we can score: the content is satisficing, that it answers a question or serves demonstrable need, it is mobile first, concise, accurate, without confusion, and has appropriate voice and tone.


    This is really a thing and by grace of English a word: a mashup of “satisfying” and “suffice” that blah blah blah means most folks will skim content for a good-enough answer rather than read and digest the whole thing. On a medium where the average page visit is less than a minute, this kind of hit-it-and-quit-it behavior makes sense. In most cases, the speed of the customer journey is as important if not more than its quality: speed is the quality ( #perfmatters ). Our users are primed to get what they need and move on.

    Unless the metric that matters for our content is time engaged, such as for a video, then it behooves creators to craft content that is optimized for this behavior. In most cases, we are judging scannability. The heuristic for a satisficing user experience involves the presence of markers such as use of the inverted pyramid, simple descriptive headlines, not infrequent use of relevant links in the text that people use as indicators of particularly relevant areas.

    What Question Does it Answer?

    An exercise I think is particularly useful is to try to determine what question does this content answer? Where it’s difficult to determine, there is a problem with the content. Your page about business hours answers the question: “when are you open?” The most satisficing answer, of course, is “right now” — right at the top.


    The questioning of relevancy can be a little tender and awkward, but it can be insightful to determine whether there is demonstrable need for that piece of content, particularly in terms of business goals.

    Having to produce clickbait to get eyeballs on ads is totally legit. Relevance is probably unique to the organization and its success metrics, but not all content is created equal, some content has larger audiences, greater demonstrable need, and it’s useful to identify these – if only to know what’s not working.

    Mobile first

    Mobile-first content is tailored for an increasingly mobile audience. The manyfold value includes its future-proofing as well as how that content is delivered, the point of mobile-firstness being that it is point-of-need.

    Brief, true, clear, appropriate

    — or “everything else.” The content should be as brief as necessary. Where facts are presented it needs to be accurate. The message should be clear. The content needs to be relevant.

    A simple rubric

    The idea is that by grading these qualities on a scale from 0 – 5 — low to high — determining the average and then dividing by our chosen scale (5), we assign a score to that piece of content. In excel, the formula is something like =AVERAGE(E2:K2)/5 which calculates a percentage. An “About Us” page with a score of 75% needs less attention than a “Contact” page scoring in the low 40s.

    That’s all there is to it. Two pieces of content compared against one another to give stakeholders an idea where there are opportunities for improvement.


    For my needs, I score content based on my gut-feeling, using simple markers I mentioned above. This of course can be as informal or strict as needed. And even though I set up the inventory, I don’t actually want to be the one performing it (yuck, amirite?), so I drummed-up the following scale so that multiple scorers can be in the same ballpark.

    This scale is for a large academic library website, so the language reflects that. Note that I use “audience” rather than “relevance” to illustrate or communicate demonstrable need without creating insult.


    A 5-point scale. Who is the content meant for; for which audiences is there demonstrable need?

    Score Guideline 5 All users, all ages. Often basic content every library website should have. Not having this content would generate complaints from all audiences. 4 Important to a large audience (e.g., all undergraduates). Content with a lot at stake, either because of high traffic or because it is crucial to the success of library services. Not every library has to have this content, but it has become important to our identity / brand / purpose / mission. 3 This content extends our purpose and is targeted to a specific but needful audience. Most course or subject guides, or information for just a demographic. 2 This content is nice to have, provides an aesthetic, or is otherwise “bonus.” Not necessarily a sufficiently demonstrable need, but it adds to the credibility and success of the library. It could probably be removed any time without huge repercussions. 1 There is barely any rationale for this content existing. The library gains nothing in particular for providing this content. Or, this content is needlessly redundant. 0 This is a picture of a goat. Accurate

    A 5-point scale. Is the content on this page up to date and accurate?

    Score Guideline 5 All information is correct and up to date. 4 There are one or two statements that could use an update, but they are not crucial or misleading. 3 There is some misleading information and this needs attention. 2 – 0 Crucial information is incorrect, such as building hours. Adjust based on severity. Voice / Tone

    A 5-point scale. Is the voice / tone appropriate for this kind of content?

    Score Guideline 5 The voice and tone is consistent with the overall voice, as well as appropriately casual/stiff/excited/silly/warning for this content. 4 Voice and tone is appropriate, but maybe leaning a little too far to one or another extreme. 3 I can’t really detect any particular voice. E.g., this is a table of information. 2 Voice and tone is a inconsistent throughout the content. 1 – 0 Voice and tone is wildly inappropriate. It’s a severe weather warning with emoji. Clear

    A 5-point scale. There’s nothing particularly confusing about this content, right?

    Score Guideline 5 Crystal clear! 4 lear! – but with a few non-crucial spotty areas that I didn’t quite understand. 3 Uh oh, this is sounding a tad academic. Or there is a lot of information to digest, or the information is presented strangely. 2 The important information on this page is muddled. 1 Are you speaking Elvish? 0 … Klingon? Concise

    A 5-point scale. Is the content as brief as possible while being comprehensive?

    Score Guideline 5 Yep 4 There are a few lingering, irrelevant items in the sidebar, or maybe the author is a little wordy. 3 This content could be cut in half. 2 Irrelevant information is mixed in with the relevant, making the content long and potentially impacting its clarity. 1 – 0 Does this have to be an essay? Satisficing

    A 5-point scale, with some explanation. Can the user find or do what they want without a lot of effort? is it skimmable? Can they get the information they want within the Average Time on Page?

    Here are some markers for “satisficing” content: descriptive page and section titles; most important content first, followed by more in-depth supporting information; the path to completion is clear.

    Score Guideline 5 The answer to the question the user has is right at the top. There are clear section headings, it is easy to scan. 4 The answer to the question the user has is easy to locate, but maybe buried under something else. 3 It takes longer than 5 seconds to locate the answer to the question that brought the user. There are section titles but they are not overly descriptive. This isn’t very scannable. 2 – 0 Just, no. Adjust for severity. Mobile First

    A 5-point scale, with some explanation. The page layout and content is mobile first. Meaning that on a small screen things are legible without eyestrain, can be interacted with by touch, align in the correct order.

    Score Guideline 5 Print is legible, navigable, and content is presented in order of importance.

    4 – 3 Print may be a tad small, or the order a little unclear, but the website is responsive and the content is legible.

    2 There are unresponsive elements (such as tables) on this page that make it difficult to navigate, or the print is too small

    1 The website is not responsive

    0 The website redirects to an app, m-dot website, or something equally aggressive.
    The qualitative “core” in “core content”

    Ida Aalen captured my imagination with “The Core Model: Designing Inside Out for Better Results,” published January 2015 in A List Apart. I jumped on her instructions for performing a content modelling workshop which forces you to align user goals with business goals by identifying the essential parts of a piece of content, where it exists in a user flow (inward and forward paths), and how it should be crafted for small screens.

    A Core Model Handout

    It’s useful to represent these and similar attributes in the core content audit to add context to the rubric.

    • What question does it answer?
    • Inward paths — how users discover or navigate to the content
    • Forward paths — what opportunities are we explicitly presenting for forward action (e.g., sign-up for a mailing list, navigating to extra content, etc.)
    • Content type — page, event, listicle, video, ….
    Integrating a quantitative component, like Google Analytics

    The rubric presents the illusion of objectivity but keeping the conversations that result from the audit constructive can be bolstered with select page-level data from an analytics tool that you think helps support an argument. I chose the number of sessions per month, average time on page in seconds, and bounce rate.

    This isn’t intended to be a traffic report, but as additional context describing the content in question. Sessions specifically add weight to the score given to the “audience” rating, whereas time on page can indicate either success or failure depending on the type of content: 54 seconds average may be just the time required to find a number or spin-up directions, but on a video may suggest that most users don’t make it to the pitch or watch the last 80% of the video. Bounce rate, too, depends on the content itself: high bounce when looking for hours of operation is dandy, unless this page needs to lead to a product or contact form.

    I’ve been thinking about adding Speed Index, and at a certain index — say, 3000 — start pairing-off percentage points from the overall score.


    Integrating numbers from Google Analytics during a page audit adds a lot of time to that audit, though. For me, the important part is that there is an inventory, and with lots of people working on this project in just their spare time I’ve opted to make these qualitative and quantitative contextual fields optional. I can’t — or don’t know how to — make these automatic.

    What I can do, however, is use services like Zapier or IFTTT to automatically create an entry for a new content item in the Google Spreadsheet once it is published. This relies on our content management system broadcasting a feed or other API on-publish, but this is common enough that I choose to delude myself by thinking it’s a sure thing.

    My feeling is that at the point where the rate of content creation overpowers our ability to maintain an inventory is the point where using a CMS is necessary, but that’s not always the case for smaller or less frequently updated web presences.

    Nevertheless, I work with a central WordPress Network as the hub of a COPE — “create once, publish everywhere” — system, which utilizes feeds and APIs to syndicate content across platforms. These feeds can be hooks to automatically populate rows into the audit.

    This is in-progress

    I am so far pretty happy with how the core content audit works, which I now use for several sites. However, just in the course of this writing I’ve tweaked things as they occur and something like this is most definitely a work in progress. I am anxious to hear your thoughts and ideas.

    In the meantime, you can fork this Google Spreadsheet.

    For my needs, this has been super successful. I can basically get away using otherwise untrained student workers to perform the nitty-gritty data entry and trust that by their following my rubric we are getting a solid visualization of the quality of our content. Problem items bubble to the top. Every so often, it’s a false alarm because of the nature of the content, but it nevertheless inspires someone – me, the content author, or another onlooker – to at least fact-check content that might have otherwise been published and forgotten.

    New posts, regardless of their content type, are automatically inserted when published without any scoring. While this automation only extends to a fraction of the content we produce, it at least ensures that the audit remains growing and at least useful as a reference.

    Hopefully, you can use it too.

    The post Core Content Audit appeared first on LibUX.

    pinboard: Twitter

    Fri, 2015-12-04 08:47
    It is official: #catmandu has a #code4lib #c4l16 pre-conference slot :)