You are here

Feed aggregator

ACRL TechConnect: Collaborative UX Testing: Cardigans Included

planet code4lib - Tue, 2015-08-04 14:00

Understanding and responding to user needs has always been at the heart of librarianship, although in recent years this has taken a more intentional approach through the development of library user experience positions and departments.  Such positions are a mere fantasy though for many smaller libraries, where librarian teams of three or four run the entire show.  For the twenty-three member libraries of the Private Academic Library Network of Indiana (PALNI) consortium this is regularly the case, with each school on staff having an average of four librarians.  However, by leveraging existing collaborative relationships, utilizing recent changes in library systems and consortium staffing, and (of course) picking up a few new cardigans, PALNI has begun studying library user experience at scale with a collaborative usability testing model.

With four library testing locations spread over 200 miles in Indiana, multiple facilitators were used to conduct testing for the consortial discovery product, OCLC’s WorldCat Discovery. Using WebEx to screen record and project the testing into a library staff observation room, 30 participants completed three general tasks with multiple parts helping us to assess user needs and participant behavior.

There were clear advantages of collaborative testing over the traditional, siloed approach which were most obviously shown in the amount and type of data we received. The most important opportunity was the ability to test different setups of the same product. This type of comparative data led to conclusive setup recommendations, and showed problems unique to the institutions versus general user problems. The chance to test multiple schools also provided a lot more data, which reduced the likelihood of testing only outliers.

The second major advantage of collaborative testing was the ability to work as a team. From a physical standpoint, working as a team allowed us to spread the testing out, keeping it fresh in our minds and giving enough time in-between to fix scripts and materials. This also allowed us to test before and after technical upgrades. From a relational perspective, the shouldering of the work and continual support reduced burn out during the testing. Upon analyzing the data, different people brought different skill sets. Our particular team consisted of a graphic/interface designer, a sympathetic ear, and a master editor, all of whom played important roles when it came to analyzing and writing the report. Simply put, it was an enjoyable experience which resulted in valuable, comparative data – one that could not have happened if the libraries had taken a siloed approach.

When we were designing our test, we met with Arnold Arcolio, a User Researcher in OCLC’s User Experience and Information Architecture Group. He gave us many great pieces of advice. Some of them we found to work well in our testing, while others we rejected. The most valuable piece of advice he gave us was to start with the end in mind. Make sure you have clear objectives for what data you are trying to obtain. If you leave your objectives open ended, you will spend the rest of your life reviewing the data and learning interesting things about your users every time.

He recommended: We decided: Test at least two users of the same type. This helps avoid outliers. For us, that meant testing at least two first year students and two seniors. Test users on their own devices. We found this to be impractical for our purposes, as all devices used for testing had to have web conferencing software which allowed us to record users’ screen. Have the participants read the tasks out loud. A technique that we used and recommend as well. Use low-tech solutions for our testing, rather than expensive software and eye tracking software. This was a huge relief to PALNI’s executive director who manages our budget. Test participants where they would normally do their research, in dorm rooms, faculty offices, etc. We did not take this recommendation due to time and privacy concerns. He was very concerned about our use of multiple facilitators. We standardized our testing as much as possible.  First, we choose uniforms for our facilitators. Being librarians, the obvious choice was cardigans. We ordered matching, logoed cardigans from Lands’ End and wore those to conduct our testing. This allowed us to look as similar as possible and avoid skewing participants’ impressions.  We chose cardigans in blue because color theory suggests that blue persuades the participants to trust the facilitator while feeling calm and confident. We also worked together to create a very detailed script that was used by each facilitator for each test.

Our next round of usability testing will incorporate many of the same recommendations provided by our usability expert, discussed above, with a few additions and changes. This Fall, we will be including a mobile device portion using a camera mount (Mr. Tappy see to screen record, testing different tasks, and working with different libraries. Our libraries’ staff also recommended making the report more action-oriented with best setup practices and highlighting instructional needs.  We are also developing a list of common solutions for participant problems, such as when to redirect or correct misspellings. Finally, as much as we love the cardigans, we will be wearing matching logoed polos underneath for those test rooms that mirror the climate of the Sahara Desert.

We have enjoyed our usability experiences immensely–it is a great chance to visit with both library staff, faculty, and students from other institutions in our consortium. Working collaboratively proved to be a success in our consortia where smaller libraries, short staff, and minimal resources made it otherwise impossible to conduct large scale usability testing.   Plus, we welcome having another cardigan in our wardrobe.

More detailed information about our Spring 2015 study can be found in our report, “PALNI WorldCat Discovery Usability Report.”

About our guest authors:

Eric Bradley is Head of Instruction and Reference at Goshen College and an Information Fluency Coordinator for PALNI.  He has been at Goshen since 2013.  He does not moonlight as a Mixed Martial Arts fighter or Los Angeles studio singer.

Ruth Szpunar is an Instruction and Reference Librarian at DePauw University and an Information Fluency Coordinator for PALNI. She has been at DePauw since 2005. In her spare time she can be found munching on chocolate or raiding the aisles at the Container Store.

Megan West has been the Digital Communications Manager at PALNI since 2011. She specializes in graphic design, user experience, project management and has a strange addiction to colored pencils.

SearchHub: Solr 5’s new ‘bin/post’ utility

planet code4lib - Tue, 2015-08-04 00:11
Series Introduction

This is the first in a three part series demonstrating how it’s possible to build a real application using just a few simple commands.  The three parts to this are:

  • Getting data into Solr using bin/post
  • Visualizing search results: /browse and beyond
  • Putting it together realistically: example/files – a concrete useful domain-specific example of bin/post and /browse
Introducing bin/post: a built-in Solr 5 data indexing tool In the beginning was the command-line… As part of the ease of use improvements in Solr 5, the bin/post tool was created to allow you to more easily index data and documents. This article illustrates and explains how to use this tool. For those (pre-5.0) Solr veterans who have most likely run Solr’s “example”, you’ll be familiar with post.jar, under example/exampledocs.  You may have only used it when firing up Solr for the first time, indexing example tech products or book data. Even if you haven’t been using post.jar, give this new interface a try if even for the occasional sending of administrative commands to your Solr instances.  See below for some interesting simple tricks that can be done using this tool. Let’s get started by firing up Solr and creating a collection: $ bin/solr start $ bin/solr create -c solr_docs The bin/post tool can index a directory tree of files, and the Solr distribution has a handy docs/ directory to demonstrate this capability: $ bin/post -c solr_docs docs/ java -classpath /Users/erikhatcher/solr-5.3.0/dist/solr-core-5.3.0.jar -Dauto=yes -Dc=solr_docs -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool /Users/erikhatcher/solr-5.3.0/docs/ SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/solr_docs/update... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log Entering recursive mode, max depth=999, delay=0s Indexing directory /Users/erikhatcher/solr-5.3.0/docs (3 files, depth=0) . . . 3575 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/solr_docs/update... Time spent: 0:00:30.705 30 seconds later we have Solr’s docs/ directory indexed and available for searching. Foreshadowing to the next post in this series, check out http://localhost:8983/solr/solr_docs/browse?q=faceting to see what you’ve got. Is there anything bin/post can do that clever curling can’t do?  Not a thing, though you’d have to iterate over a directory tree of files or do web crawling and parsing out links to follow for entirely comparable capabilities.  bin/post is meant to simplify the (command-line) interface for many common Solr ingestion and command needs. Usage The tool provides solid -h help, with the abbreviated usage specification being: $ bin/post -h Usage: post -c <collection> [OPTIONS] <files|directories|urls|-d ["...",...]> or post -help collection name defaults to DEFAULT_SOLR_COLLECTION if not specified ... See the full bin/post -h output for more details on parameters and example usages. A collection, or URL, must always be specified with -c (or by DEFAULT_SOLR_COLLECTION set in the environment) or -url. There are parameters to control the base Solr URL using -host, -port, or the full -url. Note that when using -url it must be the full URL, including the core name all the way through to the /update handler, such as -url http://staging_server:8888/solr/core_name/update. Indexing “rich documents” from the file system or web crawl File system indexing was demonstrated above, indexing Solr’s docs/ directory which includes a lot of HTML files. Another fun example of this is to index your own documents folder like this: $ bin/solr create -c my_docs bin/post -c my_docs ~/Documents There’s a constrained list of file types (by file extension) that bin/post will pass on to Solr, skipping the others.  bin/post -h provides the default list used. To index a .png file, for example, set the -filetypes parameter: bin/post -c test -filetypes png image.png.  To not skip any files, use “*” for the filetypes setting: bin/post -c test -filetypes "*" docs/ (note the double-quotes around the asterisk, otherwise your shell may expand that to a list of files and not operate as intended) Browse and search your own documents at http://localhost:8983/solr/my_docs/browse Rudimentary web crawl Careful now: crawling web sites is no trivial task to do well. The web crawling available from bin/post is very basic, single-threaded, and not intended for serious business.  But it sure is fun to be able to fairly quickly index a basic web site and get a feel for the types of content processing and querying issues to face as a production scale crawler or other content acquisition means are in the works: $ bin/solr create -c site $ bin/post -c site -recursive 2 -delay 1 # (this will take some minutes) Web crawling adheres to the same content/file type filtering as the file crawling mentioned above; use -filetypes as needed.  Again, check out /browse; for this example try http://localhost:8983/solr/site/browse?q=revolution Indexing CSV (column/delimited) files Indexing CSV files couldn’t be easier! It’s just this, where data.csv is a standard CSV file: $ bin/post -c collection_name data.csv CSV files are handed off to the /update handler with the content type of “text/csv”. It detects it is a CSV file by the .csv file extension. Because the file extension is used to pick the content type and it currently only has a fixed “.csv” mapping to text/csv, you will need to explicitly set the content -type like this if the file has a different extension: $ bin/post -c collection_name -type text/csv data.file If the delimited file does not have a first line of column names, some columns need excluding or name mapping, the file is tab rather than comma delimited, or you need to specify any of the various options to the CSV handler, the -params option can be used. For example, to index a tab-delimited file, set the separator parameter like this: $ bin/post -c collection_name data.tsv -type text/csv -params "separator=%09" The key=value pairs specified in -params must be URL encoded and ampersand separated (tab is url encoded as %09).  If the first line of a CSV file is data rather than column names, or you need to override the column names, you can provide the fieldnames parameter, setting header=true if the first line should be ignored: $ bin/post -c collection_name data.csv -params "fieldnames=id,foo&header=true" Here’s a neat trick you can do with CSV data, add a “data source”, or some type of field to identify which file or data set each document came from. Add a literal.<field_name>= parameter like this: $ bin/post -c collection_name data.csv -params "literal.data_source=temp" Provided your schema allows for a data_source field to appear on documents, each file or set of files you load get tagged to some scheme of your choosing making it easy to filter, delete, and operate on that data subset.  Another literal field name could be the filename itself, just be sure that the file being loaded matches the value of the field (easy to up-arrow and change one part of the command-line but not another that should be kept in sync). Indexing JSON If your data is in Solr JSON format, it’s just bin/post -c collection_name data.json. Arbitrary, non-Solr, JSON can be mapped as well. Using the exam grade data and example from here, the splitting and mapping parameters can be specified like this: $ bin/post -c collection_name grades.json -params "split=/exams&f=first:/first&f=last:/last&f=grade:/grade&f=subject:/exams/subject&f=test:/exams/test&f=marks:/exams/marks&json.command=false" Note that json.command=false had to be specified so the JSON is interpreted as data not as potential Solr commands. Indexing Solr XML Good ol’ Solr XML, easy peasy: bin/post -c collection_name example/exampledocs/*.xml.  If you don’t know what Solr XML is, have a look at Solr’s example/exampledocs/*.xml files. Alas, there’s currently no splitting and mapping capabilities for arbitrary XML using bin/post; use Data Import Handler with the XPathEntityProcessor to accomplish this for now. See SOLR-6559 for more information on this future enhancement. Sending commands to Solr Besides indexing documents, bin/post can also be used to issue commands to Solr. Here are some examples:
  • Commit: bin/post -c collection_name -out yes -type application/json -d '{commit:{}}' Note: For a simple commit, no data/command string is actually needed.  An empty, trailing -d suffices to force a commit, like this – bin/post -c collection_name -d
  • Delete a document by id: bin/post -c collection_name -type application/json -out yes -d '{delete: {id: 1}}'
  • Delete documents by query: bin/post -c test -type application/json -out yes -d '{delete: {query: "data_source:temp"}}'
The -out yes echoes the HTTP response body from the Solr request, which generally isn’t any more helpful with indexing errors, but is nice to see with commands like commit and delete, even on success. Commands, or even documents, can be piped through bin/post when -d dangles at the end of the command-line: # Pipe a commit command $ echo '{commit: {}}' | bin/post -c collection_name -type application/json -out yes -d # Pipe and index a CSV file $ cat data.csv | bin/post -c collection_name -type text/csv -d Inner workings of bin/post The bin/post tool is a straightforward Unix shell script that processes and validates command-line arguments and launches a Java program to do the work of posting the file(s) to the appropriate update handler end-point. Currently, SimplePostTool is the Java class used to do the work (the core of the infamous post.jar of yore). Actually post.jar still exists and is used under bin/post, but this is an implementation detail that bin/post is meant to hide. SimplePostTool (not the bin/post wrapper script) uses the file extensions to determine the Solr end-point to use for each POST.  There are three special types of files that POST to Solr’s /update end-point: .json, .csv, and .xml. All other file extensions will get posted to the URL+/extract end-point, richly parsing a wide variety of file types. If you’re indexing CSV, XML, or JSON data and the file extension doesn’t match or isn’t actually a file (if you’re using the -d option) be sure to explicitly set the -type to text/csv, application/xml, or application/json. Stupid bin/post tricks Introspect rich document parsing and extraction Want to see how Solr’s rich document parsing sees your files? Not a new feature, but a neat one that can be exploited through bin/post by sending a document to the extract handler in a debug mode returning an XHTML view of the document, metadata and all. Here’s an example, setting -params with some extra settings explained below: $ bin/post -c test -params "extractOnly=true&wt=ruby&indent=yes" -out yes docs/SYSTEM_REQUIREMENTS.html java -classpath /Users/erikhatcher/solr-5.3.0/dist/solr-core-5.3.0.jar -Dauto=yes -Dparams=extractOnly=true&wt=ruby&indent=yes -Dout=yes -Dc=test -Ddata=files org.apache.solr.util.SimplePostTool /Users/erikhatcher/solr-5.3.0/docs/SYSTEM_REQUIREMENTS.html SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/test/update?extractOnly=true&wt=ruby&indent=yes... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file SYSTEM_REQUIREMENTS.html (text/html) to [base]/extract { 'responseHeader'=>{ 'status'=>0, 'QTime'=>3}, ''=>'<?xml version="1.0" encoding="UTF-8"?> <html xmlns=""> <head> <meta name="stream_size" content="1100"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.html.HtmlParser"/> <meta name="stream_content_type" content="text/html"/> <meta name="dc:title" content="System Requirements"/> <meta name="Content-Encoding" content="UTF-8"/> <meta name="resourceName" content="/Users/erikhatcher/solr-5.2.0/docs/SYSTEM_REQUIREMENTS.html"/> <meta name="Content-Type" content="text/html; charset=UTF-8"/> <title>System Requirements</title> </head> <body> <h1>System Requirements</h1> ... </body> </html> ', 'null_metadata'=>[ 'stream_size',['1100'], 'X-Parsed-By',['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.html.HtmlParser'], 'stream_content_type',['text/html'], 'dc:title',['System Requirements'], 'Content-Encoding',['UTF-8'], 'resourceName',['/Users/erikhatcher/solr-5.3.0/docs/SYSTEM_REQUIREMENTS.html'], 'title',['System Requirements'], 'Content-Type',['text/html; charset=UTF-8']]} 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/test/update?extractOnly=true&wt=ruby&indent=yes... Time spent: 0:00:00.027 Setting extractOnly=true instructs the extract handler to return the structured parsed information rather than actually index the document. Setting wt=ruby (ah yes! go ahead, try it in json or xml :) and indent=yes allows the output (be sure to specify -out yes!) to render readably in a console. Prototyping, troubleshooting, tinkering, demonstrating It’s really handy to be able to test and demonstrate a feature of Solr by “doing the simplest possible thing that will work”  and bin/post makes this a real joy.  Here are some examples – Does it match? This technique allows you to easily index data and quickly see how queries work against it.  Create a “playground” index and post a single document with fields id, description, and value: $ bin/solr create -c playground $ bin/post -c playground -type text/csv -out yes -d $'id,description,value\n1,are we there yet?,0.42'

Unix Note: that dollar-sign before the single-quoted CSV string is crucial for the new-line escaping to pass through properly.  Or one could post the same data but putting the field names into a separate parameter using bin/post -c playground -type text/csv -out yes -params "fieldnames=id,description,value" -d '1,are we there yet?,0.42' avoiding the need for a new-line and the associated issue.

Does it match a fuzzy query?  their~, in the /select request below, is literally a FuzzyQuery, and ends up matching the document indexed (based on string edit distance fuzziness), rows=0 so we just see the numFound and debug=query output:

$ curl 'http://localhost:8983/solr/playground/select?q=their~&wt=ruby&indent=on&rows=0&debug=query' { 'responseHeader'=>{ 'status'=>0, 'QTime'=>0, 'params'=>{ 'q'=>'their~', 'debug'=>'query', 'indent'=>'on', 'rows'=>'0', 'wt'=>'ruby'}}, 'response'=>{'numFound'=>1,'start'=>0,'docs'=>[] }, 'debug'=>{ 'rawquerystring'=>'their~', 'querystring'=>'their~', 'parsedquery'=>'_text_:their~2', 'parsedquery_toString'=>'_text_:their~2', 'QParser'=>'LuceneQParser'}}

Have fun with your own troublesome text, simply using an id field and any/all fields involved in your test queries, and quickly get some insight into how documents are indexed, text analyzed, and queries match.  You can use this CSV trick for testing out a variety of scenarios, including complex faceting, grouping, highlighting, etc often with just a small bit of representative CSV data.

Windows, I’m sorry. But don’t despair. bin/post is a Unix shell script. There’s no comparable Windows command file, like there is for bin/solr. The developer of bin/post is a grey beard Unix curmudgeon and scoffs “patches welcome” when asked where the Windows version is.  But don’t despair, before there was bin/post there was post.jar. And there still is post.jar.  See the Windows support section of the Reference Guide for details on how to run the equivalent of everything bin/post can do. Future What more could you want out of a tool to post content to Solr? Turns out a lot! Here’s a few ideas for improvements:
  • For starters, SolrCloud support is needed. Right now the exact HTTP end-point is needed, whereas SolrCloud indexing is best done with ZooKeeper awareness. Perhaps this fits under SOLR-7268.
  • SOLR-7057: Better content-type detection and handling (.tsv files could be considered delimited with separator=%09 for example)
  • SOLR-6994: Add a comparable Windows command file
  • SOLR-7042: Improve bin/post’s arbitrary JSON handling
  • SOLR-7188: And maybe, just maybe, this tool could also be the front-end to client-side Data Import Handler
And no doubt there are numerous other improvements to streamline the command-line syntax and hardiness of this handy little tool. Conclusion $ bin/post -c your_collection your_data/ No, bin/post is not necessarily the “right” way to get data into your system considering streaming Spark jobs, database content, heavy web crawling, or other Solr integrating connectors.  But pragmatically maybe it’s just the ticket for sending commit/delete commands to any of your Solr servers, or doing some quick tests.  And, say, if you’ve got a nightly process that produces new data as CSV files, a cron job to bin/post the new data would be as pragmatic and “production-savvy” as anything else. Next up… With bin/post, you’ve gotten your data into Solr in one simple, easy-to-use command.  That’s an important step, though only half of the equation.  We index content in order to be able to query it, analyze it, and visualize it.  The next article in this series delves into Solr’s templated response writing capabilities, providing a typical (and extensible) search results user interface.

The post Solr 5’s new ‘bin/post’ utility appeared first on Lucidworks.

FOSS4Lib Recent Releases: Hydra - 9.2.2

planet code4lib - Mon, 2015-08-03 23:44

Last updated August 3, 2015. Created by Peter Murray on August 3, 2015.
Log in to edit this page.

Package: HydraRelease Date: Monday, August 3, 2015

Nicole Engard: Bookmarks for August 3, 2015

planet code4lib - Mon, 2015-08-03 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Pydio The mature open source alternative to Dropbox and

Digest powered by RSS Digest

The post Bookmarks for August 3, 2015 appeared first on What I Learned Today....

Related posts:

  1. Students get extra Dropbox space
  2. Project Gutenberg to Dropbox
  3. Open Source Options for Education

FOSS4Lib Upcoming Events: VuFind Summit 2015

planet code4lib - Mon, 2015-08-03 19:12
Date: Monday, October 12, 2015 - 08:00 to Tuesday, October 13, 2015 - 17:00Supports: VuFind

Last updated August 3, 2015. Created by Peter Murray on August 3, 2015.
Log in to edit this page.

From the announcement:

Registration is now open for the 2015 VuFind Summit held Monday October 12 and Tuesday October 13, 2015 at Villanova University (in Villanova, PA). Registration will be $45 for two days of events, with breakfast/lunch included. You can register here:

As usual, the event will be a combination of structured talks, planning sessions and free-form hacking.

Evergreen ILS: Hack-A-Way 2015

planet code4lib - Mon, 2015-08-03 18:40

After much deliberation, forecasting, haruspicy the dates for the 2015 in Danvers MA (just north of  Boston) have been selected and it will be November 4th – November 6th.

Lukas Koster: Maps, dictionaries and guidebooks

planet code4lib - Mon, 2015-08-03 14:51

Interoperability in heterogeneous library data landscapes

Libraries have to deal with a highly opaque landscape of heterogeneous data sources, data types, data formats, data flows, data transformations and data redundancies, which I have earlier characterized as a “data maze”. The level and magnitude of this opacity and heterogeneity varies with the amount of content types and the number of services that the library is responsible for. Academic and national libraries are possibly dealing with more extensive mazes than small public or company libraries.

In general, libraries curate collections of things and also provide discovery and delivery services for these collections to the public. In order to successfully carry out these tasks  they manage a lot of data. Data can be regarded as the signals between collections and services.

These collections and services are administered using dedicated systems with dedicated datastores. The data formats in these dedicated datastores are tailored to perform the dedicated services that these dedicated systems are designed for. In order to use the data for delivering services they were not designed for, it is common practice to deploy dedicated transformation procedures, either manual ones or as automated utilities. These transformation procedures function as translators of the signals in the form of data.

Here lies the origin of the data maze: an inextricably entangled mishmash of systems with explicit and

© Ron Zack

implicit data redundancies using a number of different data formats, some of which systems are talking to each other in some way. This is not only confusing for end users but also for library system staff. End users lack clarity about user interfaces to use, and are missing relevant results from other sources and possible related information. Libraries need licenses and expertise for ongoing administration, conversion and migration of multiple systems, and suffer unforeseen consequences of adjustments elsewhere.

To take the linguistic analogy further, systems make use of a specific language (data format) to code their signals in. This is all fine as long as they are only talking to themselves. But as soon as they want to talk to other systems that use a different language, translations are needed, as mentioned. Sometimes two systems use the same language (like MARC, DC, EAD), but this does not necessarily mean they can understand each other. There may be dialects (DANMARC, UNIMARC), local colloquialisms, differences in vocabularies and even alphabets (local fields, local codes, etc.). Some languages are only used by one system (like PNX for Primo). All languages describe things in their own vocabulary. In the systems and data universe there are not many loanwords or other mechanisms to make it clear that systems are talking about the same thing (no relations or linked data). And then there is syntax and grammar (such as subfields and cataloguing rules) that allow for lots of variations in formulations and formats.

Translation does not only require applying a dictionary, but also interpretation of the context, syntax, local variations and transcriptions. Consequently much is lost in translation.

The transformation utilities functioning as translators of the data signals suffer from a number of limitations. They translate between two specific languages or dialects only. And usually they are employed by only one system (proprietary utilities). So even if two systems speak the same language, they probably both need their own translator from a common source language. In many cases even two separate translators are needed if source and target system do not speak each other’s language or dialect. The source signals are translated to some common language which in turn is translated into the target language. This export-import scenario, which entails data redundancy across systems, is referred to as ETL (Extract Transform Load). Moreover, most translators only know a subset of the source and target language dependent on the data signals needed by the provided services. In some cases “data mappings” are used as conversion guides. This term does not really cover what is actually needed, as I have tried to demonstrate. It is not enough to show the paths between source and target signals. It is essential to add the selections and transformations needed as well. In order to make sense of the data maze you need a map, a dictionary and a guidebook.

To make things even more complicated, sometimes reading data signals is only possible with a passport or visa (authentication for access to closed data). Or even worse, when systems’ borders are completely closed and no access whatsoever is possible, not even with a passport. Usually, this last situation is referred to with the term “data silos”, but that is not the complete picture. If systems are fully open, but their data signals are coded by means of untranslatable languages or syntaxes, we are also dealing with silos.

Anyway, a lot of attention and maintenance is required to keep this Tower of Babel functioning. This practice is extremely resource-intensive, costly and vulnerable. Are there any solutions available to diminish maintenance, costs and vulnerability? Yes there are.

First of all, it is absolutely crucial to get acquainted with the maze. You need a map (or even an atlas) to be able to see which roads are there, which ones are inaccessible, what traffic is allowed, what shortcuts are possible, which systems can be pulled down and where new roads can be built. This role can be fulfilled by a Dataflow Repository, which presents an up-to-date overview of locations and flows of all content types and data elements in the landscape.

Secondly it is vital to be able to understand the signals. You need a dictionary to be able to interpret all signals, languages, syntaxes, vocabularies, etc. A Data Dictionary describing data elements, datastores, dataflows and data formats is the designated tool for this.

And finally it is essential to know which transformations are taking place en route. A guidebook should be incorporated in the repository, describing selections and transformations for every data flow.

You could leave it there and be satisfied with these guiding tools to help you getting around the existing data maze more efficiently, with all its ETL utilities and data redundancies. But there are other solutions, that focus on actually tackling or even eliminating the translation problem. Basically we are looking at some type of Service Oriented Architecture (SOA) implementation. SOA is a rather broad concept, but it refers to an environment where individual components (“systems”) communicate with each other in a technology and vendor agnostic way using interoperable building blocks (“services”). In this definition “services” refer to reusable dataflows between systems, rather than to useful results for end users. I would prefer a definition of SOA to mean “a data and utilities architecture focused on delivering optimal end user services no matter what”.

Broadly speaking there are four main routes to establish a SOA-like condition, all of which can theoretically be implemented on a global, intermediate or local level.

  1. Single Store/Single Format: A single universal integrated datastore using a universal data format. No need for dataflows and translations. This would imply some sort of linked (open) data landscape with RDF as universal language and serving all systems and services. A solution like this would require all providers of relevant systems and databases to commit to a single universal storage format. Unrealistic in the short term indeed, but definitely something to aim for, starting at the local level.
  2. Multiple Stores/Shared Format: A heterogeneous system and datastore landscape with a universal communication language (a lingua franca, like English) for dataflows. No need for countless translators between individual systems. This universal format could be RDF in any serialization. A solution like this would require all providers of relevant systems and databases to commit to a universal exchange format. Already a bit less unrealistic.
  3. Shared Store/Shared Format: A heterogeneous system and datastore landscape with a central shared intermediate integrated datastore in a single shared format. Translations from different source formats to only one shared format. Dataflows run to and from the shared store only. For instance with RDF functioning as Esperanto, the artificial language which is actually sometimes used as “Interlingua” in machine translation. A solution like this does not require a universal exchange format, only a translator that understands and speaks all formats, which is the basis of all ETL tools. This is much more realistic, because system and vendor dependencies are minimized, except for variations in syntax and vocabularies. The platform itself can be completely independent.
  4. Multiple Stores/Single Translation Pool: or what is known as an Enterprise Service Bus (ESB). No translations are stored, no data is integrated. Simultaneous point to point translations between systems happen on the fly. Looks very much like the existing data maze, but with all translators sitting together in one cubicle. This solution is not a source of much relief, or as one large IT vendor puts it: “Using an ESB can become problematic if large volumes of data need to be sent via the bus as a large number of individual messages. ESBs should never replace traditional data integration like ETL tools. Data replication from one database to another can be resolved more efficiently using data integration, as it would only burden the ESB unnecessarily.”.

Overlooking the possible routes out of the data maze, it seems that the first step should be employing the map, dictionary and guidebook concept of the dataflow repository, data dictionary and transformation descriptions. After that the only feasible road on the short term is the intermediate integrated Shared Store/Shared Format solution.

Library of Congress: The Signal: The Personal Digital Archiving 2015 Conference

planet code4lib - Mon, 2015-08-03 12:13

“Washington Square Park” by Jean-Christophe Benoist. On Wikimedia.

The annual Personal Digital Archiving conference is about preserving any digital collection that falls outside the purview of large cultural institutions. Considering the expanding range of interests at each subsequent PDA conference, the meaning of the word “personal” has become thinly stretched to cover topics such as family history, community history, genealogy and digital humanities.

New York University hosted Personal Digital Archiving 2015 this past April, during a chilly snap over an otherwise perfect Manhattan spring weekend. The event attracted about 150 people, including more students than in the past.

Each year, depending on the conference’s location and the latest events or trends, the PDA audience and topics vary. But the presenters and attendees always share the same core interest: taking action about digital preservation.

PDA conferences glimpse at projects that are often created by citizen archivists, people who have taken on the altruistic task of preserving a digital collection simply because they recognized the importance of the content and felt that someone should step up and save it. PDA conferences are a chance for trained archivists, amateur archivists and accidental archivists to share information about their projects, about their challenges, about what worked and what didn’t, and about lessons learned.

Videos from Day 1 and Day 2 are online at the Internet Archive.

Howard Besser. Photo by Jasmyn Castro.

Howard Besser, professor of Cinema Studies at NYU and founding director of the NYU Moving Image Archiving and Preservation Program, set the tone in his welcome speech by talking about the importance of digital video as evidence and as cultural records, especially regarding eyewitness video of news events.

Keynote speaker Don Perry, a documentary film producer, talked about “What Becomes of the Family Album in the Digital Age?” (PDF) and his work with “Digital Diaspora Family Reunion.” He talked about preserving digital photos and the cultural impact of sharing photos with friends and family. He stressed that this work applies to every individual, family and community, and that everyone should consider the cultural importance of their digital photos. “The value of the artifact – and what we keep trying to tell young people – is that they are the authors of a history in the making,” said Perry. “And that they need to consider the archives that they’re creating as exactly the same kinds of images that filmmakers like us use to make a documentary. People in the future will be looking through their images to try to understand who we are today.”

Panel: Personal Tools and Methods for PDA. Photo by Jasmyn Castro.

Preserving digital photos is always popular at PDA conferences. It is a common interest that binds us together as stakeholders, especially since the advent of mobile phone digital cameras. Presentations related to digital photos included:

The digital preservation of art (material and digital) is quickly emerging as an area of concern and archivist activism. PDA 2015 had these art-related presentations:

There was a noticeable absence of commercial products and digital scrapbooks at PDA 2015. Instead, presentations, workshops and posters shared practical information about projects that used open-source tools:

Another emerging trend at PDA conferences is toward Digital Humanities and Social Sciences research. Some researchers analyzed and pondered human behavior and digital collections, while others compiled data into presentations of historical events. Presentations included:

I wrote in the beginning of this post that the meaning of the word “personal” is getting stretched at PDA conferences; it’s more like the concept of “personal” is expanding. Personal photos mingle with family personal photos to become a larger archive, a family archive. Facebook has spawned a “local history” phenomena, where members of a community post their personal photos and comments, and the individual personal contributions congeal organically into a community history site. PDA 2015 had several community-related presentations:

Increasingly we hear from colleges and universities, usually — though not exclusively — from their librarians, expressing concern that students and faculty may not be aware of the need to preserve their digital stuff. PDA 2015 hosted a panel, titled “Reflections on Personal Digital Archiving Day on the College Campus” (PDF), comprising representatives from five colleges who spoke about their on-campus outreach experiences:

  • Rachel Appel, Bryn Mawr College
  • Amy Bocko, Wheaton College
  • Joanna DiPasquale, Vassar College
  • Sarah Walden, Amherst College
  • Kevin Powell, Brown University.

We featured a follow-up post on their work for The Signal titled, “Digital Archiving Programming at Four Liberal Arts Colleges.”

A visiting scholar from China, Xiangjun Feng, was scheduled to deliver a presentation on a similar subject — personal digital archiving and students and scholars at her university — but she had to cancel her trip. We put her presentation online, “The Behavior and Perception of Personal Digital Archiving of Chinese University Students.” (PDF)

Howard Besser gave the keynote address on Day 2 of the conference, along with his fellow video-preservation pioneer, Rick Prelinger. It was more like a jam session between two off-beat scholars. Each showed a video clip; Besser showed “Why Archive Video?” and Prelinger showed the infamous “Duck and Cover,” a 1951 public service film aimed at school children that advises them to take shelter under their desks during a nuclear attack.

Other presentations during the conference also touched on video preservation:

The third day of the conference was set aside for hand-on workshops:

  • Courtney Mumma, “Archivematica and AtoM: End-to-End Digital Curation for Diverse Collections”
  • Peter Chan, “Appraise, Process, Discover & Deliver Email”
  • Cal Lee and Kam Woods, “Curating Personal Digital Archives Using BitCurator and BitCurator Access Tools”
  • Yvonne Ng, Marie Lascu and Maggie Schreiner, “Do-It-Yourself Personal Digital Archiving.”

Ng, who is with the human-rights organization, Witness, also gave a presentation during the conference titled, “Evaluating the Effectiveness of a PDA Resource.” (PDA)

Perhaps the conference is also expanding past the “preservation” part of its name into usage; after all, preservation and access are two sides of the same coin. It’s a pleasure every year to see the new ways that people address access and usability.

We still have yet to hear much from the genealogy community, from community historians and public librarians about preserving family history and community history. The same for the healthcare, medical and personal-health communities, though it’s just a matter of time before they join the conversation.

Cliff Lynch, directory of the Coalition for Networked Information, wrote in his essay titled “The Future of Personal Digital Archiving: Defining the Research Agendas,” (published in the book, Personal Archiving), “In the near future, medical records will commonly include genotyping or gene sequencing data, detailed machine-readable medical history records, perhaps prescription or insurance claim information, tests, and imaging. Whether the individual is dead or alive, this is prime material for data mining on a large scale…We could imagine a very desirable — though perhaps currently impossible – future option where an individual could choose to place his or her medical records (before and after death) in a genuinely public research commons, perhaps somewhat like signing up to become an organ donor.”

“…personal collections, and now personal digital archives, are the signature elements that distinguish many of the genuinely great research collections housed  in libraries and archives…We need policy discussions about…what organizations should take responsibility for collecting them. This conversation has connection to the evolving mission and strategies not just of national and research libraries, but of local historical societies, public libraries, and similar groups.”

The conversation will continue at Personal Digital Archiving 2016, hosted by the University of Michigan in Ann Arbor.

Terry Reese: MarcEdit Mac Preview Update

planet code4lib - Sun, 2015-08-02 23:42

MarcEdit Mac users, a new preview update has been made available.  This is getting pretty close to the first “official” version of the Mac version.  And for those that may have forgotten, the preview designation will be removed on Sept. 1, 2015.

So what’s been done since the last update?  Well, I’ve pretty much completed the last of the work that was scheduled for the first official release.  At this point, I’ve completed all the planned work on the MARC Tools and the MarcEditor functions.  For this release, I’ve completed the following:

** 1.0.9 ChangeLog

  • Bug Fix: Opening Files — you cannot select any files but a .mrc extension. I’ve changed this so the open dialog can open multiple file types.
  • Bug Fix: MarcEditor — when resizing the form, the filename in the status can disappear.
  • Bug Fix: MarcEditor — when resizing, the # of records per page moves off the screen.
  • Enhancement: Linked Data Records — Tool provides the ability to embed URI endpoints to the end of 1xx, 6xx, and 7xx fields.
  • Enhancement: Linked Data Records — Tool has been added to the Task Manager.
  • Enhancement: Generate Control Numbers — globally generates control numbers.
  • Enhancement: Generate Call Numbers/Fast Headings – globally generated call numbers/fast headings for selected records.
  • Enhancement: Edit Shortcuts — added back the tool to enabled Record Marking via a comment.

Over the next month, I’ll be working on trying to complete four other components prior to the first “official” release Sept. 1.  This means that I’m anticipating at least 1, maybe 2 more large preview releases before Sept. 1, 2015.  The four items I’ll be targeting for completion will be:

  1. Export Tab Delimited Records Feature — this feature allows users to take MARC data and create delimited files (often for reporting or loading into a tool like Excel).
  2. Delimited Text Translator — this feature allows users to generate MARC records from a delimited file.  The Mac version will not, at least initially, be able to work with Excel or Access data.  The tool will be limited to working with delimited data.
  3. Update Preferences windows to expose MarcEditor preferences
  4. OCLC Metadata Framework integration…specifically, I’d like to re-integrate the holdings work and the batch record download.

How do you get the preview?  If you have the current preview installed, just open the program and as long as you have the notifications turned on – the program will notify that an update is available.  Download the update, and install the new version.  If you don’t have the preview installed, just go to: and select the Mac app download.

If you have any questions, let me know.


Roy Tennant: The Oldest Internet Publication You’ve Never Heard Of

planet code4lib - Fri, 2015-07-31 18:48

Twenty-five years ago I started a library current awareness service called Current Cites. The idea was to have a team of volunteers monitor library and information technology literature and cite only the best publications in a monthly publication (see the first page of the inaugural issue pictured). Here is the latest issue. TidBITS is, I think, the only Internet publication that is older, and they beat us only by a few months.

Originally, the one-paragraph description accompanying the bibliographic details was intended to summarize the contents. However, we soon allowed each reviewer latitude in using humor and personal insights to provide context and an individual voice.

Although we began publication in print only and for an intended audience of UC Berkeley Library staff, we quickly realized that the audience could be global and the technologies were coming to make it available for free to such a worldwide audience. If you’re curious, you can read more about how Current Cites came to be as well as its early history.

Ever since we have published every month without fail. It has weathered my paternity leave (twins, with one now graduated from college and the other soon to be), the turnover of many reviewers, and going through several sponsoring organizations. We have had only three editors in all that time: David F.W. Robison, Teri Rinne, and myself.

On our 20th anniversary I wrote some of my thoughts about longevity and what contributes to it, which still applies. But then I’ve always been hard to dump, as Library Journal can attest. I’ve been writing for them since 1997.

So please bear with me as I mark this milestone. With only about 3,300 subscribers to the mailing list distribution (we also have an RSS feed and I tweet a link to each issue), we are probably the longest-lived Internet publication you’ve never heard of. Until now.

Here for your edification is the current number of subscribers by country:

United States 2,476 Canada 210 Australia 134 United Kingdom 69 Netherlands 40 New Zealand 33 Spain 32 Germany 28 Italy 26 Taiwan 20 Sweden 18 Israel 17 Brazil 16 Norway 15 Japan 14 France 13 Belgium 11 ??? 11 India 10 Ireland 10 South Africa 8 Finland 7 Denmark 6 Portugal 6 Hungary 5 Singapore 5 Switzerland 5 Mexico 4 Peru 4 Austria 3 Croatia 3 Greece 3 Lebanon 3 Republic of Korea 3 Saudi Arabia 3 United Arab Emirates 3 Argentina 2 Chile 2 China 2 Colombia 2 Federated States of Micronesia 2 Kazakhstan 2 Lithuania 2 Philippines 2 Poland 2 Slovakia 2 Trinidad and Tobago 2 Turkey 2 Botswana 1 Czech Republic 1 Estonia 1 Hong Kong 1 Iceland 1 Islamic Republic of Iran 1 Jamaica 1 Malaysia 1 Morocco 1 Namibia 1 Pakistan 1 Qatar 1 Uruguay 1

Harvard Library Innovation Lab: Link roundup July 31, 2015

planet code4lib - Fri, 2015-07-31 15:21

This is the good stuff.

The Factory of Ideas: Working at Bell Labs

Technology is cyclical. Timesharing is cloud computing.

The UK National Videogame Arcade is the inspirational mecca that gaming needs | Ars Technica

UK’s National Videogame Arcade is a sort of interactive art installation allowing visitors to tweak and play games

I Can Haz Memento

include the hash tag “#icanhazmemento” in a tweet with a link and a service replies with an archive

A Graphical Taxonomy of Roller Derby Skate Names

Dewey Decimator or Dewey Decimauler? Hmmm, maybe Scewy Decimal.

The White House’s Alpha Geeks — Backchannel — Medium

Making tech happen inside gov

HangingTogether: Current Cites – the amazing 25th anniversary

planet code4lib - Fri, 2015-07-31 15:00

I suspect that a large part of the audience for this blog also subscribes to Current Cites the “annotated bibliography of selected articles, books, and digital documents on information technology” as the masthead describes it. Those of us who subscribe would describe it as “essential”. Those of us who publish newsletters describe the fact that as of August 2015 it will have been published continuously for twenty five years as “amazing”. Those of us who know the editor, our pal and colleague, Roy Tennant, describe the feat he has performed as “stunning” and him as “indefatigable“.

And if you are not a subscriber to this essential, amazing, and stunning newsletter you should be clicking right here. And then you should congratulate Roy in a comment below. Do that right now.


By Mireia Garcia Bermejo (Own work)  via Wikimedia Commons

About Jim Michalko

Jim coordinates the OCLC Research office in San Mateo, CA, focuses on relationships with research libraries and work that renovates the library value proposition in the current information environment.

Mail | Web | Twitter | LinkedIn | Google+ | More Posts (105)

Open Knowledge Foundation: Launch of timber tracking dashboard for Global Witness

planet code4lib - Fri, 2015-07-31 14:53

Open Knowledge has produced an interactive trade dashboard for anti-corruption NGO Global Witness to supplement their exposé on EU and US companies importing illegal timber from the Democratic Republic of Congo (DRC).

The DRC Timber Timber Trade Tracker consumes open data from to visualise where in the world Congolese timber is going. The dashboard makes it easy to identify countries that are importing large volumes of potentially illegal timber, and to see where timber shipped by companies accused of systematic illegal logging and social and environmental abuses is going on.

Global Witness has long campaigned for greater oversight of the logging industry in DRC which is home to two thirds of the world’s second largest rainforest. The logging industry is mired with corruption with two of the DRC’s biggest loggers allegedly complicit in the beating and raping of local populations. Alexandra Pardal, campaign leader at Global Witness said:

We knew that DRC logging companies were breaking the law, but the extent of illegality is truly shocking. The EU and US are failing in their legal obligations to keep timber linked to illegal logging, violence and intimidation off our shop floors. Traders are cashing in on a multi-million dollar business that is pushing the world’s vanishing rainforests to extinction.

The dashboard is part of a long term collaboration between Open Knowledge and Global Witness through which they have jointly created a series of interactives and data-driven investigations around corruption and conflict in the extractives industries.

To read the full report and see the dashboard go here.

If you work for an organisation that wants to make its data come alive on the web, get in touch with our team through

LibUX: “The User Experience” in Public Libraries Magazine

planet code4lib - Fri, 2015-07-31 14:17

Toby Greenwalt asked Amanda and I —um, Michael — to guest-write about the user experience for his The Wired Library column in Public Libraries Magazine. Our writeup was just published online after appearing in print a couple of months ago.

“The Wired Library” in Public Libraries Magazine, vol. 54, no. 3

We were pretty stoked to have an opportunity to jabber outside our usual #libux echo chamber to evangelize a little and rejigger woo-woo ideas about the user experience for real-world use — it’s catching on.

Such user experience is holistic, negatively or positively impacted at every interaction point your patron has with your library. The brand spanking new building loses its glamour when the bathrooms are filthy; the breadth of the collection loses its meaning when the item you drove to the library for isn’t on the shelf; an awesome digital collection just doesn’t matter if it’s hard to access; the library that literally pumps joy through its vents nets a negative user experience when the hump of the doorframe makes it hard to enter with a wheelchair.

The rest of the post has to do with simple suggestions for improving the website, but the big idea stuff is right up top. Knowing what we know about how folks read on the web, we still get to flashbake some neurons even if this is a topic readers don’t care about.

Read the “The User Experience” over at Public Libraries Online.

I'm just going to go ahead and take a little credit for the way they referred to THE user experience here.

— Michael Schofield (@schoeyfield) July 22, 2015

I write the Web for Libraries each week — a newsletter chock-full of data-informed commentary about user experience design, including the bleeding-edge trends and web news I think user-oriented thinkers should know.

Email Address

The post “The User Experience” in Public Libraries Magazine appeared first on LibUX.

Islandora: Meet Your Developer: Will Panting

planet code4lib - Fri, 2015-07-31 13:50

A Meet Your Developer double feature this week, as we introduce another instructor for the upcoming Islandora Conference: Will Panting. A Programmer/Analyst at discoverygarden, Inc., Will is a key member of the Committers Group and of one of the most stalwart defenders of best practices and backwards compatibility in Islandora. If you adopt a brand new module and it doesn't break anything, you may well have Will to thank.

Please tell us a little about yourself. What do you do when you’re not at work?

I went to UPEI and have a major in Comp Sci and a minor in Business. Before DGI I had a short stint at the University. As well as all the normal things like friends and family I spend my spare time developing some personal projects and brewing beer. I've been trying to get my brown recipe right for years now.

How long have you been working with Islandora? How did you get started?

More than four years that I've been with DGI. I had heard about the company through UPEI. I find working on Islandora very rewarding; I think this space is of some very real value.

Sum up your area of expertise in three words:

Complete Islandora Stack

What are you working on right now?

A complex migration from a custom application. It's a good one, using most of the techniques we've had to in the past.

What contribution to Islandora are you most proud of?

I've been in about just every corner of the code base and written tons of peripheral modules and customizations. I think the thing that I'm most proud of isn't a thing, but a consistent push for sustainable practice.

What new feature or improvement would you most like to see?

I'm divided between a viewer framework, an XSLT management component or the generic graph traversal hooks. All basic technology that would create greater consistency and speed development.

What’s the one tool/software/resource you cannot live without?

Box provisioning; absolutely crucial to our rate of development.

If you could leave the community with one message from reading this interview, what would it be?

Commit. Dive deep in the code, let it cut you up then stitch the wounds and do it again. It's great to see new committers.

FOSS4Lib Recent Releases: Fedora Repository - 4.3.0

planet code4lib - Fri, 2015-07-31 12:55

Last updated July 31, 2015. Created by Peter Murray on July 31, 2015.
Log in to edit this page.

Package: Fedora RepositoryRelease Date: Friday, July 24, 2015

FOSS4Lib Recent Releases: Siegfried - 1.2.0

planet code4lib - Fri, 2015-07-31 12:53

Last updated July 31, 2015. Created by Peter Murray on July 31, 2015.
Log in to edit this page.

Package: SiegfriedRelease Date: Friday, July 31, 2015

DPLA: Seeking Balance in Copyright and Access

planet code4lib - Thu, 2015-07-30 17:25

The most important word in discussions around copyright in the United States is balance. Although there are many, often strong disagreements between copyright holders and those who wish to provide greater access to our cultural heritage, few dispute that the goal is to balance the interests of the public with those of writers, artists, and other creators.

Since the public is diffuse and understandably pays little attention to debates about seemingly abstract topics like copyright, it has been hard to balance their interests with those of rightsholders, especially corporations, who have much more concentrated attention and financial incentives to tilt the scale. (Also, lawyers.) Unsurprisingly, therefore, the history of copyright is one of a repeated lengthening of copyright terms and greater restrictions on public use.

The U.S. Copyright Office has spent the last few years looking at possible changes to the Copyright Act given that we are now a quarter-century into the age of the web, and its new forms of access to culture enabled by mass digitization. Most recently, the Office issued a report with recommendations about what to do about orphan works and the mass digitization of copyrighted works. The Office has requested feedback on its proposal, as well as on other specific questions regarding copyright and visual works and a proposed “making available” right (something that DPLA has already responded to). Each of these studies and proposals impact the Digital Public Library of America and our 1,600 contributing institutions, as well as many other libraries, archives, and museums that seek to bring their extensive collections online.

We greatly appreciate that the Office is trying to tackle these complex issues, given how difficult it is to ascertain the copyright status of many works created in the last century. As the production of books, photographs, audio, and other types of culture exploded, often by orders of magnitude, and as rights no longer had to be registered, often changed hands in corporate deals, and passed to estates (since copyright terms now long outlast the creators), we inherited an enormous problem of unclear rights and “orphan works” where rightsholders cannot easily—or ever—be found. This problem will only worsen now that digital production has given the means to billions of people to become creators, and not just consumers, of culture.

Although we understand the complexity and many competing interests that the Office has tried to address in the report, we do not believe their recommendations achieve that critical principle of balance. In our view, the recommendations unfortunately put too many burdens on the library community, and thus too many restrictions on public access. The report seeks to establish a lengthy vetting process for scanned items that is simply unworkable and extraordinarily expensive for institutions that are funded by, and serve, the public.

Last week, with the help of DPLA’s Legal Advisory Committee co-chair Dave Hansen, we filed a response to one of the Office’s recent inquiries, focusing on how the copyright system can be improved for visual works like photographs. As our filing details, DPLA’s vast archive of photographs from our many partners reveals how difficult it would be for cultural heritage institutions to vet the rights status of millions of personal, home, and amateur photographs, as well as millions of similar items in the many local collections contained in DPLA.

These works can provide candid insights into our shared cultural history…[but] identifying owners and obtaining permissions is nearly impossible for many personal photographs and candid snapshots…Even if creators are identifiable by name, they are often not locatable. Many are dead, raising complicated questions about whether rights were transferred to heirs, or perhaps escheated to the state. Because creators of many of these works never thought about the rights that they acquired in their visual works, they never made formal plans for succession of ownership.

Thus, as the Office undertakes this review, we urge it to consider whether creators, cultural heritage institutions, and the public at large would be better served by a system of protection that explicitly seeks to address the needs, expectations, and motivations of the incredibly large number of creators of these personal, home and amateur visual works, while appropriately accommodating those creators for whom copyright incentives do matter and for whom licensing and monetization are important.

Rather than placing burdens on libraries and archives for clearing use of visual works, we recommend that the Copyright Office focus on the creation of better copyright status and ownership information by encouraging rightsholders, who are in the best position to provide that information, to step forward. You can read more about our position in the full filing.

When we launched in 2013, one of the most gratifying responses we received was an emotional email from Australian who found a photograph of his grandmother, digitized by an archive in Utah and made discoverable through DPLA. It’s hard to put a price on such a discovery, but surely we must factor such moments into any discussion of copyright and access. We should value more greatly the public’s access to our digitized record, and find balanced ways for institutions to provide such access.

Library of Congress: The Signal: Mapping Libraries: Creating Real-time Maps of Global Information

planet code4lib - Thu, 2015-07-30 13:43

The following is a guest post by Kalev Hannes Leetaru, a data scientist and Senior Fellow at George Washington University Center for Cyber & Homeland Security. In a previous post, he introduced us to the GDELT Project, a platform that monitors the news media, and presented how mass translation of the world’s information offers libraries enormous possibilities for broadening access. In this post, he writes about re-imagining information geographically.

Why might geography matter to the future of libraries?

Information occurs against a rich backdrop of geography: every document is created in a location, intended for an audience in the same or other locations, and may discuss yet other locations. The importance of geography in how humans understand and organize the world (PDF) is underscored by its prevalence in the news media: a location is mentioned every 200-300 words in the typical newspaper article of the last 60 years. Social media embraced location a decade ago through transparent geotagging, with Twitter proclaiming in 2009 that the rise of spatial search would fundamentally alter how we discovered information online. Yet the news media has steadfastly resisted this cartographic revolution, continuing to organize itself primarily through coarse editorially-assigned topical sections and eschewing the live maps that have redefined our ability to understand global reaction to major events. Using journalism as a case study, what does the future of mass-scale mapping of information look like and what might we learn of the future potential for libraries?

What would it look like to literally map the world’s information as it happens? What if we could reach across the world’s news media each day in real time and put a dot on a map for every mention in every article, in every language of any location on earth, along with the people, organizations, topics, and emotions associated with each place? For the past two years this has been the focus of the GDELT Project and through a new collaboration with online mapping platform CartoDB, we are making it possible to create rich interactive real-time maps of the world’s journalistic output across 65 languages.

Leveraging more than a decade of work on mapping the geography of text, GDELT monitors local news media from throughout the globe, live translates it, and performs “full-text geocoding” in which it identifies, disambiguates, and converts textual descriptions of location into mappable geographic coordinates. The result is a real-time multilingual geographic index over the world’s news that reflects the actual locations being talked about in the news, not just the bylines of where articles were filed. Using this platform, this geographic index is transformed into interactive animated maps that support spatial interaction with the news.

What becomes possible when the world’s news is arranged geographically? At the most basic level, it allows organizing search results on a map. The GDELT Geographic News Search allows a user to search by person, organization, theme, news outlet, or language (or any combination therein) and instantly view a map of every location discussed in context with that query, updated every hour. An animation layer shows how coverage has changed over the last 24 hours and a clickable layer displays a list of all matching coverage mentioning each location over the past hour.

Figure 1 – GDELT’s Geographic News Search showing geography of Portuguese-language news coverage during a given 24 hour period

Selecting a specific news outlet like or as the query yields an instant geographic search interface to that outlet’s coverage, which can be embedded on any website. Imagine if every news website included a map like this on its homepage that allowed readers to browse spatially and find its latest coverage of rural Brazil, for example. The ability to filter news at the sub-national level is especially important when triaging rapidly-developing international stories. A first responder assisting in Nepal is likely more interested in the first glimmers of information emerging from its remote rural areas than the latest on the Western tourists trapped on Mount Everest.

Coupling CartoDB with Google’s BigQuery database platform, it becomes possible to visualize large-scale geographic patterns in coverage. The map below visualizes all of the locations mentioned in news monitored by GDELT from February to May 2015 relating to wildlife crime. Using the metaphor of a map, this list of 30,000 articles in 65 languages becomes an intuitive clickable map.

Figure 2 – Global discussion of wildlife crime

Exploring how the news changes over time, it becomes possible to chart the cumulative geographic focus of a news outlet, or to compare two outlets. Alternatively, looking across global coverage holistically, it becomes possible to instantly identify the world’s happiest and saddest news, or to determine the primary language of news coverage focusing on a given location. By arraying emotion on a map it becomes possible to instantly spot sudden bursts of negativity that reflect breaking news of violence or unrest. Organizing by language, it becomes possible to identify the outlets and languages most relevant to a given location, helping a reader find relevant sources about events in that area. Even the connections among locations in terms of how they are mentioned together in the news yields insights into geographic contextualization. Finally, by breaking the world into a geographic grid and computing the topics trending in each location, it becomes possible to create new ways of visualizing the world’s narratives.

Figure 3 – All locations mentioned in the New York Times (green) and BBC (yellow/orange) during the month of March 2015

Figure 4 – Click to see a live animated map of the average “happy/sad” tone of worldwide news coverage over the last 24 hours mentioning each location

Figure 5 – Click to see a live animated map of the primary language of worldwide news coverage over the last 24 hours mentioning each location

Figure 6 – Interactive visualization of how countries are grouped together in the news media

Turning from global news to domestic television news, these same approaches can be applied to television closed captioning, making it possible to click on a location and view the portion of each news broadcast mentioning events at that location.

Figure 7 – Mapping the locations mentioned in American television news

Turning back to the question that opened this post – why might geography matter to the future of libraries? As news outlets increasingly cede control over the distribution of their content, they do so not only to reach a broader audience, but to leverage more advanced delivery platforms and interfaces. Libraries are increasingly facing identical pressures as patrons turn towards services (PDF) like Google Scholar, Google Books, and Google News instead of library search portals. If libraries embraced new forms of access to their content, such as the kinds of geographic search capabilities outlined in this post, users might find those interfaces more compelling than those of non-news platforms. The ability of ordinary citizens to create their own live-updating “geographic mashups” of library holdings opens the door to engaging with patrons in ways that demonstrate the value of libraries beyond as a museum of physical artifacts and connecting individuals across national or international lines. As more and more library holdings, from academic literature to the open web itself, are geographically indexed, libraries stand poised to lead the cartographic revolution, opening the geography of their vast collections to search and visualization, and making it possible for the first time to quite literally map our world’s libraries.

State Library of Denmark: Sampling methods for heuristic faceting

planet code4lib - Thu, 2015-07-30 10:25

Initial experiments with heuristic faceting in Solr were encouraging: Using just a sample of the result set, it was possible to get correct facet results for large result sets, reducing processing time by an order of magnitude. Alas, further experimentation unearthed that the sampling method was vulnerable to clustering. While heuristic faceting worked extremely well for most of the queries, it failed equally hard for a few of the queries.

The problem

Abstractly, faceting on Strings is a function that turns a collection of documents into a list of top-X terms plus the number of occurrences of these terms. In Solr the collection of documents is represented with a bitmap: One bit per document; if the bit is set, the document is part of the result set. The result set of 13 hits for an index with 64 documents could look like this:

00001100 01010111 00000000 01111110

Normally the faceting code would iterate all the bits, get the terms for the ones that are set and update the counts for those terms. The iteration of the bits is quite fast (1 second for 100M bits), but getting the terms (technically the term ordinals) and updating the counters takes more time (100 seconds for 100M documents).

Initial attempt: Sample the full document bitmap

The initial sampling was done by dividing the result set into chunks and only visiting those chunks. If we wanted to sample 50% of our result set and wanted to use 4 chunks, the parts of the result set to visit could be the one marked with red:

4 chunks: 00001100 01111110 00000000 01010111

As can be counted, the sampling hit 5 documents out of 13. Had we used 2 chunks, the result could be

2 chunks: 00001100 01111110 00000000 01010111

Only 2 hits out of 13 and not very representative. A high chunk count is needed: For 100M documents, 100K chunks worked fairly well. The law of large numbers helps a lot, but in case of document clusters (a group of very similar documents indexed at the same time) we still need both a lot of chunks and a high sampling percentage to have a high chance of hitting them. This sampling is prone to completely missing or over representing clusters.

Current solution: Sample the hits

Remember that iterating of the result bitmap itself is relatively fast. Instead of processing chunks of the bitmap and skipping between them, we iterate over all the hits and only update counts for some of them.

If the sampling rate is 50%, the bits marked with red would be used as sample:

50% sampling: 00001100 01111110 00000000 01010111

If the sampling rate is 33%, the bits for the sample documents would be

33% sampling: 00001100 01111110 00000000 01010111

This way of sampling is a bit slower than sampling on the full document bitmap as all bits must be visited, but it means that the distribution of the sampling points is as fine-grained as possible. It turns out that the better distribution gives better results, which means that the size of the sample can be lowered. Lower sample rate = higher speed.

Testing validity

A single shard from the Net Archive Search was used for testing. The shard was 900GB with 250M documents. Faceting was performed on the field links, which contains all outgoing links from indexed webpages. There are 600M unique values in that field and each document in the index contains an average of 25 links. For a full search on *:* that means 6 billion updates of the counter structure.

For this test, we look for the top-25 links. To get the baseline, a full facet count was issued for the top-50 links for a set of queries. A heuristic facet call was issued for the same queries, also for the top-50. The number of lines until the first discrepancy were counted for all the pairs. The ones with a count beneath 25 were considered faulty. The reason for the over provisioning was to raise the probability of correct results, which of course comes with a performance penalty.

The sampling size was set to 1/1000 the number of documents or roughly 200K hits. Only result sets sizes above 1M are relevant for validity as those below takes roughly the same time to calculate with and without sampling.

Heuristic validity for top 25/50

While the result looks messy, the number of faulty results was only 6 out of 116, for results set sizes above 1M. For the other 110 searches, the top-25 fields were correct. Raising the over provisioning to top-100 imposes a larger performance hit, but reduces the number of faulty results to 0 for this test.

Heuristic validity for top 25/100

Testing performance

The response times for full count faceting and heuristic faceting on the links field with over provision of 50 is as follows:

Heuristic speed for top 25/50

Switching from linear to logarithmic plotting for the y-axis immediately:

Heuristic speed for top 25/50, logarithmic y-axis

It can be seen full counting rises linear with result size, while sampling time is near-constant. This makes sense as the sampling was done by updating counts for a fixed amount of documents. Other strategies, such as making the sampling rate a fraction of the result size, should be explored further, but as the validity plot shows, the fixed strategy works quite well.

The performance chart for over provisioning of 100 looks very much like the one for 50, only with slightly higher response times for sampling. As the amount of non-valid results is markedly lower for an over provisioning of 100, this seems like the best speed/validity trade off for our concrete setup.

Heuristic speed for top 25/100, logarithmic y-axis


Heuristic faceting with sampling on hits gives a high probability of correct results. The speed up relative to full facet counting rises with result set size as sampling has near-constant response times. Using over provisioning allows for fine-grained tweaking between performance and chance of correct results. Heuristic faceting is expected to be the default for interactive use with the links field. Viability of heuristic faceting for smaller fields is currently being investigated.

As always, there is full source code and a drop-in sparse faceting Solr 4.10 WAR at GitHub.


Subscribe to code4lib aggregator