You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 2 hours 42 min ago

District Dispatch: OITP welcomes summer intern

Tue, 2016-05-31 15:34

Brian Clark

On June 6, Brian M. Clark will begin an internship with ALA’s Office for Information Technology Policy (OITP) for the summer. Brian just completed his junior year at Elon University in North Carolina where he is majoring in media analytics and minoring in business administration. At Elon, Brian has completed coursework in Web and mobile communications, Creating multimedia content, Applied media/data analytics, Media law and ethics, Statistics, Economics, Finance, Marketing, Management, and Accounting.

Not surprisingly, Brian’s projects this summer will focus on social media and the web generally and how ALA can better leverage communications technologies to achieve more effective policy advocacy. In addition, Brian will participate in selected D.C. activities and ALA meetings to develop some appreciation of public policy advocacy and lobbying. Activities include attending hearings of the Congress and/or federal regulatory agencies and attending events of think tanks and advocacy groups. Such participation might be in conjunction with ALA’s Google Policy Fellow Nick Gross, who will also be in residence this summer.

Please join me in welcoming Brian to ALA, the Washington Office, the realm of information policy, and libraryland.

The post OITP welcomes summer intern appeared first on District Dispatch.

District Dispatch: Bring federal job training funding home to your library

Tue, 2016-05-31 14:28

Jobseekers at Redding Library job fair.

Do you know how to secure funding for job training services and programs in your library? Learn how to secure workforce support funding for your library at this year’s 2016 American Library Association (ALA) Annual Conference in Orlando, Fla. at the Washington Update session “Concrete Tips to Take Advantage of Workforce Funding.” During the session, a panel of library and workforce leaders will discuss best practices for supporting jobseekers. The session takes place on Saturday, June 25, 2016, from 8:30-10:00 a.m. in the Orange County Convention Center, room W303.

Session participants will learn about effective job training from two different panel discussions and discuss activities, classes and programs they can offer in their own libraries. Conference session attendees will also discuss new workforce support opportunities as the federal government rolls out the new Workforce Innovation and Opportunity Act (WIOA). The U.S. Department of Labor expects to release Workforce Innovation and Opportunity Act regulations on June 30, 2016.

The program will include a number of dynamic speakers, including Mimi Coenen, chief operating officer of CareerSource Central Florida; Tonya Garcia, director of the Long Branch Public Library in New Jersey; Stephen Parker, legislative director of the National Governors Association; Alta Porterfield, Delaware Statewide Coordinator of the Delaware Libraries Inspiration Space; and Renae Rountree, director of the Washington County Public Library in Florida.

Want to attend other policy sessions at the 2016 ALA Annual Conference? View all ALA Washington Office sessions

The post Bring federal job training funding home to your library appeared first on District Dispatch.

Islandora: Guest Blog: So you want to get started working on CLAW?

Tue, 2016-05-31 13:54

We have a very exciting entry for this week's Islandora update: a guest blog from Islandora CLAW sprinter Ben Rosner (Barnard College). Ben has been following along with the project for a while now and started his first sprint last month. He has been an awesome addition to the team and he was kind enough to share his experiences as a 'newbie' to CLAW, and explain how you can join in too:

So you want to get started working on CLAW?

Lessons learned from a first-time contributor, hopefully to ease your transition into the land of CLAW.

Foremost, before any listing of resources, pages, and things to understand: please know that IRC is the place to be. Even if the channel is quiet, someone is ALWAYS lurking and will help you through any issue. Without the #islandora channel on I'm not sure this beginner would have had the fortitude to stick with what can sometimes be challenging, but very rewarding work. Also, the weekly CLAW calls are great if you have the time, even if just to listen in. I didn't understand a flippin' thing the first time I joined, but I stuck with it as time allowed and it has paid off. Lastly the Google Group, both islandora and islandora-dev - just so you can get a 'feel' for what's going on in the community. 

The stuff to bookmark now, and remember for later!

If you're going to work on or with the microservices you need this guide: READ THIS GUIDE, LEARN THIS GUIDE, LIVE THIS GUIDE. curl so many times it hurts, then some more. While you're doing all this 'curling' watch and explore http://localhost:8080/fcrepo/rest/, query blazegraph with a simple SELECT * WHERE { ?s ?p ?o }. What's happening? Why? HOW!? Is that a camel? No way, get out there's a camel? Yup. We've got that.

Here are some dummy objects I've created to quickly populate my repo that might be handy when you're curling: Note the instructions in my may be outdated as we continue work on the Islandora microservices.

If you've never used a Vagrant before look inside of the install folder in the main CLAW repo and follow the README. Note to Windows users, run the command prompt/git shell/whatever and VirtualBox opened as an ADMIN before typing vagrant up. You've been warned!

Have an idea of PSR coding standards - when you're not in Drupal land you'll be using PSR2 while working on CLAW. Just like in college when you had to write in APA (or any of the lesser styles, muhahaha), PSR2 is a style guide. Here's the guide and here's something to fix your code for you (also see below about picking your editor of choice).

Theres more?! Diego Pino (@DiegoPino) has been amazing and hosted a series of live learning tutorials that are recorded and available on YouTube. There are links in main CLAW repository's, and here are the slides that were presented (or at least some of em').

Finding a comfortable editing space.

Pick your editor of choice - personally I'm a fan of Sublime Text 3 with a few plugins, mainly phpfmt/SublimeLinter (for PSR2 compliance and auto-formatting), bracket highlighter (distinguishes which block of code you're working on/in), and markdown preview (sometimes you want it to look nice before you push that commit). There are a few linters out there - some of them MUCH better than phpfmt (like SublimeLinter), and they work great, but rtm before you get started using them.

If you prefer a full-fledged IDE, PhpStorm is my recommend - educational users should be able to swing a free license, though do check with them as I'm not a lawyer... It does your typical IDE stuff (formatting, code completion, intellisense, etc.).

If you love yourself some VIM look through @nruests repos for his configs and I'm told you'll be all set!

Finally, if you're new to GitHub, don't panic, but learning it is a practical exercise regardless of how many guides you read. Here is one such guide, but go ahead and Google 'github guide' and you'll be inundated with everyone's "best guide to learn github." Depending on your learning style all of it is rubbish. Github is a PRACTICAL application. Just mess around with it and when you don't know something, speak up in IRC! Also, don't let your IDE do your git stuff, the commands are more powerful than an IDE exposes.

Okay enough ranting. Here's my "CLAW story."

A nice way to start contributing is through sprints, which are held during the first (or last, I forget) two weeks of each month. There are a number of opened issues/tickets that you can peruse. There is a sprint kick-off call where folks discuss what they want to tackle, and if you're not sure, you can be aimed towards those items that will interest you. I've learned this: EVERY LITTLE BIT HELPS. No task is to small. (See: the last CLAW Tutorial/Lesson for more about Sprints and getting involved and read the

So I started in Sprint 06 which ended April, 2016. Here is my "journal entry" from then:

Thus far my #CLAW experience has been using and helping update the Vagrant (for Windows comparability sigh) and working on the PHP Microservices (it's iterative process, still workin' my first PR).

I'm working primarily on #150 and having some success. Lot's of testing using error.log and watching what's happening in Apache/FCRepo/Blazegraph. I think commit 64df023 is gonna crack this nut.

The Vagrant refuses to symlink on windows machines and so our composer architecture which allows live editing is failing to work. Since I work in a mixed coding environment (which includes a Windows machine) this is fairly important. It turns out we may need to run VMWare with elevated privs for proper linking to occur. shrug

And my work continued, even outside of any true sprint. Small things like creating a .gitconfig to handle line endings properly, resolving other issues with the Vagrant and small documentation tasks. Again little things that I could do as I could.

Then Sprint 007 (I added the extra 0, because who doesn't love a Bond reference?) came along, and I missed the kickoff call due to other obligations. But I asked for a ticket ~ and got one that was like "woot" this is amazing. However! Due to my own overzelousness and over obligation I did not get to work on my issue as much as I would have like :(. And guess what?! No-one judged me, everyone understood, and was it was just great to be part of the community and realize we really are in it together. I plan to continue my work with CLAW, and to take on as much as I possibily can! It's a great project, with great people (Nick, Diego, Jared - the main committers - are so supportive and collaborative), and something I am happy to be part of.

Ben out!

Wait Ben! Help me get started!

Have you curled yet? Get on IRC! Chat with us! Ping nruest a lot, he loves helping folks get started. Check the issues on the github, noone will stop you from trying to tackle one! Just fork and hack away :).

Eric Lease Morgan: Catholic Pamphlets and the Catholic Portal: An evolution in librarianship

Tue, 2016-05-31 12:34

This blog posting outlines, describes, and demonstrates how a set of Catholic pamphlets were digitized, indexed, and made accessible through the Catholic Portal. In the end it advocates an evolution in librarianship.

A few years ago, a fledgling Catholic pamphlets digitization process was embarked upon. [1] In summary, a number of different library departments were brought together, a workflow was discussed, timelines were constructed, and in the end approximately one third of the collection was digitized. The MARC records pointing to the physical manifestations of the pamphlets were enhanced with URLs pointing to their digital surrogates and made accessible through the library catalog. [2] These records were also denoted as being destined for the Catholic Portal by adding a value of CRRA to a local note. Consequently, each of the Catholic Pamphlet records also made their way to the Portal. [3]

Because the pamphlets have been digitized, and because the digitized versions of the pamphlets can be transformed into plain text files using optical character recognition, it is possible to provide enhanced services against this collection, namely, text mining services. Text mining is a digital humanities application rooted in the counting and tabulation of words. By counting and tabulating the words (and phrases) in one or more texts, it is possible to “read” the texts and gain a quick & dirty understanding of their content. Probably the oldest form of text mining is the concordance, and each of the digitized pamphlets in the Portal is associated with a concordance interface.

For example, the reader can search the Portal for something like “is the pope always right”, and the result ought to return a pointer to a pamphlet named Is the Pope always right? of papal infallibility. [4] Upon closer examination, the reader can download a PDF version of the pamphlet as well as use a concordance against it. [5, 6] Through the use of the concordance the reader can see that the words church, bill, charlie, father, and catholic are the most frequently used, and by searching the concordance for the phrase “pope is”, the reader gets a single sentence fragment in the result, “…ctrine does not declare that the Pope is the subject of divine inspiration by wh…” And upon further investigation, the reader can see this phrase is used about 80% of the way through the pamphlet.

The process of digitizing library materials is very much like the workflows of medieval scriptoriums, and the process is well understood. Description and access to digital versions of original materials is well-accommodated by the exploitation of MARC records. The next step for the profession to move beyond find & get and towards use & understand. Many people can find many things, with relative ease. The next step for librarianship is to provide services against the things readers find so they can more easily learn & comprehend. Save the time of the reader. The integration of the University of Notre Dame’s Hesburgh Libraries’s Catholic Pamphlets Collection into the Catholic Portal is one possible example of how this evolutionary process can be implemented.


[1] digitization process –

[2] library catalog –

[3] Catholic Portal –

[4] “Of Papal Infallibility” –

[5] PDF version –

[6] concordance interface –

HangingTogether: A story about the spirituality of libraries

Tue, 2016-05-31 09:02

At the AMICAL 2016 conference, I heard an inspiring story about the cyclical destruction and revival of libraries. Dr. Richard Hodges, President of the AU of Rome, began his welcome message as follows: “Unlike what many of you may believe, you don’t come from Silicon Valley, you come from the monks”. He went on to explain that libraries were created at the end of the 8th century by monasteries. It was then that monks started to make books, besides growing food and brewing beer. They crafted the leather skin covers, the straps, the folded leaves of vellum, and all the instrumentation necessary to write books. They created the blue print of the library. In subsequent years, full blown libraries developed, like the ones in Saint Denis and Montecassino. Like with all things successful, once they grow, you need to sustain them. The monks thus devised a model to attract donors with the lure of a counter gift: a hand-crafted book. When, in the mid-9th century, the Vikings and Saracens destroyed many monasteries and their libraries, the books survived in the hands of those who had been donors. And so, the spirituality of what libraries stood for, as preservers of intellectual heritage, survived the destruction and was the seed for the new learning of the Renaissance. The storyteller hinted at globalization as a similar wave of destruction, which might leave us without libraries but with the promise that the spirituality of what libraries stand for, will resurface in a new guise.

According to keynote speaker Jim Groom this new guise is the Archiving Movement. He painted the Web landscape as a wonderful space of many small initiatives with do-it-yourself blogs and a Wiki-infra on which one can build an entire curriculum for free, with fascinating open technology and with exciting new learning experiences. In his view, it is all about our individual content, building domains of our own and leaving our personal digital footprints. He advocated the need for individuals to become archivists, reclaiming ownership and control over their data from the “big companies”. In this world of the small against the giants, “rogue Internet archivists” (or morphing librarians, as you wish) are excavating and rescuing the remains of parts of the web, that are dying and being destroyed.

In my presentation on “Adapting to the new scholarly record” I talked about shifting trends in the research ecosystem and disturbances which are disrupting the tasks and responsibilities of librarians, as stewards of the record of science. I conveyed the concerns of experts and practitioners in the field, who met during a series of OCLC Research workshops on this matter. They talked about the short-term need for a demonstrable pay-off by universities and funding agencies; the diverse concerns on campus around image, IPR and compliance; the emergence of new digital platforms like ResearchGate and others, that lure researchers into providing data to them and bypassing their institutional repositories; etc. All these forces at play are distracting libraries from safeguarding the record for future scholarship. These observations beg the question, which came from the audience: “what can we do about it?” and in particular “What can we do, as AMICAL libraries”?

I had been impressed by the information literacy (IL) session the day before. AMICAL libraries from Paris to Sharjah presented their efforts to engage faculty and to broaden the understanding of IL within the university. Many of the libraries face challenges with their student population, such as reluctance and resistance to reading, deficiencies in academic writing skills, inexperienced information retrieval expectations and ineffective search practices. The session concluded with the desirability to integrate IL in the curriculum.

So, I answered my audience without hesitation: Please continue the good work you are doing in IL! Why do we hear so little about IL at other library conferences in Europe? Isn’t IL a core part of that spirituality Richard Hodges talked about – a core part of what libraries stand for? The next generation needs to be prepared for the new learning in the digital information age. This requires education and training. People are not born being-digital!

About Titia van der Werf

Titia van der Werf is a Senior Program Officer in OCLC Research based in OCLC's Leiden office. Titia coordinates and extends OCLC Research work throughout Europe and has special responsibilities for interactions with OCLC Research Library Partners in Europe. She represents OCLC in European and international library and cultural heritage venues.

Mail | Web | More Posts (3)

District Dispatch: Finding the “big picture” on big data at the 2016 ALA Annual Conference

Tue, 2016-05-31 06:50

Photo by Rowan Universit yPublication via Flickr

Every day, technology is making it possible to collect and analyze ever more data about students’ performance and behavior, including their use of library resources. The use of “big data” in the educational environment, however, raises thorny questions and deep concerns about individual privacy and data security. California responded to these concerns by passing the Student Online Private Information Protection Act, and student data privacy also is now the focus of several bills in Congress.

Participate in a discussion on the big picture on student data privacy at the conference session “Student Privacy: The Big Picture on Big Data,” which takes place during the 2016 American Library Association (ALA) Annual Conference in Orlando, Fla.  During the session, Khaliah Barnes, associate director of the Electronic Privacy Information Center and Director of its Student Privacy Project, will discuss how the growing use of big data threatens student privacy and how evolving state and federal data privacy laws impact school and academic libraries. The session takes place on Monday, June 27, 2016, 10:00-11:30 a.m., in the Orange County Convention Center, room W206A.

As director of the EPIC Student Privacy Project, Khaliah created the Student Privacy Bill of Rights. Khaliah defends student privacy rights before federal regulatory agencies and federal court. She has testified before states and local districts on the need to safeguard student records. Khaliah is a frequent panelist, commentator, and writer on student data collection. Khaliah has provided expert commentary to local and national media, including CBS This Morning, the New York Times, the Washington Post, NPR, Fox Business, CNN, Education Week, Politico, USA Today, and Time Magazine.

Want to attend other policy sessions at the 2016 ALA Annual Conference? View all ALA Washington Office sessions

The post Finding the “big picture” on big data at the 2016 ALA Annual Conference appeared first on District Dispatch.

DuraSpace News: Managed by DuraCloud: 150+ Terabytes of Data

Tue, 2016-05-31 00:00


Breakout of content stored in DuraCloud based on storage provider

DuraSpace News: Collaboration Between DSpace Registered Service Providers To Integrate And Deploy a Module On a Tight Deadline

Tue, 2016-05-31 00:00

From Peter Dietz, Longsight

Independence, Ohio  Longsight manages DSpace for the Woods Hole Open Access System (WHOAS), a repository of marine biology publications and datasets. To provide as much value to this rich data, WHOAS has added a Linked Data module to DSpace to allow researchers to query their data.

DuraSpace News: Building an Authority Source For People

Tue, 2016-05-31 00:00

From Peter Dietz, Longsight

DuraSpace News: Fedora at Open Repositories: Hands-on Fedora 4, RepoRodeo, API Extension, State of the CLAW, Hydra at 30

Tue, 2016-05-31 00:00

Austin, TX  In two weeks the open repository community will gather at the Open Repositories Conference in Dublin, Ireland to share ideas and catch up with old friends and colleagues. The Fedora community will be on hand to participate and offer insights into current and future development of the flexible and extensible open source repository platform used by leading academic institutions.

Introduction to Fedora 4 Workshop

DuraSpace News: Save the Date for Open Repositories 2017

Tue, 2016-05-31 00:00

From the organizers of Open Repositories 2017

Brisbane, Australia  The Open Repositories (OR) Steering Committee in conjunction with the University of Queensland (UQ), Queensland University of Technology (QUT) and Griffith University are delighted to inform you that Brisbane will host the annual Open Repositories 2017 Conference.

It's exciting to have Open Repositories return to Australia, where it all began in 2006.  

DuraSpace News: New DSpace Repository for The Natural History Museum

Tue, 2016-05-31 00:00

From Lisa Cardy, Library Services Manager, Natural History Museum

LibUX: Announcing W3 Radio

Mon, 2016-05-30 18:21

So, LibUX has been a super vehicle for me — hi, I’m Michael — to talk, write, and make friends around the user experience design and development of libraries, non-profits, and the higher-ed web. These are special niches wherein the day-to-day challenges pervading the web are compounded by unique hyperlocal user bases and ethical imperatives largely without parallel on the commercial web.

And because there is so much to mine, so many hours in the day, and the LibUX audience is so varied — designers, developers, enthusiasts, dabblers, directors, big-bosses, students, vendors — I curate against topics that interest me-the-developer but maybe aren’t exactly relevant to me-the-librarian. There is soooo much happening in this space that captures my imagination, so of course I thought I’d start a new podcast: W3 Radio — you know, as in “world wide web.”

Do you need your web design news right now in ten minutes or less? Well it’s just your luck that soon I Michael Schofield am starting a new podcast: W3 Radio – bite sized best-of the world wide web. You’ll soon be able to tune in to W3 Radio on your podcatcher of choice
and real soon at

The gimmick is that it’s just a weekly recap of headlines in under ten minutes, which I think makes it perfect for playing catch-up – oh, also, I am pretending to be an old-timey radio anchor, and I am almost positive that won’t get old.

Download the MP3

Availability update ( June 1, 2016 )

W3 Radio is now available in Google Play, and still pending in iTunes. Of course, you can always subscribe to the direct feed

So, as of this writing, publishing this announcement generates the feed I use to populate the various podcatchers. It will likely pend for a day or two before they make it available. Stay tuned to this space for links.

The post Announcing W3 Radio appeared first on LibUX.

LITA: Travel Apps!

Mon, 2016-05-30 17:00

With ALA’s annual conference in Orlando just around the corner, travel is in the plans for many librarians and staff. Fortunately, as I live in Florida, I don’t have that far to go. But if you do, then you’re going to need some good apps.

I travel frequently and have a few of my favorite apps that I use for travel, and I’d like to share them with you:

Airline App of Choice

I personally only use two airlines so I can only speak to their particular apps, but seriously, if you have a smartphone and you aren’t using it to hold tickets or boarding passes, you’re missing out. You can also use your app to check flight times and delays, book future travel, or just to play around (one of my airline’s apps lets you send virtual postcards).


Even if it’s just a weekend trip, this app is great at letting you know what you should bring depending on the weather and your activities. You can adjust the lists according to your preferences as well. Though this is the free version, there is a paid version where you can save your packing lists to Evernote. (Android)


I only use Foursquare when I travel. It got put through its paces in Boston when I needed to find a place to eat near my location or was looking for a historic site I hadn’t been to. It also helped in giving me tips about the place: what to order, when to avoid the place, how the staff was. On top of that, it links with your Map App of Choice (Google Maps FTW!) to give you directions and contact information. It’s not Yelp, but I feel it’s more genuine. (Android)


Take it from someone who lived in Orlando: driving in that city is not fun. This is why you want Waze: it can show you directions as well as let you input traffic accidents you happen across as you drive (well, maybe after you drive). It even helps out with finding cheap gas. (Android)

Photo-Editing App of Choice

You’re no doubt going to be taking a lot of photos on your trip, so why not spice them up with some creative edits and share them? There are a plethora of photo apps out there to choose from, the most ubiquitous being Instagram (Android), but I love Hipstamatic (paid, iOS only) because you can randomize your filters and get a totally unexpected result every shot. Other apps that are fun are Pixlr (Android) (there’s a desktop version, too!) and Photoshop Express (Android)
What are some travel apps that you cannot live without? Post them in the comments or tweet them my way @LibrarianStevie!

Tim Ribaric: Code4Lib North Strikes Again

Mon, 2016-05-30 13:57

Hotel had put-put, not relevant but interesting none the less.

read more

LibUX: 039 – Jean Felisme

Mon, 2016-05-30 04:50

I’ve known Jean Felisme for awhile through WordCamp Miami. We see each other quite a bit at meetups and he’s a ton of fun – he’s also been pretty hardcore about evangelizing freelance. Recently he made the switch from freelance into the very special niche that is the higher-ed web, so when he was just six weeks into his new position at the School of Computing and Information Sciences at Florida International University I took the opportunity to pick his brain.

Hope you enjoy.

If you like, you can download the MP3 or subscribe to LibUX on Stitcher, iTunes, Google Play Music, or just plug our feed straight into your podcatcher of choice. Help us out and say something nice. Your sharing and positive reviews are the best marketing we could ask for.

Here’s what we talked about
  • 1:08 – WP Campus is coming up!
  • 2:45 – All about Jean
  • 4:28 – How the trend toward building-out in-house teams will impact freelance
  • 9:38 – What is the day-to-day like just six weeks in?
  • 12:03 – Student-hosted applications and content – scary
  • 13:09 – The makeup of Jean’s team
  • 17:43 – Are you playing with any web technology you haven’t before?
  • 19:37 – The tight relationship with the students
  • 20:31 – On web design curriculum
  • 28:00 – We fail to wrap up and keep talking about freelance for a few more minutes.

The post 039 – Jean Felisme appeared first on LibUX.

District Dispatch: Next CopyTalk on Government Overreach

Fri, 2016-05-27 16:04

Please join us on June 2 for a free webinar on another form of copyright creep.

Please join us on June 2 for a free webinar on another form of copyright creep. This one on recent efforts to copyright state government works.  

Issues behind State governments copyrighting government works

The purpose of copyright is to provide incentives for creativity in exchange for a time limited government provided monopoly. When drafting the federal copyright law, Congress explicitly prohibited the federal government as well as employees of the federal government from having the authority to create a copyright in government created works. However, the federal law is silent on state government power to create, hold, and enforce copyrights. This has resulted in a patchwork of varying levels of state copyright laws across all fifty states.

Currently, California favors the approach where a vast majority of works created by the state and local government are by default in the public domain. An ongoing debate is happening now as to whether California should end the public domain status of most state and local government works. The state legislature is contemplating a bill (AB 2880) that would authorize copyright authority to all state agencies, local governments, and political subdivisions. In recent years entities of state government have attempted to rely on copyright as a means to suppress the dissemination of taxpayer-funded research and as a means to chill criticism but failed in the courts due to a lack of copyright authority. Ernesto Falcon, legislative counsel with the Electronic Frontier Foundation, will review the status of the legislation, the court decisions that lead to its creation, and the debate that now faces the California legislature.

Day/Time: Thursday, June 2 at 2pm Eastern/11am Pacific for our hour long free webinar.

Go to and sign in as a guest. You’re in.

This program is brought to you by OITP’s copyright education subcommittee.

The post Next CopyTalk on Government Overreach appeared first on District Dispatch.

FOSS4Lib Recent Releases: VuFind - 3.0.1

Fri, 2016-05-27 15:44
Package: VuFindRelease Date: Friday, May 27, 2016

Last updated May 27, 2016. Created by Demian Katz on May 27, 2016.
Log in to edit this page.

Bug fix release.

Eric Lease Morgan: VIAF Finder

Fri, 2016-05-27 13:34

This posting describes VIAF Finder. In short, given the values from MARC fields 1xx$a, VIAF Finder will try to find and record a VIAF identifier. [0] This identifier, in turn, can be used to facilitate linked data services against authority and bibliographic data.

Quick start

Here is the way to quickly get started:

  1. download and uncompress the distribution to your Unix-ish (Linux or Macintosh) computer [1]
  2. put a file of MARC records named authority.mrc in the ./etc directory, and the file name is VERY important
  3. from the root of the distribution, run ./bin/

VIAF Finder will then commence to:

  1. create a “database” from the MARC records, and save the result in ./etc/authority.db
  2. use the VIAF API (specifically the AutoSuggest interface) to identify VAIF numbers for each record in your database, and if numbers are identified, then the database will be updated accordingly [3]
  3. repeat Step #2 but through the use of the SRU interface
  4. repeat Step #3 but limiting searches to authority records from the Vatican
  5. repeat Step #3 but limiting searches to the authority named ICCU
  6. done

Once done the reader is expected to programmatically loop through ./etc/authority.db to update the 024 fields of their MARC authority data.


Here is a listing of the VIAF Finder distribution:

  • 00-readme.txt – this file
  • bin/ – “One script to rule them all”
  • bin/ – reads MARC records and creates a simple “database”
  • bin/ – used to create a distribution of this system
  • bin/ – rudimentary use of the SRU interface to query VIAF
  • bin/ – rudimentary use of the AutoSuggest interface to query VIAF
  • bin/ – sort of demonstrates how to update MARC records with 024 fields
  • bin/ – extracts the first n number of MARC records from a set of MARC records, and useful for creating smaller, sample-sized datasets
  • etc – the place where the reader is expected to save their MARC files, and where the database will (eventually) reside
  • lib/ – a tiny set of… subroutines used to read and write against the database

If the reader hasn’t figured it out already, in order to use VIAF Finder, the Unix-ish computer needs to have Perl and various Perl modules — most notably, MARC::Batch — installed.

If the reader puts a file named authority.mrc in the ./etc directory, and then runs ./bin/, then the system ought to run as expected. A set of 100,000 records over a wireless network connection will finish processing in a matter of many hours, if not the better part of a day. Speed will be increased over a wired network, obviously.

But in reality, most people will not want to run the system out of the box. Instead, each of the individual tools will need to be run individually. Here’s how:

  1. save a file of MARC (authority) records anywhere on your file system
  2. not recommended, but optionally edit the value of DB in bin/
  3. run ./bin/ feeding it the name of your MARC file, as per Step #1
  4. if you edited the value of DB (Step #2), then edit the value of DB in bin/, and then run ./bin/
  5. if you want to possibly find more VIAF identifiers, then repeat Step #4 but with ./bin/ and with the “simple” command-line option
  6. optionally repeat Step #5, but this time use the “named” command-line option, and the possible named values are documented as a part of the VAIF API (i.e., “bav” denotes the Vatican
  7. optionally repeat Step #6, but with other “named” values
  8. optionally repeat Step #7 until you get tired
  9. once you get this far, the reader may want to edit bin/, specifically configuring the value of MARC, and running the whole thing again — “one script to rule them all”
  10. done

A word of caution is now in order. VIAF Finder reads & writes to its local database. To do so it slurps up the whole thing into RAM, updates things as processing continues, and periodically dumps the whole thing just in case things go awry. Consequently, if you want to terminate the program prematurely, try to do so a few steps after the value of “count” has reached the maximum (500 by default). A few times I have prematurely quit the application at the wrong time and blew my whole database away. This is the cost of having a “simple” database implementation.

To do

Alas, contains a memory leak. makes use of the SRU interface to VIAF, and my SRU queries return XML results. then uses the venerable XML::XPath Perl module to read the results. Well, after a few hundred queries the totality of my computer’s RAM is taken up, and the script fails. One work-around would be to request the SRU interface to return a different data structure. Another solution is to figure out how to destroy the XML::XPath object. Incidentally, because of this memory leak, the integer fed to was implemented allowing the reader to restart the process at a different point dataset. Hacky.


The use of the database is key to the implementation of this system, and the database is really a simple tab-delimited table with the following columns:

  1. id (MARC 001)
  2. tag (MARC field name)
  3. _1xx (MARC 1xx)
  4. a (MARC 1xx$a)
  5. b (MARC 1xx$b and usually empty)
  6. c (MARC 1xx$c and usually empty)
  7. d (MARC 1xx$d and usually empty)
  8. l (MARC 1xx$l and usually empty)
  9. n (MARC 1xx$n and usually empty)
  10. p (MARC 1xx$p and usually empty)
  11. t (MARC 1xx$t and usually empty)
  12. x (MARC 1xx$x and usually empty)
  13. suggestions (a possible sublist of names, Levenshtein scores, and VIAF identifiers)
  14. viafid (selected VIAF identifier)
  15. name (authorized name from the VIAF record)

Most of the fields will be empty, especially fields b through x. The intention is/was to use these fields to enhance or limit SRU queries. Field #13 (suggestions) is for future, possible use. Field #14 is key, literally. Field #15 is a possible replacement for MARC 1xx$a. Field #15 can also be used as a sort of sanity check against the search results. “Did VIAF Finder really identify the correct record?”

Consider pouring the database into your favorite text editor, spreadsheet, database, or statistical analysis application for further investigation. For example, write a report against the database allowing the reader to see the details of the local authority record as well as the authority data in VIAF. Alternatively, open the database in OpenRefine in order to count & tabulate variations of data it contains. [4] Your eyes will widened, I assure you.


First, this system was written during my “artist’s education adventure” which included a three-month stint in Rome. More specifically, this system was written for the good folks at Pontificia Università della Santa Croce. “Thank you, Stefano Bargioni, for the opportunity, and we did some very good collaborative work.”

Second, I first wrote (SRU interface) and I was able to find VIAF identifiers for about 20% of my given authority records. I then enhanced to include limitations to specific authority sets. I then wrote (AutoSuggest interface), and not only was the result many times faster, but the result was just as good, if not better, than the previous result. This felt like two steps forward and one step back. Consequently, the reader may not ever need nor want to run

Third, while the AutoSuggest interface was much faster, I was not able to determine how suggestions were made. This makes the AutoSuggest interface seem a bit like a “black box”. One of my next steps, during the copious spare time I still have here in Rome, is to investigate how to make my scripts smarter. Specifically, I hope to exploit the use of the Levenshtein distance algorithm. [5]

Finally, I would not have been able to do this work without the “shoulders of giants”. Specifically, Stefano and I took long & hard looks at the code of people who have done similar things. For example, the source code of Jeff Chiu’s OpenRefine Reconciliation service demonstrates how to use the Levenshtein distance algorithm. [6] And we found Jakob Voß’s useful for pointing out AutoSuggest as well as elegant ways of submitting URL’s to remote HTTP servers. [7] “Thanks, guys!”

Fun with MARC-based authority data!


[0] VIAF –

[1] VIAF Finder distribution –

[2] VIAF API –

[4] OpenRefine –

[5] Levenshtein distance –

[6] Chiu’s reconciliation service –

[7] Voß’s –