You are here

Feed aggregator

Hydra Project: Sufia 6.1 released

planet code4lib - Mon, 2015-07-06 14:22

We are pleased to announce that Sufia 6.1 has been released released.  A set of release notes can be found here (

If you are currently using Sufia 6.0 we would recommend upgrading to Sufia 6.1 as soon as possible.  Beyond the additional features in the release a number of bugs were fixed.

Thanks to the 16 contributors for this release, which comprised 139 commits touching 187 files: Adam Wead, Michael Tribone, Gregorio Luis Ramirez, Justin Coyne, Nathan Rogers, Michael J. Giarlo, Carolyn Cole, Trey Terrell, Colin Brittle, Anna Headley, Hector Correa, E. Lynette Rayle, Chris Beer, Jeremy Friesen, Colin Gross, and Tricia Jenkins.

Hydra Project: Open Repositories 2017

planet code4lib - Mon, 2015-07-06 14:21

Of potential interest to Hydranauts

Call for Expressions of Interest in hosting the annual Open Repositories Conference, 2017

The Open Repositories Steering Committee seeks Expressions of Interest from candidate host organizations for the 2017 Open Repositories Annual Conference. Proposals from all geographic areas will be given consideration.

Important dates

The Open Repositories Steering Committee is accepting Expressions of Interest (EoI) to host the OR2017 conference until August 31st, 2015.  Shortlisted sites will be notified by the end of September 2015.


Candidate institutions must have the ability to host at least a four-day conference of approximately 300-500 attendees (OR2015 held in Indianapolis, USA drew more than 400 people). This includes appropriate access to conference facilities, lodging, and transportation, as well as the ability to manage a range of supporting services (food services, internet services, and conference social events; conference web site; management of registration and online payments; etc.). The candidate institutions and their local arrangements committee must have the means to support the costs of producing the conference through attendee registration and independent fundraising. Fuller guidance is provided in the Open Repositories Conference Handbook on the Open Repositories wiki.

Expressions of Interest Guidelines

Organisations interested in proposing to host the OR2017 conference should follow the steps listed below:

  1. Expressions of Interest (EoIs) must be received by August 31st, 2015. Please direct these EoIs and any enquiries to OR Steering Committee Chair William Nixon <>.
  1. As noted above, the Open Repositories wiki has a set of pages at Open Repositories Conference Handbook ( which offer guidelines for organising an Open Repositories conference. Candidate institutions should pay particular attention to the pages listed at “Preparing a bid” before submitting an EoI.
  1. The EoI must include:

* the name of the institution (or institutions in the case of a joint bid)

* an email address as a first point of contact

* the proposed location for the conference venue with a brief paragraph describing the local amenities that would be available to delegates, including its proximity to a reasonably well-served airport

  1. The OR Steering Committee will review proposals and may seek advice from additional reviewers. Following the review, one or more institutions will be invited to submit a detailed proposal.
  1. Invitations to submit a detailed proposal will be issued by the end of September 2015; institutions whose interest will not be taken up will also be notified at that time. The invitations sent out will provide a timeline for submitting a formal proposal and details of additional information available to the shortlisted sites for help in the preparation of their bid.  The OR Steering Committee will be happy to answer specific queries whilst proposals are being prepared.

About Open Repositories

Since 2006 Open Repositories has hosted an annual conference that brings together users and developers of open digital repository platforms. For further information about Open Repositories and links to past conference sites, please visit the OR home page:

Subscribe to announcements about Open Repositories conferences by joining the OR Google Group

Mark E. Phillips: Punctuation in DPLA subject strings

planet code4lib - Mon, 2015-07-06 14:00

For the past few weeks I’ve been curious about the punctuation characters that are being used in the subject strings in the DPLA dataset I’ve been using for some blog posts over the past few months.

This post is an attempt to find out the range of punctuation characters used in these subject strings and is carried over from last week’s post related to subject string metrics.

What got me started was that in the analysis used for last week’s post,  I noticed that there were a number of instances of em dashes “—” (528 instances) and en dashes “–” (822 instances) being used in place of double hyphens “–” in subject strings from The Portal to Texas History. No doubt these were most likely copied from some other source.  Here is a great subject string that contains all three characters listed above.

Real Property — Texas –- Zavala County — Maps

Turns out this isn’t just something that happened in the Portal data,  here is an example from the Mountain West Digital Library.

Highway planning--Environmental aspects–Arizona—Periodicals

To get the analysis started the first thing that I need to do is establish what I’m considering punctuation characters because that definition can change depending on who you are talking to and what language you are using.  For this analysis I’m using the punctuation listed in the python string module.

>>> import string >>> print string.punctuation !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

So this gives us 32 characters that I’m considering to be punctuation characters for the analysis in this post.

The first thing I wanted to do was to get an idea of which of the 32 characters were present in the subject strings, and how many instances there were.  In the dataset I’m using there are 1,871,877 unique subject strings.  Of those subject strings 1,496,769 or 80% have one or more punctuation characters present.  

Here is the breakdown of the number of subjects that have a specific character present.  One thing to note is that when processing if there were repeated instance of a character, they were reduced to a single instance, it doesn’t affect the analysis just something to note.

Character Subjects with Character ! 72 “ 1,066 # 432 $ 57 % 16 & 33,825 ‘ 22,671 ( 238,252 ) 238,068 * 451 + 81 , 607,849 – 954,992 . 327,404 / 3,217 : 10,774 ; 5,166 < 1,028 = 1,027 > 1,027 ? 7,005 @ 53 [ 9,872 ] 9,893 \ 32 ^ 1 _ 80 ` 99 { 9 | 72 } 9 ~ 4

One thing that I found interesting is that characters () and [] have different numbers of instances suggesting there are unbalanced brackets and parenthesis in subjects somewhere.

Another interesting note is that there are 72 instances of subjects that use the pipe character “|”.  The pipe is often used by programmers and developers as a delimiter because it “is rarely used in the data values”  this analysis says that while true it is rarely used,  it should be kept in mind that it is sometimes used.

Next up was to look at how punctuation was distributed across the various Hubs.

In the table below I’ve pulled out the total number of unique subjects per Hub in the DPLA dataset.  I show the number of subjects without punctuation and the number of subjects with some sort of punctuation and finally display the percentage of subjects with punctuation.

Hub Name Unique Subjects Subjects without Punctuation Subjects with Punctuation Percent with Punctuation ARTstor 9,560 6,093 3,467 36.3% Biodiversity_Heritage_Library 22,004 14,936 7,068 32.1% David_Rumsey 123 106 17 13.8% Harvard_Library 9,257 553 8,704 94.0% HathiTrust 685,733 56,950 628,783 91.7% Internet_Archive 56,910 17,909 39,001 68.5% J._Paul_Getty_Trust 2,777 375 2,402 86.5% National_Archives_and_Records_Administration 7,086 2,150 4,936 69.7% Smithsonian_Institution 348,302 152,850 195,452 56.1% The_New_York_Public_Library 69,210 9,202 60,008 86.7% United_States_Government_Printing_Office_(GPO) 174,067 14,525 159,542 91.7% University_of_Illinois_at_Urbana-Champaign 6,183 2,132 4,051 65.5% University_of_Southern_California._Libraries 65,958 37,237 28,721 43.5% University_of_Virginia_Library 3,736 1,099 2,637 70.6% Digital_Commonwealth 41,704 8,381 33,323 79.9% Digital_Library_of_Georgia 132,160 9,876 122,284 92.5% Kentucky_Digital_Library 1,972 579 1,393 70.6% Minnesota_Digital_Library 24,472 16,555 7,917 32.4% Missouri_Hub 6,893 2,410 4,483 65.0% Mountain_West_Digital_Library 227,755 84,452 143,303 62.9% North_Carolina_Digital_Heritage_Center 99,258 9,253 90,005 90.7% South_Carolina_Digital_Library 23,842 4,002 19,840 83.2% The_Portal_to_Texas_History 104,566 40,310 64,256 61.5%

To make it a little easier to see I make a graph of this same data and divided the graph into two groups,  on the left are the Content-Hubs and the right are the Service-Hubs.

Percent of Subjects with Punctuation

I don’t see a huge difference between the two groups and the percentage of punctuation in subjects, at least by just looking at things.

Next I wanted to see out of the 32 characters that I’m considering in this post,  how many of those characters are present in a given hubs subjects.  That data is in the table and graph below.

Hub Name Characters Present ARTstor 19 Biodiversity_Heritage_Library 20 David_Rumsey 7 Digital_Commonwealth 21 Digital_Library_of_Georgia 22 Harvard_Library 12 HathiTrust 28 Internet_Archive 26 J._Paul_Getty_Trust 11 Kentucky_Digital_Library 11 Minnesota_Digital_Library 16 Missouri_Hub 14 Mountain_West_Digital_Library 30 National_Archives_and_Records_Administration 10 North_Carolina_Digital_Heritage_Center 23 Smithsonian_Institution 26 South_Carolina_Digital_Library 16 The_New_York_Public_Library 18 The_Portal_to_Texas_History 22 United_States_Government_Printing_Office_(GPO) 17 University_of_Illinois_at_Urbana-Champaign 12 University_of_Southern_California._Libraries 25 University_of_Virginia_Library 13

Here is this data in a graph grouped in Content and Service Hubs.

Unique Punctuation Characters Present

Mountain West Digital Library had the most characters covered with 30 of the 32 possible punctuation characters. One the low end was the David Rumsey collection with only 7 characters represented in the subject data.

The final thing is to see the character usage for all characters divided by hub so the following graphic presents that data.  I tried to do a little coloring of the table to make it a bit easier to read, don’t know how well I accomplished that.

Punctuation Character Usage (click to view larger image)

So it looks like the following characters ‘(),-. are present in all of the hubs.  The characters %/?: are present in almost all of the hubs (missing one hub each).

The least used character is the ^ which is only in use by one hub in one record.  The characters ~ and @ are only used in two hubs each.

I’ve found this quick look at the punctuation usage in subjects pretty interesting so far,  I know that there were some anomalies that I unearthed for the Portal dataset with this work that we now have on the board to fix,  they aren’t huge issues but things that probably would stick around for quite some time in a set of records without specific identification.

For me the next step is to see if there is a way to identify punctuation characters that are used incorrectly and be able to flag those fields and records in some way to report back to metadata creators.

Let me know what you think via Twitter if you have questions or comments.


Terry Reese: MarcEdit OSX Public Preview 1

planet code4lib - Mon, 2015-07-06 03:40

It’s with a little trepidation that I’m formally making the first Public Preview of the MarcEdit OSX version available for download and use.  In fact, as of today, this version is now *the* OSX download available on the downloads page.  I will no longer be building the old code-base for use on OSX.

When I first started this project around Mid-April, I began knowing that this process would take some time.  I’ve been working on MarcEdit continuously for a little over 16 years.  It’s gone through one significant rewrite (when the program moved from Assembly to C#) and has had way too many revisions to count.  In agreeing to take on the porting work — I’d hoped that I could port a significant portion of the program over the course of about 8 months and that by the end of August, I could produce a version of MarcEdit that would cover the 80% or so of the commonly used application toolset.  To do this, it meant porting the MARC Tools portion of the application and the MarcEditor.

Well, I’m ahead of schedule.  Since about 2014, I’ve been reworking a good deal of the application to support a smoother porting process sometime in the future — though, honestly, I wasn’t sure that I’d ever actual do the porting work.  Pleasantly, this early work has made a good deal of the porting work easier allowing me to move faster than I’d anticipated.  As of this posting, a significant portion of that 80% has been converted, and I think that for many people — most of what they probably use daily — has been implemented.  And while I’m ahead of schedule and have been happy with how the porting process has gone, make no mistake — it’s been a lot of work, and a lot of code.  Even though this work has primarily been centered around rewriting just the UI portions of MarcEdit, you are still talking, as of today, close to 200,000 lines of code.  This doesn’t include the significant amount of work I’ve done around the general assemblies that have provided improvements to all MarcEdit users.  Because of that — I need to start getting feedback from users.  While the general assemblies go through an automated testing process — I haven’t, as of yet, come up with an automated testing process for the OSX build.  This means that I’m testing things manually, and simply cannot go through the same leveling of testing that I do each time I build the Windows version.  Most folks may not realize it, but it takes about a day to build the Windows version — as the program goes through various unit tests processing close to 25 million records.  I simply don’t have an equivalent of that process yet, so I’m hoping that everyone interested in this work will give it a spin, use it for real work, and let me know if/when things fall down.

In creating the Preview, I’ve tried to make the process for users as easy as possible.  Users interested in running the program simply need to be running at least OSX 10.8 and download the dmg found here:  Once downloaded, run the dmg an a new disk will mount called MarcEdit OSX.  Run this file, and you’ll see the following installer:

MarcEdit OSX installer

Drag the MarcEdit icon into the Applications folder and the application will either install, or overwrite an existing version.  That’s it.  No other downloads are necessary.  On first run, the program will generate a marcedit folder under /users/[yourid]/marcedit.  I realize that this isn’t completely normal — but I need the data accessible outside of the normal app sandbox to easily support updates.  I’d also considered the User Documents folder, but the configuration data probably shouldn’t live there either.  So, this is where I ended up putting it.

So what’s been completed — Essentially, all the MARC Tools functions and a significant amount of the MarcEditor has been completed.  There are some conspicuous functions that are absent at this point though.  The Call Number and Fast Heading generation, the Delimited Text Translator and Exporter, the Select and Delete Selected Records, everything Z39.50 related, as well as the Linked Data tools and the Integration work with OCLC and Koha.  All these are not currently available — but will be worked on.  At this point, what users can do is start letting me know what absent components are impacting you the most, and I’ll see how they fit into the current development roadmap.

Anyway — that’s it.  I’m excited to let you all give this a try, and a little nervous as well.  This has been a significant undertaking which has definitely pushed me a bit, requiring me to learn Object-C in a short period of time, as well as quickly assimilate a significant portion of Apples SDK documents relating to UI design.  I’m sure I’ve missed things, but it’s time to let other folks start working with it.

If you have been interested in this work — download the installer, kick the tires, and give feedback.  Just remember to be gentle.  


Download URL:


Terry Reese: MarcEdit 6.1 Update

planet code4lib - Mon, 2015-07-06 03:33

This was something I’d hoped to get into the last update, but didn’t get the time to test it; so I got it done now.  While at the first MarcEdit User Group meeting at ALA, there was a question about supporting 880 fields when exporting data via tab delimited format.  When you use the tool right now, the program will export all the 880 fields, not a specific 880 field.  This update changes that.  After the update, when you select the 880 field in the Export tab delimited tool, the program will ask you for the linking field.  In this case, the program will then match the 880$6[linkingfield], and pull the selected subfield.  I’m not sure how often this comes up — but it certainly made a lot of sense when the problem was described to me.

You can pick up the download at:


DuraSpace News: Quarterly Report from Fedora, January - June 2015

planet code4lib - Mon, 2015-07-06 00:00

From The Fedora Leadership Group

Fedora Development

In the past two quarters, the development team released three new versions of Fedora 4; detailed release notes are here:

Open Library Data Additions: PACER dump of cofc (Court of Federal Claims)

planet code4lib - Sun, 2015-07-05 17:56

PACER records from The U.S. Court of Federal Claims..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata

Patrick Hochstenbach: Homework assignment #7 Sketchbookskool #BootKamp

planet code4lib - Sun, 2015-07-05 07:15
Filed under: Sketchbook Tagged: sketch, sketchbook, sketchbookskool, urbansketching

Jonathan Rochkind: Long-standing bug in Chrome (WebKit?) on page not being drawn, scroll:auto, retina

planet code4lib - Sat, 2015-07-04 13:57

In a project I’m recently working on, I ran into a very odd bug in Chrome (may reproduce in other WebKit browsers, not sure).

My project loads some content via AJAX into a portion of the page. In some cases, the content loaded is not properly displayed, it’s not actually painted by the browser. There is space taken up by it on the page, but it’s kind of as if it had `display:none` set, although not quite like that because sometimes _some_ of the content is displayed but not others.

Various user interactions will force the content to paint, including resizing the browser window.

Googling around, there are various people who have been talking about this bug, or possibly similar bugs, for literally years. Including here and maybe this is the same thing or related, hard to say.

think the conditions that trigger the bug in my case may include:

  • A Mac “retina” screen, the bug may not trigger on ordinary resolutions.
  • Adding/changing content via Javascript in a block on the page that has been set to `overflow: auto` (or just overflow-x or overflow-y auto).

I think both of these things are it, and it’s got something to do with Chrome/WebKit getting confused calculating whether a scrollbar is neccesary (and whether space has to be reserved for it) on a high-resolution “retina” screen, when dynamically loading content.

It’s difficult to google around for this, because nobody seems to quite understand the bug. It’s a big dismaying though that it seems likely this bug — or at least related bugs with retina screens, scrollbar calculation, dynamic content, etc — have existed in Chrome/WebKit for possibly many years.  I am not certain if any tickets are filed in Chrome/WebKit bug tracker on this (or if anyone’s figured out exactly what causes it from Chrome’s point of view).  (this ticket is not quite the same thing, but is also about overflow calculations and retina screens, so could be caused by a common underlying bug).

There are a variety of workarounds suggested on Google, for bugs with Chrome not properly painting dynamically loaded content. Some of them didn’t seem to work for me; others cause a white flash even in browsers that wouldn’t otherwise be effected by the bug; others were inconvenient to apply in my context or required a really unpleasant `timeout` in JS code to tell chrome to do something a few dozen/hundred ms after the dynamic content was loaded. (I think Chrome/WebKit may be smart enough to ignore changes that you immediately undo in some cases, so they don’t trigger any rendering redraw; but here we want to trick Chrome into doing a rendering redraw without actually changing the layout, so, yeah).

Here’s the hacky lesser evil workaround which seems to work for me. Immediately after dynamically loading the content, do this to it’s parent div:

$("#parentDiv").css("opacity", 0.99999).css("opacity", 1.0);

It does leave a `style` element setting opacity to 1.0 sitting around on your parent container after you’re done, oh well.

I haven’t actually tried the solution suggested here, to a problem which may or may not be the same one I have — of simply adding `-webkit-transform: translate3d(0,0,0)` to relevant elements.

One of the most distressing things about this bug is if you aren’t testing on a retina screen (and why/how would you unless your workstation happens to have one), you may not ever notice or be able to reproduce the bug, but you may be ruining the interface for users on retina screens (and find their bug report completely unintelligible and unreproducible if they do report it, whether or not they mention they have a retina screen when they file it, which they probably won’t, they may not even know what this is, let alone guess it’s a pertinent detail).

Also that the solutions are so hacky that I am not confident they won’t stop working in some future version of Chrome that still exhibits the bug.

Oh well, so it goes. I really wish Chrome/WebKit would notice and fix though. Probably won’t happen until someone who works on Chrome/WebKit gets a retina screen and happens to run into the bug themselves.

Filed under: General

FOSS4Lib Recent Releases: ePADD - 1.0

planet code4lib - Fri, 2015-07-03 21:25

Last updated July 3, 2015. Created by Peter Murray on July 3, 2015.
Log in to edit this page.

Package: ePADDRelease Date: Wednesday, July 1, 2015

FOSS4Lib Updated Packages: ePADD

planet code4lib - Fri, 2015-07-03 21:22

Last updated July 3, 2015. Created by Peter Murray on July 3, 2015.
Log in to edit this page.

ePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.

The software is comprised of four modules:

Appraisal: Allows donors, dealers, and curators to easily gather and review email archives prior to transferring those files to an archival repository.

Processing: Provides archivists with the means to arrange and describe email archives.

Discovery: Provides the tools for repositories to remotely share a redacted view of their email archives with users through a web server discovery environment. (Note that this module is downloaded separately).

Delivery: Enables archival repositories to provide moderated full-text access to unrestricted email archives within a reading room environment.

Package Type: Archival Record Manager and EditorLicense: Apache 2.0 Package Links Development Status: Production/StableOperating System: LinuxMacWindows Releases for ePADD Technologies Used: SOLRTomcatProgramming Language: JavaOpen Hub Link: Hub Stats Widget: 

FOSS4Lib Recent Releases: Sufia - 6.1.0

planet code4lib - Fri, 2015-07-03 21:15

Last updated July 3, 2015. Created by Peter Murray on July 3, 2015.
Log in to edit this page.

Package: SufiaRelease Date: Thursday, July 2, 2015

Open Library Data Additions: Amazon Crawl: part if

planet code4lib - Fri, 2015-07-03 13:58

Part if of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

DPLA: Preserving the Star-Spangled Banner

planet code4lib - Fri, 2015-07-03 12:00

The tune of the “Star-Spangled Banner” is one that will be played at picnics, fireworks displays, and other Fourth of July celebrations across the country this weekend. But the “broad stripes and bright stars” of the original flag that flew over Fort McHenry in 1814–inspiring Francis Scott Key to pen the iconic poem–have required some refreshing over the years. While recent conservation efforts have made the flag a centerpiece of the Smithsonian’s climate-controlled Flag Hall at the National Museum of American History, that wasn’t the only big upkeep project on the flag. Here’s the story behind the 1914 conservation effort spearheaded by a talented embroidery teacher to bring new life to an American icon.

Women at work repairing the Star-Spangled Banner, 1914. Courtesy of The New York Public Library.

The Star-Spangled Banner first came to the Smithsonian in 1907 and  was formally gifted a few years later from the family of Lieutenant Colonel George Armistead. When it came to the museum, the flag itself had seen significant damage. In addition to the battle it survived at Fort McHenry, pieces of the flag had been given out as mementos by Armistead’s family to friends, war veterans, and politicians (legend has it even to Abraham Lincoln, though his rumored piece has never been found).

The original “Star-Spangled Banner.” Courtesy of The New York Public Library.

By the time the Smithsonian’s first conservation efforts began, the flag itself was 100 years old and in fragile condition. In 1914, the Smithsonian brought on embroidery teacher and professional flag restorer Amelia Fowler (who had experience fixing historic flags at the US Naval Academy) to undertake the Star-Spangled Banner project. Fowler, alongside her team of ten needlewomen, spent eight-weeks in the humid early summer restoring the flag. The team took off a canvas backing that had been attached in the 1870s, when the flag was displayed at the Boston Navy Yard. Fowler attached a new linen backing, with approximately 1,700,000 stitches, in a unique honeycomb pattern–a preservation technique Fowler herself patented. For the project, Fowler was paid $500 and her team split an additional $500. The newly-preserved flag was on display for the next fifty years.

Fowler’s flag restoration, which she said would “defy the test of time,” did last until 1999, during the “Save America’s Treasures” preservation campaign, when conservation efforts began again. The extensive work that Fowler completed to revive the Star-Spangled Banner, those millions of stitches, took conservators almost two years to remove. The iconic flag remains up for display in the Smithsonian’s National Museum of American History, inspiring new generations in “the land of the free and the home of the brave.”

Featured image, 1839 sheet-music for “The Star-Spangled Banner,” courtesy of the University of North Carolina at Chapel Hill via North Carolina Digital Heritage Center.

Jonathan Rochkind: “Dutch universities start their Elsevier boycott plan”

planet code4lib - Fri, 2015-07-03 01:42

“We are entering a new era in publications”, said Koen Becking, chairman of the Executive Board of Tilburg University in October. On behalf of the Dutch universities, he and his colleague Gerard Meijer negotiate with scientific publishers about an open access policy. They managed to achieve agreements with some publishers, but not with the biggest one, Elsevier. Today, they start their plan to boycott Elsevier.

Dutch universities start their Elsevier boycott plan

Filed under: General

Mark E. Phillips: Characteristics of subjects in the DPLA

planet code4lib - Thu, 2015-07-02 14:49

There are still a few things that I have been wanting to do with the subject data from the DPLA dataset that I’ve been working with for the past few months.

This time I wanted to take a look at some of the characteristics of the subject strings themselves and see if there is any information there that is helpful, useful for us to look at as an indicator of quality for the metadata record associated with that subject.

I took at look at the following metrics for each subject string; length, percentage integer, number of tokens, length of anagram, anagram complexity, number of non-alphanumeric characters (punctuation).

In the tables below I present a few of the more interesting selections from the data.

Subject Length

This is calculated by stripping whitespace from the ends of each subject, and then counting the number of characters that are left in the string.

Hub Unique Subjects Minimum Length Median Length Maximum Length Average Length stddev ARTstor 9,560 3 12.0 201 16.6 14.4 Biodiversity_Heritage_Library 22,004 3 10.5 478 16.4 10.0 David_Rumsey 123 3 18.0 30 11.3 5.2 Digital_Commonwealth 41,704 3 17.5 3490 19.6 26.7 Digital_Library_of_Georgia 132,160 3 18.5 169 27.1 14.1 Harvard_Library 9,257 3 17.0 110 30.2 12.6 HathiTrust 685,733 3 31.0 728 36.8 16.6 Internet_Archive 56,910 3 152.0 1714 38.1 48.4 J._Paul_Getty_Trust 2,777 4 65.0 99 31.6 15.5 Kentucky_Digital_Library 1,972 3 31.5 129 33.9 18.0 Minnesota_Digital_Library 24,472 3 19.5 199 17.4 10.2 Missouri_Hub 6,893 3 182.0 525 30.3 40.4 Mountain_West_Digital_Library 227,755 3 12.0 3148 27.2 25.1 National_Archives_and_Records_Administration 7,086 3 19.0 166 22.7 17.9 North_Carolina_Digital_Heritage_Center 99,258 3 9.5 3192 25.6 20.2 Smithsonian_Institution 348,302 3 14.0 182 24.2 11.9 South_Carolina_Digital_Library 23,842 3 26.5 1182 35.7 25.9 The_New_York_Public_Library 69,210 3 29.0 119 29.4 13.5 The_Portal_to_Texas_History 104,566 3 16.0 152 17.7 9.7 United_States_Government_Printing_Office_(GPO) 174,067 3 39.0 249 43.5 18.1 University_of_Illinois_at_Urbana-Champaign 6,183 3 23.0 141 23.2 14.3 University_of_Southern_California._Libraries 65,958 3 13.5 211 18.4 10.7 University_of_Virginia_Library 3,736 3 40.5 102 31.0 17.7

My takeaway from this is that three characters long is just about the shortest subject that one is able to include,  not the absolute rule, but that is the low end for this data.

The average length ranges from 11.3 average characters for the David Rumsey hub to 43.5 characters on average for the United States Government Printing Office (GPO).

Put into a graph you can see the average subject length across the Hubs a bit easier.

Average Subject Length

The length of a field can be helpful to find values that are a bit outside of the norm.  For example you can see that there are five Hubs  that have maximum character lengths of over 1,000 characters. In a quick investigation of these values they appear to be abstracts and content descriptions accidentally coded as a subject.

Maximum Subject Length

For the Portal to Texas History that had a few subjects that came in at over 152 characters long,  it turns out that these are incorrectly formatted subject fields where a user has included a number of subjects in one field instead of separating them out into multiple fields.

Percent Integer

For this metric I stripped whitespace characters, and then divided the number of digit characters by the number of total characters in the string to come up with the percentage integer.

Hub Unique Subjects Maximum % Integer Average % Integer stddev ARTstor 9,560 61.5 1.3 5.2 Biodiversity_Heritage_Library 22,004 92.3 2.2 11.1 David_Rumsey 123 36.4 0.5 4.2 Digital_Commonwealth 41,704 66.7 1.6 6.0 Digital_Library_of_Georgia 132,160 87.5 1.7 6.2 Harvard_Library 9,257 44.4 4.6 9.0 HathiTrust 685,733 100.0 3.5 8.4 Internet_Archive 56,910 100.0 4.1 9.4 J._Paul_Getty_Trust 2,777 50.0 3.6 8.0 Kentucky_Digital_Library 1,972 63.6 5.7 9.9 Minnesota_Digital_Library 24,472 80.0 1.1 5.1 Missouri_Hub 6,893 50.0 2.9 7.5 Mountain_West_Digital_Library 227,755 100.0 1.1 5.5 National_Archives_and_Records_Administration 7,086 42.1 4.7 9.4 North_Carolina_Digital_Heritage_Center 99,258 100.0 1.5 5.9 Smithsonian_Institution 348,302 100.0 1.1 3.6 South_Carolina_Digital_Library 23,842 57.1 2.3 6.5 The_New_York_Public_Library 69,210 100.0 12.0 13.5 The_Portal_to_Texas_History 104,566 100.0 0.4 3.7 United_States_Government_Printing_Office_(GPO) 174,067 80.0 0.4 2.4 University_of_Illinois_at_Urbana-Champaign 6,183 50.0 6.1 10.9 University_of_Southern_California._Libraries 65,958 100.0 1.3 6.4 University_of_Virginia_Library 3,736 72.7 1.8 6.8

Average Percent Integer

If you group these into the Content-Hub and Service-Hub categories you can see things a little better.

It appears that the Content-Hubs on the left trend a bit higher than the Service-Hubs on the right.  This probably has to do with the use of dates in subject strings as a common practice in bibliographic catalog based metadata which isn’t always the same in metadata created for more heterogeneous collections of content that we see in the Service-Hubs.


For the tokens metric I replaced punctuation character instance with a single space character and then used the nltk word_tokenize function to return a list of tokens.  I then just to the length of that resulting list for the metric.

Hub Unique Subjects Maximum Tokens Average Tokens stddev ARTstor 9,560 31 2.36 2.12 Biodiversity_Heritage_Library 22,004 66 2.29 1.46 David_Rumsey 123 5 1.63 0.94 Digital_Commonwealth 41,704 469 2.78 3.70 Digital_Library_of_Georgia 132,160 23 3.70 1.72 Harvard_Library 9,257 17 4.07 1.77 HathiTrust 685,733 107 4.75 2.31 Internet_Archive 56,910 244 5.06 6.21 J._Paul_Getty_Trust 2,777 15 4.11 2.14 Kentucky_Digital_Library 1,972 20 4.65 2.50 Minnesota_Digital_Library 24,472 25 2.66 1.54 Missouri_Hub 6,893 68 4.30 5.41 Mountain_West_Digital_Library 227,755 549 3.64 3.51 National_Archives_and_Records_Administration 7,086 26 3.48 2.93 North_Carolina_Digital_Heritage_Center 99,258 493 3.75 2.64 Smithsonian_Institution 348,302 25 3.29 1.56 South_Carolina_Digital_Library 23,842 180 4.87 3.45 The_New_York_Public_Library 69,210 20 4.28 2.14 The_Portal_to_Texas_History 104,566 23 2.69 1.36 United_States_Government_Printing_Office_(GPO) 174,067 41 5.31 2.28 University_of_Illinois_at_Urbana-Champaign 6,183 26 3.35 2.11 University_of_Southern_California._Libraries 65,958 36 2.66 1.51 University_of_Virginia_Library 3,736 15 4.62 2.84

Average number of tokens

Tokens end up being very similar to that of the overall character length of a subject.  If I was to do more processing I would probably divide the length by the number of tokens and get an average work length for the tokens in the subjects.  That might be interesting.


I’ve always found anagrams of values in metadata to be interesting,  sometimes helpful and sometimes completely useless.  For this value I folded the case of the subject string to convert letters with diacritics to their ASCII version and then created an anagram of the resulting letters.  I used the length of this anagram for the metric.

Hub Unique Subjects Min Anagram Length Median Anagram Length Max Anagram Length Avg Anagram Length stddev ARTstor 9,560 2 8 23 8.93 3.63 Biodiversity_Heritage_Library 22,004 0 7.5 23 9.33 3.26 David_Rumsey 123 3 12 13 7.93 2.28 Digital_Commonwealth 41,704 0 9 26 9.97 3.01 Digital_Library_of_Georgia 132,160 0 9.5 23 11.74 3.18 Harvard_Library 9,257 3 11 21 12.51 2.92 HathiTrust 685,733 0 14 25 13.56 2.98 Internet_Archive 56,910 0 22 26 12.41 3.96 J._Paul_Getty_Trust 2,777 3 19 21 13.02 3.60 Kentucky_Digital_Library 1,972 2 14.5 22 13.02 3.28 Minnesota_Digital_Library 24,472 0 12 22 9.76 3.00 Missouri_Hub 6,893 0 22 25 11.09 4.06 Mountain_West_Digital_Library 227,755 0 7 26 11.85 3.54 National_Archives_and_Records_Administration 7,086 3 11 22 10.01 3.09 North_Carolina_Digital_Heritage_Center 99,258 0 6 26 11.00 3.54 Smithsonian_Institution 348,302 0 8 23 11.53 3.42 South_Carolina_Digital_Library 23,842 1 12 26 13.08 3.67 The_New_York_Public_Library 69,210 0 10 24 11.45 3.17 The_Portal_to_Texas_History 104,566 0 10.5 23 9.78 2.98 United_States_Government_Printing_Office_(GPO) 174,067 0 14 24 14.56 2.80 University_of_Illinois_at_Urbana-Champaign 6,183 3 7 21 10.42 3.46 University_of_Southern_California._Libraries 65,958 0 9 23 9.81 3.20 University_of_Virginia_Library 3,736 0 9 22 12.76 4.31

Average anagram length

I find this interesting in that there are subjects in several of the Hubs (Digital_Commonwealth, Internet Archive, Mountain West Digital Library, and South Carolina Digital Library that have a single subject instance that contains all 26 letters.  That’s just neat.  Now I didn’t look to see if these are the same subject instances that were themselves 3000+ characters long.




It can be interesting to see what punctuation was used in a field so I extracted all non-alphanumeric values from the string which left me with the punctuation characters.  I took the number of unique punctuation characters for this metric.

Hub Name Unique Subjects min median max mean stddev ARTstor 9,560 0 0 8 0.73 1.22 Biodiversity Heritage Library 22,004 0 0 8 0.59 1.02 David Rumsey 123 0 0 4 0.18 0.53 Digital Commonwealth 41,704 0 1.5 10 1.21 1.10 Digital Library of Georgia 132,160 0 1 7 1.34 0.96 Harvard_Library 9,257 0 0 6 1.65 1.02 HathiTrust 685,733 0 1 9 1.63 1.16 Internet_Archive 56,910 0 2 11 1.47 1.75 J_Paul_Getty_Trust 2,777 0 2 6 1.58 0.99 Kentucky_Digital_Library 1,972 0 1.5 5 1.50 1.38 Minnesota_Digital_Library 24,472 0 0 7 0.42 0.74 Missouri_Hub 6,893 0 3 7 1.24 1.37 Mountain_West_Digital_Library 227,755 0 1 8 0.97 1.04 National_Archives_and_Records_Administration 7,086 0 3 7 1.68 1.61 North_Carolina_Digital_Heritage_Center 99,258 0 0.5 7 1.34 0.93 Smithsonian_Institution 348,302 0 2 7 0.84 0.96 South_Carolina_Digital_Library 23,842 0 3.5 8 1.68 1.41 The_New_York_Public_Library 69,210 0 1 7 1.57 1.12 The_Portal_to_Texas_History 104,566 0 1 7 0.84 0.91 United_States_Government_Printing_Office_(GPO) 174,067 0 2 7 1.38 0.99 University_of_Illinois_at_Urbana-Champaign 6,183 0 2 6 1.31 1.25 University_of_Southern_California_Libraries 65,958 0 0 7 0.75 1.09 University_of_Virginia_Library 3,736 0 5 7 1.67 1.58 63 0 2 5 1.17 1.31

Average Punctuation Characters

Again on this one I don’t have much to talk about.  I do know that I plan to take a look at what punctuation characters are being used by which hubs.  I have a feeling that this could be very useful in identifying problems with mapping from one metadata world to another.  For example I know there are examples of character patterns that resemble sub-field indicators from a MARC record in the subject values in the DPLA, dataset, (‡, |, and — ) how many that’s something to look at.

Let me know if there are other pieces that you think might be interesting to look at related to this subject work with the DPLA metadata dataset and I’ll see what I can do.

Let me know what you think via Twitter if you have questions or comments.

Open Knowledge Foundation: Just Released: “Where Does Europe’s Money Go? A Guide to EU Budget Data Sources”

planet code4lib - Thu, 2015-07-02 11:57

The EU has committed to spending €959,988 billion between 2014 and 2020. This money is disbursed through over 80 funds and programmes that are managed by over 100 different authorities. Where does this money come from? How is it allocated? And how is it spent?

Today we are delighted to announce the release of “Where Does Europe’s Money Go? A Guide to EU Budget Data Sources”, which aims to help civil society groups, journalists and others to navigate the vast landscape of documents and datasets in order to “follow the money” in the EU. The guide also suggests steps that institutions should take in order to enable greater democratic oversight of EU public finances. It was undertaken by Open Knowledge with support from the Adessium Foundation.

As we have seen from projects like Farm Subsidy and journalistic collaborations around the EU Structural Funds it can be very difficult and time-consuming to put together all of the different pieces needed to understand flows of EU money.

Groups of journalists on these projects have spent many months requesting, scraping, cleaning and assembling data to get an overview of just a handful of the many different funds and programmes through which EU money is spent. The analysis of this data has led to many dozens of news stories, and in some cases even criminal investigations.

Better data, documentation, advocacy and journalism around EU public money is vital to addressing the “democratic deficit” in EU fiscal policy. To this end, we make the following recommendations to EU institutions and civil society organisations:

  1. Establish a single central point of reference for data and documents about EU revenue, budgeting and expenditure and ensure all the information is up to date at this domain (e.g. at a website such as At the same time, ensure all EU budget data are available from the EU open data portal as open data.
  2. Create an open dataset with key details about each EU fund, including name of the fund, heading, policy, type of management, implementing authorities, link to information on beneficiaries, link to legal basis in Eur-Lex and link to regulation in Eur-Lex.
  3. Extend the Financial Transparency System to all EU funds by integrating or federating detailed data expenditures from Members States, non-EU Members and international organisations. Data on beneficiaries should include, when relevant, a unique European identifier of company, and when the project is co-financed, the exact amount of EU funding received and the total amount of the project.
  4. Clarify and harmonise the legal framework regarding transparency rules for the beneficiaries of EU funds.
  5. Support and strengthen funding for civil society groups and journalists working on EU public finances.
  6. Conduct a more detailed assessment of beneficiary data availability for all EU funds and for all implementing authorities – e.g., through a dedicated “open data audit”.
  7. Build a stronger central base of evidence about the uses and users of EU fiscal data – including data projects, investigative journalism projects and data users in the media and civil society.

Our intention is that the material in this report will become a living resource that we can continue to expand and update. If you have any comments or suggestions, we’d love to hear from you.

If you are interested in learning more about Open Knowledge’s other initiatives around open data and financial transparency you can explore the Where Does My Money Go? project, the OpenSpending project, read our other previous guides and reports or join the Follow the Money network.

Peter Murray: Thursday Threads: New and Interesting from ALA Exhibits

planet code4lib - Thu, 2015-07-02 10:51
Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

I’m just home from the American Library Association meeting in San Francisco, so this week’s threads are just a brief view of new and interesting things I found on the exhibit floor.

Funding for my current position at LYRASIS ran out at the end of June, so I am looking for new opportunities and challenges for my skills. Check out my resume/c.v. and please let me know of job opportunities in library technology, open source, and/or community engagement.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my Pinboard bookmarks (or subscribe to its feed in your feed reader). Items posted to are also sent out as tweets; you can follow me on Twitter. Comments and tips, as always, are welcome.


Book-Donations-Processing-as-a-Service. See something new everyday. #alaac15

— Peter Murray (@DataG) June 28, 2015

I didn’t get to talk to anyone at this booth, but I was interested in the concept. I remember donations processing being such a hassle — analyze each book for its value, deciding whether it is part of your collection policy, determining where to sell it, manage the sale, and so forth. American Book Drive seems to offer such a service. Right now their service is limited to California. I wonder if it will expand, or if there are similar service providers in other areas of the countries.

Free Driver’s Ed Resources for Libraries

Free driver's ed resources for librs. Group has a great story. – Another first at #alaac15

— Peter Murray (@DataG) June 28, 2015

This exhibitor had a good origin story. A family coming to the U.S. had a difficult time getting their drivers licenses, so they created an online resource for all 50 states that covers the details. They’ve had success with the business side of their service, so they decided to give it away to libraries for free.

Free Online Obituaries Service from Orange County Library

Orange County Public Library offering free obituary service and publicizing through libraries. Via @us_imls

— Peter Murray (@DataG) June 28, 2015

With newspapers charging more for printing obituaries, important community details are no longer being printed. The Epoch Project from the Orange County (FL) Library System provides a simple service with text and media to capture this cultural heritage information. Funded initially by an IMLS grant [PDF], they are now in the process of rounding up partners in each state to be ambassadors to bring the service to other libraries around the country.

Link to this post!

Open Library Data Additions: HathiTrust Metadata

planet code4lib - Thu, 2015-07-02 06:46

Metadata records from

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata


Subscribe to code4lib aggregator