You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 1 hour 50 min ago

District Dispatch: CopyTalk: the libertarians are coming! The libertarians are coming!

Tue, 2016-03-29 19:18

CopyTalk is back.

When it comes to copyright policy, you will not find strict bipartisanship between Republicans and Democrats. Both parties tend to focus on maintaining the business models that they are familiar with. Do no harm to the content companies, help the starving artists without understanding why they are not making money, and make Internet companies and anyone who provides internet services more accountable for copyright piracy (aka infringement).  But not everyone is a Republican or Democrat; there are others, like Independents, Tea Party people and Libertarians, who favor small government, individual freedom, the market economy and private property.  Of course, like all political parties, there is a spectrum of thought among libertarians— there are libertarians, and then there are libertarians. But in general, what do Libertarians think about copyright law?   Were copyright ever to be reformed, what reform would a libertarian want to see?

ALA is a founding member of Re:Create: a coalition of industry, civil society, trade associations, and libertarians who seek balanced copyright, creativity, freedom of speech and a more understandable copyright law.

The April 7th CopyTalk will feature coalition partner R Street—“a free market think tank advancing real solutions for complex public policy problems.” This webinar will be our first policy related webinar, so get ready for a trip to Wonky Town!

Details: Thursday, April 7th at 2:00 pm (Eastern) and 11:00 am (Pacific).  This is the URL that will get you into the webinar. Register as a guest and you’re in.  Yes, it’s still FREE because the Office for Information Technology Policy and the Copyright Education Subcommittee want to expand copyright awareness and education opportunities.

The post CopyTalk: the libertarians are coming! The libertarians are coming! appeared first on District Dispatch.

LITA: 2016 LITA Forum – Call for Proposals

Tue, 2016-03-29 15:38

The 2016 LITA Forum Committee seeks proposals for the 19th Annual Forum of the Library Information and Technology Association in Fort Worth Texas, November 17-20, 2016 at the Omni Fort Worth Hotel.

Submit your proposal at this site

The Forum Committee welcomes proposals for full-day pre-conferences, concurrent sessions, or poster sessions related to all types of libraries: public, school, academic, government, special, and corporate. Collaborative and interactive concurrent sessions, such as panel discussions or short talks followed by open moderated discussions, are especially welcomed. We deliberately seek and strongly encourage submissions from underrepresented groups, such as women, people of color, the LGBT community and people with disabilities.

The Submission deadline is Friday April 29, 2016.

Proposals could relate to, but are not restricted to, any of the following topics:

  • Discovery, navigation, and search
  • Practical applications of linked data
  • Library spaces (virtual or physical)
  • User experience
  • Emerging technologies
  • Cybersecurity and privacy
  • Open content, software, and technologies
  • Assessment
  • Systems integration
  • Hacking the library
  • Scalability and sustainability of library services and tools
  • Consortial resource and system sharing
  • “Big Data” — work in discovery, preservation, or documentation
  • Library I.T. competencies

Proposals may cover projects, plans, ideas, or recent discoveries. We accept proposals on any aspect of library and information technology. The committee particularly invites submissions from first time presenters, library school students, and individuals from diverse backgrounds.

Vendors wishing to submit a proposal should partner with a library representative who is testing/using the product.

Presenters will submit final presentation slides and/or electronic content (video, audio, etc.) to be made available on the web site following the event. Presenters are expected to register and participate in the Forum as attendees; a discounted registration rate will be offered.

If you have any questions, contact Tammy Allgood Wolf, Forum Planning Committee Chair, at

Submit your proposal at this site

More information about LITA is available from the LITA website, Facebook and Twitter.

Tim Ribaric: A hot take on discovery system results

Tue, 2016-03-29 15:36

Here's an example of Google doing better then a discovery system. First and foremost your mileage may vary. This is a very specific example but endemic of the landscape we find ourselves in.

read more

David Rosenthal: Following Up On The Emulation Report

Tue, 2016-03-29 15:00
A meeting was held at the Mellon Foundation to follow up on my report Emulation and Virtualization as Preservation Strategies. I was asked to provide a brief introduction to get discussion going. The discussions were confidential, but below the fold is an edited text of my introduction with links to the sources.

I think the two most useful things I can do this morning are:
  • A quick run-down of developments I'm aware of since the report came out.
  • A summary of the key problem areas and recommendations from the report.
I'm going to ignore developments by the teams represented here. Not that they aren't important, but they can explain them better than I can.
EmulatorsFirst, the emulators themselves. Reports of new, enthusiast-developed emulators continue to appear. Among recent ones are:
The quality of the emulators, especially when running legacy artefacts, is a significant concern. A paper at last year's SOSP by Nadav Amit et al entitled Virtual CPU Verification casts light on the causes and cures of fidelity failures in emulators. They observed that the problem of verifying virtualized or emulated CPUs is closely related to the problem of verifying a real CPU. Real CPU vendors sink huge resources into verifying their products, and this team from the Technion and Intel were able to base their research into X86 emulation on the tools that Intel uses to verify its CPU products.

Although QEMU running on an X86 tries hard to virtualize rather than emulate, it is capable of emulating and the team were able to force it into emulation mode. Using their tools, they were able to find and analyze 117 bugs in QEMU, and fix most of them. Their testing also triggered a bug in the VM BIOS:
But the VM BIOS can also introduce bugs of its own. In our research, as we addressed one of the disparities in the behavior of VCPUs and CPUs, we unintentionally triggered a bug in the VM BIOS that caused the 32-bit version of Windows 7 to display the so-called blue screen of death.Having Intel validate the open source hypervisors, especially doing so by forcing them to emulate rather than virtualize, would be a big step forward. To what extent the validation process would test the emulation of the hardware features of legacy CPUs important for preservation is uncertain, though the fact that their verification caught a bug that was relevant only to Windows 7 is encouraging.

QEMU is supported via the Software Freedom Conservancy. It supported Christopher Hellwig's lawsuit against VMware for GPL violations. As a result the Conservancy is apparently seeing corporate support evaporate, placing its finances in jeopardy.
FrameworksSecond, the frameworks. The performance of the Internet Archive's JSMESS framework, now being called Emularity, depends completely on the performance of the JavaScript virtual machine. Other frameworks are less dependent, but its performance is still important. The movement supported by major browser vendors to replace this virtual machine with a byte-code virtual machine called WebAssembly has borne fruit. A week ago four major browsers announced initial support, all running the same game, a port of Unity's Angry Bots. This should greatly reduce the pressure for multi-core and parallelism support in JavaScript, which was always likely to be a kludge. Improved performance for in-browser emulation is also likely to make in-browser emulation more competitive with techniques that need software installation and/or cloud infrastructure, reducing the barrier to entry.

The report discusses the problems GPUs pose for emulation and the efforts to provide paravirtualized GPU support in QEMU. This limited but valuable support is now mainstreamed in the Linux 4.4 kernel.

Mozilla among others has been working to change the way in which Web pages are rendered in the browser to exploit the capabilities of GPUs. Their experimental "servo" rendering engine gains a huge performance advantage by doing so. For us, this is a double-edged sword. It makes the browser dependent on GPU support in a way it wasn't before, and thus makes the task of browser emulations such as harder. If, on the other hand, it means that GPU capabilities will be exposed to WebAssembly, it raises the prospect of worthwhile GPU-dependent emulations running in browsers, further reducing the barrier to entry.
CollectionsThird, the collections. The Internet Archive has continued to release collections of legacy software using Emularity. The Malware Museum, a collection of currently 47 viruses from the '80s and '90s, has proven very popular, with over 850K views in about 6 weeks. The Windows 3.X Showcase, a curated sample of the over 1500 Windows emulations in the collection, has received 380K views in the same period. It is particularly interesting because it includes a stock install of Windows 3.11. Despite that the team has yet to receive a takedown request from Microsoft.

About the same time as my report, a team at Cornell led by Oya Rieger and Tim Murray produced a white paper for the National Endowment for the Humanities entitled Preserving and Emulating Digital Art Objects. I blogged about it. To summarize my post, I believe that outside their controlled "reading room" conditions the concern they express for experiential fidelity is underestimated, because smartphones and tablets are rapidly replacing PCs. But two other concerns, for emulator obsolescence and the fidelity of access to web resources, are overblown.
ToolsFourth, the tools. The Internet Archive has a page describing how DOS software to be emulated can be submitted. Currently about 65 submissions a day are being received, despite the somewhat technical process it lays out. Each is given minimal initial QA to ensure that it comes up, and is then fed into the crowd-sourced QA process described in the report. It seems clear that improved tooling, especially automating the process via an interactive Web page that ran the emulation locally before submission, would result in more and better quality submissions.
Internet of ThingsThe Internet of Things has been getting a lot of attention, especially the catastrophic state of IoT security. Updating the software of Things in the Internet to keep them even marginally secure is often impossible because the Things are so cheap there are no dollars for software support and updates, and because customers have no way to tell that one device is less insecure than another. This is exactly the problem faced by preserved software that connects to the Internet, as discussed in the report. Thus efforts to improve the security of the IoT and efforts such as Freiburg's to build an "Internet Emulator" to protect emulations of preserved software may be highly synergistic.

Off on a tangent, it is worth thinking about the problems of preserving the Internet of Things. The software and hardware are intimately linked, even more so than smartphone apps. So does preserving the Internet of Things reduce to preserving the Things in the Internet, or does emulation have a role to play?
The To-Do ListTo refresh your memories, here are the highlights of the To-Do List that ends the report, with some additional commentary. I introduce the list by pointing out the downsides of the lack of standardization among the current frameworks, in particular:
  • There will be multiple emulators and emulation frameworks, and they will evolve through time. Re-extracting or re-packaging preserved artefacts for different, or different versions of, emulators or emulation frameworks would be wasted effort.
  • The most appropriate framework configuration for a given user will depend on many factors, including the bandwidth and latency of their network connection, and the capabilities of their device. Thus the way in which emulations are advertised to users, for example by being embedded in a Web page, should not specify a particular framework or configuration; this should be determined individually for each access.
I stressed that:
If the access paths to the emulations link directly to evanescent services emulating the preserved artefacts, not to the artefacts themselves, the preserved artefacts are not themselves discoverable or preservable. In summary, the To-Do list was:
  1. Standardize Preserved System Images so that the work of preparing preserved system images for emulation will not have to be redone repeatedly as emulation technology evolves, and
  2. Standardize Access To System Images and
  3. Standardize Invoking Emulators so that the work of presenting emulations of preserved system images to the "reader" will not have to be redone repeatedly as emulation technology evolve.
  4. Improve Tools For Preserving System Images: The Internet Archive's experience shows that even minimal support for submission of system images can be effective. Better support should be a high priority. If the format of system images could be standardized, submissions would be available to any interested archive.
  5. Enhance Metadata Databases: these tools, and standardized methods for invoking emulators, rely on metadata database, which need significant enhancement for this purpose.
  6. Support Emulators: The involvement of Intel in QA-ing QEMU is a major step forward, but it must be remembered that most emulations of old software depend on enthusiast-supported emulators such as MAME/MESS. Supporting ways to improve emulator quality, such as for example external code reviews to identify critical quality issues, and a "bounty" program for fixing them, should be a high priority. It would be important that any such program be "bottom-up"; a "top-down" approach would not work in the enthusiast-dependent emulator world.
  7. Develop Internet Emulators: is already demonstrating the value of emulating software that connects to the Internet. Doing so carries significant risks, and developing technology to address them (before the risks become real and cause a backlash) needs high priority. The synergies between this and the security of the Internet of Things should be explored urgently.
  8. Tackle Legalities: As always, the legal issues are the hardest to address. I haven't heard that the PERSIST meeting in Paris last November came up with any new ideas in this area. The lack of a reaction to the Internet Archive's Windows 3.X Showcase is encouraging, and I'm looking forward to hearing whether others have made progress in this area.

Open Library Data Additions: Amazon Crawl: part o-4

Tue, 2016-03-29 14:59

Part o-4 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Library of Congress: The Signal: Digital Preservation at the State Library of Massachusetts

Tue, 2016-03-29 12:47

This is a guest post by Stefanie Ramsay.

The State Library in 1890. Courtesy of the State Library of Massachusetts Special Collections.

How do you capture, preserve and make accessible thousands of born-digital documents produced by state agencies, published to varying websites without any notification or consistency and which are often relocated or removed over time? This is the complex task that the State Library of Massachusetts faces in its legal digital preservation mandate.

My National Digital Stewardship residency involves conducting a comprehensive assessment of existing state publications, assessing how other state libraries and archives handle this challenge and establishing best practices and procedures for the preservation and accessibility of electronic state publications at the State Library. In this post, I’ll cover how we’ve approached this project, as well as our next steps.

State agencies publish thousands of documents for the public, and a legal mandate requires that they send this content to the library for preservation. Unfortunately this is a rare occurrence, leaving library staff to retrieve content using various other methods. The staff relies on a homegrown web crawler to capture publications from agency websites, but they also comb through individual agency pages and check social media and the news to spot mentions of agency publications.

Creative as these approaches may be, they do not form a sustainable practice for handling the large amounts of content that agencies produce. Before establishing a better workflow however, the library needed to get a better understanding of how much material is published, what kinds of material are published and how best to capture these materials for long-term access and preservation. Having this data, we can then begin to build an effective digital preservation program.

The State Library today after a complete renovation. Photo by Stefanie Ramsay.

At the beginning of my residency, we began using web statistics collected from the Massachusetts governments’ main portal, The statistics show publications requested by users of the site per month. Having the URL allows us to see where these are posted and to ascertain the types of documents agencies are publishing.

We found a wide range of documents, such as annual reports, meeting materials, Executive Orders, and my personal favorite, a guide to resolving conflict between people and beavers.

After categorizing the content, we needed to narrow down a collection scope. Rather than attempting to capture every publication, I thought it best to define what types of documents are most valuable to the staff and library users and to focus our efforts on those documents (which is not to say that the lower priority items will be ignored, but that the higher priority items will be handled first, then we will develop a plan for the rest). To determine what documents were high and low priorities, we implemented a ranking process.

Each staff member ranked the publications for individual agencies on a scale of 1-5 (1 being lowest priority, 5 being highest) on shared spreadsheets, and our collective averages started to filter what was most valuable. Documents such as reports, project documents and topical issues rose to the top, while items such as draft reports, requests for proposals and ephemeral information sunk to the low priority tier.

This process formed the basis of our collection policy statement to be used as a guide when identifying and selecting content for ingestion. This statement is regularly updated as we continue to determine our priorities. We also began collecting metrics on the total number of documents captured by the statistics and the number of documents that fell into each priority tier. This gives us a sense of the bigger picture of not only the amount of content, but how much needs to be handled quickly and forms the basis of an argument for increased resources.

This issue is not unique to Massachusetts; every state library has a mandate to capture state government information and every state takes a different approach based on their resources, staff expertise, and constituents. In my research of how other state libraries and archives handle this mandate, one common thread emerged: at least 24 states use Archive-It as a means for capturing digital content. I was eager to investigate this, as I hoped it could be another resource for Massachusetts to use as well.

The IT department of the Executive Branch of Massachusetts state government, MassIT, has an Archive-It account and has crawled since 2008. Though the account was publicly available, MassIT had not advertised the site, as their focus was on ensuring capture of content rather than accessibility. Seizing this opportunity for collaboration, we reached out to MassIT, who granted us access to the site. We worked together to customize the metadata and I wrote some language for the library’s website that provided instructions for our patrons on how to use Archive-It to access state publications.

The State Library stacks. Courtesy of the State Library of Massachusetts.

Our situation is a bit different in that we do not exclusively maintain the Archive-It account. However, we are using this resource in a similar way to many other state libraries and archives. Archive-It will not be our main repository for accessing state publications– the library has a DSpace digital repository that has been in place since 2006 and will continue to be our central portal for providing enhanced access to high priority publications.

Archive-It will act as a service for crawling, thereby ensuring the capture of more documents than we could hope to collect on our own and allowing us another means of finding material we may not have captured in the past. Using the two in concert goes a long way towards meeting the legal mandate.

With just a few months left in the residency, there is still much work to be done. We’re testing a workflow for batch downloading PDFs using a Firefox add-on called DownThemAll!, investigating how to streamline the cataloging process and conducting outreach efforts to state agencies.

Outreach is crucial in raising awareness of the library’s resources and services, as well as in reminding agencies about that pesky law regarding their publications. These steps form the foundation of a more sustainable digital preservation program at the State Library.

Open Knowledge Foundation: Open Data Day Buenos Aires – planning the open data agenda for 2016

Tue, 2016-03-29 12:30

This blog was written by Yamila Garcia, Open Knowledge ambassador in Argentina 

For the third time, we celebrated Open Data Day in Argentina, and we invited different groups to celebrate it with us: members of the official open government office; transparency, open data and freedom of information activists, civic innovators, journalists and anyone who is interested in the progress of 21st century open governments.

March the 5th marks a day of open data deliberations, where we understand the importance of open data in three main pillars – release, reuse and impact. It is a day to share ideas, projects and opening up the dialogue channels about open public information, promotion of freedom of information laws, open government in the three branches  of the state, strengthening democracy, promoting citizenes participation and generation of public and civic innovation. Fundación Conocimiento Abierto with Argentinian civil society organizations (Democracia en Red, Asociación Civil por la Igualdad y la Justicia, Directorio Legislativo and FOPEA) had the honour to receive 250 participants in #ODD16. The event was supported by ILDA.

In the last two years, we invited open data projects in Argentina and practitioners from different fields such as academia, government (in all branches), journalists and civic hackers to join us under the same roof and present their projects.  This year we decided to shake things up, and had an event with the following  activities:

  • Panels: we had Four central panels with the following topics:
    • “Progress for a law on access to information” with Laura Alonso (anticorruption office holder), Fernando Sanchez (national legislature), Government officials and José Crettaz (Journalist of La Nación Newspaper).
    • “Challenges for open government” with Rudi Borrmann (National Subsecretary of Innovation and open government ), Carolina Cornejo (ACIJ), Alvaro Herrero ( City government) and Gustavo Bevilaqcua (national legislature).
    • “Hackers civic and open data” with four famous civic hackers  
    • “Local governments and opening information” panel with five representatives of local government innovation


  • An open space of dialogue with mentors in the following  topics:
    • Innovating in the public sector,
    • Challenges for a law on access to public information
    • Municipalities progress in the area of open government, How to achieve citizen participation channels?
    • OGP agenda in Argentina
    • Challenges for open municipal government with open data
    • The impact of open data: How to measure results?
    • Codeando Argentina: Cooperation between governments and civic hackers, and Parliament opened.

We accomplished the goal of gather all the areas that work on open data to shape the Open Data Agenda for 2016 in a collaborative way. Each year this community grows more and more. In 2017, we will expect to have an Open Data Day in other parts of the country, and not only in Buenos Aires. From year to year, we get more challenges, and we are happy to have Open Data Day to tackle them. 




Terry Reese: MarcEdit Mac: Export Tab Delimited Records

Tue, 2016-03-29 04:15

As part of the last update, I added a new feature that is only available in the Mac Version of MarcEdit at this point.  One of the things that had been missing in the Export Tab Delimited Tool was the ability to save and load one’s export settings.  I added that as part of the most recent update.  At the same time, I though that the ability to batch process multiple files using same criteria may be useful as well.  So this has been added to the Mac interface as well.

In the image above, you initiate the batch processing mode by checking the batch process checkbox.  This will change the marc file and save file textbox and buttons to directory paths.  You will also be prompted to select a file extension to process.

I’m not sure if this will be useful — but as I’m working through new functionality, I’ll be noting changes being made to the MarcEdit Mac version.  And this is notable, because this is the first time that the Mac version contains functionality that is not in the Windows version.


Terry Reese: MarcEdit Bug Fix Update

Tue, 2016-03-29 04:05

I’ve posted a new update for all versions of MarcEdit this afternoon.  Last night, when I posted the new update, I introduced a bug into the RDA Helper that rendered it basically unusable.  When adding functionality to the tool to enable support for abbreviations at a subfield level, I introduced a problem that removed the subfield codes from fields where abbreviations would take place.

So what does this bug look like?  When processing data, a set field that would look like this:
=300  \\$a1 vol $b ill.

would be replaced to look like:
=300  \\a1 volume b illustrations

As one case see, the delimiter symbol “$” has been removed.  This occurred in all fields where data abbreviations were occurring.  This has been corrected with this update.

You can get the update from the downloads page: or via the automated update tool.


DuraSpace News: VIVO/Fedora Integration Forum

Tue, 2016-03-29 00:00

From Andrew Woods, Fedora Tech Lead

Austin, TX  There has been increasing interest and discussion around the opportunities of an integration between VIVO and Fedora 4. In an effort to provide a public forum for detailing use cases, identifying related initiatives, gaining consensus on integration patterns, etc, the following mailing list has been created:

DuraSpace News: NOW AVAILABLE: DSpace-CRIS 5.5.0

Tue, 2016-03-29 00:00

From Andrea Bollini, Head of Open Source and Open Standards Strategy, Cineca

Rome, Italy  I'm glad to announce the availability of the 5.5.0 version of DSpace-CRIS built on top of DSpace JSPUI 5.5.

Open Library Data Additions: Amazon Crawl: part dn

Mon, 2016-03-28 20:54

Part dn of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

LITA: Universal Design for Libraries and Librarians, an important LITA web course

Mon, 2016-03-28 19:01

Consider this important new LITA web course:
Universal Design for Libraries and Librarians

Instructors: Jessica Olin, Director of the Library, Robert H. Parker Library, Wesley College; and Holly Mabry, Digital Services Librarian, Gardner-Webb University

Offered: April 11 – May 27, 2016
A Moodle based web course with asynchronous weekly content lessons, tutorials, assignments, and groups discussion.

Register Online, page arranged by session date (login required)

Universal Design is the idea of designing products, places, and experiences to make them accessible to as broad a spectrum of people as possible, without requiring special modifications or adaptations. This course will present an overview of universal design as a historical movement, as a philosophy, and as an applicable set of tools. Students will learn about the diversity of experiences and capabilities that people have, including disabilities (e.g. physical, learning, cognitive, resulting from age and/or accident), cultural backgrounds, and other abilities. The class will also give students the opportunity to redesign specific products or environments to make them more universally accessible and usable.


By the end of this class, students will be able to…

  • Articulate the ethical, philosophical, and practical aspects of Universal Design as a method and movement – both in general and as it relates to their specific work and life circumstances
  • Demonstrate the specific pedagogical, ethical, and customer service benefits of using Universal Design principles to develop and recreate library spaces and services in order to make them more broadly accessible
  • Integrate the ideals and practicalities of Universal Design into library spaces and services via a continuous critique and evaluation cycle

Here’s the Course Page

Jessica Olin

Jessica Olin

Is the Director of the Library, Robert H. Parker Library, Wesley College. Ms. Olin received her MLIS from Simmons College in 2003 and an MAEd, with a concentration in Adult Education, from Touro University International. Her first position in higher education was at Landmark College, a college that is specifically geared to meeting the unique needs of people with learning differences. While at Landmark, Ms. Olin learned about the ethical, theoretical, and practical aspects of universal design. She has since taught an undergraduate course for both the education and the entrepreneurship departments at Hiram College on the subject.

Holly Mabry

Holly Mabry

Holly Mabry received her MLIS from UNC-Greensboro in 2009. She is currently the Digital Services Librarian at Gardner-Webb University where she manages the university’s institutional repository, and teaches the library’s for-credit online research skills course. She also works for an international virtual reference service called Chatstaff. Since finishing her MLIS, she has done several presentations at local and national library conferences on implementing universal design in libraries with a focus on accessibility for patrons with disabilities.


February 29 – March 31, 2016


  • LITA Member: $135
  • ALA Member: $195
  • Non-member: $260

Technical Requirements:

Moodle login info will be sent to registrants the week prior to the start date. The Moodle-developed course site will include weekly new content lessons and is composed of self-paced modules with facilitated interaction led by the instructor. Students regularly use the forum and chat room functions to facilitate their class participation. The course web site will be open for 1 week prior to the start date for students to have access to Moodle instructions and set their browser correctly. The course site will remain open for 90 days after the end date for students to refer back to course material.

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

Mark E. Phillips: Beginning to look at the description field in the DPLA

Mon, 2016-03-28 15:00

Last year I took a look at the subject field and the date fields in the Digital Public Library of America (DPLA).  This time around I wanted to begin looking at the description field and see what I could see.

Before diving into the analysis,  I think it is important to take a look at a few things.  First off, when you reference the DPLA Metadata Application Profile v4,  you may notice that the description field is not a required field,  in fact the field doesn’t show up in APPENDIX B: REQUIRED, REQUIRED IF AVAILABLE, AND RECOMMENDED PROPERTIES.  From that you can assume that this field is very optional.  Also, the description field when present is often used to communicate a variety of information to the user.  The DPLA data has examples that are clearly rights statements, notes, physical descriptions of the item, content descriptions of the item, and in some instances a place to store identifiers or names. Of all of the fields that one will come into contact in the DPLA dataset,  I would image that the description field is probably one of the ones with the highest variability of content.  So with that giant caveat,  let’s get started.

So on to the data.

The DPLA makes available a data dump of the metadata in their system.  Last year I was analyzing just over 8 million records, this year the collection has grown to more than 11 million records ( 11,654,800 in the dataset I’m using).

The first thing that I had to accomplish was to pull out just the descriptions from the full json dataset that I downloaded.  I was interested in three values for each record, specifically the Provider or “Hub”, the DPLA identifier for the item and finally the description fields.  I finally took the time to look at jq, which made this pretty easy.

For those that are interested here is what I came up with to extract the data I wanted.

zcat all.json.gz | jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (, 'descriptions': ._source.sourceResource.description?}'

This results in an output that look like this.

{"provider":"","id":"4fce5c56d60170c685f1dc4ae8fb04bf","descriptions":["Lang: Charles Aikin Collection"]} {"provider":"","id":"bca3f20535ed74edb20df6c738184a84","descriptions":["Lang: Maire, graveur."]} {"provider":"","id":"76ceb3f9105098f69809b47aacd4e4e0","descriptions":null} {"provider":"","id":"88c69f6d29b5dd37e912f7f0660c67c6","descriptions":null}

From there my plan was to write some short python scripts that can read a line, convert it from json into a python object and then do programmy stuff with it.

Who has what?

After parsing the data a bit I wanted to remind myself of the spread of the data in the DPLA collection.  There is a page on the DPLA’s site that shows you how many records have been contributed by which Hub in the network.  This is helpful but I wanted to draw a bar graph to give a visual representation of this data.

DPLA Partner Records

As has been the case since it was added, Hathitrust is the biggest provider of records to the DPLA with other 2.4 million records.  Pretty amazing!

There are three other Hubs/Providers that contribute over 1 million records each,  The Smithsonian, New York Public Library, and the University of Southern California Libraries. Down from there there are three more that contribute over half a million records,  Mountain West Digital Library, National Archives and Records Administration (NARA) and The Portal to Texas History.

There were 11,410 records (coded as undefined_provider) that are not currently associated with a Hub/Provider,  probably a data conversion error somewhere during the record ingest pipeline.

 Which have descriptions

After the reminder about the size and shape of the Hubs/Providers in the DPLA dataset, we can dive right into the data and see quickly how well represented in the data the description field is.

We can start off with another graph.

Percent of Hubs/Providers with and without descriptions

You can see that some of the Hubs/Providers have very few records (< 2%) with descriptions (Kentucky Digital Library, NARA) while others had a very high percentage (> 95%) of records with description fields present (David Rumsey, Digital Commonwealth, Digital Library of Georgia, J. Paul Getty Trust, Government Publishing Office, The Portal to Texas History, Tennessee Digital Library, and the University of Illinois at Urbana-Champaign).

Below is a full breakdown for each Hub/Provider showing how many and what percentage of the records have zero descriptions, or one or more descriptions.

Provider Records 0 Descriptions 1+ Descriptions 0 Descriptions % 1+ Descriptions % artstor 107,665 40,851 66,814 37.94% 62.06% bhl 123,472 64,928 58,544 52.59% 47.41% cdl 312,573 80,450 232,123 25.74% 74.26% david_rumsey 65,244 168 65,076 0.26% 99.74% digital-commonwealth 222,102 8,932 213,170 4.02% 95.98% digitalnc 281,087 70,583 210,504 25.11% 74.89% esdn 197,396 48,660 148,736 24.65% 75.35% georgia 373,083 9,344 363,739 2.50% 97.50% getty 95,908 229 95,679 0.24% 99.76% gpo 158,228 207 158,021 0.13% 99.87% harvard 14,112 3,106 11,006 22.01% 77.99% hathitrust 2,474,530 1,068,159 1,406,371 43.17% 56.83% indiana 62,695 18,819 43,876 30.02% 69.98% internet_archive 212,902 40,877 172,025 19.20% 80.80% kdl 144,202 142,268 1,934 98.66% 1.34% mdl 483,086 44,989 438,097 9.31% 90.69% missouri-hub 144,424 17,808 126,616 12.33% 87.67% mwdl 932,808 57,899 874,909 6.21% 93.79% nara 700,948 692,759 8,189 98.83% 1.17% nypl 1,170,436 775,361 395,075 66.25% 33.75% scdl 159,092 33,036 126,056 20.77% 79.23% smithsonian 1,250,705 68,871 1,181,834 5.51% 94.49% the_portal_to_texas_history 649,276 125 649,151 0.02% 99.98% tn 151,334 2,463 148,871 1.63% 98.37% uiuc 18,231 127 18,104 0.70% 99.30% undefined_provider 11,422 11,410 12 99.89% 0.11% usc 1,065,641 852,076 213,565 79.96% 20.04% virginia 30,174 21,081 9,093 69.86% 30.14% washington 42,024 8,838 33,186 21.03% 78.97%

With so many of the Hub/Providers having a high percentage of records with descriptions, I was curious about the overall records in the DPLA.  Below is a pie chart that shows you what I found.

DPLA records with and without descriptions

Almost 2/3 of the records in the DPLA have at least one description field, this is more than I would have expected for an un-required, un-recommended field, but I think this is probably a good thing.

Descriptions per record

The final thing I wanted to look at in this post was the average number of description fields for each of the Hubs/Providers.  This time we will start off with the data table below.

Provider Providers min median max mean stddev artstor 107,665 0 1 5 0.82 0.84 bhl 123,472 0 0 1 0.47 0.50 cdl 312,573 0 1 10 1.55 1.46 david_rumsey 65,244 0 3 4 2.55 0.80 digital-commonwealth 222,102 0 2 17 2.01 1.15 digitalnc 281,087 0 1 19 0.86 0.67 esdn 197,396 0 1 1 0.75 0.43 georgia 373,083 0 2 98 2.32 1.56 getty 95,908 0 2 25 2.75 2.59 gpo 158,228 0 4 65 4.37 2.53 harvard 14,112 0 1 11 1.46 1.24 hathitrust 2,474,530 0 1 77 1.22 1.57 indiana 62,695 0 1 98 0.91 1.21 internet_archive 212,902 0 2 35 2.27 2.29 kdl 144,202 0 0 1 0.01 0.12 mdl 483,086 0 1 1 0.91 0.29 missouri-hub 144,424 0 1 16 1.05 0.70 mwdl 932,808 0 1 15 1.22 0.86 nara 700,948 0 0 1 0.01 0.11 nypl 1,170,436 0 0 2 0.34 0.47 scdl 159,092 0 1 16 0.80 0.41 smithsonian 1,250,705 0 2 179 2.19 1.94 the_portal_to_texas_history 649,276 0 2 3 1.96 0.20 tn 151,334 0 1 1 0.98 0.13 uiuc 18,231 0 3 25 3.47 2.13 undefined_provider 11,422 0 0 4 0.00 0.08 usc 1,065,641 0 0 6 0.21 0.43 virginia 30,174 0 0 1 0.30 0.46 washington 42,024 0 1 1 0.79 0.41

This time with an image

Average number of descriptions per record

You can see that there are several Hubs/Providers a have multiple descriptions per record,  with the Government Publishing Office coming in at 4.37 descriptions per record.

I found it interesting that when you exclude the two Hubs/Providers that don’t really do descriptions (KDL and NARA) you see two that have a very low standard deviation from their mean (average) Tennessee Digital Library at 0.13 and The Portal to Texas History at 0.20 don’t drift much from their almost one description-per-record for Tennessee and almost two descriptions-per-record for Texas. It makes me think that this is probably a set of records that each of those Hubs/Providers would like to have identified so they could go in and add a few descriptions.


Well that wraps up this post that I hope is the first in a series of posts about the description field in the DPLA dataset.  In subsequent posts we will move away from record level analysis of description fields and get down to the field level to do some analysis of the descriptions themselves.  I have a number of predictions but I will hold onto those for now.

If you have questions or comments about this post,  please let me know via Twitter.

Open Knowledge Foundation: Open Data Day 2016 Malaysia Data Expedition – Measuring Provision of Public Services for Education

Mon, 2016-03-28 10:06

This blog post was written by the members of the Sinar project in Malaysia 

In Malaysia, Sinar Project with the support of Open Knowledge International organised a one-day data expedition based on the guide from School of Data to search for data related to government provision of health and education services. This brought together a group of people with diverse skills to formulate questions of public interest. The data sourced would be used for analysis and visualisation in order to provide answers.

Data Expedition

A data expedition is a quest to explore uncharted areas of data and report on those findings. The participants with different skillsets gathered throughout the day at the Sinar Project office. Together they explored data relating to schools and clinics to see what data and analysis methods are available to gain insights on the public service provision for education and health.

We used the guides and outlines for the data expedition from School of Data website. The role playing guides worked as a great ice breaker. There was healthy competition on who could draw the best giraffes for those wanting to prove their mettle as a designer for the team.



Deciding what to explore, education or health?

The storyteller in the team, who was a professional journalist started out with a few questions to explore.

  • Are there villages or towns which are far away from schools?
  • Are there villages or towns which are far away from clinics and hospitals?
  • What is the population density and provision of clinics and schools?

The scouts then went on a preliminary exploration for whether this data exists.

Looking for the Lost City of Open Data

The Scouts, with the aid of the rest of the team, looked for data that could answer the questions. They found a lot of usable data from the Malaysian government open data portal This data included lists of all public schools and clinics with addresses, as well as numbers of teachers for each district.

It was decided by the team that given the time limitation, the focus would be to answer the questions on education data. Another priority was to find data relating to class sizes to see if schools are overcrowded or not. Below you can see the data that the team found. 

Education Open Data Data in Reports



Not all schools are created equal, there are different types, some are considered as high achieving schools or Sekolah Berprestasi Tinggi

Health Open Data GIS


Other Data

CIDB Construction Projects contains relevant information such as construction of schools and clinics Script to import into Elastic Search


Sinar Project had some budgets as open data, at state and federal levels that could be used as additional reference point. These were created as part of the Open Spending project.

Selangor State Government

Federal Government Higher education Education Methodology

The team opted to focus on the available datasets to answer questions about education provision, by first converting all school addresses into geocoding, and then looking at joining up data to find out the relationship between enrollments, school and teacher ratios.

Joining up data

To join up data; the different data sets such as teacher numbers and schools, VLOOKUP function in Excel was used to join by School code.

Converting Address to geolocation (latlong)

To convert street addresses to latitude, longitude coordinates we used the dataset with the cleansed address’ along with a geocoding tool csvgeocode

./node_modules/.bin/csvgeocode ./input.csv ./output.csv --url "{{Alamat}}&amp;key=" --verbose

Convert the completed CSV to GeoJSON points

Use the  csv2geojson

<span style="font-weight: 400;">csv2geojson --lat "Lat" --lon "Lng" Selangor_Joined_Up_Moe.csv</span>

To get population by PBT

Use the data from state economic planning unit agency site for socio-economic data specifically section Jadual 8

To get all the schools separated by individual PBT (District)

UseGeoJSON of Schools data and PBT Boundary loaded into QGIS; and use the Vector > Geo-processing > Intersect.  

A post from Stack Exchange suggests  it might be better to use Vector > Spatial Query > Spatial Query option.

Open Datasets Generated

The cleansed and joined up datasets created during this expedition are made available on GitHub. While the focus was on education, due to the similarity in available data, the methods were also applied to clinics also. See it on our repository –

Visualizations All Primary and Secondary Schools on a Map with Google Fusion Tables Teacher to Students per school ratios


  • Teachers vs enrollment did not provide data relating to class size or overcrowding
  • Demographic datasets to measure schools to eligible population
  • More school datasets required for teachers, specifically by subject and class ratios
  • Methods used for location of schools can also be applied to clinics & hospital data

It was discovered that additional data was needed to provide useful information on the quality of education. There was not enough demographic data found to check against the number of schools in a particular district. Teacher to student ratio was also not a good indicator of problems reported in the news. The teacher to enrollment ratios was generally very low with a mean of 13 and median of 14. What was needed, was ratio by subject teachers, class size or against the population of eligible children of each area, to provide better insights.

Automatically calculating the distance from points was also considered and matched up with whether there are school bus operators in the area. This was discussed because the distance from schools may not be relevant for rural areas, where there were not enough children to warrant a school within the distance policy. A tool to check distance from a point to the nearest school could be built with the data made available. This could be useful for civil society to use data as evidence to prove that distance was too far or transport not provided for some communities.

Demographic data was found for local councils; this could be used by researchers using local council boundary data on whether there were enough schools against the population of local councils. Interestingly in Malaysia, education is under Federal government and despite having state and local education departments, the administrative boundaries do not match up with local council boundaries or electoral boundaries. This is a planning coordination challenge for policy makers. Administrative local council boundary data was made available as open data thanks to the efforts of another civil society group Tindak Malaysia, which scanned and digitized the electoral and administrative boundaries manually.

Running future expeditions

This was a one day expedition so it was time limited. For running these brief expeditions we learned the following:

  • Focus and narrow down expedition to specific issue
  • Be better prepared, scout for available datasets beforehand and determine topic
  • Focus on central repository or wiki of available data

Thank you to all of the wonderful contributors to the data expedition:

  • Lim Hui Ying (Storyteller)
  • Haris Subandie (Engineer)
  • Jack Khor (Designer)
  • Chow Chee Leong (Analyst)
  • Donaldson Tan (Engineer)
  • Michael Leow (Engineer)
  • Sze Ming (Designer)
  • Swee Meng (Engineer)
  • Hazwany (Nany) Jamaluddin (Analyst)
  • Loo (Scout)

Terry Reese: MarcEdit Updates

Mon, 2016-03-28 05:24

I spent some time this week working through a few updates based on some feedback I’ve gotten over the past couple of weeks.  Most of the updates at this point are focused on the Windows/Linux builds, but the Mac build has been updated as well as all new functionality found in the linking libraries and RDA changes apply there as well.  I’ll be spending this week focusing on making Mac MarcEdit UI to continue to work towards functional parity with the Windows version.

Windows/Linux Updates:

* 6.2.100
** Bug Fix: Build Links Tool — when processing a FAST heading without a control number, the search would fail.  This has been corrected.
** Bug Fix: MarcEditor — when using the convenience function that allows you to open mrc files directly into the MarcEditor and saving directly back to the mrc file — when using a task, this function would be disconnected.  This has been corrected.
** Enhancement: ILS Integration — added code to enable the use of profiles.
** Enhancement: ILS Integration — added a new select option so users can select from existing Z39.50 servers.
** Enhancement: OAI Harvesting — Added a debug URL string so that users can see the URL MarcEdit will be using to query the users server.
** UI Change: OAI Harvesting — UI has been changed to have the data always expanded.
** Enhancement: MarcValidator — Rules file has been updated to include some missing fields.
** Enhancement; MarcValidator — Rules file includes a new parameter: subfield, which defines the valid subfields within a field.  If a subfield appears not in this list, it will mark the record as an error.
** Enhancement: Task Menu — Task menu items have been truncated according to Windows convention.  I’ve expanded those values so users can see approximately 45 characters of a task name.
** Cleanup: Validate Headings — did a little work on the validate headings to clean up some old code.  Finishing prep to start allowing indexes beyond LCSH based on the rules file developed for the build links tool.


Mac Updates:

** 1.4.43 ChangeLog
* Bug Fix: Build Link Tool: Generating FAST Headings would work when an identifier was in the record, but wasn’t correctly finding the data when looking.
* Enhancement: RDA Helper:  Rules file has been updated and code now exists to allow users to define subfields that are valid.
* Bug Fix: RDA Helper: Updated library to correct a processing error when handling unicode replacement of characters in the 264.
* Enhancement: RDA Helper: Users can now define fields by subfield.  I.E. =245$c and abbreviation expansion will only occur over the defined subfields.

MarcValidator Changes:

One of the significant changes in the program this time around has been a change in how the Validator works.  The Validator currently looks at data present, and determines if that data has been used correctly.  I’ve added a new field in the validator rules file called subfield (Example block):

# Uncomment these lines and add validation routines like:
#valida    [^0-9x]    Valid Characters
#validz    [^0-9x]    Valid Characters
ind1    blank    Undefined
ind2    blank    Undefined
subfield    acqz68    Valid Subfields
a    NR    International Standard Book Number
c    NR    Terms of availability
q    R    Qualifier
z    R    Canceled/invalid ISBN
6    NR    Linkage
8    R    Field link and sequence number

The new block is the subfield item – here the tool defines all the subfields that are valid for this field.  If this element is defined and a subfield shows up that isn’t defined, you will receive an error message letting you know that the record has a field with an improper subfield in it.

RDA Helper

The other big change came in the RDA Helper.  Here I added the ability for the abbreviation field to be defined at a finer granularity.  Up to this point, abbreviation definitions happened at the field or field group level.  Users can now define down to the subfield level.  For example, if the user wanted to just target the 245$c, for abbreviations but leave all other 245 subfields alone, one would just define =245$c in the abbreviation field definition file.  If you want to define multiple subfields for processing, define each as its own unit…i.e:

You can get the download from the MarcEdit website ( or via the MarcEdit automatic download functionality.

Questions, let me know.


DuraSpace News: VIVO Updates for March 27–User Group Meeting Details, Membership Focus

Mon, 2016-03-28 00:00

From Mike Conlon, VIVO Project Director

VIVO User Group Meeting.  Registration is open.  VIVO User Group Meeting #1 will be held May 5-6 at the Galter Health Science Library at Northwestern in Chicago!

Open Library Data Additions: Western Washington University MARC

Sun, 2016-03-27 05:32

MARC records from Western Washington University. A late addition to marc_oregon_summit_records..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata

Ed Summers: Revisiting Archive Collections

Sun, 2016-03-27 04:00

For my Qualitative Research Methods class this week I was asked to find a research article in my field of interest that uses focus groups, and to write up a short summary of the article and critique their use of focus groups as a research method.

After what seemed like a bit too much searching around in Google Scholar I eventually ran across an article written by Jon Newman of Lambeth Archive in 2012 about the experiences that a group of archives in the South East of England had with participatory cataloging of their collections (Newman, 2012). The archives Newman discusses were all participants in the Mandeville Legacy Project who were attempting to provide better access to their archival collections related to the topic of disabilities and rehabilitation. They chose to use a technique pioneered in the museum community called Revisiting Collections which was adapted specifically for the archives as Revisiting Archive Collections or RAC.

RAC is a technique that was designed by the Collections Trust in the UK to try to make archival descriptions more inclusive, accurate and complete by including the contributions of individuals outside of the archival profession. In the words of the RAC Toolkit:

A key strength of Revisiting Collections is that it provides a framework for embedding new understanding and perspectives on objects and records directly within the museum or archive’s collection knowledge management system, ensuring that it forms part of the story about the collections that is recorded and made accessible to all.

RAC’s framework includes community based focus groups which bring individuals into contact with archival materials and elicit knowledge sharing as well as new narrative and documentation. RAC is similar in spirit to other methods for achieving [participatory archives], but is different because it uses actual focus groups rather than a Web based crowd sourcing approach. The framework includes detailed instructions for running a RAC focus group including:

  • how to select participants
  • how to select materials
  • consent
  • prompt questions
  • data collection
  • attribution
  • room setup
  • starting/ending the session
  • follow up after the session

The essential idea is that people who have direct experience of the subject material have much to offer in the description of the records. Newman connects RAC’s theoretical stance of involving more voices in the production of archives with the work of Terry Cook, Tom Nesmith, Verne Harris, Wendy Duff and Eric Ketelaar. This constellation of archival theory has been actively dismantling the Jenkinsonian notion of the archivist as a neutral, informed, anonymous and monolithic voice. It is not simply a stylistic choice, but a foundational point about recognizing the archivist’s and archive’s role in shaping the historical record. RAC is an example of connecting this theory to actual practice.

So Newman isn’t using focus groups as a research method in this study, but is instead reflecting on the use of focus groups as a technique for generating more complete and useful archival descriptions. To do this he provides case studies that reflect the implementation of RAC in 5 county records offices.

He found that in all these cases work still needed to be done to integrate the results of the focus group sessions into the archival descriptions themselves. Part of the problem lay in how well the archival standards and systems accommodate this new type of community or user centered information. Museums in the UK (at least in 2012) have a SPECTRUM which is a standard that includes guidance for adding user generated information, and the standard is implemented in museum collection management systems. Newman found that guidance on how RAC fit with archival description ISAD(G) systems was not enough to get the newly acquired information into archival systems.

However Newman also found that the focus group sessions generated powerful, revealing and creative descriptions of the records which were highly valuable. The interactions between the archive and the external partners led to increased levels of engagement and trust that was deemed extremely useful by both parties. Using visual material from the archives was as an effective way to generate discussion in the focus groups.

Newman noted that some archivists had uncertainty about how to add the emotive content of these contributions to the archival description. To my eye this seemed like perhaps some were still clinging to the notion that archival descriptions were unbiased and neutral. Indeed, I noticed that the RAC guidelines themselves recommended only adding acquired content if it was deemed neutral:

Information that is destined for the ISAD(G) catalogue may be used verbatim if you consider it to be neutral, factual and verifiable. It is more likely, however, to be a trigger for the archivist to revisit the catalogue, investigate or authenticate the new information that has been has been offered and rework the existing description. (p. 24)

Of course revisiting the catalog to revise is a bit of a luxury, especially when many archives have large backlogs of records that lack any description at all.

The RAC guidelines also require attribution when adding to the official archival description. This in turn requires obtaining consent from the focus group participants. But Newman observed that there was occasionally some uncertainty about how this consent and attribution worked in situations like students names where privacy came into play.

A big part of the work of conducting the focus groups is in the data analysis afterwards. RAC provides guidance on how to mark up the focus group transcripts using 5 categories:

  • ISAD(G) catalog
  • keywords
  • subject guide
  • free text

As any researcher will tell you this markup process in itself can be highly time consuming. I think it would’ve benefited Newman’s article to examine how participating archives were able to perform this step: how much they did it, and what categories of information were most acquired. Some basic statistics such as the number of focus groups conducted by institution, the number, ages, backgrounds of participants, and time spent may have been difficult to acquire but would’ve helped get more of a sense of the scope of the work. In addition it would’ve been interesting to learn more about how focus group participants were selected.

Despite these shortcomings I enjoyed Newman’s analysis, and am sympathetic to the theoretical goals of the RAC project. It is a useful example of putting post-Foucauldian critiques of the archive into practice, without waving the crowdsourcing magic wand. I think a useful extension of this work would be to dive a bit deeper into how participating archives routed around their archival systems by adding content to websites and/or subject guides, and to contemplate how archival description could be linked to that larger body of documentation.


Newman, J. (2012). Revisiting archive collections: Developing models for participatory cataloguing. Journal of the Society of Archivists, 33(1), 57–73.