You are here

Feed aggregator

Journal of Web Librarianship: A Review of "Library Analytics and Metrics"

planet code4lib - Wed, 2016-04-06 08:17
Robert J. Vander Hart

Journal of Web Librarianship: A Review of "Responsive Web Design in Practice"

planet code4lib - Wed, 2016-04-06 08:17
Rachel E. Vacek

Journal of Web Librarianship: A Review "Mobile Technologies for Every Library"

planet code4lib - Wed, 2016-04-06 08:16
Mat T. Wilson

DuraSpace News: The DSpace Community Comes Together Around a New Vision and Mission Statement

planet code4lib - Wed, 2016-04-06 00:00

Austin, TX  The DSpace community has adopted a new mission and vision statement developed by the mission and vision working group based on background work completed by the community over the past several years.

DuraSpace News: OR2016 keynotes and Accepted Contributions Announced; Early Bird Deadline is April 13

planet code4lib - Wed, 2016-04-06 00:00

From Dermot Frost, Chair, OR2016 Host Committee and David Minor, Matthias Razum, and Sarah Shreeves, Co-Chairs, OR2016 Program Committee

Dublin, Ireland  Open Repositories 2016–to be held in Dublin, Ireland June 13th-16th–is pleased to announce our opening and closing keynote speakers - Laura Czerniewicz and Rufus Pollock. Read below for more information about both.

Open Knowledge Foundation: Diplohack in Brussels – The first hack in the Council of the European Union

planet code4lib - Tue, 2016-04-05 22:27

For the first time in history, we can hack from inside the Council of the European Union building! Join us at #Diplohack in Brussels in the Council of the European Union on the 29-30 of April.

We invite everyone to take part, whether you’re a diplomat, developer, designer, citizen, student, journalist or activist. We will connect different profiles together in teams to use European data for good.

The idea is that you create a prototype or MVP (minimum viable product) with this data in just 24 hours that is focused on transparency and decision-making. We will support you in any way possible, explain the data and help you get started.

Diplohack, as the hackathon is called, forms part of the Dutch Presidency of the Council of the European Union transparency strategy. The Brussels diplohack will run for 24 hours straight and is part of the several Diplohacks across Europe. Those hackathons intend to make the EU more transparent.

Tech developers, EU diplomats, journalists, citizen activists, social entrepreneurs, data experts and many more will join forces and think of transparency applications to make decision making in the EU searchable and understandable.

Everybody interested in the EU data can enter the hackathon. The winners of the diplohack will be invited to compete in a European final in Amsterdam during the TransparencyCamp Europe Unconference.

The Diplohack event is organised the Council of the European Union, the Dutch EU Presidency and Open Knowledge Belgium. Get your free ticket for the #Diplohack!

The Diplohack will be preceded by the Webinar with EU data experts to explain more about the data. You can join even if you don’t participate in the Diplohack itself. Register here. Check or the discuss forum thread more info on the programme and the Eventbrite page for more practical information.

District Dispatch: Reminder: CopyTalk this Thursday with the Libertarians

planet code4lib - Tue, 2016-04-05 21:14

From Lotus Head

This month’s CopyTalk will be unlike any before.  This one will be about our understanding of what copyright is and why we have it in the first place.  That’s right, we’re talking copyright policy. ALA’s policy is that copyright was created by Congress to advance the dissemination of information, creative arts, and knowledge for the benefit of the public. Libraries are important vehicles for advancing the purpose of copyright because they are sites of learning and personal enrichment. We lawfully acquire copyright resources so more people have access to them.  We replace and preserve these resources under copyright exceptions also to benefit the public.  We do other things as well, but this is meant to be a short blog post.

Of course, copyright means different things to different people and stakeholder groups. This Thursday, libertarians from R Street will share their thoughts on ways to look at U.S. copyright policy.  Join us for a wonky time!

Thursday, April 7th 2016 2pm (Eastern)/11am (Pacific) use this URL to access the webinar. Register as a guest and you’re in.  Yes, it’s still FREE because the Office for Information Technology Policy and the Copyright Education Subcommittee want to expand copyright awareness and education opportunities.

And yes, we archive the webinars!

The post Reminder: CopyTalk this Thursday with the Libertarians appeared first on District Dispatch.

FOSS4Lib Recent Releases: pycounter - 0.13.0

planet code4lib - Tue, 2016-04-05 20:23

Last updated April 5, 2016. Created by wooble on April 5, 2016.
Log in to edit this page.

Package: pycounterRelease Date: Tuesday, April 5, 2016

NYPL Labs: Together We Listen: Make Hundreds of NYC Stories Accessible—One Word at a Time

planet code4lib - Tue, 2016-04-05 16:05

The following blog post is co-authored by Willa Armstrong (NYPL Labs) and Alex Kelly (Adult Programming and Outreach Services).

Are you familiar with the NYPL Community Oral History Project? Take a few moments to listen to some highlights or just dive right into our full collection of stories.

The NYPL Community Oral History Project is truly the people’s project. It’s powered by the public, as hundreds of engaged community members come together to gather oral histories from each other in order to preserve the rich and constantly changing history of New York City. Beginning in 2013 at Jefferson Market Library in Greenwich Village and building momentum, oral histories have been collected in six additional neighborhoods. Visible Lives, an oral history project on the disability experience is another large scale collection effort, based out of Andrew Heiskell Braille and Talking Book Library. Read more about our growing collection at

To date, the Community Oral History collection contains over 1,000 stories, with more on the way! This is very exciting, but here’s the issue: We’re faced with the challenge of making this large corpus of audio accessible and searchable to the public. It’s a challenge faced by many organizations and institutions with audio-based collections and archives.

When working to make audio accessible, transcripts are important because they make audio content searchable online, and accessible to people with hearing disabilities. Recent advances in speech-to-text technologies have made great progress in opening audio to the web, but the transcripts they produce are still error-prone and can only be considered first drafts. Though they’re a good start, careful human editing is required to provide polish these computer-generated drafts and ensure accurate, high quality transcripts.

So, how are we going to transcribe these hundreds of audio hours quickly and cost effectively?  People and computers need to collaborate.

And this brings us to our big announcement: NYPL Labs has built a brand new Open Transcript Editor to engage the public in helping to make our oral history collection accessible—one word at a time. The Open Transcript Editor is an interactive transcript editor allowing multiple people to perform the final layer of polish and proofreading on computer-generated transcripts. It’s a big undertaking and we’re inviting the public to pitch in and help correct computer-generated transcripts from our NYPL Community Oral History Project.

Visit to get started and help make this public treasure trove of NYC stories accessible.

Since we’re not the only ones tackling this challenge, the NYPL has teamed up with The Moth, a live storytelling organization with its own growing audio archive, for Together We Listen, a community program that invites our respective audiences to correct transcripts online using this new tool. To get our initial, computer-generated transcripts, both partners have sent their stories through Pop Up Archive, a speech-to-text service that works extensively with the public media and cultural heritage sectors. This project was made possible with generous support provided by the Knight Foundation Protoype Fund, an initiative of the John S. and James L. Knight Foundation. The Open Transcript Editor codebase itself is open source and is available to be used and further developed around other audio archives.

Anyone can contribute to this effort online and, for the first time ever, NYPL is also organizing in-person events at various library locations to encourage people to get together and help transcribe the oral histories for their own communities. We hope you can join us at an event in your neighborhood!

By editing transcripts, you're helping to create truly accurate transcripts in order to share hundreds of stories from the ongoing Community Oral History Project. Once they have been edited with agreement from enough contributors, completed transcripts will be available to read and download at, along with the audio recordings.

Pitch in now: Tune in and transcribe!   Join the Together We Listen project!

Join our initiative to make New York City history accessible one story at a time! We've partnered with The Moth to create a transcription tool that will allow you to help us improve upon computer-generated transcripts for over 1,000 stories from our Community Oral History Project. Be a part of history today:

Posted by NYPL The New York Public Library on Tuesday, April 5, 2016

Want to receive updates about NYPL digital initiatives? Sign up for our e-mail newsletter.

David Rosenthal: The Curious Case of the Outsourced CA

planet code4lib - Tue, 2016-04-05 15:00
I took part in the Digital Preservation of Federal Information Summit, a pre-meeting of the CNI Spring Membership Meeting. Preservation of government information is a topic that the LOCKSS Program has been concerned with for a long time; my first post on the topic was nine years ago. In the second part of the discussion I had to retract a proposal I made in the first part that had seemed obvious. The reasons why the obvious was in fact wrong are interesting. The explanation is below the fold.

One major difficulty in preserving the Federal Web presence is finding it. The idea that all Federal websites live under .gov or .mil, or even that somewhere in the Federal government is a complete, accurate and up-to-date list of them is just wrong. How does a Web crawler know that some random site in .com, .org or .us is actually part of the Federal Web presence?

Connecting to GeoTrustIn a praiseworthy attempt to protect Federal websites from the all-powerful Chinese hackers and the dreaded terrorists, a decree as gone forth that, come December 31st this year, all of them must be HTTPS-only. An HTTPS website must have a certificate carrying a chain of signatures terminating in one from a root Certificate Authority (CA) that browsers trust. Here, for example, is the website of GeoTrust, a commercial CA that browsers trust. The certificate chain is:
  • is certified by:
  • GeoTrust Extended Validation SHA256 SSL CA, which is certified by:
  • GeoTrust Primary Certification Authority G3, which is a CA browsers trust.
The list of CAs that my Firefox trusts is here, all 188 of them. Note that because GeoTrust is in the list, it can certify its own website. I wrote about the issues around CAs back in 2013, notably that:
  • The browser trusts all of them equally.
  • The browser trusts CAs that the CAs on the list delegate trust to. Back in 2010, the EFF found more than 650 organizations that Internet Explorer and Firefox trusted.
  • Commercial CAs on the list, and CAs they delegate to, have regularly been found to be issuing false or insecure certificates.
Among the CAs on the list are agencies of many governments, such as the Dutch, Chinese, Hong Kong, and Japanese governments.

I assumed that the US government would be on the list too. My obvious idea was that government websites outside .gov and .mil could be found by crawling other domains looking for HTTPS sites whose certificate's signature chain ended at the US government root CA. This would solve a big problem for collecting and preserving Federal government information. Alas, I under-estimated the mania for outsourcing government functions to for-profit companies.

Connecting to the LoC websiteAs an example, visit Your browser will be redirected to, the home page of the Library of Congress website. It will display a green padlock icon,showing that the connection is secure and the browser has verified the certificate upon which the connection's security depends. So far so good. Now click on the green padlock and reveal the details of this verification. The certificate chain looks like:
  • * is certified by:
  • Entrust Certification Authority - L1K, which is certified by:
  • Entrust Root Certification Authority G2, which is on the browser's trusted CA list.
What this means is that the Library of Congress is paying a commercial CA to reassure citizens that their website is what it claims to be, and is secure.

Connecting to the DHS websiteIf you visit you will be redirected to but unlike the Library of Congress you won't get the reassuring green padlock. There are two reasons:
  • The images in the page are delivered via HTTP, so:
    Your connection to this site is private, but someone on the network might be able to change the look of the page.
  • The browser doesn't like the HTTPS connection because:
    Your connection to is encrypted using an obsolete cipher suite.
But the browser has verified the signature chain. It looks like:
  • is certified by:
  • GeoTrust SSL CA - G3, which is certified by:
  • GeoTrust Global CA, which is on the browser's trusted CA list.
So the Department of Homeland Security is paying a different commercial CA to reassure citizens that their website is what it claims to be, and that it is not very secure.

Why is this? Is it because the Library of Congress believes that Entrust is more trustworthy than the US Government? I hope not, Entrust is one of the CAs whose delegated CAs have been caught issuing bogus certificates. It is because, as far as I can tell, the list of 188 CAs that browsers trust contains no US Government controlled CA.

So, your browser trusts the government of the People's Republic of China but not the government of the United States of America!

It isn't that the Federal government doesn't trust itself to run a secure root CA. There is a Federal root CA, the Common Policy Root CA, which is clearly regarded as secure since it is used to control access to Federal systems. But it isn't in the browser's list of trusted CAs, so it isn't any use for outward-facing services such as websites. If it was Federal websites could be Federally certified as GeoTrust websites are GeoTrust certified.

In what world does this make sense? One in which there's money to be made selling services to the Federal government. By failing to follow the example of other governments by putting a root CA that they control into the list, the government arranges for funds to flow to for-profit companies who can protect the cash flow by lobbying, and arranging a warm welcome on the other side of the revolving door for the decision makers. All the for-profit CAs need to do is to make sure they stay in GSA's list of approved vendors, like DigiCert.

So, apart from the waste of taxpayer money, and the failure of my idea for finding government websites, what is the downside of this situation? CAs sometimes misbehave, as DigiNotar and StartSSL did. The result is a dispute between the guilty CAs and the browser vendors, resolved by removing them from the list, as StartSSL was, or by explicitly distrusting their root certificates, as DigiNotar's were. If this happened to one of the CAs Federal websites use, the dispute to which the Feds were not a party would result in the websites using the guilty CA becoming unavailable until the affected certificates could be replaced with new ones from a different CA in GSA's list. The browser vendors control the trusted CA list, so among other things they control citizens' access to government information. Since they're all based in the US, there would be good reasons why they'd be reluctant to remove a US government CA from the list.

I'm naturally reluctant to trust the Federal government, but I'm a whole lot more reluctant to trust for-profit CAs. It looks like I'm out of luck; the policy about public access to the Federal root CA is up on the Web:
Does the US government operate a publicly trusted certificate authority?

No, not as of early 2016, and this is unlikely to change in the near future.

The Federal PKI root is trusted by some browsers and operating systems, but is not contained in the Mozilla Trusted Root Program. The Mozilla Trusted Root Program is used by Firefox, as well as a wide variety of devices and operating systems. This means that the Federal PKI is not able to issue certificates for use in TLS/HTTPS that are trusted widely enough to secure a web service used by the general public.

The Federal PKI has an open application to the Mozilla Trusted Root Program. However, even if the Federal PKI’s application is accepted, it will take a significant amount of time for the Federal PKI’s root certificate to actually be shipped onto devices and propagate widely around the world.

DPLA: DPLA staff attend LDCX at Stanford University

planet code4lib - Tue, 2016-04-05 14:00

Four DPLA staff members — Mark Breedlove, Tom Johnson, Mark Matienzo, and I —  recently attended LDCX at Stanford University. The annual conference is a chance for those in the library, archive, and museum (LAM) communities who work with technology to collaborate on solutions to common problems. Since most the staff attending live on the East Coast and have been weathering the last dregs of winter, the trip injected a much needed dose of sunshine too!

LDCX is an “unconference” meaning the agenda is decided on by the attendees on the morning of the first day. Topics are ad hoc discussions and geared towards coming up with actual solutions using shared standards and infrastructures. This year’s conference saw work on emerging standards like the Portland Common Data Model (PCDM), the International Image Interoperability Framework (IIIF), along with more general and perennial topics like mentoring and training and the use of standard technologies like Ruby on Rails.

DPLA staff in particular were heavily in discussions on data quality and methods for analysis, the use of the DPLA’s Metadata Application Profile (MAP) and how it fits into the broader scheme of data models in the cultural heritage community, such as Europeana’s Data Model (EDM) as well as PCDM. Further discussions of DPLA MAP explored the possibilities of further extensions and adoption of DPLA MAP in other venues. (For more information about DPLA’s MAP, please see

In general, one of the best things about LDCX is the chance to meet face to face with colleagues and collaborators in the LAM technology community to discuss our challenges and work together on our shared opportunities.

Several other staff members joined the group starting on Wednesday for a series of meetings related to the Hydra open source repository system. Dan Cohen, Rachel Frick, and Audrey Altman are all part of the Hydra-in-a-Box project and came out to participate in team meetings. More information about the outcomes of that work will be posted soon to the Hydra-in-a-Box blog. Not only was the “Hybox” team in high-gear, but Thursday and Friday also saw meetings of two Hydra-related groups: the Hydra Developers Congress and the Hydra Power Steering Meeting. The Developers Congress is a regular event in the Hydra community, held in different locations around the country, that provides a chance for developers in the Hydra Open Source community to work together on code for Hydra and its components. The Hydra Power Steering meeting is an annual meeting of the Hydra Steering group dedicated to strategic planning for the community.

It was a productive and fun week for those DPLA-ers that attended — a chance to recharge our creative batteries for some, work through vital and important topics with trusted colleagues for others, and maybe to chase away the winter blues for a week as well!

Conal Tuohy: Visualizing Government Archives through Linked Data

planet code4lib - Tue, 2016-04-05 13:41

Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

Open Library Data Additions: Amazon Crawl: part bn

planet code4lib - Tue, 2016-04-05 13:39

Part bn of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Tod Robbins: Log 4213 by todrobbins

planet code4lib - Tue, 2016-04-05 03:37

Trying to fix #homebrew #tmate #askpass and feeling swallowed by my lack of #unix know-how. Ha! See:

District Dispatch: Learn and network at ALA’s National Policy Convening

planet code4lib - Mon, 2016-04-04 21:49

ALA will hold a National Policy Convening on April 12th and 13th in Washington, D.C.

Come join us for ALA’s first-ever National Policy Convening in Washington, D.C. on April 12-13. Given that a new Administration and Congress will be coming to town, it is timely to discuss and debate information policy and the public interest:

  • How can we advance creative and innovative learning for our children?
  • What can be done to advance entrepreneurship and small business in local communities?
  • What are the big issues for the digital age and how could the Library of Congress be best leveraged for the public interest?

Come join us for some answers! ALA’s president, Sari Feldman, will chair this National Policy Convening with featured policy players from the public, private and non-profit sectors, including:

Youth Engagement with Technology

  • Senator Angus King, I-Maine
  • Alan Fishel, Partner, Arent Fox, LLP
  • Tiffany Moore, Vice President, Congressional Affairs, Consumer Technology Association
  • Stephan Turnipseed, Executive Vice President & Chief Strategy Officer, Destination Imagination

Advancing Economic Opportunity in Communities

  • Maureen Conway, Vice President, Aspen Institute
  • Darryl L. DePriest, Chief Counsel, Office of Advocacy, U.S. Small Business Administration
  • Sari Feldman, President, American Library Association
  • Russell D. Greiff, Managing Director and General Partner, 1776 Ventures
  • Emily Robbins, Principal Associate, National League of Cities
  • Moderated by Larra Clark, Deputy Director, Office for Information Technology Policy, American Library Association

Future Directions for the Library of Congress

  • Robert Darnton, Carl H. Pforzheimer University Professor and University Librarian, Emeritus, Harvard University
  • Katie Oyama, Senior Policy Counsel, Google, Inc.
  • Sascha Meinrath, Palmer Chair in Telecommunications, The Pennsylvania State University
  • Moderated by Alan S. Inouye, Director, Office for Information Technology Policy, American Library Association.

Convening Schedule

Tuesday, April 12, 2016:

  • 5:00 – 5:30 pm           Registration and Check-In
  • 5:30 – 6:30 pm           Youth Engagement with Technology
  • 6:30 – 8:00 pm           Reception

Wednesday, April 13, 2016:

ALA appreciates in-kind and financial support from Arent Fox LLP, the Bill & Melinda Gates Foundation, and Google, Inc.

Register Now for the convening!

The post Learn and network at ALA’s National Policy Convening appeared first on District Dispatch.

Tod Robbins: Log 4209 by todrobbins

planet code4lib - Mon, 2016-04-04 21:14

Thinking through packaging #Adobe #Premiere and #AfterEffects projects at the end of a post-production phase.

  • What kinds of file checks (#MD5)?
  • Various ways of grouping media assets
  • Would sidecar #metadata files be helpful?

Nicole Engard: Bookmarks for April 4, 2016

planet code4lib - Mon, 2016-04-04 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Sponsored: 64% off Code Black Drone with HD Camera Our #1 Best-Selling Drone–Meet the Dark Night of the Sky!
  • Mattermost Mattermost is an open source, self-hosted Slack-alternative
  • mBlock Program your app, Arduino projects and robots by dragging & dropping
  • Fidus Writer Fidus Writer is an online collaborative editor especially made for academics who need to use citations and/or formulas.
  • Beek Social network for booklovers
  • Open eBooks Open eBooks is a partnership between Digital Public Library of America, The New York Public Library, and First Book, with content support from digital books distributor Baker & Taylor.

Digest powered by RSS Digest

The post Bookmarks for April 4, 2016 appeared first on What I Learned Today....

Related posts:

  1. Google Citations and
  2. Using DOIs in Blogs
  3. How to be productive & social

Tod Robbins: Log 4208 by todrobbins

planet code4lib - Mon, 2016-04-04 18:42

Estimating transfer size/time for project archives. Needs to be 10-12TB to fit on three 4TB drives. #digipres

Harvard Library Innovation Lab: How We’re Freeing the Law, Part 1: Books

planet code4lib - Mon, 2016-04-04 18:29

Adi Kamdar is a 1L at Harvard Law School and our embedded reporter on the Free the Law project. In this first post, he tracks the progress of a casebook through our scanning process from start to finish.

Harvard Law Library is one of the few collections with nearly every law reporter—roughly 40,000 books in total. The Free the Law project’s goal is to put the court decisions inside these volumes online, so anyone can access the precedents that shape the American legal system. Right now, the project is about halfway through, and within the next couple years they’ll have completed this monumental task.

But how exactly does a book become a byte? And what happens to these physical texts after they’ve been digitized?

Harvard Depository

The project begins each week with a book order—a 600 book order, to be exact, for law reporters that chronicle U.S. legal history since the country’s inception.

The law reporters are held in a sprawling warehouse 30 miles away from the law school—the Harvard Depository. With over 200,000 square feet of storage space, the climate-controlled Depository’s mission is pure efficiency: each book—and there are over nine million—is sorted and stored by size, rather than by name or author, in order to maximize space.

But it turns out law reporters are the packing peanuts of the Harvard Depository. When the reporters were first sent over to the warehouse, instead of being stored normally, they kept the volumes around in the packaging room. Whenever they filled a cardboard box with other books for storage, they would throw in a reporter or two if there was any extra space that needed to be filled. No one thought the print reporters would be that useful anymore, so making them easily available in bulk was a low priority. Plus, the library had decided to cancel print runs of reporters in 2010, saving valuable shelf space, especially when digital copies were easily available online.

Because of this tactic, law reporters are spread all throughout the Depository. Asking for, say, Michigan’s volumes isn’t as simple as pulling out a handful of boxes—it’s a hunt.

Langdell Library

Every Wednesday, the team receives the 600 volumes of case reporters. They line the hallway of the ground floor of Langdell, filling shelf after shelf. One by one, each book is examined before it can be taken apart. (Some books—for example, volumes with marginalia—are flagged for archival purposes.) Each volume is then catalogued and given a unique barcode so it can be tracked throughout the whole process.


The books are then taken to the Prep Room where, ironically, they’re repaired before they’re chopped up. Damaged pages are taped together, book bindings are cut off by hand, and the remaining sheets are taken over to a guillotine. Once aligned, the operator has to press two separate buttons underneath the cutting table at the same time to make sure her hands aren’t under the blade. The result? Cleanly cut pages.


View post on

View post on

Next, the bundle of pages is hauled over to the Scanning Room. Here, six employees work overlapping shifts to ensure that pages are being scanned every day, 14 hours a day. Roughly 200 documents per minute are fed through the machine, which has a camera on top and bottom to image both sides of the page.


View post on

View post on

View post on

Now that the books are chopped and scanned, what happens to the physical pages? After all, the purpose of this project is to digitize the law. Plus, according to circulation records, very few people were reading the old reporters anyway. Rebinding them and keeping them in the library would be a waste of space, time, and money. But just in case anyone questions the authenticity of the scans, Harvard decided it would be valuable to have the physical copies accessible. So the project decided to vacuum seal the pages. Once the pages are jogged together (using a state-of-the-art paper-jogging machine) and placed back inside their book jacket, the volumes are taken over to one last room—where they will be put inside a meat packing device. Yes, it turns out that the meat industry unwittingly stumbled across the best way to preserve books. The machine shrink wraps the pages, maintaining the integrity of the volume while handily adding an extra layer of protection from mold, humidity, and bugs.

View post on

The re-bound volumes are then re-shelved, where they await being shipped off to…


View post on

Louisville, Kentucky

Because of the Harvard Law Library’s limited shelf capacity, the newly packaged pages will soon be loaded onto trucks and shipped down to Louisville.

Why Kentucky? Well, because of Underground Vaults & Storage, a company that has been storing all manner of things in Louisville’s old limestone mines. The sealed books will be stored there (where they will “fear no tornado, wildfire, flood or other natural disaster”) until the rare instance that they need to be recalled.

And that’s the story of these legal volumes—from one massive depository to another, by way of a guillotine, a scanner, and a meat packer. In our next post, we’ll explore what happens after they become digital images, and how Free the Law is building the largest free database of legal opinions in the world.


Subscribe to code4lib aggregator