You are here

Feed aggregator

District Dispatch: ALA encouraged by “fair use” decision in Georgia State case

planet code4lib - Mon, 2014-10-20 16:13

Georgia State University Library. Photo by Jason Puckett via flickr.

On Friday, the U.S. Court of Appeals for the 11th Circuit handed down an important decision in Cambridge University Press et al. v. Carl V. Patton et al. concerning the permissible “fair use” of copyrighted works in electronic reserves for academic courses. Although publisher’s sought to bar the uncompensated excerpting of copyrighted material for “e-reserves,” the court rejected all such arguments and provided new guidance in the Eleventh Circuit for how “fair use” determinations by educators and librarians should best be made. Remanding to the lower court for further proceedings, the court ruled that fair use decisions should be based on a flexible, case-by-case analysis of the four factors of fair use rather than rigid “checklists” or “percentage-based” formulae.

Courtney Young, president of the American Library Association (ALA), responded to the ruling by issuing a statement.

The appellate court’s decision emphasizes what ALA and other library associations have always supported—thoughtful analysis of fair use and a rejection of highly restrictive fair use guidelines promoted by many publishers. Critically, this decision confirms the importance of flexible limitations on publisher’s rights, such as fair use. Additionally, the appeals court’s decision offers important guidance for reevaluating the lower courts’ ruling. The court agreed that the non-profit educational nature of the e-reserves service is inherently fair, and that that teachers’ and students’ needs should be the real measure of any limits on fair use, not any rigid mathematical model. Importantly, the court also acknowledged that educators’ use of copyrighted material would be unlikely to harm publishers financially when schools aren’t offered the chance to license excerpts of copyrighted work.

Moving forward, educational institutions can continue to operate their e-reserve services because the appeals court rejected the publishers’ efforts to undermine those e-reserve services. Nonetheless, institutions inside and outside the appeals court’s jurisdiction—which includes Georgia, Florida and Alabama—may wish to evaluate and ultimately fine tune their services to align with the appeals court’s guidance. In addition, institutions that employ checklists should ensure that the checklists are not applied mechanically.

In 2008, publishers Cambridge, Oxford University Press, and SAGE Publishers sued Georgia State University for copyright infringement. The publishers argued that the university’s use of copyright-protected materials in course e-reserves without a license was a violation of the copyright law. Previously, in May 2012, Judge Orinda Evans of the U.S. District Court ruled in favor of the university in a lengthy 350-page decision that reviewed the 99 alleged infringements, finding all but five infringements to be fair uses.

The post ALA encouraged by “fair use” decision in Georgia State case appeared first on District Dispatch.

HangingTogether: Evolving Scholarly Record workshop (Part 3)

planet code4lib - Mon, 2014-10-20 16:00

This is the third of three posts about the workshop.

Part 1 introduced the Evolving Scholarly Record framework.  Part 2 described the two plenary discussions.  This part summarizes the breakout discussions.

Following the presentations, attendees divided into breakout groups.  There were a variety of suggested topics, but the discussions took on lives of their own.  The breakout discussions surfaced many themes that may merit further attention:

Support for researchers

It may be the institution’s responsibility to provide infrastructure to support compliance with mandates, but it is certainly the library’s role to assist researchers in depositing their content somewhere and to ensure that deposits are discoverable.  We should establish trust by offering our expertise and familiarity with reliable external repositories, deposit, compliance with mandates, selection, description …  and support the needs of researchers during and after their projects.  Access to research outcomes involves both helping researchers find and access information they need as inputs to their work and helping them to ensure that their outputs are discovered and accessible by others.  We should also find ways to ensure portability of research outputs throughout a researcher’s career.  We need to partner with faculty and help them take the long view.  We cannot do this by making things harder for the researcher, but by making it seamless, building on the ways they prefer to work.

Adapting to the challenge

We need to retool and reskill to add new expertise:  ensuring that processes are retained along with data, promoting licenses that allow reusability, thinking about what repositories can deliver back to research, and adding developers to our teams.  When we extend beyond existing library standards, we need to look elsewhere to see what we can adopt rather than create.  We need to leverage and retain the trust in libraries, but need resources to do the work.  While business models don’t exist yet, we need to find ways to rebalance resources and contain costs.  One of the ways we might do that is to build library expertise and funding into the grant proposal process, becoming an integral part of the process from inception to dissemination of results.


Academic libraries should first collect, preserve, and provide access to materials created by those at their institution.  How do libraries put a value on assets (to the institution, to researchers, to the wider public)?  Not just outputs but also the evidence-base and surrounding commentary.  What should proactively be captured from active research projects?   How many versions should be retained?  What role should user-driven demand play?  What is needed to ensure we have evidence for verification and retain results of failed experiments?  What need not be saved (locally or at all)?  When is sampling called for?  What about deselection?  While we can involve researchers in identifying resources for preservation, in some cases we may have to be proactive and hunt them down and harvest them ourselves.


Competitiveness (regarding tenure, reputation, IP, and scooping) can inhibit sharing.  Timing of data sharing can be important, sometimes requiring an embargo.  Privacy issues regarding research subjects must be considered.  Researchers may be sensitive about sharing “personal” scientific notes – or sharing data before their research is published.  Different disciplines have different traditions about sharing.

Collaboration with others in the university

Policy and financial drivers (mandates, ROI expectations, reputation and assessment) will motivate a variety of institutional stakeholders in various ways.  How can expertise be optimized and duplication be minimized?  Libraries can’t change faculty behaviors, so need to join together with those with more influence.  When Deans see that libraries can address parts of the challenge, they will welcome involvement.  When multiple units are employing different systems and services, IT departments and libraries may become key players.  There are limits to institutional capacity, so cooperating with other institutions is also necessary.

Collaboration with other stakeholders in a distributed archive across publishers, subjects, nations

The variety and quantity of information now making up the scholarly record is far greater than it used to be and publishers are no longer the gatekeepers of the scholarly record.  This is a time to restructure the role of libraries vis-à-vis the rest of the ecosystem.  We need to function as part of a ecosystem that includes commercial and governmental entities.  We need to act locally, think globally, employing DOIs and Researcher IDs to interoperate with other systems and to appeal to scholars.  Help researchers negotiate on IP rights, terms of use, privacy, and other issues when engaging with environments like GitHub, SlideShare, and publishers’ systems, being aware that, while others will engage with identifiers, metadata systems, discovery, etc., few may be committed to preservation.  How do we decide who are trustworthy stewards and what kinds of assurances are needed?

Technical approaches

We need to understand various solutions for fixity, versioning, and citation. We need to accommodate persistent object identifiers and multiple researcher name identifiers.  We need to explore ways to link the various research materials related to the same project.  We need to coordinate metadata in objects (e.g., an instrument’s self-generated metadata) with metadata about the objects and metadata about the context).  Embedded links need to be maintained.  Campus systems may need to interoperate with external systems (such as SHARE).  We should help find efficient metrics for assessing researcher impact and enhancing institutional reputation.  We should consider collaborating on processes to capture content from social media.  In doing these things we should be contributing to developing standards, best practices, and tools.

Policy issues

What kinds of statements of organizational responsibility are needed:  a declaration of intent covering what we will collect, a service agreement covering what services we will provide to whom, terms of use, and explicit assertions about which parts of the university are doing what?  Are there changes to copyright needed; does Creative Commons licensing work for researchers?  What about legal deposit for digital content, based on the print model?  What happens when open access policies conflict with publisher agreements?


Attendees of the workshop feel that stewardship efforts will evolve from informal to more formal.  Mandates, cost-savings, and scale will motivate this evolution.  It is a common good to have demonstrable historical record to document what is known, to protect against fraud, and for future research to build upon.  Failure to act is a risk for libraries, for research, and for the scholarly record.

Future Evolving Scholarly Record workshops will expand the discussion and contribute to identifying topics for further investigation.  The next scheduled workshops will be in Washington DC on December 10, 2014 and in San Francisco on June 4, 2015.  Watch for more details and for announcements of other workshops on the OCLC Research events page.

About Ricky Erway

Ricky Erway, Senior Program Officer at OCLC Research, works with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation.

Mail | Web | Twitter | LinkedIn | More Posts (36)

David Rosenthal: Journal "quality"

planet code4lib - Mon, 2014-10-20 15:00
Anurag Acharya and co-authors from Google Scholar have a pre-print at entitled Rise of the Rest: The Growing Impact of Non-Elite Journals in which they use article-level metrics to track the decreasing importance of the top-ranked journals in their respective fields from 1995 to 2013. I've long argued that the value that even the globally top-ranked journals add is barely measurable and may even be negative; this research shows that the message is gradually getting out. Authors of papers subsequently found to be "good" (in the sense of attracting citations) are slowly but steadily choosing to publish away from the top-ranked journals in their field. You should read the paper, but below the fold I have some details.

Acharya et al:
attempt to answer two questions. First, what fraction of the top-cited articles are published in non-elite journals and how has this changed over time. Second, what fraction of the total citations are to non-elite journals and how has this changed over time. For the first question they observe that:
The number of top-1000 papers published in non-elite journals for the representative subject category went from 149 in 1995 to 245 in 2013, a growth of 64%. Looking at broad research areas, 4 out of 9 areas saw at least one-third of the top-cited articles published in non-elite journals in 2013. For 6 out of 9 areas, the fraction of top-cited papers published in non-elite journals for the representative subject category grew by 45% or more. and for the second that:
Considering citations to all articles, the percentage of citations to articles in non-elite journals went from 27% in 1995 to 47% in 2013. Six out of nine broad areas had at least 50% of citations going to articles published in non-elite journals in 2013.They summarize their method as:
We studied citations to articles published in 1995-2013. We computed the 10 most-cited journals and the 1000 most-cited articles each year for all 261 subject categories in Scholar Metrics. We marked the 10 most-cited journals in a category as the elite journals for the category and the rest as non-elite. In a post to liblicense, Ann Okerson asks:
  • Any thoughts about the validity of the findings? Google has access to high-quality data, so it is unlikely that they are significantly mis-characterizing journals or papers.They examine the questions separately in each of their 261 subject categories, and re-evaluate the top-ranked papers and journals each year.
  • Do they take into account the overall growth of article publishing in the time frame examined? Their method excludes all but the most-cited 1000 papers in each year, so they consider a decreasing fraction of the total output each year:
    • The first question asks what fraction of the top-ranked papers appear in top-ranked journals, so the total volume of papers is irrelevant.
    • The second question asks what fraction of all citations (from all journals, not just the top 1000) are to top-ranked journals. Increasing the number of articles published doesn't affect the proportion of them in a given year that cite top-ranked journals.
  • What's really going on here? Across all fields, the top-ranked 10 journals in their respective fields contain a gradually but significantly decreasing fraction of the papers subsequently cited. Across all fields, a gradually but significantly decreasing fraction of citations are to the top-ranked 10 journals in their respective fields.  This means that authors of cite-worthy papers are decreasingly likely to publish in, read from, and cite papers in their field's top-ranked journals. In other words, whatever value that top-ranked journals add to the papers they publish is decreasingly significant to authors.
Much of the subsequent discussion on liblicense misinterprets the paper, mostly by assuming that when the paper refers to "elite journals" it means Nature, NEJM, Science and so on. As revealed in the quote above, the paper uses "elite" to refer to the top-ranked 10 journals in each of the individual 261 fields. It seems unlikely that a broad journal such as Nature would publish enough articles in any of the 261 fields to be among the top-ranked 10 in that field. Looking at Scholar Metrics, I compiled the following list, showing all the categories (Scholar Metrics calls them subcategories) which currently have one or more global top-10 journals among their "elite journals" in the paper's sense:
  • Life Sciences & Earth Sciences (general): Nature, Science, PNAS
  • Health & Medical Sciences (general): NEJM, Lancet, PNAS
  • Cell Biology: Cell
  • Molecular Biology: Cell
  • Oncology: Journal of Clinical Oncology
  • Chemical & Material Sciences (general): Chemical Reviews, Journal of the American Chemical Society
  • Physics & Mathematics (general): Physical Review Letters
Only 7 of the 261 categories currently have one or more global top-10 journals among their "elite". Only 3 categories are specific, the other 4 are general. The impact of the global top-10 journals on the paper's results is minimal.

Lets look at this another way. No matter how well their work is regarded by others in their field, researchers in the vast majority of fields have no prospect of ever publishing in a global top-10 journal because those journals effectively don't publish papers in those fields. And if they ever did, the paper is likely to be junk, as illustrated by my favorite example, because the global top-10 journal's stable of reviewers don't work in that field. The global top-10 journals are important to librarians, because they look at scholarly communication from the top down, to publishers, because they are important to librarians so they anchor the "big deals", and to researchers in a small number of important fields. To every one else, they may be interesting but they are not important.

Acharya et al conclude:
First, the fraction of top-cited articles published in non-elite journals increased steadily over 1995-2013. While the elite journals still publish a substantial fraction of high-impact articles, many more authors of well-regarded papers in diverse research fields are choosing other venues.

Second, now that finding and reading relevant articles in non-elite journals is about as easy as finding and reading articles in elite journals, researchers are increasingly building on and citing work published everywhere. Both seem right to me, which reinforces the message that, even on a per-field basis, highly rated journals are not adding as much value as they did in the past (which was much less than commonly thought). Authors of other papers are the ultimate judge of the value of a paper (they are increasingly awarding citations to papers published elsewhere), and of the value of a journal (they are increasingly publishing work that other authors value elsewhere).

District Dispatch: ALA, Depts. of Ed. and Labor to Host Webinar on Workforce Funding

planet code4lib - Mon, 2014-10-20 14:26

On October 27, 2014, the American Library Association (ALA) will host “$2.2 Billion Reasons to Pay Attention to WIOA,” an interactive webinar that will explore ways that public and community college libraries can receive funding for employment skills training and job search assistance from the recently-passed Workforce Innovation and Opportunity Act. The no-cost webinar, which includes speakers from the U.S. Departments of Education and Labor, takes place Oct 27, 2014, from 2:00–3:00 p.m. EDT.

The Workforce Innovation and Opportunity Act allows public and community college libraries to be considered additional One-Stop partners and authorizes adult education and literacy activities provided by public and community college libraries as an allowable statewide employment and training activity. Additionally, the law defines digital literacy skills as a workforce preparation activity.

Speakers include:

  • Moderator: Sari Feldman, president-elect, American Library Association and executive director, Cuyahoga County Public Library
  • Susan Hildreth, director, Institute of Museum and Library Services
  • Heidi Silver-Pacuilla, team leader, Applied Innovation and Improvement, Office of Career, Technical, and Adult Education, U.S. Department of Education
  • Kimberly Vitelli, chief of Division of National Programs, Employment and Training Administration, U.S. Department of Labor

Register now as space is limited. The webinar will be archived and emailed to subscribers of the District Dispatch, ALA’s policy blog.

The post ALA, Depts. of Ed. and Labor to Host Webinar on Workforce Funding appeared first on District Dispatch.

Aaron Schmidt: Showering in the Library

planet code4lib - Mon, 2014-10-20 14:00

It was a hot, dusty day in Moab, Utah. I drove into town from my beautiful campsite overlooking the La Sal Mountains, where I’d been cycling and exploring the beautiful country. I was taking a few days off from work, and even though I was relaxing, I had a phone call I didn’t want to reschedule. So back to town I went, straight to—naturally—the public library. I had fond memories of the library from a previous visit a few years back: a beautiful building with reliable Wi-Fi. Aside from not being allowed to bring coffee inside, it would be a great place to check email and take a call on the bench outside.

As I entered the library, I decided that transitioning from adventure mode to work mode required, at least, washing some of Moab’s ample sand and dust off of my hands. I washed my hands and what happened next I did automatically, without consideration or contemplation: I cupped my hands and splashed some water on my face. Refreshing! I then wet a paper towel to wipe the sunscreen off of the back of my neck.

It was at about this point that I realized just what was going on; I was the guy bathing in the library restroom!

Half shocked, half amused by my actions, I quickly made sure I didn’t drip anywhere and sully the otherwise very clean and pleasant basin.

Contextually appropriate

I can’t say I’m proud of my mindless act, but it did get me thinking about the very sensitive issue of appropriate behavior in libraries.

I’m not going on a campaign encouraging libraries to offer showers to their patrons, but not because I think the idea is ridiculous. I actually think it is a legitimate potential service offering. That such a service would likely be useful for only a very small segment of library users is one reason why it isn’t worth ­pursuing.

But as a theoretical concept, I find nothing inherently wrong or illogical with the idea of a library offering showers. It is simply an idea that hasn’t found many appropriate contexts.

Even so, with the smallest amount of imagination I can think of contexts in which this could work. What about a multiuse facility that houses a restaurant, a gym, a coworking space, and a library? Seems like an amazing place. And don’t forget that the new central library in Helsinki, Finland—to be completed in 2017—will feature sauna facilities. These will be contextually and culturally ­appropriate.

Challenging assumptions

This is about more than showers and saunas. It is about our long-held assumptions and how we react to new ideas. When we’re closed off to concepts without examining them fully, or without exploring the frameworks in which they exist, we’re unlikely truly to innovate or create any radically meaningful experiences. When evaluating new initiatives, we should consider the library less and our communities more. Without this sort of thinking, we’d have never realized libraries with popular materials, web access, and instructional classes, let alone cafés, gaming nights, and public health nurses.

Learning about our contexts—our communities—takes more than facilitating surveys and leading focus groups. After all, those techniques put less emphasis on people and more on their opinions. Even though extra work is required, the techniques aren’t mysterious. There are well-established methods we can use to learn about the individuals in our areas and then design contextually appropriate programs and services.

To the Grand County Public ­Library in Moab, my apologies for the slight transgression. I did leave the restroom in the same shape as I found it. To ­everyone else, if you’re in Moab, visit the library. But if you need a place to clean up in that city, try the aquatic center. It has nice pools and clean ­showers.

Library of Congress: The Signal: The ePADD Team on Processing and Accessing Email Archives

planet code4lib - Mon, 2014-10-20 13:09

As archives increasingly process born-digital collections one thing is clear; processing digital collections often involves working with tons of email. There is already some great work exploring how to deal with email, but given that it is such a significant problem area it is great to see work focused on developing tools to make sense of this material. Of particular concern is how email is simultaneously so ubiquitous and so messy.  I’ve heard cases of repositories needing to deal with hundreds of millions of email objects in a single collection. Beyond that, in actual practice people use email for just about everything,  so email records are often a messy mixture of public, private, personal and professional material.

Email? from user tamaleaver on Flickr.

To this end, the ePADD project at Stanford, with the help of an NHPRC grant, is working to produce an open-source tool that will allow repositories and individuals to interact with email archives before and after they have been transferred to a repository. I was lucky enough to sit in on a presentation from the projects technical advisor, Dr. Sudheendra Hangal, on the status of the project and am thrilled to have this opportunity to discuss work on it with him and his colleagues, Glynn Edwards and Peter Chan as part of our Insights Interview series. Glynn is the Head of Technical Services in the Manuscripts division in Stanford Libraries and the Manager of the Born-Digital Program and Peter is a digital archivist at Stanford Libraries.

Trevor: Could you briefly describe the scope and objective of the ePADD project? Specifically, what problem are you working to solve and how are you going about solving it?

Glynn: The ePADD project grew out of earlier experimentation during the Mellon-funded AIMS grant. One of the collections contained 50,000 unique email messages. Peter, who was our new digital archivist, experimented with Gephi (exporting header information to create social network graphs) and Forensic Toolkit. Neither option managed to provide a suitable tool for processing or to facilitate discovery. FTK did not allow us to flag individual messages that contained personal identifying information for restriction and neither provided a view of the entities within the corpus nor expose them to remote researchers.

Most of the previous email projects and tools we researched were focused specifically on acquisition and preservation. They did not address other core functions of stewardship – appraisal, processing and access (discovery & delivery).

During this experimental period, Peter discovered MUSE (Memories USing Email), a research project in the Mobisocial group of Stanford’s Computer Science Dept. Using NLP and a built-in lexicon, it allowed us to extract entities, view by correspondent or a graphical visualization of sentiments based on the lexical terms. This was a step in the right direction and we began a multiyear collaboration with Sudheendra Hangal, MUSE’s creator.

The objectives are to create an open-source Java-based software program built on MUSE that supports different activities aligned with core functions of the digital curation lifecycle: appraisal, accessioning, processing, discovery and delivery. In effect it would allow a user, anyone from the creator, donor, curator, archivist or researcher, to use the collection both before and after transfer to a repository.

Stanford, along with our collaborating partners (NYU, Smithsonian, Columbia, and Bodleian @ Oxford), created and prioritized a set of specifications for the initial development cycle, funded by NHPRC. We also developed and published a beta site to demonstrate our concept for exporting entities and correspondents to facilitate discovery. We have been steadily receiving more email collections. Our most recent acquisition contains over 600,000 unique messages. The grant states that the program will handle at least 250,000 messages – so this latest archive will be more than adequate as a stress test!

Trevor: Could you tell us a bit about the design of the workflow in the tool? How are you envisioning donors and processing archivists working with it?

Peter: The workflow is designed as follows:

Creators of email archives use the appraisal module to scan their archives and identify messages they don’t want to transfer. They can also flag messages as “restricted” and enter annotations to specify the terms of the restriction. The files exported from ePADD will NOT contain the messages flagged as “Do Not Transfer.”

After receiving the files from donors, processing archivists will then identify messages to be restricted according to the policy of their institutions and communication with the donor. Depending on the resources available, processing archivists may want to confirm the email addresses of correspondents suggested by ePADD. Archivists may also want to reconcile the correspondents/person entities extracted with authority records suggested by ePADD. After they finish processing, archivists will output two versions of the archive from ePADD. Neither set contains any restricted messages.

The first set is designed for the discovery module, with all messages redacted barring identified entities (people, places, organizations) and email address with a masked domain name. This version will be stored in the web server to be used for the discovery module.  Public researchers with internet access can browse and search the archives using the discovery module. They will only see a redacted version of the original messages containing extracted entities, but this is still useful to them to get a sense of the entities present in the archive, without being able to see what is said about them.

The second version, designed for the delivery module, will be stored in the reading room computer designated for email delivery. Researchers using the designated computer in the reading room will be able to browse and search the archives.  The messages, when displayed, will be the full messages without redaction. Researchers can define their own lexicon to analyze the collection. They may request copies by flagging the messages they need. Public service archivists/librarians can then give the researchers the files according to the policy of their institutions.

Glynn: I would only add that the appraisal module is meant to make it possible for a creator/donor to review their email archives, to create their own lexicon if desired, and prepare the files for export and final transfer to a repository. During this process they may take actions on specific messages (individually) or sets of messages (bulk by topic or correspondent) as restricted or elected not to transfer. We felt this functionality was important to offer a donor for two reasons. First, in the hope that they weed out irrelevant messages or spam! Second, there may be individuals they correspond with that do not want their messages archived – this is a case for one of our collections.

Trevor: How do you imagine archivists using this tool? Further, how do you see it fitting in with the ecosystem of other open-source tools and platforms that act as digital repository platforms and other tools for processing and working born digital archival materials like BitCurator and Archivematica?

The social life of email at Enron – a new study from user chieftech on Flickr.

Peter: I consider “processing” of born-digital materials to include both identifying restricted materials AND arranging/describing the intellectual content of the materials. My understanding of Bitcurator and Archivematica is that neither offers tools to arrange/describe the intellectual content of the materials. ePADD offers four tools to arrange/describe the intellectual content of email archives. First, it uses a natural language processing library to extract personal names, organizational names and locations in email archives to give researchers a sense of people, organizations and locations in the archives. Second, it gathers all image files in one place for researchers to browse and if necessary go to the messages containing the images.

Third, it offers user-definable lexicons which contain categories of words the system will use to search against the emails so that researchers/archivists can browse emails according to the lexicons they defined. Finally, ePADD reconciles the correspondents and personal names mentioned in messages with the FAST (Faceted Application of Subject Terminology) dataset which is derived from the Library of Congress Subject Headings. Archivists can then give their confirmation to the suggested matches by ePADD. If none of the suggestions are correct, they can enter their own links to the authority records.

I can see people using ePADD to appraise, process, discover and deliver emails and sending the files generated for delivery and discovery to systems using Archivematica for long term preservation.

Trevor: In Sudheendra’s presentation I saw some really interesting things happening with approaches to identifying different distinct email addresses that are associated with the same individual over time in a collection, and some interesting approaches to associating the names of individuals with canonical data for names of people. I think he also illustrated ways that the content of the messages could be identified and associated with subjects. Could you tell us a bit about how this works and how you are thinking about the possibilities and impact on things like archival description that these approaches could have?

Glynn: With email archives – or any born-digital materials – archivists need automated methods to get through large amounts of data. ePADD incorporates several methods of automation to assist with processing of email. Here are three:

1. Correspondents & name resolution

During ingest, ePADD gathers all correspondents and recipients from email headers and performs basic name resolution tasks. When your cursor rolls over a name, different versions that were aggregated appear in a pop-up window. The archivist can go into the back end and override or edit the addresses that are associated with a specific name.

I would direct you to the wonderful documentation on processing and using email archives on the MUSE website. Regarding the resolution of correspondents, Sudheendra states (PDF) in the “MUSE: Reviving Memories Using Email Archives” report that “MUSE performs entity resolution by unifying names and email addresses in email headers when either the name or email address (as specified in the RFC-822 email header) is equivalent. This is essential since email addresses and even name spellings for a person are likely to change in a long-term archive.”

ePADD performs this process during ingest and allows the donor (appraisal module) or archivist (processing module) to correct or edit the email aliases that are automatically bundled together by ePADD at ingestion.

2. Entity extraction & disambiguation

ePADD extracts entities from the email corpus using Apache’s openNLP library and checks them against OCLC’s FAST database to identify authorities. In the case of multiple hits on a name, it shows all the matching records and can read data from DBpedia to automatically rank the likelihood of each record being the correct one. The archivist finally confirms which authority record is correct.

Example of how ePADD connects to third party data to disambiguate, aggregate and link names & identities.

Algorithms are also used to help the archivist or researcher understand context while reading a message. For example, suppose a conversation mentions Bob, which could refer to any number of Bobs present in the archive. ePADD analyzes the occurrences of Bob throughout the archive with respect to the text and headers of this message, and thinks: “Hmm…when the name Bob is used with the people copied on this email, and when these other names appear in the message, its more likely to be Bob Creeley than other Bobs in the archive like Dylan or Woodward.” It displays a popup with the ranked list of possibilities (see image).

The colored bar underneath each full name indicates the likelihood of that association. This feature can be used by an archivist during processing or by researchers in the delivery module to understand the archive’s contents better. If you think about it, we humans do this kind of context-based disambiguation all the time; ePADD is helping us along by trying to automate some of it.

Example of identification of named entities in the Robert Creeley email archive

3. Lexical searches & review

The archivist can use the built-in lexicons or create one in order to tease out the subjects or topics in the archive. MUSE came with a “sentiment” lexicon and ePADD will include another default lexicon based on searching for Personally Identifiable Information and sensitive material. This will include the ability to identify regular expressions – such as credit card or social security numbers as well as material that may be governed by FERPA or HIPAA. These lexicons are editable or one could start from scratch and create a specialized one. The beauty of this is that once the terms are indexed by ePADD the user can view the messages individually or in a visualization graph.

An ePADD discovery module visualization graph.


Trevor: As a follow-up to that question, how is the project conceptualizing the role of the archivist engaging with some of these automated processes for description? Sudheendra showed how an archivist could intervene and accept/reject or tweak the resulting bundling of email addresses and associate them with named entities. With that said, I imagine it would be a huge undertaking, and one that seems inconsistent with an MPLP approach, to have an archivist review all of this metadata. To that end, are there ways the project can enable some level of review of particularly important figures and still communicate which part is automated and which part has been reviewed? Or are there other ways the team is thinking about this kind of issue?

Peter: In view of the large number of correspondents and personal names mentioned in an email archive, reviewing ALL name entities is usually not feasible. Depending on resources we have for each archive, we can review, say the top 1000 most mentioned names in an archive.

Glynn: Agreed. This is similar to processing the analog or paper correspondence in a collection. The archivist usually selects correspondents that are either well known, that have substantive letters – either in form or extent. Not all correspondents in a collection make it into the finding aid as added entries, into folder-level description, or even into a detailed index. With ePADD the top 50 or 100 correspondents (in extent) are easily and automatically identified.

However, because researchers may be interested in entities/correspondents that we do not “process,” we are considering allowing them much of the same functionality in the full text access module in the reading room. One example, would be allowing the researcher to create a new lexicon and search by their terms.

To identify what’s been processed is a work in process. We still need to build in some administrative features – such as scope and content notes – to let the researcher know the types/depth of actions performed.

Trevor: How are you thinking about authenticity of records in the context of this project? That is, what constitutes the original and authentic format of these records and how does the project work to ensure the integrity of those records over time. Similarly, how are you thinking about documenting decisions and actions taken in the appraisal process on the records?

Peter: According to “ISO 15489-1, Information and documentation–Records Management,” an authentic (electronic) record is one that can be proven:

a) to be what it purports to be,

b) to have been created or sent by the person purported to have created or sent it, and

c) to have been created or sent at the time purported.

Format is not part of the requirements for an authentic electronic record. One of the reasons is that electronically produced documents actually are not objects at all but rather, by their nature, products that have to be processed each time they are used. There is no transfer, no reading without a re-creation of the information. Furthermore, electronic records are at risk because of technical obsolescence as newer formats replace older ones.

ePADD does not address the issue of authenticity in this round of funding. This issue is definitely important and complicated and I would like to address it in the future.

Trevor: What lessons has the team learned so far about working with email archives? Are there any assumptions or thoughts you had about working with email as records that have evolved or changed while working on the project?

Peter: Conversion of old archived emails can be tricky. Even though normalization is not within the scope of ePADD, still people need to convert emails to MBox format before ePADD can work on them. One of our partners found missing headers from emails when looking at them using ePADD. The emails came from old Groupwise emails that were migrated into Outlook and then converted to Mbox. Is this a conversion error when converting Groupwise emails to Outlook? Or when converting the Outlook emails to mbox? Or when ePADD parses the emails?

Attachment files come in diverse file formats. The ability to view files in attachments is an important feature for a system like ePADD. Apache can read files in ~50 file formats. On the other hand, QuickView Plus can view files in ~500 file formats. Should we integrate a commercial software in ePADD in order to view files in the 450 file formats which Apache OpenOffice is not capable of? If yes, ePADD will not be an open-source project anymore. If no, ePADD users have to face the fact that there are files they are not able to view.

Glynn: The sheer volume of data to review can be very daunting. The more specific the terms in the lexicon to perform automated indexing to messages the better. You want to discover messages that should be restricted but not have too many false positives to wade through during review.

The ability to process in bulk cannot be stressed too much. When performing actions on a set of messages – either from a lexical result, correspondent or a user-defined search. ePADD allows to you apply any action to that entire subset. You can also apply actions to original folders. For example, if messages are organized into a folder marked “human resources,” the archivist or donor may choose to flag all the messages in that folder as “restricted until 2050.”

Trevor: What are the next steps for the project? What sorts of things are you exploring for the future?

Peter: I would like to look at the topics/concepts exchanged in emails (and match them against the Library of Congress Subject Heading – Topical). It would be interesting to know what books and movies were mentioned in emails. Publishing extracted entities as linked open data is definitely one thing I would like to do as well. However, it all depends on funding.

Glynn: This is the fun part – envisioning what else is needed or desired in future iterations. It is, however, reliant on funding and collaboration. Input is needed across different types of institutions – museums, government, academic, corporate to name a few. While many of the use cases would be similar, there are unique aspects or goals for different institutions.

Over the past few weeks, we’ve taken part in the NDSA-sponsored meeting (see Chris Prom’s blog post) and held ePADD’s first Advisory Group meeting. These sparked some wonderful discussions and ideas about next steps for greater discovery, delivery and collaboration.

There is a definite need in the profession to begin defining and documenting use cases, to analyze and document life cycles of email archives and existing tools in order to evaluate gaps and future needs, to further discovery through exporting correspondents and extracted entities from ePADD and publishing them with a dynamic search interface across archives. Other avenues we would like to explore are the ability to process and deliver other document types (beyond email), including social media.

The final delivery or access module is intended for reading room access, and we hope to provide more robust tools to allow user interaction with the archives. Additionally, we would like to offer data dumps for text mining/analysis or extractions of header information for social network analysis. Currently these are managed by correspondence through Special Collections.

One suggestion from our Advisory Group was to broaden use of ePADD before final release in the summer (2015). By allowing other repositories to use ePADD for processing we would expose more email collections for researcher use and hopefully get more feedback for the development and specification teams. This will be a better demonstration and test of the program. To this end we plan to release ePADD beyond our grant collaborators to other institutions that have already expressed strong interest.

Sudheendra: Glynn and Peter have answered your questions wonderfully, so I’ll just jump in with a little bit of speculation. In the last couple of decades, we’re seeing that a lot of our lives are reflected in our online activities, be it email, blogs, Facebook, Twitter or any other medium. A small example: a cousin of mine spent almost a year organizing a major dance performance for her daughter. She was reflecting on the effort, and exclaimed to me: “That was so much work. You know, I should save all those emails!” I think that is very telling. All of us have wonderful stories in our lives. There are moments of joy and exasperation, love and sorrow, accomplishment and failure, and they are often captured in our electronic communications. We should be able to preserve them, reflect on them, and hand them over to future generations. We already do this with photographs, which are wonderful. However, text-based communications are complementary to images because they capture thoughts, feelings and intentions in a way that images do not.

Unfortunately, the misuse of personal data for commercial or surveillance reasons is causing many people to be wary of preserving their own records, and even to go out of their way to delete them. This is a pity, because there is so much value buried in archives, if only users could keep their data under their own control, and have good tools with which to make sense of it. So in the next decade, I predict that individuals and families will routinely use tools like ePADD to preserve history important to them. We’re all archivists in that sense.

LITA: E-Learning in the Library

planet code4lib - Mon, 2014-10-20 12:00
Pixabay, 2008

Online education has extended its presence to public libraries. Online learning and career training, by services such as Ed2Go and Lynda, are usually offered complimentary to college and university students. Similar services such as Gale Courses, Universal Class and Treehouse are geared toward public library use.
Gale Courses is a subscription service of Cengage Learning. It is a hybrid of Ed2Go, offering courses that range from GED preparation to PC Security. Courses are six weeks in length and are instructor led.
Universal Class offers hundreds of courses on a variety of topics, including dog obedience training, to patrons of diverse interests. Courses are self-paced and users can begin a course at anytime.
Treehouse is uniquely geared toward web design, development and programming for personal computers and mobile device applications. Users can select self-paced educational Tracks that are focused on a specific development area.

An alternative to MOOCs
A considerable population of the general public cannot afford to pursue a formal education. Extending the services of the library into web-based learning, online courses provide access to continuing education for the general public. The mention of free online education is not complete without a nod to massive open online courses (MOOC). MOOCs can be non-profit or commercial. They offer free or affordable online education, of varying course structure, to students around the world. Though MOOCs and open courseware are comparable alternatives, library-hosted continuing education offers additional incentives from those of most freely available online courses.

Education as a service
One advantage to using a service provided by the public library is that patrons can use the computers available on site. For patrons lacking home computer access, they can incorporate another library service into their education. Continuing education courses are free to library card holders at participating libraries. If your regional library does not offer the service, you can always purchase a library card from a participating library. Considering that each course can range from $50 to the mid $100s, the benefit of access to hundreds of courses outweighs the cost of purchasing a library card. Patrons will receive a certificate of completion for each completed course and in the case of Universal Class they will receive continuing education units that are approved by the International Association for Continuing Education and Training (IACET). Treehouse opts for using a point-based system and Badges, digital awards, which signify a user’s progress. Online education also helps to highlight the public library as an evolving source of public information.
All three continuing education providers, offer free trials and demo courses for anyone interested in their services.

Jonathan Rochkind: ActiveRecord Concurrency in Rails4: Avoid leaked connections!

planet code4lib - Mon, 2014-10-20 03:35

My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.

I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4.   The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.

However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.

Background: The ActiveRecord Concurrency Model

Is pretty much described in the header docs for ConnectionPool, and the fundamental architecture and contract hasn’t changed since Rails 2.2.

Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.

You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use.  So far so good.

But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.

And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)

And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before.  A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.

And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.

The danger of leaked connections

So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).

But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do.  If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.

The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.

And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it.  If you leak one connection like this, that’s one less connection available in the ConnectionPool.  If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.

In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would  go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed.   You could call this method from time to time yourself to try and clean up after yourself.

And in fact, if you tried to check out a connection, and no connections were available — Rails 3.2 would call clear_stale_cached_connections! itself to see if there were any leaked connections that could be reclaimed, before raising a ConnectionTimeout. So if you were leaking connections all over the place, you still might not notice, the ConnectionPool would clean em up for you.

But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually.  As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.

So this makes it pretty important to avoid leaking connections.

(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup.  That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )

Monkey-patch AR to avoid leaked connections

I understand where Rails is coming from with the ‘implicit checkout’ thing.  For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).

So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.

The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.

That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).

So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”.  Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is: do ActiveRecord::Base.forbid_implicit_checkout_for_thread! # stuff end

Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.

If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.

This way you can enforce your code to only use `with_connection` like it should.

Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.

DO fear the Reaper

In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections.  In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”

The problem is, as far as I can tell by reading the code, it simply does not do this.

What does the reaper do?  As far as I can tell trying to follow the code, it mostly looks for connections which have actually dropped their network connection to the database.

A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long.  A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).

The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.

Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)

So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.

Oh, maybe fear ruby 1.9.3 too

When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:

ThreadError: deadlock; recursive locking:

In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0. 

If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.

Can you use an already loaded AR model without a connection?

Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?

Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.

Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.

But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).

Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.

Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.

Concurrency Patterns to Avoid in ActiveRecord?

Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).

So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.

If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.

I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.

That’s not what my app does.  I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).

My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)

So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`.  But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.

I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point,  I’d recommend avoiding using ActiveRecord concurrency this way.

What to do?

What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.

At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.”  DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).

Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).

I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel).  I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.

So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads.  Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.

That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.

Filed under: General

Jason Ronallo: A Plugin For Mediaelement.js For Preview Thumbnails on Hover Over the Time Rail Using WebVTT

planet code4lib - Mon, 2014-10-20 01:18

The time rail or progress bar on video players gives the viewer some indication of how much of the video they’ve watched, what portion of the video remains to be viewed, and how much of the video is buffered. The time rail can also be clicked on to jump to a particular time within the video. But figuring out where in the video you want to go can feel kind of random. You can usually hover over the time rail and move from side to side and see the time that you’d jump to if you clicked, but who knows what you might see when you get there.

Some video players have begun to use the time rail to show video thumbnails on hover in a tooltip. For most videos these thumbnails give a much better idea of what you’ll see when you click to jump to that time. I’ll show you how you can create your own thumbnail previews using HTML5 video.

TL;DR Use the time rail thumbnails plugin for Mediaelement.js.

Archival Use Case

We usually follow agile practices in our archival processing. This style of processing became popularized by the article More Product, Less Process: Revamping Traditional Archival Processing by Mark A. Greene and Dennis Meissner. For instance, we don’t read every page of every folder in every box of every collection in order to describe it well enough for us to make the collection accessible to researchers. Over time we may decide to make the materials for a particular collection or parts of a collection more discoverable by doing the work to look closer and add more metadata to our description of the contents. But we try not to allow the perfect from being the enemy of the good enough. Our goal is to make the materials accessible to researchers and not hidden in some box no one knows about.

Some of our collections of videos are highly curated like for video oral histories. We’ve created transcripts for the whole video. We extract out the most interesting or on topic clips. For each of these video clips we create a WebVTT caption file and an interface to navigate within the video from the transcript.

At NCSU Libraries we have begun digitizing more archival videos. And for these videos we’re much more likely to treat them like other archival materials. We’re never going to watch every minute of every video about cucumbers or agricultural machinery in order to fully describe the contents. Digitization gives us some opportunities to automate the summarization that would be manually done with physical materials. Many of these videos don’t even have dialogue, so even when automated video transcription is more accurate and cheaper we’ll still be left with only the images. In any case, the visual component is a good place to start.

Video Thumbnail Previews

When you hover over the time rail on some video viewers, you see a thumbnail image from the video at that time. YouTube does this for many of its videos. I first saw that this would be possible with HTML5 video when I saw the JW Player page on Adding Preview Thumbnails. From there I took the idea to use an image sprite and a WebVTT file to structure which media fragments from the sprite to use in the thumbnail preview. I’ve implemented this as a plugin for Mediaelement.js. You can see detailed instructions there on how to use the plugin, but I’ll give the summary here.

1. Create an Image Sprite from the Video

This uses ffmpeg to take a snapshot every 5 seconds in the video and then uses montage (from ImageMagick) to stitch them together into a sprite. This means that only one file needs to be downloaded before you can show the preview thumbnail.

ffmpeg -i "video-name.mp4" -f image2 -vf fps=fps=1/5 video-name-%05d.jpg montage video-name*jpg -tile 5x -geometry 150x video-name-sprite.jpg 2. Create a WebVTT metadata file

This is just a standard WebVTT file except the cue text is metadata instead of captions. The URL is to an image and uses a spatial Media Fragment for what part of the sprite to display in the tooltip.

WEBVTT 00:00:00.000 --> 00:00:05.000,0,150,100 00:00:05.000 --> 00:00:10.000,0,150,100 00:00:10.000 --> 00:00:15.000,0,150,100 00:00:15.000 --> 00:00:20.000,0,150,100 00:00:20.000 --> 00:00:25.000,0,150,100 00:00:25.000 --> 00:00:30.000,100,150,100 3. Add the Video Thumbnail Preview Track

Put the following within the <video> element.

<track kind="metadata" class="time-rail-thumbnails" src=""></track> 4. Initialize the Plugin

The following assumes that you’re already using Mediaelement.js, jQuery, and have included the vtt.js library.

$('video').mediaelementplayer({ features: ['playpause','progress','current','duration','tracks','volume', 'timerailthumbnails'], timeRailThumbnailsSeconds: 5 }); The Result Your browser won’t play an MP4. You can [download it instead](/video/mep-feature-time-rail-thumbnails-example.mp4).

See Bug Sprays and Pets with sound.


The plugin can either be installed using the Rails gem or the Bower package.


One of the DOM API features I hadn’t used before is MutationObserver. One thing the thumbnail preview plugin needs to do is know what time is being hovered over on the time rail. I could have calculated this myself, but I wanted to rely on MediaElement.js to provide the information. Maybe there’s a callback in MediaElement.js for when this is updated, but I couldn’t find it. Instead I use a MutationObserver to watch for when MediaElement.js changes the DOM for the default display of a timestamp on hover. Looking at the time code there then allows the plugin to pick the correct cue text to use for the media fragment. MutationObserver is more performant than the now deprecated MutationEvents. I’ve experienced very little latency using a MutationObserver which allows it to trigger lots of events quickly.

The plugin currently only works in the browsers that support MutationObserver, which is most current browsers. In browsers that do not support MutationObserver the plugin will do nothing at all and just show the default timestamp on hover. I’d be interested in other ideas on how to solve this kind of problem, though it is nice to know that plugins that rely on another library have tools like MutationObserver around.

Other Caveats

This plugin is brand new and works for me, but there are some caveats. All the images in the sprite must have the same dimensions. The durations for each thumbnail must be consistent. The timestamps currently aren’t really used to determine which thumbnail to display, but is instead faked relying on the consistent durations. The plugin just does some simple addition and plucks out the correct thumbnail from the array of cues. Hopefully in future versions I can address some of these issues.


Having this feature be available for our digitized video, we’ve already found things in our collection that we wouldn’t have seen before. You can see how a “Profession with a Future” evidently involves shortening your life by smoking (at about 9:05). I found a spinning spherical display of Soy-O and synthetic meat (at about 2:12). Some videos switch between black & white and color which you wouldn’t know just from the poster image. And there are some videos, like talking heads, that appear from the thumbnails to have no surprises at all. But maybe you like watching boiling water for almost 13 minutes.

OK, this isn’t really a discovery in itself, but it is fun to watch a head banging JFK as you go back and forth over the time rail. He really likes milk. And Eisenhower had a different speaking style.

You can see this in action for all of our videos on the NCSU Libraries’ Rare & Unique Digital Collections site and make your own discoveries. Let me know if you find anything interesting.

Preview Thumbnail Sprite Reuse

Since we already had the sprite images for the time rail hover preview, I created another interface to allow a user to jump through a video. Under the video player is a control button that shows a modal with the thumbnail sprite. The sprite alone provides a nice overview of the video that allows you to see very quickly what might be of interest. I used an image map so that the rather large sprite images would only have to be in memory once. (Yes, image maps are still valid in HTML5 and have their legitimate uses.) jQuery RWD Image Maps allows the map area coordinates to scale up and down across devices. Hovering over a single thumb will show the timestamp for that frame. Clicking a thumbnail will set the current time for the video to be the start time of that section of the video. One advantage of this feature is that it doesn’t require the kind of fine motor skill necessary to hover over the video player time rail and move back and forth to show each of the thumbnails.

This feature has just been added this week and deployed to production this week, so I’m looking for feedback on whether folks find this useful, how to improve it, and any bugs that are encountered.

Summarization Services

I expect that automated summarization services will become increasingly important for researchers as archives do more large-scale digitization of physical collections and collect more born digital resources in bulk. We’re already seeing projects like fondz which autogenerates archival description by extracting the contents of born digital resources. At NCSU Libraries we’re working on other ways to summarize the metadata we create as we ingest born digital collections. As we learn more what summarization services and interfaces are useful for researchers, I hope to see more work done in this area. And this is just the beginning of what we can do with summarizing archival video.

Tara Robertson: Digesting the Gender and Sexuality in Information Studies Colloquium

planet code4lib - Mon, 2014-10-20 00:57

Most of the conferences I go to are technology ones that are focused on practical applications and knowledge sharing on how we have solved specific technical problems or figured out new, more efficient ways to do old things. It’s been a long time since I’ve been to a conference that’s about broader ideas and a much longer time since I’ve been to an academic conference. This was outside my comfort zone and it was an extremely worthwhile experience.

I was unbelievably excited to see the program for the first Gender and Sexuality in Information Studies colloquium. Also, as Emily Drabinski and Lisa Sloniowski were involved, so I knew it was going to be great.

There were 100 attendees. I’d estimate that library and information studies professors and PhD students made up 50%, library school grad students made up 25%, and the other 25% of us were practioners, who work almost exclusively in academic settings. The conference participants had the best selection of glasses, and I was inspired to document some of them.

The program was great and I had a very hard time picking which of the 3 streams I wanted to attend. A few people scampered between rooms to catch papers in different streams. Program highlights for me was the panel on porn in the library and the panel on gender and content. My thoughts on the porn in the library panel became a bit long, so I’ll post those tomorrow.

In my opinion it was a shame that most of the presenters defaulted to a traditional academic style of conference presentation, that is, they stood at the front of the room and read their papers to the audience without making much eye contact. For me the language was sometimes unnecessarily dense and that many of the theoretical concepts discussed would’ve been more successful if expressed in plain English.

I was also disappointed that there wasn’t a plan to post the papers online. Lisa explained to me that for those librarians and scholars in a university environment publications are important to tenure and promotion. Conference presentations count, but not as much as peer reviewed publications, which don’t count as much as book publications. I know there’s a plan in the works for a edition of Library Trends that will be published in 2 years. Also, I know from the interest on Twitter that there are many people who weren’t able to travel to Toronto and attend in person who are very hungry to read these papers. For the technology conferences I go to it is standard to share as much as possible: to livestream the conference, to archive the Twitter stream, and to post presentations online and made code public too. I hope that most of the presenters will figure out a way to share their work openly without it costing them in academic prestige. There’s got to be a way to do this.

There was a really magical feeling at this first colloquium on gender and sexuality in LIS. Everyone brought their smarts, ideas and generous spirits. I think a lot of us have been starved for this kind of environment, engagement and community.

My brain, heart and sinuses are full. I’m exhausted and heading home to Vancouver. This one day of connections and ideas will keep me going for another year. Kudos to the organizers Emily Drabinski, Patrick Keilty and Litwin Books for organizing this. I’m hungry for more.

Galen Charlton: Tips and tricks for leaking patron information

planet code4lib - Sat, 2014-10-18 00:39

Here is a partial list of various ways I can think of to expose information about library patrons and their search and reading history by use (and misuse) of software used or recommended by libraries.

  • Send a patron’s ebook reading history to a commercial website…
    • … in the clear, for anybody to intercept.
  • Send patron information to a third party…
    • … that does not have an adequate privacy policy.
    • … that has an adequate privacy policy but does not implement it well.
    • … that is sufficiently remote that libraries lack any leverage to punish it for egregious mishandling of patron data.
  • Use an unencrypted protocol to enable a third-party service provider to authenticate patrons or look them up…
    • … such as SIP2.
    • … such as SIP2, with the patron information response message configured to include full contact information for the patron.
    • … or many configurations of NCIP.
    • … or web services accessible over HTTP (as opposed to HTTPS).
  • Store patron PINs and passwords without encryption…
    • … or using weak hashing.
  • Store the patron’s Social Security Number in the ILS patron record.
  • Don’t require HTTPS for a patron to access her account with the library…
    • … or if you do, don’t keep up to date with the various SSL and TLS flaws announced over the years.
  • Make session cookies used by your ILS or discovery layer easy to snoop.
  • Use HTTP at all in your ILS or discovery layer – as oddly enough, many patrons will borrow the items that they search for.
  • Send an unencrypted email…
    • … containing a patron’s checkouts today (i.e., an email checkout receipt).
    • … reminding a patron of his overdue books – and listing them.
    • … listing the titles of the patron’s available hold requests.
  • Don’t encrypt connections between an ILS client program and its application server.
  • Don’t encrypt connections between an ILS application server and its database server.
  • Don’t notice that a rootkit has been running on your ILS server for the past six months.
  • Don’t notice that a keylogger has been running on one of your circulation PCs for the past three months.
  • Fail to keep up with installing operating system security patches.
  • Use the same password for the circulator account used by twenty circulation staff (and 50 former circulation staff) – and never change it.
  • Don’t encrypt your backups.
  • Don’t use the feature in your ILS to enable severing the link between the record of a past loan and the specific patron who took the item out…
    • … sever the links, but retain database backups for months or years.
  • Don’t give your patrons the ability to opt out of keeping track of their past loans.
  • Don’t give your patrons the ability to opt in to keeping track of their past loans.
  • Don’t give the patron any control or ability to completely sever the link between her record and her past circulation history whenever she chooses to.
  • When a patron calls up asking “what books do I have checked out?” … answer the question without verifying that the patron is actually who she says she is.
  • When a parent calls up asking “what books does my teenager have checked out?”… answer the question.
  • Set up your ILS to print out hold slips… that include the full name of the patron. For bonus points, do this while maintaining an open holds shelf.
  • Don’t shred any circulation receipts that patrons leave behind.
  • Don’t train your non-MLS staff on the importance of keeping patron information confidential.
  • Don’t give your MLS staff refreshers on professional ethics.
  • Don’t shut down library staff gossiping about a patron’s reading preferences.
  • Don’t immediately sack a library staff member caught misusing confidential patron information.
  • Have your ILS or discovery interface hosted by a service provider that makes one or more of the mistakes listed above.
  • Join a committee writing a technical standard for library software… and don’t insist that it take patron privacy into account.

Do you have any additions to the list? Please let me know!

Of course, I am not actually advocating disclosing confidential information. Stay tuned for a follow-up post.

Harvard Library Innovation Lab: Link roundup October 17, 2014

planet code4lib - Fri, 2014-10-17 21:23

This is the good stuff.

UNIX: Making Computers Easier To Use — AT&T Archives film from 1982, Bell Laboratories

Love the idea that UNIX and computing should be social. Building things, together.

Digital Public Library of America » GIF IT UP

The @DPLA @digitalnz GIF IT UP competition is the funnest thing in libraries right now. Love it.

physical-web/ at master · google/physical-web

URLs emitted from physical world devices. This is the right way to think about phone/physical world interfaces.

Forty Portraits in Forty Years –

Gotta love the Brown sisters. Photos from our archives are neat. Stitching together a time lapse would be amazing.

Peter Thiel Thinks We All Can Do Better | On Point with Tom Ashbrook


Roy Tennant: A Tale of Two Records

planet code4lib - Fri, 2014-10-17 20:46

Image courtesy Wikipedia, public domain.

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of BIBFRAME,
it was the season of RDA,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to metadata Heaven,
we were all going direct the other way–
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.

There were a MARC with a large set of tags and an RDA with a plain face, on the throne of library metadata; there were a with a large following and a JSON-LD with a fair serialization, on the throne of all else. In both camps it was clearer than crystal to the lords of the Library preserves of monographs and serials, that things in general were settled for ever.


But they weren’t. Oh were they not. It mayhaps would have been pleasant, back in 2014, to have settled everything for all time, but such things were not to be.

The library guilds united behind the RDA wall, where they frantically ran MARC records through the furnace to forge fresh new records of RDA, employed to make the wall ever thicker and higher.

The Parliamentary Library assaulted their ramparts with the BIBFRAME, but the stones flung by that apparatus were insufficient to breach the wall of RDA.

Meanwhile, the vast populace in neither camp employed, to garner the attention of the monster crawlers and therefore their many minions, ignoring the internecine squabbles over arcane formats.

Eventually warfare settled down to a desultory, almost emotionless flinging of insults and the previous years of struggle were rendered meaningless.

So now we, the occupants of mid-century modernism, are left to contemplate the apparent fact that formats never really mattered at all. No, dear reader, they never did. What mattered was the data, and the parsing of it, and its ability to be passed from hand to hand without losing meaning or value.

One wonders what those dead on the Plain of Standards would say if they could have lived to see this day.


My humble and abject apologies to Mr. Charles Dickens, for having been so bold as to damage his fine work with my petty scribblings.

District Dispatch: Free webinar: Giving legal advice to patrons

planet code4lib - Fri, 2014-10-17 20:05

Reference librarian assisting readers. Photo by the Library of Congress.

Every day, public library staff are asked to answer legal questions. Since these questions are often complicated and confusing, and because there are frequent warnings about not offering legal advice, reference staff may be uncomfortable addressing legal reference questions. To help reference staff build confidence in responding to legal inquiries, the American Library Association (ALA) and iPAC will host the free webinar “ Connecting Patrons with Legal Information” on Wednesday, November 12, 2014, from 2:00–3:00 p.m. EDT.

The session will offer information on laws, legal resources and legal reference practices. Participants will learn how to handle a law reference interview, including where to draw the line between information and advice, key legal vocabulary and citation formats. During the webinar, leaders will offer tips on how to assess and choose legal resources for patrons. Register now as space is limited.

Catherine McGuire, head of Reference and Outreach at the Maryland State Law Library, will lead the free webinar. McGuire currently plans and presents educational programs to Judiciary staff, local attorneys, public library staff and members of the public on subjects related to legal research and reference. She currently serves as Vice Chair of the Conference of Maryland Court Law Library Directors and the co-chair of the Education Committee of the Legal Information Services to the Public Special Interest Section (LISP-SIS) of the American Association of Law Libraries (AALL).

Webinar: Connecting Patrons with Legal Information
Date: Wednesday, November 12, 2014
Time: 2:00–3:00 p.m. EDT

The archived webinar will be emailed to District Dispatch subscribers.

The post Free webinar: Giving legal advice to patrons appeared first on District Dispatch.

OCLC Dev Network: A Close Look at the WorldCat::Discovery Ruby Gem

planet code4lib - Fri, 2014-10-17 18:15

This is the third installment in our deep dive series on the WorldCat Discovery API. This week we will be taking a close look at some of the demo code we have written ourselves to exercise the API throughout its development process. We have decided to share our work through our OCLC Developer Network account.

pinboard: NCSA Brown Dog

planet code4lib - Fri, 2014-10-17 16:53
@todrobbins saw you were asking about browndog in #code4lib did you find this already?

OCLC Dev Network: Systems Maintenance on October 19

planet code4lib - Fri, 2014-10-17 16:30

Web services that require user level authentication will be down for Identity Management system (IDM) updates beginning Sunday, October 19th, at 3:00 am local data center time.


HangingTogether: Evolving Scholarly Record workshop (Part 2)

planet code4lib - Fri, 2014-10-17 16:00

This is the second of three posts about the workshop.

Part 1 introduced the Evolving Scholarly Record framework.  This part summarizes the two plenary discussions.

Research Records and Artifact EcologiesNatasa Miliç-Frayling, Principal Researcher, Microsoft Research Cambridge

Natasa illustrated the diversity and complexity of digital research information comparing it to a rainbow and asking how do we preserve a rainbow?  She began with the question, How can we support the reuse of scientific data, tools, and resources to facilitate new scientific discoveries?  We need to take a sociological point of view because scientific discovery is a social enterprise within communities of practice – and the information takes a complex journey from the lab to the paper, evolving en route.  When teams consist of distributed scientists notions of ownership and sharing are challenged.  We need to be attuned to the interplay between technology and collaborative practices as it affects the information artifacts.

Natasa encouraged a shift in thinking from the record to the ecology, as she shared her study of the artifacts ecology of a particular nanotechnology endeavor.  Their ecosystem has electronic lab books, includes tools, ingests sensor data, and incorporates analysis and interpretation.  This ecosystem provides context for understanding the data and other artifacts, but scientists want help linking these artifacts and overcoming limitations of physical interaction.  They want content extraction and format transformation services.  They want to create project maps and overviews to support their work in order to convey meaning to guide third party reuse of the artifacts.  Preservation is not just persistence; it requires a connection with the contemporary ecosystem.  A file and an application can persist and be completely unusable.  They need to be processed and displayed to be experienced and this requires preserving them in their original state and virtualising the old environments on future platforms.  She acknowledged the challenges in supporting research, but implored libraries to persevere.

A Perspective on Archiving the Evolving Scholarly RecordHerbert Van de Sompel, Scientist, Los Alamos National Laboratory

Herbert took a web-focused view, saying that not only is nearly everything digital, it is nearly all networked, which must be taken into account when we talk about archiving.  His presentation reflected thinking in progress with Andrew Treloar, of the Australian National Data Service.  Herbert highlighted the “collect” and “fix” roles and how the materials will be obtained by archives.  He used Roosendaal and Geurtz’s functions of scholarly communication to structure his talk: Registration (the claim, with its related objects), Certification (peer review and other validation), Awareness (alerts and discovery of new claims), and Archiving (preserving over time), emphasizing that there is no scholarly record without archiving.  The four functions had been integrated in print journal publishing, but now the functions are disaggregated and distributed among many entities.

Herbert then characterized the future environment as the Web of Objects.  Scholarly communication is becoming more visible, continuous, informal, instant, and content-driven.  As a result, research objects are more varied, compound, diverse, networked, and open.  He discussed several challenges this presents to libraries.  Archiving must take into account that objects are often hosted on common web platforms (e.g., GitHub, SlideShare, WordPress), which are not necessarily dedicated to scholarship.  We archive only 50% of journal articles and they tend to be the easy, low-risk titles.  “Web at Large” resources are seldom archived.   Today’s approach to archiving focuses on atomic objects and loses context.  We need to move toward archiving compound objects in various states of flux, as resources on the web rather than as files in file systems.  He distinguished between recording (short-term, no guarantees, many copies, and tied to the scholarly process) and archiving (longer-term, guarantees, one copy, and part of the scholarly record).  Curatorial decisions need to be made to transfer materials from the recording infrastructures to an archival infrastructure through collaborations, interoperability, and web-scale processes.

Part 3 will summarize the breakout discussions.

About Ricky Erway

Ricky Erway, Senior Program Officer at OCLC Research, works with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation.

Mail | Web | Twitter | LinkedIn | More Posts (36)


Subscribe to code4lib aggregator