You are here

Feed aggregator

DPLA: DPLA and the Digital Library Federation team up to offer special DPLAfest travel grants

planet code4lib - Tue, 2015-02-10 16:00

The Digital Public Library of America (DPLA)  is excited to work with the Digital Library Federation (DLF) program of the Council on Library and Information Resources to offer DPLA + DLF Cross-Pollinator Travel Grants. The purpose of these grants is to extend the opportunity to attend DPLAfest 2015 to four DLF community members. Successful applicants should be able to envision and articulate a connection between the DLF community and the work of DPLA.

It is our belief that the key to sustainability of large-scale national efforts require robust community support. Connecting DPLA’s work to the energetic and talented DLF community is a positive way to increase serendipitous collaboration around this shared digital platform.

The goal of the DPLA + DLF Travel Grants is to bring cross-pollinators—active DLF community  contributors who can provide unique perspectives to our work and share the vision of DPLA from their perspective—to the conference. By teaming up with the DLF to provide travel grants, it is our hope to engage DLF community members and connect them to exciting areas of growth and opportunity at DPLA.

The travel grants include DPLAfest 2015 conference registration, travel costs, meals, and lodging in Indianapolis.

The DPLA + DLF Cross-Pollinator Travel Grants is the first of a series of collaborations between CLIR/DLF and DPLA.


Four awards of up to $1,250 each to go towards the travel, board, and lodging expenses of attending DPLAfest 2015. Additionally, the grantees will each receive a complimentary full registration to the event ($75). Recipients will be required to write a blog post about their experience subsequent to DPLAfest; this blog post will be co-published by DLF and DPLA.


Applicants must be a staff member of a current DLF member organization and not currently working on a DPLA hub team.


Send an email by March 5th, 5 pm EDT, containing the following items (in one document) to, with the subject “DPLAFest Travel Grant: [Full Name]:

  • Cover letter of nomination from the candidate’s supervisor/manager or an institutional executive, including acknowledgement that candidate would not have been funded to attend DPLAfest.
  • Personal statement from the candidate (ca. 500 words) explaining their educational background, what their digital library/collections involvement is, why they are excited about digital library/collections work, and how they see themselves benefiting from and participating in DPLAfest.
  • A current résumé.

Applications may be addressed to the DPLA + DLF Cross-Pollinator Committee.


Candidates will be selected by DPLA and DLF staff. In assessing the applications, we will look for a demonstrated commitment to digital work, and will consider the degree to which participation might enhance communication and collaboration between the DLF and DPLA communities. Applicants will be notified of their status no later than March 16, 2015.

These fellowships are generously supported by the Council on Library and Information Resource’s Digital Library Federation program.

David Rosenthal: The Evanescent Web

planet code4lib - Tue, 2015-02-10 16:00
Papers drawing attention to the decay of links in academic papers have quite a history, i blogged about three relatively early ones six years ago. Now Martin Klein and a team from the Hiberlink project have taken the genre to a whole new level with a paper in PLoS One entitled Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. Their dataset is 2-3 orders of magnitude bigger than previous studies, their methods are far more sophisticated, and they study both link rot (links that no longer resolve) and content drift (links that now point to different content). There's a summary on the LSE's blog.

Below the fold, some thoughts on the Klein et al paper.

As regards link rot, they write:
In order to combat link rot, the Digital Object Identifier (DOI) was introduced to persistently identify journal articles. In addition, the DOI resolver for the URI version of DOIs was introduced to ensure that web links pointing at these articles remain actionable, even when the articles change web location.But even when used correctly, such as, DOIs introduce a single point of failure. This became obvious on January 20th when the domain name briefly expired. DOI links all over the Web failed, illustrating yet another fragility of the Web. It hasn't been a good time for access to academic journals for other reasons either. Among the publishers unable to deliver content to their customers in the last week or so were Elsevier, Springer, Nature, HighWire Press and Oxford Art Online.

I've long been a fan of Herbert van de Sompel's work, especially Memento. He's a co-author on the paper and we have been discussing it. Unusually, we've been disagreeing. We completely agree on the underlying problem of the fragility of academic communication in the Web era as opposed to its robustness in the paper era. Indeed, in the introduction of another (but much less visible) recent paper entitled Towards Robust Hyperlinks for Web-Based Scholarly Communication Herbert and his co-authors echo the comparison between the paper and Web worlds from the very first paper we published on the LOCKSS system a decade and a half ago. Nor am I critical of the research underlying the paper, which is clearly of high quality and which reveals interesting and disturbing properties of Web-based academic communication. All I'm disagreeing with Herbert about is the way the research is presented in the paper.

My problem with the presentation is that this paper, which has a far higher profile than other recent publications in this area, and which comes at a time of unexpectedly high visibility for web archiving, seems to me to be excessively optimistic, and to fail to analyze the roots of the problem it is addressing. It thus fails to communicate the scale of the problem.

The paper is, for very practical reasons of publication in a peer-reviewed journal, focused on links from academic papers to the web-at-large. But I see it as far too optimistic in its discussion of the likely survival of the papers themselves, and the other papers they link to (see Content Drift below). I also see it as far too optimistic in its discussion of proposals to fix the problem of web-at-large references that it describes (see Dependence on Authors below).

All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself (see Getting Links Right below) that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.

In addition, the paper fails to even mention the elephant in the room, the fact that both the papers and the web-at-large content are copyright. The archives upon which the proposed web-at-large solutions rest, such as the Internet Archive, are themselves fragile. Not just for the normal economic and technical reasons we outlined nearly a decade ago, but because they operate under the DMCA's "safe harbor" provision and thus must take down content upon request from a claimed copyright holder. The archives such as Portico and LOCKSS that preserve the articles themselves operate instead with permission from the publisher, and thus must impose access restrictions.

This is the root of the problem. In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like.

None of this is to suggest that developing and deploying partial solutions is a waste of time. It is what I've been doing the last quarter of my life. There cannot be a single comprehensive technical solution. The best we can do is to combine a diversity of partial solutions. But we need to be clear that even if we combine everything anyone has worked on we are still a long way from solving the problem. Now for some details.

Content DriftAs regards content drift, they write:
Content drift is hardly a matter of concern for references to journal articles, because of the inherent fixity that, especially PDF-formated, articles exhibit. Nevertheless, special-purpose solutions for long-term digital archiving of the digital journal literature, such as LOCKSS, CLOCKSS, and Portico, have emerged to ensure that articles and the articles they reference can be revisited even if the portals that host them vanish from the web. More recently, the Keepers Registry has been introduced to keep track of the extent to which the digital journal literature is archived by what memory organizations. These combined efforts ensure that it is possible to revisit the scholarly context that consists of articles referenced by a certain article long after its publication.While I understand their need to limit the scope of their research to web-at-large resources, the last sentence is far too optimistic.

First, research using the Keepers Registry and other resources shows that at most 50% of all articles are preserved. So future scholars depending on archives of digital journals will encounter large numbers of broken links.

Second, even the 50% of articles that are preserved may not be accessible to a future scholar. CLOCKSS is a dark archive and is not intended to provide access to future scholars unless the content is triggered. Portico is a subscription archive, future scholars' institutions may not have a subscription. LOCKSS provides access only to readers at institutions running a LOCKSS box. These restrictions are a response to the copyright on the content and are not susceptible to technical fixes.

Third, the assumption that journal articles exhibit "inherent fixity" is, alas, outdated. Both the HTML and PDF versions of articles from state-of-the-art publishing platforms contain dynamically generated elements, even when they are not entirely generated on-the-fly. The LOCKSS system encounters this on a daily basis. As each LOCKSS box collects content from the publisher independently, each box gets content that differs in unimportant respects. For example, the HTML content is probably personalized ("Welcome Stanford University") and updated ("Links to this article"). PDF content is probably watermarked ("Downloaded by"). Content elements such as these need to be filtered out of the comparisons between the "same" content at different LOCKSS boxes. One might assume that the words, figures, etc. that form the real content of articles do not drift, but in practice it would be very difficult to validate this assumption.

Soft-404 ResponsesI've written before about the problems caused for archiving by "soft-403 and soft-404" responses by Web servers. These result from Web site designers who believe their only audience is humans, so instead of providing the correct response code when they refuse to supply content, they return a pretty page with a 200 response code indicating valid content. The valid content is a refusal to supply the requested content. Interestingly, PubMed is an example, as I discovered when clicking on the (broken) PubMed link in the paper's reference 58.

Klein et al define a live web page thus:
On the one hand, the HTTP transaction chain could end successfully with a 2XX-level HTTP response code. In this case we declared the URI to be active on the live web. Their estimate of the proportion of links which are still live is thus likely to be optimistic, as they are likely to have encountered at least soft-404s if not soft-403s.

Getting Links RightEven when the resolver is working, its effectiveness in persisting links depends on its actually being used. Klein et al discover that in many cases it isn't:
one would assume that URI references to journal articles can readily be recognized by detecting HTTP URIs that carry a DOI, e.g., However, it turns out that references rather frequently have a direct link to an article in a publisher's portal, e.g., instead of the DOI link.The direct link may well survive relocation of the content within the publisher's site. But journals are frequently bought and sold between publishers, causing the link to break. I believe there are two causes for these direct links, publisher's platforms inserting them so as not to risk losing the reader, but more importantly the difficulty for authors to create correct links. Cutting and pasting from the URL bar in their browser necessarily gets the direct link, creating the correct one via requires the author to know that it should be hand-edited, and to remember to do it.

Attempts to ensure linked materials are preserved suffer from a similar problem:
The solutions component of Hiberlink also explores how to best reference archived snapshots. The common and obvious approach, followed by Webcitation and, is to replace the original URI of the referenced resource with the URI of the Memento deposited in a web archive. This approach has several drawbacks. First, through removal of the original URI, it becomes impossible to revisit the originally referenced resource, for example, to determine what its content has become some time after referencing. Doing so can be rather relevant, for example, for software or dynamic scientific wiki pages. Second, the original URI is the key used to find Mementos of the resource in all web archives, using both their search interface and the Memento protocol. Removing the original URI is akin to throwing away that key: it makes it impossible to find Mementos in web archives other than the one in which the specific Memento was deposited. This means that the success of the approach is fully dependent on the long term existence of that one archive. If it permanently ceases to exist, for example, as a result of legal or financial pressure, or if it becomes temporally inoperative as a result of technical failure, the link to the Memento becomes rotten. Even worse, because the original URI was removed from the equation, it is impossible to use other web archives as a fallback mechanism. As such, in the approach that is currently common, one link rot problem is replaced by another.The paper, and a companion paper, describe Hiberlink's solution, which is to decorate the link to the original resource with an additional link to its archived Memento. Rene Voorburg of the KB has extended this by implementing robustify.js
robustify.js checks the validity of each link a user clicks. If the linked page is not available, robustify.js will try to redirect the user to an archived version of the requested page. The script implements Herbert Van de Sompel's Memento Robust Links - Link Decoration specification (as part of the Hiberlink project) in how it tries to discover an archived version of the page. As a default, it will use the Memento Time Travel service as a fallback. You can easily implement robustify.js on your web pages in so that it redirects pages to your preferred web archive. Note, however, that soft-403s and soft-404s pose the same problem for robustify.js as they do for all Web archiving technologies.

Dependence on AuthorsMany of the solutions that have been proposed to the problem of reference rot also suffer from dependence on authors:
Webcitation was a pioneer in this problem domain when, years ago, it introduced the service that allows authors to archive, on demand, web resources they intend to reference. ... But Webcitation has not been met with great success, possibly the result of a lack of authors' awareness regarding reference rot, possibly because the approach requires an explicit action by authors, likely because of both.Webcitation is not the only one:
To a certain extent, portals like FigShare and Zenodo play in this problem domain as they allow authors to upload materials that might otherwise be posted to the web at large. The recent capability offered by these systems that allows creating a snapshot of a GitHub repository, deposit it, and receive a DOI in return, serves as a good example. The main drivers for authors to do so is to contribute to open science and to receive a citable DOI, and, hence potentially credit for the contribution. But the net effect, from the perspective of the reference rot problem domain, is the creation of a snapshot of an otherwise evolving resource. Still, these services target materials created by authors, not, like web archives do, resources on the web irrespective of their authorship. Also, an open question remains to which extent such portals truly fulfill a long term archival function rather than being discovery and access environments.Hiberlink is trying to reduce this dependence:
In the solutions thread of Hiberlink, we explore pro-active archiving approaches intended to seamlessly integrate into the life cycle of an article and to require less explicit intervention by authors. One example is an experimental Zotero extension that archives web resources as an author bookmarks them during note taking. Another is HiberActive, a service that can be integrated into the workflow of a repository or a manuscript submission system and that issues requests to web archives to archive all web at large resources referenced in submitted articles.But note that these services (and Voorburg's) depend on the author or the publisher installing them. Experience shows that authors are focused on getting their current paper accepted, large publishers are reluctant to implement extensions to their publishing platforms that offer no immediate benefit, and small publishers lack the expertise to do so.

Ideally, these services would be back-stopped by a service that scanned recently-published articles for web-at-large links and submitted them for archiving, thus requiring no action by author or publisher. The problem is that doing so requires the service to have access to the content as it is published. The existing journal archiving services, LOCKSS, CLOCKSS and Portico have such access to about half the published articles, and could in principle be extended to perform this service. In practice doing so would need at least modest funding. The problem isn't as simple as it appears at first glance, even for the articles that are archived. For those that aren't, primarily from less IT-savvy authors and small publishers, the outlook is bleak.

ArchivingFinally, the solutions assume that submitting a URL to an archive is enough to ensure preservation. It isn't. The referenced web site might have a robots.txt policy preventing collection. The site might have crawler traps, exceed the archive's crawl depth, or use Javascript in ways that prevent the archive collecting a usable representation. Or the archive may simply not process the request in time to avoid content drift or link rot.

AcknowledgementI have to thank Herbert van de Sompel for greatly improving this post through constructive criticism. But it remains my opinion alone.

DuraSpace News: UPDATE: Open Repositories 2015 DEVELOPER TRACK

planet code4lib - Tue, 2015-02-10 00:00

Adam Field and Claire Knowles, OR2015 Developer Track Co-Chairs; Cool Tools, Daring Demos and Fab Features

Indianapolis, IN  The OR2015 developer track presents an opportunity to share the latest developments across the technical community. We will be running informal sessions of presentations and demonstrations showcasing community expertise and progress:

SearchHub: Mark Your Calendar: Lucene/Solr Revolution 2015

planet code4lib - Mon, 2015-02-09 21:42
If you attended Lucene/Solr Revolution 2014 and took the post-event survey, you’ll know that we polled attendees on where they would like to see the next Revolution take place. Austin was our winner, and we couldn’t be more excited! Mark your calendar for October 13-16 for Lucene/Solr Revolution 2015 at the Hilton Austin for four days packed with hands-on developer training and multiple educational tracks led by industry experts focusing on Lucene/Solr in the enterprise, case studies, large-scale search, and data integration. We had a blast in DC for last year’s conference, and this year we’re adding even more opportunities to network and interact with other Solr enthusiasts, experts, and committers. Registration and Call for Papers will open this spring. To stay up-to-date on all things Lucene/Solr Revolution, visit and follow @LuceneSolrRev on Twitter. Revolution 2014 Resources Videos, presentations, and photos from last year’s conference are now available. Check them out at the links below. View presentation recordings from Lucene/Solr Revolution 2014 Download slides from Lucene/Solr Revolution 2014 presentations View photos from Lucene/Solr Revolution 2014

The post Mark Your Calendar: Lucene/Solr Revolution 2015 appeared first on Lucidworks.

LibraryThing (Thingology): New “More Like This” for LibraryThing for Libraries

planet code4lib - Mon, 2015-02-09 18:04

We’ve just released “More Like This,” a major upgrade to LibraryThing for Libraries’ “Similar items” recommendations. The upgrade is free and automatic for all current subscribers to LibraryThing for Libraries Catalog Enhancement Package. It adds several new categories of recommendations, as well as new features.

We’ve got text about it below, but here’s a short (1:28) video:

What’s New

Similar items now has a See more link, which opens More Like This. Browse through different types of recommendations, including:

  • Similar items
  • More by author
  • Similar authors
  • By readers
  • Same series
  • By tags
  • By genre

You can also choose to show one or several of the new categories directly on the catalog page.

Click a book in the lightbox to learn more about it—a summary when available, and a link to go directly to that item in the catalog.

Rate the usefulness of each recommended item right in your catalog—hovering over a cover gives you buttons that let you mark whether it’s a good or bad recommendation.

Try it Out!

Click “See more” to open the More Like This browser in one of these libraries:

Find out more

Find more details for current customers on what’s changing and what customizations are available on our help pages.

For more information on LibraryThing for Libraries or if you’re interested in a free trial, email, visit, or register for a webinar.

Library of Congress: The Signal: DPOE Interview: Three Trainers Launch Virtual Courses

planet code4lib - Mon, 2015-02-09 15:56

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress.

This is the first post in a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program.  Today I’ll focus on some exceptional individuals, who among other things, have completed one of the DPOE Train-the-Trainer workshops and delivered digital preservation training. I am interviewing Stephanie Kom, North Dakota State Library; Carol Kussmann, University of Minnesota Libraries; and Sara Ring, Minitex (a library network providing continuing education and other services to MN, ND and SD), who recently led an introductory virtual course on digital preservation.

Barrie: Carol, you attended the inaugural DPOE Train-the-Trainer Workshop in Washington, and Stephanie and Sara, you attended the first regional event at the Indiana State Archives during the summer of 2012, correct? Can you tell the readers about your experiences and how you and others have benefited as a result?

Carol Kussmann

Carol: In addition to learning about the DPOE curriculum itself the most valuable aspect of these Train-the-Trainer workshops was meeting new people and building relationships. In the inaugural workshop, we met people from across the country, many whom I have looked to for advice or worked with on other projects. Because of the Indiana regional training, we now have a sizable group of trainers in the Midwest that I feel comfortable with in talking about DPOE and other electronic record issues. We work with each other and provide feedback and assistance when we go out and train others or work on digital preservation issues in our own roles.

Stephanie Kom

Stephanie: We were just starting a digital program at my institution so the DPOE training was beyond helpful in just informing me what needed to be done to preserve our future digital content. It gave me the tools to explain our needs to our IT department. I also echo Carol’s thoughts on the networking opportunities. It was a great way to meet people in the region that are working with the same issues.

Sara: As my colleagues mentioned, in addition to learning the DPOE curriculum, what was most valuable to me was meeting new colleagues and forming relationships to build upon after the workshop. Shortly after the training, about eight of us began meeting virtually on a regular basis to offer our first digital preservation course (using the DPOE curriculum). Our small upper Midwest collaborative included trainers from North Dakota, South Dakota, Minnesota and Wisconsin. We had trainers from libraries, state archives and a museum participating, and we found we all had different strengths to share with our audience. Our first virtual course, “Managing Digital Content Over Time: An Introduction to Digital Preservation,” reached about 35 organizations of all types, and our second virtual course reached about 20 organizations in the region.

Sara Ring

Barrie: Since becoming official DPOE trainers, you have developed a virtual course to provide an introduction to digital preservation. Can you provide a few details about the course, and have you developed any other training materials from the DPOE Curriculum?

Stephanie, Carol, Sara: The virtual course we offered was broken up as three sessions, scheduled every other week. Each session covered two of the DPOE modules. Using the DPOE workshop materials as a starting point we added local examples from our own organizations and built in discussion questions and polls for the attendees so that we had plenty of interaction.

Evaluations from this first offering informed us that people wanted to know more about various tools used to manage and preserve digital content. In response, in our second offering of the course we built in more demonstrations of tools to help identify, manage and monitor digital content over time. Since we were discussing and demonstrating tools that dealt with metadata, we added more content about technical and preservation metadata standards. We also built in take-home exercises for attendees to complete between sessions. Attendees have responded well to these changes and find the take-home exercises that we have built in really useful.

We also created a Google Site for this course, with an up-to-date list of resources, best practices and class exercises. Carol created step-by-step guides that people can follow for understanding and using tools that can assist with managing and preserving their electronic records. These can be found on the University of Minnesota Libraries Digital Preservation Page.

Working through Minitex, we have developed three different classes related to digital preservation; An Introduction to Digital Preservation (webinar); the DPOE virtual course that was mentioned; and a full day in-person DPOE-based workshop. We have presented each of these at least two times.

Tools Quick Reference Guide, provided to attendees of “Managing Digital Content Over Time.”

Barrie: The DPOE curriculum, which is built upon the OAIS Reference Model, recently underwent a revision. Have you noticed any significant changes in the materials since you attended the workshop in 2011 or 2012? What improvements have you observed?

Carol: What I like about DPOE is that it provides a framework for people to talk about common issues related to digital preservation. The main concepts have not changed – which is good, but there has been a significant increase to the number of examples and resources. The “Digital Preservation Trends” slides were not available in the 2011 training. Keeping up to date on what people are doing, exploring new resources and tools, and following changing best practices is very important as digital preservation continues to be a moving target.

Sara, Stephanie: We found the “Digital Preservation Trends” slides, the final module covered in the DPOE workshop, to be a nice addition to the baseline curriculum. We don’t think they existed when we attended the DPOE train-the-trainer workshop back in 2012. We both especially like the “Engaging with the Digital Preservation Community” section which lists some of the organizations, listservs, and conferences that would be of interest to digital preservation practitioners. When you’re new to digital preservation (or the only one at your organization working with digital content), it can be overwhelming knowing where to start. Providing resources like this offers a way to get involved in the digital preservation community; to learn from each other. We always try to close our digital preservation classes by providing community resources like this.

Barrie: Regarding training opportunities, could you compare the strengths and challenges of traditional in-person learning environments to distance learning options?

Stephanie, Carol, Sara: Personally we all prefer in-person learning environments over virtual and believe that most people would agree. We saw this preference echoed in the DPOE 2014 Training Needs Assessment Survey (PDF).

The main strength of in-person is the interaction with the presenter and other participants; as a presenter you can adjust your presentation immediately based on audience reactions and their specific needs and understanding. As a participant you can meet and relate to other people in similar situations, and there are more opportunities at in-person workshops for having those types of discussions with colleagues during breaks or during lunch.

However, in-person learning is not always feasible with travel time and costs, and in this part of the country, weather often gets in the way (we have all had our share of driving through blizzard conditions in Minnesota and North Dakota). Convenience and timeliness is definitely a benefit of long distance learning; more people from a single institution can often attend for little or no additional cost. As trainers we have worked really hard to build in hands-on activities in our virtual digital preservation courses, but could probably do a lot more to encourage networking among the attendees.

Barrie: Are there plans to convene the “Managing Digital Content Over Time” series in 2015?

Stephanie, Carol, Sara: Yes, we plan on offering at least one virtual course this spring. We’ll be checking in with our upper Midwest collaborative of trainers to see who is interested in participating this time around. Minitex provides workshops on request, so we may do more virtual or in-person classes if there is demand.

One of the hands-on activities for the in-person “Managing Digital Content Over Time” course.

How has the DPOE program influenced and/or affected the work that you do at your organization?

Carol: The inaugural DPOE Training (2011) took place while I was working on an NDIIPP project led by the Minnesota State Archives to preserve and provide access to government digital records which provided me with additional tools with which to work from during the project.   After the project ended, I continued to use the information I learned during the project and DPOE training to develop a workflow for processing and preserving digital records at the Minnesota State Archives.

Since then, I became a Digital Preservation Analyst at the University of Minnesota Libraries where I continue to focus on digital preservation workflows, education and training, and other related activities. Overall, the DPOE training helped to build a foundation from which to discuss digital preservation with others whether in a classroom setting, conference presentation or one-on-one conversations. I look forward to continuing to work with members of the DPOE community.

Sara: As a digitization and metadata training coordinator at Minitex, a large part of my job is developing and presenting workshops for library professionals in our region. Participating in the DPOE training (2012) was one of the first steps I took to build and expand our training program at Minitex to include digital preservation. The DPOE program has also given me the opportunity to build up our own small cohort of DPOE trainers in the region, so we can schedule regular workshops based on who is available to present at the time.

Stephanie: I started the digitization program at our institution in 2012. Digital preservation has become a main component of that program and I am still working to get a full-fledged plan moving. Our institution is responsible for preserving other digital content and I would like our preservation plan to encompass all aspects of our work here at the library. I think one of the great things about the DPOE training is that the different pieces can be implemented before starting to produce digital content or it can be retrofitted into an already-established digital program. It can be more work when you already have a lot of digital content but the training materials make each step seem manageable.

Open Knowledge Foundation: Pakistan Data Portal

planet code4lib - Mon, 2015-02-09 11:29

December 2014 saw the Sustainable Development Policy Institute and Alif Ailaan launch the Pakistan Data Portal at the 30th Annual Sustainable Development Conference. The portal, built using CKAN by Open Knowledge, provides an access point for viewing and sharing data relating to all aspects of education in Pakistan.

A particular focus of this project was to design an open data portal that could be used to support advocacy efforts by Alif Ailaan, an organisation dedicated to improving education outcomes in Pakistan.

The Pakistan Data Portal (PDP) is the definitive collection of information on education in Pakistan and collates datasets from private and public research organisations on topics including infrastructure, finance, enrollment, and performance to name a few. The PDP is a single point of access against which change in Pakistani education can be tracked and analysed. Users, who include teachers, parents, politicians and policy makers are able to browse historical data can compare and contrast it across regions and years to reveal a clear, customizable picture of the state of education in Pakistan. From this clear overview, the drivers and constraints of reform can be identified which allow Alif Ailaan and others pushing for change in the country to focus their reform efforts.

Pakistan is facing an education emergency. It is a country with 25m children out of education and 50% girls of school age do not attend classes. A census has not been completed since 1998 and there are problems with the data that is available. It is outdated, incomplete, error-ridden and only a select few have access to much of it. An example that highlights this is a recent report from ASER, which estimates the number of children out of school at 16 million fewer than the number computed by Alif Ailaan in another report.  NGOs and other advocacy groups have tended to only be interested in data when it can be used to confirm that the funds they are utilising are working. Whilst there is agreement on the overall problem, If people can not agree on its’ scale, how can a consensus solution be hoped for?

Alif Ailaan believe if you can’t measure the state of education in the country, you cant hope to fix it fix it. This forms the focus of their campaigning efforts. So whilst the the quality of the data is a problem, some data is better than no data, and the PDP forms a focus for gathering quality information together and for building a platform from which to build change and promote policy change— policy makers can make accurate decisions which are backed up.

The data accessible through the portal is supported by regular updates from the PDP team who draw attention to timely key issues and analyse the data. A particular subject or dataset will be explored from time to time and these general blog post are supported by “The Week in Education” which summarises the latest education news, data releases and publications.

CKAN was chosen as the portal best placed to meet the needs of the PDP. Open Knowledge were tasked with customising the portal and providing training and support to the team maintaining it. A custom dashboard system was developed for the platform in order to present data in an engaging visual format.

As explained by Asif Mermon, Associate Research Fellow at SDPI, the genius of the portal is the shell. As institutions start collecting data, or old data is uncovered, it can be added to the portal to continually improve the overall picture.

The PDP is in constant development to further promote the analysis of information in new ways and build on the improvement of the visualizations on offer. There are also plans to expand the scope of the portal, so that areas beyond education can also reap its’ benefits. A further benefit is that the shell can then be be exported around the world so other countries will be able to benifit from the development.

The PDP initiative is part of the multi-year DFID-funded Transforming Education Pakistan (TEP) campaign aiming to increase political will to deliver education reform in Pakistan. Accadian, on behalf of HTSPE, appointed the Open Knowledge Foundation to build the data observatory platform and provide support in managing the upload of data including onsite visits to provide training in Pakistan.


Hydra Project: Announcing Hydra 9.0.0

planet code4lib - Mon, 2015-02-09 09:56

We’re pleased to announce the release of Hydra 9.0.0.  This Hydra gem brings together a set of compatible gems for working with Fedora 4. Amongst others it bundles Hydra-head 9.0.1 and Active-Fedora 9.0.0.  In addition to working with Fedora 4, Hydra 9 includes many improvements and bug fixes. Especially notable is the ability to add RDF properties on repository objects themselves (no need for datastreams) and large-file streaming support.

The new gem represents almost a year of effort – our thanks to all those who made it happen!

Release notes:

DuraSpace News: Fedora 4 Makes Islandora Even Better!

planet code4lib - Mon, 2015-02-09 00:00

There are key advantages for users and developers by combining Islandora 7 and Fedora 4.

Charlottetown, PEI, CA  Islandora is an open source software framework for managing and discovering digital assets utilizing a best-practices framework that includes Drupal, Fedora, and Solr. Islandora is implemented and built by an ever-growing international community.

CrossRef: Geoffrey Bilder will be at the 10th IDCC in London tomorrow

planet code4lib - Sun, 2015-02-08 21:58

Geoffrey Bilder @gbilder will be part of a panel entitled Why is it taking so long?. The panel will explore why some types of change in curation practice take so long and why others happen quickly. The panel will be moderated by Carly Strasser @carlystrasser, Manager of Strategic Partnerships for DataCite. The panel will take place on Monday, February 9th at 16:30 at 30 Euston Square in London. Learn more. #idcc15

Patrick Hochstenbach: Homework assignment #7 Sketchbookskool

planet code4lib - Sun, 2015-02-08 16:43
Filed under: Doodles Tagged: Photoshop, sketchbookskool, staedtler, urbansketching

Patrick Hochstenbach: Homework assignment #6 Sketchbookskool

planet code4lib - Sun, 2015-02-08 16:41
Filed under: Doodles Tagged: brushpen, ostrich, sketchbookskool, toy

Patrick Hochstenbach: Homework assignment #5 Sketchbookskool

planet code4lib - Sun, 2015-02-08 16:40
Filed under: Doodles Tagged: brushpen, copic, fudensuke, moleskine, pencil, sketchbookskool

David Rosenthal: It takes longer than it takes

planet code4lib - Sun, 2015-02-08 03:24
I hope it is permissible to blow my own horn on my own blog. Two concepts recently received official blessing after a good long while, for one of which I'm responsible, and for the other of which I'm partly responsible. The mysteries are revealed below the fold.

The British Parliament is celebrating the 800th anniversary of Magna Carta:
On Thursday 5 February 2015, the four surviving original copies of Magna Carta were displayed in the Houses of Parliament – bringing together the documents that established the principle of the rule of law in the place where law is made in the UK today.   The closing speech of the ceremony in the House of Lords was given by Sir Tim Berners-Lee, who is reported to have said:

I invented the acronym LOCKSS more than a decade and a half ago. Thank you, Sir Tim!

On October 24, 2014 Linus Torvalds added overlayfs to release 3.18 of the Linux kernel. Various Linux distributions have implemented various versions of overlayfs for some time, but now it is an official part of Linux. Overlayfs is a simplified implementation of union mounts, which allow a set of file systems to be superimposed on a single mount point. This is useful in many ways, for example to make a read-only file system such as a CD-ROM appear to be writable by mounting a read-write file system "on top" of it.

Other Unix-like systems have had union mounts for a long time. BSD systems first implemented it in 4.4BSD-Lite two decades ago. The concept traces back five years earlier to my paper for the Summer 1990 USENIX Conference Evolving the Vnode Interface which describes a prototype implementation of "stackable vnodes". Among other things, it could implement union mounts as shown in the paper's Figure 10:
This use of stackable vnodes was in part inspired by work at Sun two years earlier on the Translucent File Service, a user-level NFS service by David Hendricks that implemented a restricted version of union mounts. All I did was prototype the concept, and like many of my prototypes it served mainly to discover that the problem was harder than I initially thought. It took others another five years to deploy it in SunOS and BSD. Because they weren't hamstrung by legacy code and semantics by far the most elegant and sophisticated implementation was around the same time by Rob Pike and the Plan 9 team. Instead of being a bolt-on addition, union mounting was fundamental to the way Plan 9 worked.

About five years later Erez Zadok at Stony Brook led the FiST project, a major development of stackable file systems including two successive major releases of unionfs, a unioning file system for Linux.

About the same time I tried to use OpenBSD's implementation of union mounts early in the boot sequence to construct the root directory by mounting a RAM file system over a read-only root file system on a CD, but gave up on encountering deadlocks.

In 2009 Valerie Aurora published a truly excellent series of articles going into great detail about the difficult architectural and implementation issues that arise when implementing union mounts in Unix kernels. It includes the following statement, with which I concur:
The consensus at the 2009 Linux file systems workshop was that stackable file systems are conceptually elegant, but difficult or impossible to implement in a maintainable manner with the current VFS structure. My own experience writing a stacked file system (an in-kernel chunkfs prototype) leads me to agree with these criticisms.Note that my original paper was only incidentally about union mounts, it was a critique of the then-current VFS structure, and a suggestion that stackable vnodes might be a better way to go. It was such a seductive suggestion that it took nearly two decades to refute it! My apologies for pointing down a blind alley.

The overlayfs implementation in 3.18 is minimal:
Overlayfs allows one, usually read-write, directory tree to be overlaid onto another, read-only directory tree. All modifications go to the upper, writable layer.But given the architectural issues doing one thing really well has a lot to recommend itself over doing many things fairly well. This is, after all, the use case from my paper.

It took a quarter of a century, but the idea has finally been accepted. And, even though I had to build a custom 3.18 kernel to do so, I am using it on a Raspberry Pi serving as part of the CLOCKSS Archive.

Thank you, Linus! And everyone else who worked on the idea during all that time!

References (date order):

Mark E. Phillips: How we assign unique identifiers

planet code4lib - Sun, 2015-02-08 03:01

The UNT Libraries has made use of the ARK identifier specification for a number of years and have used these identifiers throughout our infrastructure on a number of levels.  This post is to give a little background about where, when, why and a little about how we assign our ARK identifiers.


The first thing we need to do is get some terminology out of the way so that we can talk about the parts consistently.  This is taken from the ARK documentation \________________/ \__/ \___/ \______/ \____________/ (replaceable) | | | Qualifier | ARK Label | | (NMA-supported) | | | Name Mapping Authority | Name (NAA-assigned) (NMA) | Name Assigning Authority Number (NAAN)

The ARK syntax can be summarized,


For the UNT Libraries we were assigned a Name Assigning Authority Number (NAAN) of 67531 so all of our identifiers will start like this ark:/67531/

We mint Names for our ARKs locally with a home-grown system locally called a “Number Server”  this Python Web service receives a request for a new number,  assigns that number a prefix based on which instance we pull from and returns the new Name.


We have four different namespaces that we use for minting identifiers.  They are the following,  metapth, metadc, metarkv, and coda.  Additionally we have a metatest namespace which we use when we need to test things out but it isn’t used that often.  Finally we have a historic namespace that is no longer used that is metacrs. Here is the breakdown of how we use these namespaces.

We try to assign all items that end up on The Portal to Texas History with Names from the metapth namespace whenever possible.  We assign all other public facing digital objects the metadc namespace.  This means that the UNT Digital Library and The Gateway to Oklahoma History both share Names from the metadc namespace.  The metarkv namespace is used for “archive only” objects that go directly into our archival repository system,  these include large Web archiving datasets.  The coda namespace is used within our archival repository called Coda.  As was stated earlier the metatest namespace is only used for testing and these items are thrown away after processing.

Name assignment

We assign Names in our systems in programatic ways,  this is always done as part of our digital item ingest process.  We tend to process items in batches,  most often we try to process several hundred items at any given time and sometimes we process several thousand items.   When we process items they are processed in parallel and therefore there is no logical order to how the Names are assigned to objects.  They are in the order that they were processed but may have no logical order past that.

We also don’t assume that our Names are continuous.  If you have an identifier metapth123 and metapth125 we don’t assume that there is an item metapth124,  sure it may be there,  but it also may never have been assigned.  When we first started with these systems we would get worked up if we assigned several hundred or a few thousands identifiers and then had to delete those items,  now this isn’t an issue at all but that took some time to get over.

Another assumption that can’t be made in our systems is that if you have an item,  Newspaper Vol 1 Issue 2 that has an identifier of metapth333 there is no guarantee that Newspaper Vol. 1 Issue 3 will have metapth334,  it might but it isn’t guaranteed either.  Another thing that happens in our systems is that items can be shared between systems and the membership to either the Portal, UNT Digital Library or Gateway is notated in the descriptive metadata.  Therefore you can’t say all metapth* identifiers are Portal or all metadc* identifiers are not the Portal, you have to look them up based on the metadata.

Once a number is assigned it is never assigned again.  This sounds like a silly thing to say but it is important to remember,  we don’t try and save identifiers, or reuse them as if we will run out of them.

Level of assignment

We currently assign an ARK identifier at the level of the intellectual object. So for example,  a newspaper issue gets and ARK, a photograph gets an ARK, a book, a map, a report, an audio recording, a video recording gets an ARK.  The sub-parts of an item are not given further unique identifiers because the way that we tend to interface with them is in the form of formatted URLs such as those described here or from other URL based patterns such as the URLs we use to retrieve items from Coda.

http:/ http:/ http:/ http:/ http:/ http:/ http:/ http:/ http:/ http:/ http:/ Lessons Learned Things I would do again.
  • I would most likely use just an incrementing counter for assigning identifiers.  Name minters such as Noid are also an option but I like the numbers with a short prefix.
  • I would not use a prefix such as UNT do stay away from branding as much as possible.  Even metapth is way too branded (see below).
Things I would change in our implementation.
  • I would only have one namespace for non-archival items.  Two namespaces for production data just invite someone to screw up (usually me) and then suddenly the reason for having one namespace over the other is meaningless.  Just manage one namespace and move on.
  • I would not have a six or seven character prefix.  metapth and metadc came as baggage from our first system,  we decided that the 30k identifiers we already minted had set our path.  Now after 1,077,975 identifiers in those namespaces,  it seems a little silly that those the first 3% of our items would have such an effect on us still today.
  • I would not brand our namespaces so closely to our systems names such as metapth, metadc, and the legacy metacrs people read too much into the naming convention.  This is a big reason for opaque Names in the first place, and is pretty important.
Things I might change in a future implementation.
  • I would probably pad my identifiers out to eight digits.   While you can’t rely on the ARKs to be generated in a given order, once they are assigned it is helpful to be able to sort by them and have a consistent order,  metapth1, metapth100, metapth100000 don’t always sort nicely like metapth00000001, metapth00000100, metapth00100000 do.  But then again longer run numbers of zeros are harder to transcribe and I had a tough time just writing this example.  Maybe I wouldn’t do this.

I don’t think any of this post applies only to ARK identifiers as most identifier schemes at some level have to have a decision made about how you are going to mint unique names for things.   So hopefully this is useful to others.

If you have any specific questions for me let me know on twitter.

FOSS4Lib Recent Releases: Hydra - 9.0

planet code4lib - Sat, 2015-02-07 17:46
Package: HydraRelease Date: Thursday, February 5, 2015

Last updated February 7, 2015. Created by Peter Murray on February 7, 2015.
Log in to edit this page.

From the release announcement

I'm pleased to announce the release of Hydra 9.0.0! This is the first release of the Hydra gem for Fedora 4 and represents almost a year of effort. In addition to working with Fedora 4, Hydra 9 includes many improvements and bug fixes. Especially notable is the ability to add RDF properties on repository objects themselves (no need for datastreams) and large-file streaming support.

CrossRef: Join us for the first CrossRef Taxonomies Webinar - March 3rd at 11:00 am ET

planet code4lib - Fri, 2015-02-06 21:15

Semantic enrichment is an active area of development for many publishers. Our enrichment processes are based on the use of different Knowledge Models (e.g., an ontology or thesaurus) which provide the terms required to describe different subject disciplines.

The CrossRef Taxonomy Interest Group is a collaboration among publishers, and sponsored by CrossRef, to share the Knowledge Models they are using, creating opportunities for standardization, collaboration and interoperability. Please join the webinar to get an introduction to the work this group is doing, use cases for the information collected and learn how your organization can contribute to the project.

Christian Kohl - Director Information and Publishing Technology, De Gruyter
Graham McCann - Head of Content and Platform Management, IOP Publishing

The webinar will take place on Tuesday, March 3rd at 11 am ET.

Register today!

SearchHub: Enabling SSL on Fusion Admin UI

planet code4lib - Fri, 2015-02-06 20:40
Lucidworks Fusion can encrypt communications to and from clients with SSL. This section describes enabling SSL on Fusion Admin UI with the Jetty server using a self-signed certificate. Basic SSL Setup Generate a self-signed certificate and a key To generate a self-signed certificate and a single key that will be used to authenticate both the server and the client, we’ll use the JDK keytool command and create a separate keystore.  This keystore will also be used as a truststore below.  It’s possible to use the keystore that comes with the JDK for these purposes, and to use a separate truststore, but those options aren’t covered here. Run the commands below in the $FUSION_HOME/jetty/ui/etc directory in the binary Fusion distribution. The “-ext SAN=…” keytool option allows you to specify all the DNS names and/or IP addresses that will be allowed during hostname verification. keytool -genkeypair -alias fusion -keyalg RSA -keysize 2048 -keypass secret -storepass secret -validity 9999 -keystore fusion.keystore.jks -ext SAN=DNS:localhost,IP: -dname “CN=localhost, OU=Organizational Unit, O=Organization, L=Location, ST=State, C=Country” The above command will create a keystore file named fusion.keystore.jks in the current directory. Convert the certificate and key to PEM format for use with cURL cURL isn’t capable of using JKS formatted keystores, so the JKS keystore needs to be converted to PEM format, which cURL understands. First convert the JKS keystore into PKCS12 format using keytool: keytool -importkeystore -srckeystore fusion.keystore.jks -destkeystore fusion.keystore.p12 -srcstoretype jks -deststoretype pkcs12 The keytool application will prompt you to create a destination keystore password and for the source keystore password, which was set when creating the keystore (“secret” in the example shown above). Next convert the PKCS12 format keystore, including both the certificate and the key, into PEM format using the openssl command: openssl pkcs12 -in fusion.keystore.p12 -out fusion.pem Configure Fusion First, copy jetty-https.xml and jetty-ssl.xml from $FUSION_HOME//jetty/home/etc to $FUSION_HOME/jetty/ui/etc Next, edit jetty-ssl.xml and change the keyStore values to point to the JKS keystore created above – the result should look like this: Edit ui file (not under $FUSION_HOME/bin and add the following 3 lines
  1. “https.port=$HTTP_PORT” \
  2. “$JETTY_BASE/etc/jetty-ssl.xml” \
  3. “$JETTY_BASE/etc/jetty-https.xml”
  Run Fusion using SSL To start all services, run $FUSION_HOME/bin/fusion start. This will start Solr, the Fusion API, the Admin UI, and Connectors, which each run in their own Jetty instances and on their own ports bin/fusion start After that, trust Fusion website (This is because we are in local machine).   Finally, Fusion Admin UI with SSL    

The post Enabling SSL on Fusion Admin UI appeared first on Lucidworks.

CrossRef: New CrossRef Members

planet code4lib - Fri, 2015-02-06 20:10

Updated February 3, 2015

Voting Members
Academy of Medical and Health Research
Agrivita, Journal of Agricultural Science (AJAS)
Eurasian Scientific and Industrial Chamber, Ltd.
Hitte Journal of Science and Education
Institute of Mathematical Problems of Biology of RAS (IMPB RAS)
MIM Research Group
Tomsk State University
Universitas Pendidikan Indonesia (UPI)

Represented Members
Amasya Universitesi Egitim Fakultesi Dergisi
Hikmet Yurdu Dusunce-Yorum Sosyal Bilimler Arastirma Dergisi
Necatibey Faculty of Education Electronics Journal of Science and Mathematics Education
Optimum Journal of Economics and Management Sciences

Last update January 26, 2015

Voting Members
Academy Publication
Escola Bahiana de Medicine e Saude Publica
Escola Superior de Educacao de Paula Frassinetti
Lundh Research Foundation
RFC Editor

Represented Members
ABRACICON: Academia Brasileira de Ciencias Contabeis
Biodiversity Science
Canakkale Arastirmalari Turk Yilligi
Chinese Journal of Plant Ecology
Dergi Karadeniz
Eskisehir Osmangazi University Journal of Social Sciences
Geological Society of India
Instituto do Zootecnia
Journal of Social Studies Education Research
Journal Press India
Kahramanmaras Sutcu Imam Universitesi Tip Fakultesi Dergisi
Nitte Management Review
Sanat Tasarim Dergisi
Sociedade Brasileira de Virologia
The Apicultural Society of Korea
The East Asian Society of Dietary Life
The Korea Society of Aesthetics and Science of Art
Turkish History Education Journal

CrossRef: CrossRef Indicators

planet code4lib - Fri, 2015-02-06 17:25

Updated February 3, 2015

Total no. participating publishers & societies 5772
Total no. voting members 3058
% of non-profit publishers 57%
Total no. participating libraries 1926
No. journals covered 37,687
No. DOIs registered to date 72,062,095
No. DOIs deposited in previous month 471,657
No. DOIs retrieved (matched references) in previous month 41,726,414
DOI resolutions (end-user clicks) in previous month 134,057,984


Subscribe to code4lib aggregator