You are here

Feed aggregator

Hydra Project: Share your Cool Tools, Daring Demos and Fab Features at Open Repositories 2015

planet code4lib - Wed, 2015-02-11 11:02

Of interest to the community:


Open Repositories 2015 DEVELOPER TRACK

June 8-11, 2015, Indianapolis, Indiana,

*** Deadline 13th March 2015 ***

Cool Tools, Daring Demos and Fab Features

The OR2015 developer track presents an opportunity to share the latest developments across the technical community. We will be running informal sessions of presentations and demonstrations showcasing community expertise and progress:

What cool development tools, frameworks, languages and technologies could you not get on without?

Is there a particular technique or process that you find apt for solving particular day-to-day repository problems?  Demonstrate it to the community.  Extra credit for command-line shenanigans and live debugging.

What new features (however small) have you added to your organisation’s repository?  What technologies were used and how did you arrive at your solution?

Presentations will be flexibly timed (5 to 20 minutes). Live demos, code repositories, ssh, hacking and audience participation are encouraged.

Submissions should take the form of a title and short paragraph detailing what will be shared with the community (including the specific platform and/or technologies you will be showcasing). Please also give an estimate of the duration of your demonstration.

Submit your proposal here: by March 13, 2015

Ideas Challenge

The Developer Challenge this year has been replaced by the more inclusive IDEAS CHALLENGE. We would like to encourage teams to form before and during the conference to propose an innovative solution to a real-world problem that repository users currently face.  Each team should include members from both the developer and user community, and represent more than one institution.

Teams’ ideas will be presented to the conference and prizes will be awarded based on the nature of the problem, the quality of the solution and the make-up of the team. Find out more at

For inquiries, please contact the Developer Track Co-Chairs, Adam Field and Claire Knowles  at af05[AT] and claire.knowles[AT]

CrossRef: February 2015: CrossRef deploys Co-Access Functionality for Books

planet code4lib - Wed, 2015-02-11 08:08

Co-Access functionality has been set up by CrossRef address the issues caused when there are multiple organisations involved in the hosting and distribution of a given book.

Co-Access is effectively an extension of CrossRef's Multiple Resolution (MR) functionality which allows multiple URLs to be assigned to a single DOI. MR functionality relies on title ownership to try to avoid conflicts and maintain the uniqueness of DOIs so that they can be queried effectively.

MR has worked well for journals, but because many book publishers outsource the hosting of their content to multiple aggregators and platforms, they need a separate process that allows independent transactions on the part of the primary publisher and any secondary content hosts, rather than interactions having to be co-ordinated by the primary publisher (who may not be depositing DOIs and metadata with CrossRef at all).

Co-Access will allow multiple parties to deposit DOIs for a single publication, and have the CrossRef system automatically resolve overlaps (we call them conflicts) and establish MR for any DOI where multiple target URLs exist for the same book or part of a book (e.g. chapters). This arrangement would be set-up between the primary publisher of a work and a set of approved participants who could also deposit (and update) DOIs for a publication. This relationship can be enabled either between prefixes or within a single prefix.

This change would allow Co-Access members to operate independently of one another when assigning DOIs to book content, and aims reduce the amount of coordination required between the primary publisher and the secondary content hosts.

CrossRef is currently embarking on a pilot of this functionality - we've done our own testing based on some known scenarios - but now need publishers to trial this on some of their own publications and provide feedback on how well it works for them. CrossRef would need publishers to be able to identify a title that will need a Co-Access arrangement set up with another party (and who the party or parties would be) to commence the testing process. If your organisation would like to be involved, please contact for further information.

Galen Charlton: Ogres, hippogriffs, and authorized Koha service providers

planet code4lib - Wed, 2015-02-11 01:14

What do ogres, hippogriffs, and authorized Koha service providers have in common?

Each of them is an imaginary creature.

20070522 Madrid: hippogriff — image by Larry Wentzel on Flickr (CC-BY)

Am I saying that Koha service providers are imaginary creatures? Not at all — at the moment, there are 54 paid support providers listed on the Koha project’s website.

But not a one of them is “authorized”.

I bring this up because a friend of mine in India (full disclosure: who himself offers Koha consulting services) ran across this flyer by Avior Technologies:

The bit that I’ve highlighted is puffery at best, misleading at worst. The Koha website’s directory of paid support providers is one thing, and one thing only: a directory. The Koha project does not endorse any vendors listed there — and neither the project nor the Horowhenua Library Trust in New Zealand (which holds various Koha trademarks) authorizes any firm to offer Koha services.

If you want your firm to get included in the directory, you need only do a few things:

  1. Have a website that contains an offer of services for Koha.
  2. Ensure that your page that offers services links back to
  3. Make a public request to be added to the directory.

That’s it.

Not included on this list of criteria:

  • Being good at offering services for Koha libraries.
  • Contributing code, documentation, or anything else to the Koha project.
  • Having any current customers who are willing to vouch for you.
  • Being alive at present (although eventually, your listing will get pulled for lack of response to inquiries from Koha’s webmasters).

What does this mean for folks interested in getting paid support services?  There is no shortcut to doing your due diligence — it is on you to evaluate whether a provider you might hire is competent and able to keep their customers reasonably happy. The directory on the Koha website exists as a convenience for folks starting a search for a provider, but beyond that: caveat emptor.

I know nothing about Avior Technologies. They may be good at what they do; they may be terrible — I make no representation either way.

But I do know this: while there are some open source projects where the notion of an “authorized” or “preferred” support provider may make some degree of sense, Koha isn’t such a project.

And that’s generally to the good of all: if you have Koha expertise or can gain it, you don’t need to ask anybody’s permission to start helping libraries run Koha — and get paid for it.  You can fill niches in the market that other Koha support providers cannot or do not fill.

You can in time become the best Koha vendor in your niche, however you choose to define it.

But authority? It will never be bestowed upon you. It is up to you to earn it by how well you support your customers, and by how much you contribute to the global Koha project.


DuraSpace News: AVAILABLE: Fedora 4.1.0 Release

planet code4lib - Wed, 2015-02-11 00:00

Winchester, MA  The Fedora team is proud to announce that Fedora 4.1.0 was released on February 4, 2015 and is now available.

pinboard: Presentation Index

planet code4lib - Tue, 2015-02-10 23:12
RT @ronallo: Detailed version of my talk on WebVTT with full speaker notes and audience handouts here: #c4l15 #code4lib

Tara Robertson: Developing a culture of consent at code4lib

planet code4lib - Tue, 2015-02-10 23:01

I love code4lib. code4lib is not a formal organization, it’s more of a loose collective of folks. The culture is very DIY. If you see something that needs doing someone needs to step up and do it. I love that part of our culture is reflecting on our culture and thinking of ways to improve it. At this year’s conference we made some improvements on our culture.

Galen Charlton kicked this discussion off with an email on the code4lib list by suggesting we institute a policy like the Evergreen conference (which was informed by work done by The Ada Initiative) where “consent be explicitly given to be photographed or recorded”.

Kudos to the local organizing committee for moving quickly (like just over 3 hours from Galen’s initial email). They purchased coloured lanyards to indicate to participants views on being photographed: red means don’t photograph me, yellow means ask me before photographing me, and green means go ahead and photograph me. This is an elegant and simple solution.

Over the past few years streaming the conference presentations has become standard as is publishing these videos to the web after the conference. This is awesome and important—not everyone can travel to attend the conference. This allows us to learn faster and build better things. I suggested that it was time to explicitly obtain speaker’s consent to stream their presentation and archive the video online.

At first I was disheartened by some of comments on the list :

  • “This needs to be opt out, not opt in.”
  • “An Opt-Out policy would be more workable for presenters.”
  • “requiring explicit permission from presenters is overly burdensome for the (streaming) crew that is struggling to get the recordings”
  • i enjoy taking candid photos of people at the conference and no one seems to mind
  • “my old Hippy soul cringes at unnecessary paperwork. A consent form means nothing. Situations change. Even a well-intended agreement sometimes needs to be reneged on.”

The lack of understanding about informed consent means a few things about the code4lib community:

  1. there’s a lack of connection to feminist organizing that has a long history of collective organizing and consent
  2.  the laissez-faire approach to consent (opt-out) centres male privilege
  3. this community still has work to do around rape culture. 

It was awesome to get support from the Programming Committee, the local organizers and some individuals. We managed to update the consent form we used for Access to be specific to code4lib and get it out to speakers in just over a week. Ranti quickly stepped up and volunteered to help me obtain consent forms from all of the speakers. As this is a single stream conference there were only 39 people so it wasn’t that much work to do. 

Here’s the consent form we used.  A few people couldn’t agree to the copyright bits, so they crossed that part out. I’m sure this form will evolve to become better.

At code4lib 2015 in Portland it was the first time we were explicit about consent. The colour coded lanyards and speaker consent forms are an important part of building a culture of consent.

Thanks to my smart friend Eli Manning (not the football player) for giving me feedback on this.

Open Library Data Additions: OL.150101

planet code4lib - Tue, 2015-02-10 21:48

OL Output of MARC records from Toronto Public Library.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

CrossRef: Introduction to CrossRef: Technical Basics webinar is tomorrow

planet code4lib - Tue, 2015-02-10 20:19

New to CrossRef? Interested in learning more about the technical aspects of CrossRef? Please join us for one of our upcoming Introduction to CrossRef Technical Basics webinars. This webinar will provide a technical introduction to CrossRef and a brief outline of how CrossRef works. New members are especially encouraged to register. All of our webinars are free to attend. If the time of the webinar is not convenient for you, we will be recording the sessions and they will be made available after the webinars for those who register.


Introduction to CrossRef Technical Basics
Date: Wednesday, Feb 11, 2015
Time: 8:00 am (San Francisco), 11:00 am (New York), 4:00 pm (London)
Moderator: Patricia Feeney

Introduction to CrossRef Technical Basics
Date: Wednesday, Mar 18, 2015
Time: 8:00 am (San Francisco), 11:00 am (New York), 4:00 pm (London)
Moderator: Patricia Feeney


Additional CrossRef webinars are also listed on our webinar page.

We look forward to having you join us!

CrossRef: If you are a CrossCheck Administrator - this webinar is for you.

planet code4lib - Tue, 2015-02-10 19:02

CrossCheck: iThenticate Admin Webinar

Date: Thursday, Feb 19, 2015
Time: 7:00 am (San Francisco), 10:00 am (New York), 3:00 pm (London)
Register today!

CrossCheck, powered by iThenticate now has over 600 members using the service to screen content for originality. Through demand from these members, CrossRef is trialling a webinar for CrossCheck administrators and more experienced users that will cover:

- The scale of the current CrossCheck database
- CrossCheck participation and usage
- An overview of the newest features in iThenticate
- A run-through of more administrator-specific features in iThenticate
- Advice on interpreting the reports and common issues
- Details on support resources available for publishers
- Q&A session

Representatives from CrossRef and iParadigms will run the webinar which will last one hour. We hope you can join us!

If you can't make this webinar visit our webinar page for additional webinars.

Eric Hellman: "Passwords are stored in plain text."

planet code4lib - Tue, 2015-02-10 16:24
Many states have "open records" laws which mandate public disclosure of business proposals submitted to state agencies. When a state library or university requests proposals for library systems or databases, the vender responses can be obtained and reviewed. When I was in the library software business, it was routine to use these laws to do "competitor intelligence". These disclosures can often reveal the inner workings of proprietary vendor software which implicate information privacy and security.

Consider for example, this request for "eResources for Minitex". Minitex is a "publicly supported network of academic, public, state government, and special libraries working cooperatively to improve library service for their users in Minnesota, North Dakota and South Dakota" and it negotiates licenses databases for libraries throughout the three states.

Question number 172 in this Request for Proposals (RFP) was: "Password storage. Indicate how passwords are stored (e.g., plain text, hash, salted hash, etc.)."

To provide context for this question, you need to know just a little bit of security and cryptography.

I'll admit to having written code 15 years ago that saved passwords as plain text. This is a dangerous thing to do, because if someone were to get unauthorized access to the computer where the passwords were stored, they would have a big list of passwords. Since people tend to use the same password on multiple systems, the breached password list could be used, not only to gain access to the service that leaked the password file, but also to other services, which might include banks, stores and other sites of potential interest to thieves.

As a result, web developers are now strongly admonished never to save the passwords as plain text. Doing so in a new system should be considered negligent, and could easily result in liability for the developer if the system security is breached. Unfortunately many businesses would rather risk paying paying lawyers a lot of money to defend themselves should something go wrong than bite the bullet and pay some engineers a little money now to patch up the older systems.

To prevent the disclosure of passwords, the current standard practice is to "salt and hash" them.

A cryptographic hash function mixes up a password so that the password cannot be reconstructed. so for example, the hash of 'my_password' is 'a865a7e0ddbf35fa6f6a232e0893bea4'. When a user enters their password, the hash of the password is recalculated and compared to the saved hash to determine whether the password is correct.

As a result of this strategy, the password can't be recovered. But it can be reset, and the fact that no one can recover the password eliminates a whole bunch of "social engineering" attacks on the security of the service.

Given a LOT of computer power, there are brute force attacks on the hash, but the easiest attack is to compute the hashes for the most common passwords. In a large file of passwords, you should be able to find some accounts that are breachable, even with the hashing. And so a "salt" is added to the password before the hash is applied. In the example above, a hash would be computed for 'SOME_CLEVER_SALTmy_password'. Which, of course, is '52b71cb6d37342afa3dd5b4cc9ab4846'.

To attack the salted password file, you'd need to know that salt. And since every application uses a different salt, each file of salted passwords is completely different. A successful attack on one hashed password file won't compromise any of the others.

Another standard practice for user-facing password management is to never send passwords unencrypted. The best way to do this is to use HTTPS, since web browser software alerts the user that their information is secure. Otherwise, any server between the user and the destination server (there might be 20-40 of these for  typical web traffic) could read and store the user's password.

The Minitex RFP covers reference databases. For this reason, only a small subset of services offered to libraries are covered here. The authentication for these sorts of systems typically don't depend on the user creating a password; user accounts are used to save the results of a search, or to provide customization features. A Minitex patron can use many of the offered databases without providing any sort of password.

So here are the verbatim responses received for the Minitex RFP:

LearningExpress, LLC
Response: "All passwords are stored using a salted hash. The salt is randomly generated and unique for each user."
My comment: This is a correct answer. However, the LearningExpress login sends passwords in the clear over HTTP.

Response: "Passwords are md5 hashed."
My comment: MD5 is the hash algorithm I used in my examples above. It's not considered very secure (see comments). OCLC Firstsearch does not force HTTPS and can send login passwords in the clear.

Response: "N/A"
My comment: This just means that no passwords are used in the service.

Infogroup Library Division
Response: "Passwords are currently stored as plain text. This may change once we develop the customization for users within ReferenceUSA. Currently the only passwords we use are for libraries to access usage stats."
My comment: The user customization now available for ReferenceUSA appears at first glance to be done correctly.

EBSCO Information Services
Response: "EBSCOhost passwords in EBSCOadmin are stored in plain text."
My comment: Should note that EBSCOadmin is not a end-user facing system. So if the EBSCO systems were compromised only library administrator credentials would be exposed. 

Encyclopaedia Britannica, Inc.
Response: "Passwords are stored as plain text."
My comment: I wonder if EB has an article on network security?

Response: "We store all passwords as plain text."
My comment: The ProQuest service available through my library creates passwords over HTTP but uses some client-side encryption. I have not evaluated the security of this encryption.

Scholastic Library Publishing, Inc.
Response: "Passwords are not stored. FreedomFlix offers a digital locker feature and is the only digital product that requires a login and password. The user creates the login and password. Scholastic Library Publishing, Inc does not have access to this information.”
My comment: The "FreedomFlix" service not only sends user passwords unencrypted over HTTP, it sends them in a GET query string. This means that not only can anyone see the user passwords in transit, but log files will capture and save them for long-term perusal. Third-party sites will be sent the password in referrer headers. When used on a shared computer, subsequent users will easily see the passwords. "Scholastic Library Publishing" may not have access to user passwords, but everyone else will have them.

Cengage Learning
Response: "Passwords are stored in plain text."
My comment: Like FreedomFlix, the Gale Infotrac service from Cengage sends user passwords in the clear in a GET query string. But it asks the user to enter their library barcode in the password field, so users probably wouldn't be exposing their personal passwords.

So, to sum up, adoption of up-to-date security practices is far from complete in the world of library databases. I hope that the laggards have improved since the submission date of this RFP (roughly a year ago) or at least have plans in place to get with the program. I would welcome comments to this post that provide updates. Libraries themselves deserve a lot of the blame, because for the most part the vendors that serve them respond to their requirements and priorities.

I think libraries issuing RFPs for new systems and databases should include specific questions about security and privacy practices, and make sure that contracts properly assign liability for data breaches with the answers to these questions in mind.

Note: This post is based on information shared by concerned librarians on the LITA Patron Privacy Technologies Interest Group list. Join if you care about this.

DPLA: DPLA and the Digital Library Federation team up to offer special DPLAfest travel grants

planet code4lib - Tue, 2015-02-10 16:00

The Digital Public Library of America (DPLA)  is excited to work with the Digital Library Federation (DLF) program of the Council on Library and Information Resources to offer DPLA + DLF Cross-Pollinator Travel Grants. The purpose of these grants is to extend the opportunity to attend DPLAfest 2015 to four DLF community members. Successful applicants should be able to envision and articulate a connection between the DLF community and the work of DPLA.

It is our belief that the key to sustainability of large-scale national efforts require robust community support. Connecting DPLA’s work to the energetic and talented DLF community is a positive way to increase serendipitous collaboration around this shared digital platform.

The goal of the DPLA + DLF Travel Grants is to bring cross-pollinators—active DLF community  contributors who can provide unique perspectives to our work and share the vision of DPLA from their perspective—to the conference. By teaming up with the DLF to provide travel grants, it is our hope to engage DLF community members and connect them to exciting areas of growth and opportunity at DPLA.

The travel grants include DPLAfest 2015 conference registration, travel costs, meals, and lodging in Indianapolis.

The DPLA + DLF Cross-Pollinator Travel Grants is the first of a series of collaborations between CLIR/DLF and DPLA.


Four awards of up to $1,250 each to go towards the travel, board, and lodging expenses of attending DPLAfest 2015. Additionally, the grantees will each receive a complimentary full registration to the event ($75). Recipients will be required to write a blog post about their experience subsequent to DPLAfest; this blog post will be co-published by DLF and DPLA.


Applicants must be a staff member of a current DLF member organization and not currently working on a DPLA hub team.


Send an email by March 5th, 5 pm EDT, containing the following items (in one document) to, with the subject “DPLAFest Travel Grant: [Full Name]:

  • Cover letter of nomination from the candidate’s supervisor/manager or an institutional executive, including acknowledgement that candidate would not have been funded to attend DPLAfest.
  • Personal statement from the candidate (ca. 500 words) explaining their educational background, what their digital library/collections involvement is, why they are excited about digital library/collections work, and how they see themselves benefiting from and participating in DPLAfest.
  • A current résumé.

Applications may be addressed to the DPLA + DLF Cross-Pollinator Committee.


Candidates will be selected by DPLA and DLF staff. In assessing the applications, we will look for a demonstrated commitment to digital work, and will consider the degree to which participation might enhance communication and collaboration between the DLF and DPLA communities. Applicants will be notified of their status no later than March 16, 2015.

These fellowships are generously supported by the Council on Library and Information Resource’s Digital Library Federation program.

David Rosenthal: The Evanescent Web

planet code4lib - Tue, 2015-02-10 16:00
Papers drawing attention to the decay of links in academic papers have quite a history, i blogged about three relatively early ones six years ago. Now Martin Klein and a team from the Hiberlink project have taken the genre to a whole new level with a paper in PLoS One entitled Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. Their dataset is 2-3 orders of magnitude bigger than previous studies, their methods are far more sophisticated, and they study both link rot (links that no longer resolve) and content drift (links that now point to different content). There's a summary on the LSE's blog.

Below the fold, some thoughts on the Klein et al paper.

As regards link rot, they write:
In order to combat link rot, the Digital Object Identifier (DOI) was introduced to persistently identify journal articles. In addition, the DOI resolver for the URI version of DOIs was introduced to ensure that web links pointing at these articles remain actionable, even when the articles change web location.But even when used correctly, such as, DOIs introduce a single point of failure. This became obvious on January 20th when the domain name briefly expired. DOI links all over the Web failed, illustrating yet another fragility of the Web. It hasn't been a good time for access to academic journals for other reasons either. Among the publishers unable to deliver content to their customers in the last week or so were Elsevier, Springer, Nature, HighWire Press and Oxford Art Online.

I've long been a fan of Herbert van de Sompel's work, especially Memento. He's a co-author on the paper and we have been discussing it. Unusually, we've been disagreeing. We completely agree on the underlying problem of the fragility of academic communication in the Web era as opposed to its robustness in the paper era. Indeed, in the introduction of another (but much less visible) recent paper entitled Towards Robust Hyperlinks for Web-Based Scholarly Communication Herbert and his co-authors echo the comparison between the paper and Web worlds from the very first paper we published on the LOCKSS system a decade and a half ago. Nor am I critical of the research underlying the paper, which is clearly of high quality and which reveals interesting and disturbing properties of Web-based academic communication. All I'm disagreeing with Herbert about is the way the research is presented in the paper.

My problem with the presentation is that this paper, which has a far higher profile than other recent publications in this area, and which comes at a time of unexpectedly high visibility for web archiving, seems to me to be excessively optimistic, and to fail to analyze the roots of the problem it is addressing. It thus fails to communicate the scale of the problem.

The paper is, for very practical reasons of publication in a peer-reviewed journal, focused on links from academic papers to the web-at-large. But I see it as far too optimistic in its discussion of the likely survival of the papers themselves, and the other papers they link to (see Content Drift below). I also see it as far too optimistic in its discussion of proposals to fix the problem of web-at-large references that it describes (see Dependence on Authors below).

All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself (see Getting Links Right below) that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.

In addition, the paper fails to even mention the elephant in the room, the fact that both the papers and the web-at-large content are copyright. The archives upon which the proposed web-at-large solutions rest, such as the Internet Archive, are themselves fragile. Not just for the normal economic and technical reasons we outlined nearly a decade ago, but because they operate under the DMCA's "safe harbor" provision and thus must take down content upon request from a claimed copyright holder. The archives such as Portico and LOCKSS that preserve the articles themselves operate instead with permission from the publisher, and thus must impose access restrictions.

This is the root of the problem. In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like.

None of this is to suggest that developing and deploying partial solutions is a waste of time. It is what I've been doing the last quarter of my life. There cannot be a single comprehensive technical solution. The best we can do is to combine a diversity of partial solutions. But we need to be clear that even if we combine everything anyone has worked on we are still a long way from solving the problem. Now for some details.

Content DriftAs regards content drift, they write:
Content drift is hardly a matter of concern for references to journal articles, because of the inherent fixity that, especially PDF-formated, articles exhibit. Nevertheless, special-purpose solutions for long-term digital archiving of the digital journal literature, such as LOCKSS, CLOCKSS, and Portico, have emerged to ensure that articles and the articles they reference can be revisited even if the portals that host them vanish from the web. More recently, the Keepers Registry has been introduced to keep track of the extent to which the digital journal literature is archived by what memory organizations. These combined efforts ensure that it is possible to revisit the scholarly context that consists of articles referenced by a certain article long after its publication.While I understand their need to limit the scope of their research to web-at-large resources, the last sentence is far too optimistic.

First, research using the Keepers Registry and other resources shows that at most 50% of all articles are preserved. So future scholars depending on archives of digital journals will encounter large numbers of broken links.

Second, even the 50% of articles that are preserved may not be accessible to a future scholar. CLOCKSS is a dark archive and is not intended to provide access to future scholars unless the content is triggered. Portico is a subscription archive, future scholars' institutions may not have a subscription. LOCKSS provides access only to readers at institutions running a LOCKSS box. These restrictions are a response to the copyright on the content and are not susceptible to technical fixes.

Third, the assumption that journal articles exhibit "inherent fixity" is, alas, outdated. Both the HTML and PDF versions of articles from state-of-the-art publishing platforms contain dynamically generated elements, even when they are not entirely generated on-the-fly. The LOCKSS system encounters this on a daily basis. As each LOCKSS box collects content from the publisher independently, each box gets content that differs in unimportant respects. For example, the HTML content is probably personalized ("Welcome Stanford University") and updated ("Links to this article"). PDF content is probably watermarked ("Downloaded by"). Content elements such as these need to be filtered out of the comparisons between the "same" content at different LOCKSS boxes. One might assume that the words, figures, etc. that form the real content of articles do not drift, but in practice it would be very difficult to validate this assumption.

Soft-404 ResponsesI've written before about the problems caused for archiving by "soft-403 and soft-404" responses by Web servers. These result from Web site designers who believe their only audience is humans, so instead of providing the correct response code when they refuse to supply content, they return a pretty page with a 200 response code indicating valid content. The valid content is a refusal to supply the requested content. Interestingly, PubMed is an example, as I discovered when clicking on the (broken) PubMed link in the paper's reference 58.

Klein et al define a live web page thus:
On the one hand, the HTTP transaction chain could end successfully with a 2XX-level HTTP response code. In this case we declared the URI to be active on the live web. Their estimate of the proportion of links which are still live is thus likely to be optimistic, as they are likely to have encountered at least soft-404s if not soft-403s.

Getting Links RightEven when the resolver is working, its effectiveness in persisting links depends on its actually being used. Klein et al discover that in many cases it isn't:
one would assume that URI references to journal articles can readily be recognized by detecting HTTP URIs that carry a DOI, e.g., However, it turns out that references rather frequently have a direct link to an article in a publisher's portal, e.g., instead of the DOI link.The direct link may well survive relocation of the content within the publisher's site. But journals are frequently bought and sold between publishers, causing the link to break. I believe there are two causes for these direct links, publisher's platforms inserting them so as not to risk losing the reader, but more importantly the difficulty for authors to create correct links. Cutting and pasting from the URL bar in their browser necessarily gets the direct link, creating the correct one via requires the author to know that it should be hand-edited, and to remember to do it.

Attempts to ensure linked materials are preserved suffer from a similar problem:
The solutions component of Hiberlink also explores how to best reference archived snapshots. The common and obvious approach, followed by Webcitation and, is to replace the original URI of the referenced resource with the URI of the Memento deposited in a web archive. This approach has several drawbacks. First, through removal of the original URI, it becomes impossible to revisit the originally referenced resource, for example, to determine what its content has become some time after referencing. Doing so can be rather relevant, for example, for software or dynamic scientific wiki pages. Second, the original URI is the key used to find Mementos of the resource in all web archives, using both their search interface and the Memento protocol. Removing the original URI is akin to throwing away that key: it makes it impossible to find Mementos in web archives other than the one in which the specific Memento was deposited. This means that the success of the approach is fully dependent on the long term existence of that one archive. If it permanently ceases to exist, for example, as a result of legal or financial pressure, or if it becomes temporally inoperative as a result of technical failure, the link to the Memento becomes rotten. Even worse, because the original URI was removed from the equation, it is impossible to use other web archives as a fallback mechanism. As such, in the approach that is currently common, one link rot problem is replaced by another.The paper, and a companion paper, describe Hiberlink's solution, which is to decorate the link to the original resource with an additional link to its archived Memento. Rene Voorburg of the KB has extended this by implementing robustify.js
robustify.js checks the validity of each link a user clicks. If the linked page is not available, robustify.js will try to redirect the user to an archived version of the requested page. The script implements Herbert Van de Sompel's Memento Robust Links - Link Decoration specification (as part of the Hiberlink project) in how it tries to discover an archived version of the page. As a default, it will use the Memento Time Travel service as a fallback. You can easily implement robustify.js on your web pages in so that it redirects pages to your preferred web archive. Note, however, that soft-403s and soft-404s pose the same problem for robustify.js as they do for all Web archiving technologies.

Dependence on AuthorsMany of the solutions that have been proposed to the problem of reference rot also suffer from dependence on authors:
Webcitation was a pioneer in this problem domain when, years ago, it introduced the service that allows authors to archive, on demand, web resources they intend to reference. ... But Webcitation has not been met with great success, possibly the result of a lack of authors' awareness regarding reference rot, possibly because the approach requires an explicit action by authors, likely because of both.Webcitation is not the only one:
To a certain extent, portals like FigShare and Zenodo play in this problem domain as they allow authors to upload materials that might otherwise be posted to the web at large. The recent capability offered by these systems that allows creating a snapshot of a GitHub repository, deposit it, and receive a DOI in return, serves as a good example. The main drivers for authors to do so is to contribute to open science and to receive a citable DOI, and, hence potentially credit for the contribution. But the net effect, from the perspective of the reference rot problem domain, is the creation of a snapshot of an otherwise evolving resource. Still, these services target materials created by authors, not, like web archives do, resources on the web irrespective of their authorship. Also, an open question remains to which extent such portals truly fulfill a long term archival function rather than being discovery and access environments.Hiberlink is trying to reduce this dependence:
In the solutions thread of Hiberlink, we explore pro-active archiving approaches intended to seamlessly integrate into the life cycle of an article and to require less explicit intervention by authors. One example is an experimental Zotero extension that archives web resources as an author bookmarks them during note taking. Another is HiberActive, a service that can be integrated into the workflow of a repository or a manuscript submission system and that issues requests to web archives to archive all web at large resources referenced in submitted articles.But note that these services (and Voorburg's) depend on the author or the publisher installing them. Experience shows that authors are focused on getting their current paper accepted, large publishers are reluctant to implement extensions to their publishing platforms that offer no immediate benefit, and small publishers lack the expertise to do so.

Ideally, these services would be back-stopped by a service that scanned recently-published articles for web-at-large links and submitted them for archiving, thus requiring no action by author or publisher. The problem is that doing so requires the service to have access to the content as it is published. The existing journal archiving services, LOCKSS, CLOCKSS and Portico have such access to about half the published articles, and could in principle be extended to perform this service. In practice doing so would need at least modest funding. The problem isn't as simple as it appears at first glance, even for the articles that are archived. For those that aren't, primarily from less IT-savvy authors and small publishers, the outlook is bleak.

ArchivingFinally, the solutions assume that submitting a URL to an archive is enough to ensure preservation. It isn't. The referenced web site might have a robots.txt policy preventing collection. The site might have crawler traps, exceed the archive's crawl depth, or use Javascript in ways that prevent the archive collecting a usable representation. Or the archive may simply not process the request in time to avoid content drift or link rot.

AcknowledgementI have to thank Herbert van de Sompel for greatly improving this post through constructive criticism. But it remains my opinion alone.

DuraSpace News: UPDATE: Open Repositories 2015 DEVELOPER TRACK

planet code4lib - Tue, 2015-02-10 00:00

Adam Field and Claire Knowles, OR2015 Developer Track Co-Chairs; Cool Tools, Daring Demos and Fab Features

Indianapolis, IN  The OR2015 developer track presents an opportunity to share the latest developments across the technical community. We will be running informal sessions of presentations and demonstrations showcasing community expertise and progress:

SearchHub: Mark Your Calendar: Lucene/Solr Revolution 2015

planet code4lib - Mon, 2015-02-09 21:42
If you attended Lucene/Solr Revolution 2014 and took the post-event survey, you’ll know that we polled attendees on where they would like to see the next Revolution take place. Austin was our winner, and we couldn’t be more excited! Mark your calendar for October 13-16 for Lucene/Solr Revolution 2015 at the Hilton Austin for four days packed with hands-on developer training and multiple educational tracks led by industry experts focusing on Lucene/Solr in the enterprise, case studies, large-scale search, and data integration. We had a blast in DC for last year’s conference, and this year we’re adding even more opportunities to network and interact with other Solr enthusiasts, experts, and committers. Registration and Call for Papers will open this spring. To stay up-to-date on all things Lucene/Solr Revolution, visit and follow @LuceneSolrRev on Twitter. Revolution 2014 Resources Videos, presentations, and photos from last year’s conference are now available. Check them out at the links below. View presentation recordings from Lucene/Solr Revolution 2014 Download slides from Lucene/Solr Revolution 2014 presentations View photos from Lucene/Solr Revolution 2014

The post Mark Your Calendar: Lucene/Solr Revolution 2015 appeared first on Lucidworks.

LibraryThing (Thingology): New “More Like This” for LibraryThing for Libraries

planet code4lib - Mon, 2015-02-09 18:04

We’ve just released “More Like This,” a major upgrade to LibraryThing for Libraries’ “Similar items” recommendations. The upgrade is free and automatic for all current subscribers to LibraryThing for Libraries Catalog Enhancement Package. It adds several new categories of recommendations, as well as new features.

We’ve got text about it below, but here’s a short (1:28) video:

What’s New

Similar items now has a See more link, which opens More Like This. Browse through different types of recommendations, including:

  • Similar items
  • More by author
  • Similar authors
  • By readers
  • Same series
  • By tags
  • By genre

You can also choose to show one or several of the new categories directly on the catalog page.

Click a book in the lightbox to learn more about it—a summary when available, and a link to go directly to that item in the catalog.

Rate the usefulness of each recommended item right in your catalog—hovering over a cover gives you buttons that let you mark whether it’s a good or bad recommendation.

Try it Out!

Click “See more” to open the More Like This browser in one of these libraries:

Find out more

Find more details for current customers on what’s changing and what customizations are available on our help pages.

For more information on LibraryThing for Libraries or if you’re interested in a free trial, email, visit, or register for a webinar.

Library of Congress: The Signal: DPOE Interview: Three Trainers Launch Virtual Courses

planet code4lib - Mon, 2015-02-09 15:56

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress.

This is the first post in a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program.  Today I’ll focus on some exceptional individuals, who among other things, have completed one of the DPOE Train-the-Trainer workshops and delivered digital preservation training. I am interviewing Stephanie Kom, North Dakota State Library; Carol Kussmann, University of Minnesota Libraries; and Sara Ring, Minitex (a library network providing continuing education and other services to MN, ND and SD), who recently led an introductory virtual course on digital preservation.

Barrie: Carol, you attended the inaugural DPOE Train-the-Trainer Workshop in Washington, and Stephanie and Sara, you attended the first regional event at the Indiana State Archives during the summer of 2012, correct? Can you tell the readers about your experiences and how you and others have benefited as a result?

Carol Kussmann

Carol: In addition to learning about the DPOE curriculum itself the most valuable aspect of these Train-the-Trainer workshops was meeting new people and building relationships. In the inaugural workshop, we met people from across the country, many whom I have looked to for advice or worked with on other projects. Because of the Indiana regional training, we now have a sizable group of trainers in the Midwest that I feel comfortable with in talking about DPOE and other electronic record issues. We work with each other and provide feedback and assistance when we go out and train others or work on digital preservation issues in our own roles.

Stephanie Kom

Stephanie: We were just starting a digital program at my institution so the DPOE training was beyond helpful in just informing me what needed to be done to preserve our future digital content. It gave me the tools to explain our needs to our IT department. I also echo Carol’s thoughts on the networking opportunities. It was a great way to meet people in the region that are working with the same issues.

Sara: As my colleagues mentioned, in addition to learning the DPOE curriculum, what was most valuable to me was meeting new colleagues and forming relationships to build upon after the workshop. Shortly after the training, about eight of us began meeting virtually on a regular basis to offer our first digital preservation course (using the DPOE curriculum). Our small upper Midwest collaborative included trainers from North Dakota, South Dakota, Minnesota and Wisconsin. We had trainers from libraries, state archives and a museum participating, and we found we all had different strengths to share with our audience. Our first virtual course, “Managing Digital Content Over Time: An Introduction to Digital Preservation,” reached about 35 organizations of all types, and our second virtual course reached about 20 organizations in the region.

Sara Ring

Barrie: Since becoming official DPOE trainers, you have developed a virtual course to provide an introduction to digital preservation. Can you provide a few details about the course, and have you developed any other training materials from the DPOE Curriculum?

Stephanie, Carol, Sara: The virtual course we offered was broken up as three sessions, scheduled every other week. Each session covered two of the DPOE modules. Using the DPOE workshop materials as a starting point we added local examples from our own organizations and built in discussion questions and polls for the attendees so that we had plenty of interaction.

Evaluations from this first offering informed us that people wanted to know more about various tools used to manage and preserve digital content. In response, in our second offering of the course we built in more demonstrations of tools to help identify, manage and monitor digital content over time. Since we were discussing and demonstrating tools that dealt with metadata, we added more content about technical and preservation metadata standards. We also built in take-home exercises for attendees to complete between sessions. Attendees have responded well to these changes and find the take-home exercises that we have built in really useful.

We also created a Google Site for this course, with an up-to-date list of resources, best practices and class exercises. Carol created step-by-step guides that people can follow for understanding and using tools that can assist with managing and preserving their electronic records. These can be found on the University of Minnesota Libraries Digital Preservation Page.

Working through Minitex, we have developed three different classes related to digital preservation; An Introduction to Digital Preservation (webinar); the DPOE virtual course that was mentioned; and a full day in-person DPOE-based workshop. We have presented each of these at least two times.

Tools Quick Reference Guide, provided to attendees of “Managing Digital Content Over Time.”

Barrie: The DPOE curriculum, which is built upon the OAIS Reference Model, recently underwent a revision. Have you noticed any significant changes in the materials since you attended the workshop in 2011 or 2012? What improvements have you observed?

Carol: What I like about DPOE is that it provides a framework for people to talk about common issues related to digital preservation. The main concepts have not changed – which is good, but there has been a significant increase to the number of examples and resources. The “Digital Preservation Trends” slides were not available in the 2011 training. Keeping up to date on what people are doing, exploring new resources and tools, and following changing best practices is very important as digital preservation continues to be a moving target.

Sara, Stephanie: We found the “Digital Preservation Trends” slides, the final module covered in the DPOE workshop, to be a nice addition to the baseline curriculum. We don’t think they existed when we attended the DPOE train-the-trainer workshop back in 2012. We both especially like the “Engaging with the Digital Preservation Community” section which lists some of the organizations, listservs, and conferences that would be of interest to digital preservation practitioners. When you’re new to digital preservation (or the only one at your organization working with digital content), it can be overwhelming knowing where to start. Providing resources like this offers a way to get involved in the digital preservation community; to learn from each other. We always try to close our digital preservation classes by providing community resources like this.

Barrie: Regarding training opportunities, could you compare the strengths and challenges of traditional in-person learning environments to distance learning options?

Stephanie, Carol, Sara: Personally we all prefer in-person learning environments over virtual and believe that most people would agree. We saw this preference echoed in the DPOE 2014 Training Needs Assessment Survey (PDF).

The main strength of in-person is the interaction with the presenter and other participants; as a presenter you can adjust your presentation immediately based on audience reactions and their specific needs and understanding. As a participant you can meet and relate to other people in similar situations, and there are more opportunities at in-person workshops for having those types of discussions with colleagues during breaks or during lunch.

However, in-person learning is not always feasible with travel time and costs, and in this part of the country, weather often gets in the way (we have all had our share of driving through blizzard conditions in Minnesota and North Dakota). Convenience and timeliness is definitely a benefit of long distance learning; more people from a single institution can often attend for little or no additional cost. As trainers we have worked really hard to build in hands-on activities in our virtual digital preservation courses, but could probably do a lot more to encourage networking among the attendees.

Barrie: Are there plans to convene the “Managing Digital Content Over Time” series in 2015?

Stephanie, Carol, Sara: Yes, we plan on offering at least one virtual course this spring. We’ll be checking in with our upper Midwest collaborative of trainers to see who is interested in participating this time around. Minitex provides workshops on request, so we may do more virtual or in-person classes if there is demand.

One of the hands-on activities for the in-person “Managing Digital Content Over Time” course.

How has the DPOE program influenced and/or affected the work that you do at your organization?

Carol: The inaugural DPOE Training (2011) took place while I was working on an NDIIPP project led by the Minnesota State Archives to preserve and provide access to government digital records which provided me with additional tools with which to work from during the project.   After the project ended, I continued to use the information I learned during the project and DPOE training to develop a workflow for processing and preserving digital records at the Minnesota State Archives.

Since then, I became a Digital Preservation Analyst at the University of Minnesota Libraries where I continue to focus on digital preservation workflows, education and training, and other related activities. Overall, the DPOE training helped to build a foundation from which to discuss digital preservation with others whether in a classroom setting, conference presentation or one-on-one conversations. I look forward to continuing to work with members of the DPOE community.

Sara: As a digitization and metadata training coordinator at Minitex, a large part of my job is developing and presenting workshops for library professionals in our region. Participating in the DPOE training (2012) was one of the first steps I took to build and expand our training program at Minitex to include digital preservation. The DPOE program has also given me the opportunity to build up our own small cohort of DPOE trainers in the region, so we can schedule regular workshops based on who is available to present at the time.

Stephanie: I started the digitization program at our institution in 2012. Digital preservation has become a main component of that program and I am still working to get a full-fledged plan moving. Our institution is responsible for preserving other digital content and I would like our preservation plan to encompass all aspects of our work here at the library. I think one of the great things about the DPOE training is that the different pieces can be implemented before starting to produce digital content or it can be retrofitted into an already-established digital program. It can be more work when you already have a lot of digital content but the training materials make each step seem manageable.

Open Knowledge Foundation: Pakistan Data Portal

planet code4lib - Mon, 2015-02-09 11:29

December 2014 saw the Sustainable Development Policy Institute and Alif Ailaan launch the Pakistan Data Portal at the 30th Annual Sustainable Development Conference. The portal, built using CKAN by Open Knowledge, provides an access point for viewing and sharing data relating to all aspects of education in Pakistan.

A particular focus of this project was to design an open data portal that could be used to support advocacy efforts by Alif Ailaan, an organisation dedicated to improving education outcomes in Pakistan.

The Pakistan Data Portal (PDP) is the definitive collection of information on education in Pakistan and collates datasets from private and public research organisations on topics including infrastructure, finance, enrollment, and performance to name a few. The PDP is a single point of access against which change in Pakistani education can be tracked and analysed. Users, who include teachers, parents, politicians and policy makers are able to browse historical data can compare and contrast it across regions and years to reveal a clear, customizable picture of the state of education in Pakistan. From this clear overview, the drivers and constraints of reform can be identified which allow Alif Ailaan and others pushing for change in the country to focus their reform efforts.

Pakistan is facing an education emergency. It is a country with 25m children out of education and 50% girls of school age do not attend classes. A census has not been completed since 1998 and there are problems with the data that is available. It is outdated, incomplete, error-ridden and only a select few have access to much of it. An example that highlights this is a recent report from ASER, which estimates the number of children out of school at 16 million fewer than the number computed by Alif Ailaan in another report.  NGOs and other advocacy groups have tended to only be interested in data when it can be used to confirm that the funds they are utilising are working. Whilst there is agreement on the overall problem, If people can not agree on its’ scale, how can a consensus solution be hoped for?

Alif Ailaan believe if you can’t measure the state of education in the country, you cant hope to fix it fix it. This forms the focus of their campaigning efforts. So whilst the the quality of the data is a problem, some data is better than no data, and the PDP forms a focus for gathering quality information together and for building a platform from which to build change and promote policy change— policy makers can make accurate decisions which are backed up.

The data accessible through the portal is supported by regular updates from the PDP team who draw attention to timely key issues and analyse the data. A particular subject or dataset will be explored from time to time and these general blog post are supported by “The Week in Education” which summarises the latest education news, data releases and publications.

CKAN was chosen as the portal best placed to meet the needs of the PDP. Open Knowledge were tasked with customising the portal and providing training and support to the team maintaining it. A custom dashboard system was developed for the platform in order to present data in an engaging visual format.

As explained by Asif Mermon, Associate Research Fellow at SDPI, the genius of the portal is the shell. As institutions start collecting data, or old data is uncovered, it can be added to the portal to continually improve the overall picture.

The PDP is in constant development to further promote the analysis of information in new ways and build on the improvement of the visualizations on offer. There are also plans to expand the scope of the portal, so that areas beyond education can also reap its’ benefits. A further benefit is that the shell can then be be exported around the world so other countries will be able to benifit from the development.

The PDP initiative is part of the multi-year DFID-funded Transforming Education Pakistan (TEP) campaign aiming to increase political will to deliver education reform in Pakistan. Accadian, on behalf of HTSPE, appointed the Open Knowledge Foundation to build the data observatory platform and provide support in managing the upload of data including onsite visits to provide training in Pakistan.


Hydra Project: Announcing Hydra 9.0.0

planet code4lib - Mon, 2015-02-09 09:56

We’re pleased to announce the release of Hydra 9.0.0.  This Hydra gem brings together a set of compatible gems for working with Fedora 4. Amongst others it bundles Hydra-head 9.0.1 and Active-Fedora 9.0.0.  In addition to working with Fedora 4, Hydra 9 includes many improvements and bug fixes. Especially notable is the ability to add RDF properties on repository objects themselves (no need for datastreams) and large-file streaming support.

The new gem represents almost a year of effort – our thanks to all those who made it happen!

Release notes:

DuraSpace News: Fedora 4 Makes Islandora Even Better!

planet code4lib - Mon, 2015-02-09 00:00

There are key advantages for users and developers by combining Islandora 7 and Fedora 4.

Charlottetown, PEI, CA  Islandora is an open source software framework for managing and discovering digital assets utilizing a best-practices framework that includes Drupal, Fedora, and Solr. Islandora is implemented and built by an ever-growing international community.

CrossRef: Geoffrey Bilder will be at the 10th IDCC in London tomorrow

planet code4lib - Sun, 2015-02-08 21:58

Geoffrey Bilder @gbilder will be part of a panel entitled Why is it taking so long?. The panel will explore why some types of change in curation practice take so long and why others happen quickly. The panel will be moderated by Carly Strasser @carlystrasser, Manager of Strategic Partnerships for DataCite. The panel will take place on Monday, February 9th at 16:30 at 30 Euston Square in London. Learn more. #idcc15


Subscribe to code4lib aggregator