You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 4 days 12 hours ago

DPLA: DPLA releases Krikri 0.1.3

Wed, 2015-02-11 18:51

The Digital Public Library of America (DPLA) is happy to announce the release of Krikri version 0.1.3, a Ruby on Rails engine for metadata aggregation, enhancement, and quality control. DPLA uses Krikri as part of Heiðrún, its new metadata ingestion system.

Krikri 0.1.3 includes the following features:
  • Harvesting metadata from OAI-PMH providers, and support for building other harvesters
  • Creating RDF metadata models, with specific support for the DPLA Metadata Application Profile
  • Parsing metadata and mapping to RDF graphs using a Domain Specific Language
  • Persistence for graphs and objects using the Linked Data Platform specification
  • Enrichments for mapped metadata, including date parsing and normalization, stripping and splitting on punctuation, and more
  • Queuing and association of jobs to metadata using provenance information
  • A basic quality assurance interface, including record browse and search, a record-graph comparison view, and reports on conformance to your metadata application profile
Krikri and Heiðrún are open source software, released under the MIT License. Krikri and Heiðrún are built on top of other open source components, including Apache Marmotta, Apache Solr, ActiveTriples, Blacklight, and Resque. More information about Krikri and Heiðrún can be found at the following links:

LITA: Jobs in Information Technology: February 11

Wed, 2015-02-11 18:43

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Digital Curation Librarian, Fort Hays State University-Forsyth Library, Hays, KS

Emerging Technologies Coordinator, Science and Engineering Division, Columbia University Libraries/Information Services, New York, NY

Librarian – Systems and Technologies, Santa Barbara City College Luria Library, Santa Barbara, CA

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

District Dispatch: ALA applauds legislation for increased Wi-Fi spectrum

Wed, 2015-02-11 16:16

Today, Senators Marco Rubio (R-FL) and Cory Booker (D-NJ) reintroduced the Wi-Fi Innovation Act (S.424), which would help ensure that our nation’s libraries and their communities have access to the spectrum needed to meet growing demands for wireless access. The legislation would require the Federal Communications Commission (FCC) to conduct a feasibility study on providing additional unlicensed spectrum in the upper 5Ghz spectrum band.

“We welcome this bipartisan effort from Senators Rubio and Booker to improve access to the Internet,” said ALA President Courtney Young in a statement. “Libraries are first responders in providing information and services for people across the country, and robust Wi-Fi is an increasingly important library service. By offering no-fee public access to the Internet via wireless connections, libraries serve as community technology hubs that enable digital opportunity and full participation in the nation’s economy.”

Public libraries are the most common public Wi-Fi access point for African-Americans and Latinos—with roughly one-third of these communities using public library Wi-Fi. This is true for 23 percent of white people, who list school as their top public Wi-Fi spot. Virtually all (98 percent) public libraries now offer Wi-Fi, up from 18 percent a decade ago.

“There is increasing demand to support the growing universe of wireless devices and services, and making more unlicensed spectrum available is critical,” Young concluded.

Companion legislation was introduced in the U.S. House of Representatives by Congressman Bob Latta (R-OH), and cosponsored by Congressman Darrell Issa (R-CA), Congresswoman Anna G. Eshoo (D-CA), Congresswoman Doris Matsui (D-CA) and Congresswoman Suzan DelBene (D-WA).

The post ALA applauds legislation for increased Wi-Fi spectrum appeared first on District Dispatch.

Journal of Web Librarianship: No new items

Wed, 2015-02-11 12:42
There are no new items in your feed

Hydra Project: Share your Cool Tools, Daring Demos and Fab Features at Open Repositories 2015

Wed, 2015-02-11 11:02

Of interest to the community:


Open Repositories 2015 DEVELOPER TRACK

June 8-11, 2015, Indianapolis, Indiana,

*** Deadline 13th March 2015 ***

Cool Tools, Daring Demos and Fab Features

The OR2015 developer track presents an opportunity to share the latest developments across the technical community. We will be running informal sessions of presentations and demonstrations showcasing community expertise and progress:

What cool development tools, frameworks, languages and technologies could you not get on without?

Is there a particular technique or process that you find apt for solving particular day-to-day repository problems?  Demonstrate it to the community.  Extra credit for command-line shenanigans and live debugging.

What new features (however small) have you added to your organisation’s repository?  What technologies were used and how did you arrive at your solution?

Presentations will be flexibly timed (5 to 20 minutes). Live demos, code repositories, ssh, hacking and audience participation are encouraged.

Submissions should take the form of a title and short paragraph detailing what will be shared with the community (including the specific platform and/or technologies you will be showcasing). Please also give an estimate of the duration of your demonstration.

Submit your proposal here: by March 13, 2015

Ideas Challenge

The Developer Challenge this year has been replaced by the more inclusive IDEAS CHALLENGE. We would like to encourage teams to form before and during the conference to propose an innovative solution to a real-world problem that repository users currently face.  Each team should include members from both the developer and user community, and represent more than one institution.

Teams’ ideas will be presented to the conference and prizes will be awarded based on the nature of the problem, the quality of the solution and the make-up of the team. Find out more at

For inquiries, please contact the Developer Track Co-Chairs, Adam Field and Claire Knowles  at af05[AT] and claire.knowles[AT]

CrossRef: February 2015: CrossRef deploys Co-Access Functionality for Books

Wed, 2015-02-11 08:08

Co-Access functionality has been set up by CrossRef address the issues caused when there are multiple organisations involved in the hosting and distribution of a given book.

Co-Access is effectively an extension of CrossRef's Multiple Resolution (MR) functionality which allows multiple URLs to be assigned to a single DOI. MR functionality relies on title ownership to try to avoid conflicts and maintain the uniqueness of DOIs so that they can be queried effectively.

MR has worked well for journals, but because many book publishers outsource the hosting of their content to multiple aggregators and platforms, they need a separate process that allows independent transactions on the part of the primary publisher and any secondary content hosts, rather than interactions having to be co-ordinated by the primary publisher (who may not be depositing DOIs and metadata with CrossRef at all).

Co-Access will allow multiple parties to deposit DOIs for a single publication, and have the CrossRef system automatically resolve overlaps (we call them conflicts) and establish MR for any DOI where multiple target URLs exist for the same book or part of a book (e.g. chapters). This arrangement would be set-up between the primary publisher of a work and a set of approved participants who could also deposit (and update) DOIs for a publication. This relationship can be enabled either between prefixes or within a single prefix.

This change would allow Co-Access members to operate independently of one another when assigning DOIs to book content, and aims reduce the amount of coordination required between the primary publisher and the secondary content hosts.

CrossRef is currently embarking on a pilot of this functionality - we've done our own testing based on some known scenarios - but now need publishers to trial this on some of their own publications and provide feedback on how well it works for them. CrossRef would need publishers to be able to identify a title that will need a Co-Access arrangement set up with another party (and who the party or parties would be) to commence the testing process. If your organisation would like to be involved, please contact for further information.

Galen Charlton: Ogres, hippogriffs, and authorized Koha service providers

Wed, 2015-02-11 01:14

What do ogres, hippogriffs, and authorized Koha service providers have in common?

Each of them is an imaginary creature.

20070522 Madrid: hippogriff — image by Larry Wentzel on Flickr (CC-BY)

Am I saying that Koha service providers are imaginary creatures? Not at all — at the moment, there are 54 paid support providers listed on the Koha project’s website.

But not a one of them is “authorized”.

I bring this up because a friend of mine in India (full disclosure: who himself offers Koha consulting services) ran across this flyer by Avior Technologies:

The bit that I’ve highlighted is puffery at best, misleading at worst. The Koha website’s directory of paid support providers is one thing, and one thing only: a directory. The Koha project does not endorse any vendors listed there — and neither the project nor the Horowhenua Library Trust in New Zealand (which holds various Koha trademarks) authorizes any firm to offer Koha services.

If you want your firm to get included in the directory, you need only do a few things:

  1. Have a website that contains an offer of services for Koha.
  2. Ensure that your page that offers services links back to
  3. Make a public request to be added to the directory.

That’s it.

Not included on this list of criteria:

  • Being good at offering services for Koha libraries.
  • Contributing code, documentation, or anything else to the Koha project.
  • Having any current customers who are willing to vouch for you.
  • Being alive at present (although eventually, your listing will get pulled for lack of response to inquiries from Koha’s webmasters).

What does this mean for folks interested in getting paid support services?  There is no shortcut to doing your due diligence — it is on you to evaluate whether a provider you might hire is competent and able to keep their customers reasonably happy. The directory on the Koha website exists as a convenience for folks starting a search for a provider, but beyond that: caveat emptor.

I know nothing about Avior Technologies. They may be good at what they do; they may be terrible — I make no representation either way.

But I do know this: while there are some open source projects where the notion of an “authorized” or “preferred” support provider may make some degree of sense, Koha isn’t such a project.

And that’s generally to the good of all: if you have Koha expertise or can gain it, you don’t need to ask anybody’s permission to start helping libraries run Koha — and get paid for it.  You can fill niches in the market that other Koha support providers cannot or do not fill.

You can in time become the best Koha vendor in your niche, however you choose to define it.

But authority? It will never be bestowed upon you. It is up to you to earn it by how well you support your customers, and by how much you contribute to the global Koha project.


DuraSpace News: AVAILABLE: Fedora 4.1.0 Release

Wed, 2015-02-11 00:00

Winchester, MA  The Fedora team is proud to announce that Fedora 4.1.0 was released on February 4, 2015 and is now available.

pinboard: Presentation Index

Tue, 2015-02-10 23:12
RT @ronallo: Detailed version of my talk on WebVTT with full speaker notes and audience handouts here: #c4l15 #code4lib

Tara Robertson: Developing a culture of consent at code4lib

Tue, 2015-02-10 23:01

I love code4lib. code4lib is not a formal organization, it’s more of a loose collective of folks. The culture is very DIY. If you see something that needs doing someone needs to step up and do it. I love that part of our culture is reflecting on our culture and thinking of ways to improve it. At this year’s conference we made some improvements on our culture.

Galen Charlton kicked this discussion off with an email on the code4lib list by suggesting we institute a policy like the Evergreen conference (which was informed by work done by The Ada Initiative) where “consent be explicitly given to be photographed or recorded”.

Kudos to the local organizing committee for moving quickly (like just over 3 hours from Galen’s initial email). They purchased coloured lanyards to indicate to participants views on being photographed: red means don’t photograph me, yellow means ask me before photographing me, and green means go ahead and photograph me. This is an elegant and simple solution.

Over the past few years streaming the conference presentations has become standard as is publishing these videos to the web after the conference. This is awesome and important—not everyone can travel to attend the conference. This allows us to learn faster and build better things. I suggested that it was time to explicitly obtain speaker’s consent to stream their presentation and archive the video online.

At first I was disheartened by some of comments on the list :

  • “This needs to be opt out, not opt in.”
  • “An Opt-Out policy would be more workable for presenters.”
  • “requiring explicit permission from presenters is overly burdensome for the (streaming) crew that is struggling to get the recordings”
  • i enjoy taking candid photos of people at the conference and no one seems to mind
  • “my old Hippy soul cringes at unnecessary paperwork. A consent form means nothing. Situations change. Even a well-intended agreement sometimes needs to be reneged on.”

The lack of understanding about informed consent means a few things about the code4lib community:

  1. there’s a lack of connection to feminist organizing that has a long history of collective organizing and consent
  2.  the laissez-faire approach to consent (opt-out) centres male privilege
  3. this community still has work to do around rape culture. 

It was awesome to get support from the Programming Committee, the local organizers and some individuals. We managed to update the consent form we used for Access to be specific to code4lib and get it out to speakers in just over a week. Ranti quickly stepped up and volunteered to help me obtain consent forms from all of the speakers. As this is a single stream conference there were only 39 people so it wasn’t that much work to do. 

Here’s the consent form we used.  A few people couldn’t agree to the copyright bits, so they crossed that part out. I’m sure this form will evolve to become better.

At code4lib 2015 in Portland it was the first time we were explicit about consent. The colour coded lanyards and speaker consent forms are an important part of building a culture of consent.

Thanks to my smart friend Eli Manning (not the football player) for giving me feedback on this.

Open Library Data Additions: OL.150101

Tue, 2015-02-10 21:48

OL Output of MARC records from Toronto Public Library.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

CrossRef: Introduction to CrossRef: Technical Basics webinar is tomorrow

Tue, 2015-02-10 20:19

New to CrossRef? Interested in learning more about the technical aspects of CrossRef? Please join us for one of our upcoming Introduction to CrossRef Technical Basics webinars. This webinar will provide a technical introduction to CrossRef and a brief outline of how CrossRef works. New members are especially encouraged to register. All of our webinars are free to attend. If the time of the webinar is not convenient for you, we will be recording the sessions and they will be made available after the webinars for those who register.


Introduction to CrossRef Technical Basics
Date: Wednesday, Feb 11, 2015
Time: 8:00 am (San Francisco), 11:00 am (New York), 4:00 pm (London)
Moderator: Patricia Feeney

Introduction to CrossRef Technical Basics
Date: Wednesday, Mar 18, 2015
Time: 8:00 am (San Francisco), 11:00 am (New York), 4:00 pm (London)
Moderator: Patricia Feeney


Additional CrossRef webinars are also listed on our webinar page.

We look forward to having you join us!

CrossRef: If you are a CrossCheck Administrator - this webinar is for you.

Tue, 2015-02-10 19:02

CrossCheck: iThenticate Admin Webinar

Date: Thursday, Feb 19, 2015
Time: 7:00 am (San Francisco), 10:00 am (New York), 3:00 pm (London)
Register today!

CrossCheck, powered by iThenticate now has over 600 members using the service to screen content for originality. Through demand from these members, CrossRef is trialling a webinar for CrossCheck administrators and more experienced users that will cover:

- The scale of the current CrossCheck database
- CrossCheck participation and usage
- An overview of the newest features in iThenticate
- A run-through of more administrator-specific features in iThenticate
- Advice on interpreting the reports and common issues
- Details on support resources available for publishers
- Q&A session

Representatives from CrossRef and iParadigms will run the webinar which will last one hour. We hope you can join us!

If you can't make this webinar visit our webinar page for additional webinars.

Eric Hellman: "Passwords are stored in plain text."

Tue, 2015-02-10 16:24
Many states have "open records" laws which mandate public disclosure of business proposals submitted to state agencies. When a state library or university requests proposals for library systems or databases, the vender responses can be obtained and reviewed. When I was in the library software business, it was routine to use these laws to do "competitor intelligence". These disclosures can often reveal the inner workings of proprietary vendor software which implicate information privacy and security.

Consider for example, this request for "eResources for Minitex". Minitex is a "publicly supported network of academic, public, state government, and special libraries working cooperatively to improve library service for their users in Minnesota, North Dakota and South Dakota" and it negotiates licenses databases for libraries throughout the three states.

Question number 172 in this Request for Proposals (RFP) was: "Password storage. Indicate how passwords are stored (e.g., plain text, hash, salted hash, etc.)."

To provide context for this question, you need to know just a little bit of security and cryptography.

I'll admit to having written code 15 years ago that saved passwords as plain text. This is a dangerous thing to do, because if someone were to get unauthorized access to the computer where the passwords were stored, they would have a big list of passwords. Since people tend to use the same password on multiple systems, the breached password list could be used, not only to gain access to the service that leaked the password file, but also to other services, which might include banks, stores and other sites of potential interest to thieves.

As a result, web developers are now strongly admonished never to save the passwords as plain text. Doing so in a new system should be considered negligent, and could easily result in liability for the developer if the system security is breached. Unfortunately many businesses would rather risk paying paying lawyers a lot of money to defend themselves should something go wrong than bite the bullet and pay some engineers a little money now to patch up the older systems.

To prevent the disclosure of passwords, the current standard practice is to "salt and hash" them.

A cryptographic hash function mixes up a password so that the password cannot be reconstructed. so for example, the hash of 'my_password' is 'a865a7e0ddbf35fa6f6a232e0893bea4'. When a user enters their password, the hash of the password is recalculated and compared to the saved hash to determine whether the password is correct.

As a result of this strategy, the password can't be recovered. But it can be reset, and the fact that no one can recover the password eliminates a whole bunch of "social engineering" attacks on the security of the service.

Given a LOT of computer power, there are brute force attacks on the hash, but the easiest attack is to compute the hashes for the most common passwords. In a large file of passwords, you should be able to find some accounts that are breachable, even with the hashing. And so a "salt" is added to the password before the hash is applied. In the example above, a hash would be computed for 'SOME_CLEVER_SALTmy_password'. Which, of course, is '52b71cb6d37342afa3dd5b4cc9ab4846'.

To attack the salted password file, you'd need to know that salt. And since every application uses a different salt, each file of salted passwords is completely different. A successful attack on one hashed password file won't compromise any of the others.

Another standard practice for user-facing password management is to never send passwords unencrypted. The best way to do this is to use HTTPS, since web browser software alerts the user that their information is secure. Otherwise, any server between the user and the destination server (there might be 20-40 of these for  typical web traffic) could read and store the user's password.

The Minitex RFP covers reference databases. For this reason, only a small subset of services offered to libraries are covered here. The authentication for these sorts of systems typically don't depend on the user creating a password; user accounts are used to save the results of a search, or to provide customization features. A Minitex patron can use many of the offered databases without providing any sort of password.

So here are the verbatim responses received for the Minitex RFP:

LearningExpress, LLC
Response: "All passwords are stored using a salted hash. The salt is randomly generated and unique for each user."
My comment: This is a correct answer. However, the LearningExpress login sends passwords in the clear over HTTP.

Response: "Passwords are md5 hashed."
My comment: MD5 is the hash algorithm I used in my examples above. It's not considered very secure (see comments). OCLC Firstsearch does not force HTTPS and can send login passwords in the clear.

Response: "N/A"
My comment: This just means that no passwords are used in the service.

Infogroup Library Division
Response: "Passwords are currently stored as plain text. This may change once we develop the customization for users within ReferenceUSA. Currently the only passwords we use are for libraries to access usage stats."
My comment: The user customization now available for ReferenceUSA appears at first glance to be done correctly.

EBSCO Information Services
Response: "EBSCOhost passwords in EBSCOadmin are stored in plain text."
My comment: Should note that EBSCOadmin is not a end-user facing system. So if the EBSCO systems were compromised only library administrator credentials would be exposed. 

Encyclopaedia Britannica, Inc.
Response: "Passwords are stored as plain text."
My comment: I wonder if EB has an article on network security?

Response: "We store all passwords as plain text."
My comment: The ProQuest service available through my library creates passwords over HTTP but uses some client-side encryption. I have not evaluated the security of this encryption.

Scholastic Library Publishing, Inc.
Response: "Passwords are not stored. FreedomFlix offers a digital locker feature and is the only digital product that requires a login and password. The user creates the login and password. Scholastic Library Publishing, Inc does not have access to this information.”
My comment: The "FreedomFlix" service not only sends user passwords unencrypted over HTTP, it sends them in a GET query string. This means that not only can anyone see the user passwords in transit, but log files will capture and save them for long-term perusal. Third-party sites will be sent the password in referrer headers. When used on a shared computer, subsequent users will easily see the passwords. "Scholastic Library Publishing" may not have access to user passwords, but everyone else will have them.

Cengage Learning
Response: "Passwords are stored in plain text."
My comment: Like FreedomFlix, the Gale Infotrac service from Cengage sends user passwords in the clear in a GET query string. But it asks the user to enter their library barcode in the password field, so users probably wouldn't be exposing their personal passwords.

So, to sum up, adoption of up-to-date security practices is far from complete in the world of library databases. I hope that the laggards have improved since the submission date of this RFP (roughly a year ago) or at least have plans in place to get with the program. I would welcome comments to this post that provide updates. Libraries themselves deserve a lot of the blame, because for the most part the vendors that serve them respond to their requirements and priorities.

I think libraries issuing RFPs for new systems and databases should include specific questions about security and privacy practices, and make sure that contracts properly assign liability for data breaches with the answers to these questions in mind.

Note: This post is based on information shared by concerned librarians on the LITA Patron Privacy Technologies Interest Group list. Join if you care about this.

DPLA: DPLA and the Digital Library Federation team up to offer special DPLAfest travel grants

Tue, 2015-02-10 16:00

The Digital Public Library of America (DPLA)  is excited to work with the Digital Library Federation (DLF) program of the Council on Library and Information Resources to offer DPLA + DLF Cross-Pollinator Travel Grants. The purpose of these grants is to extend the opportunity to attend DPLAfest 2015 to four DLF community members. Successful applicants should be able to envision and articulate a connection between the DLF community and the work of DPLA.

It is our belief that the key to sustainability of large-scale national efforts require robust community support. Connecting DPLA’s work to the energetic and talented DLF community is a positive way to increase serendipitous collaboration around this shared digital platform.

The goal of the DPLA + DLF Travel Grants is to bring cross-pollinators—active DLF community  contributors who can provide unique perspectives to our work and share the vision of DPLA from their perspective—to the conference. By teaming up with the DLF to provide travel grants, it is our hope to engage DLF community members and connect them to exciting areas of growth and opportunity at DPLA.

The travel grants include DPLAfest 2015 conference registration, travel costs, meals, and lodging in Indianapolis.

The DPLA + DLF Cross-Pollinator Travel Grants is the first of a series of collaborations between CLIR/DLF and DPLA.


Four awards of up to $1,250 each to go towards the travel, board, and lodging expenses of attending DPLAfest 2015. Additionally, the grantees will each receive a complimentary full registration to the event ($75). Recipients will be required to write a blog post about their experience subsequent to DPLAfest; this blog post will be co-published by DLF and DPLA.


Applicants must be a staff member of a current DLF member organization and not currently working on a DPLA hub team.


Send an email by March 5th, 5 pm EDT, containing the following items (in one document) to, with the subject “DPLAFest Travel Grant: [Full Name]:

  • Cover letter of nomination from the candidate’s supervisor/manager or an institutional executive, including acknowledgement that candidate would not have been funded to attend DPLAfest.
  • Personal statement from the candidate (ca. 500 words) explaining their educational background, what their digital library/collections involvement is, why they are excited about digital library/collections work, and how they see themselves benefiting from and participating in DPLAfest.
  • A current résumé.

Applications may be addressed to the DPLA + DLF Cross-Pollinator Committee.


Candidates will be selected by DPLA and DLF staff. In assessing the applications, we will look for a demonstrated commitment to digital work, and will consider the degree to which participation might enhance communication and collaboration between the DLF and DPLA communities. Applicants will be notified of their status no later than March 16, 2015.

These fellowships are generously supported by the Council on Library and Information Resource’s Digital Library Federation program.

David Rosenthal: The Evanescent Web

Tue, 2015-02-10 16:00
Papers drawing attention to the decay of links in academic papers have quite a history, i blogged about three relatively early ones six years ago. Now Martin Klein and a team from the Hiberlink project have taken the genre to a whole new level with a paper in PLoS One entitled Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. Their dataset is 2-3 orders of magnitude bigger than previous studies, their methods are far more sophisticated, and they study both link rot (links that no longer resolve) and content drift (links that now point to different content). There's a summary on the LSE's blog.

Below the fold, some thoughts on the Klein et al paper.

As regards link rot, they write:
In order to combat link rot, the Digital Object Identifier (DOI) was introduced to persistently identify journal articles. In addition, the DOI resolver for the URI version of DOIs was introduced to ensure that web links pointing at these articles remain actionable, even when the articles change web location.But even when used correctly, such as, DOIs introduce a single point of failure. This became obvious on January 20th when the domain name briefly expired. DOI links all over the Web failed, illustrating yet another fragility of the Web. It hasn't been a good time for access to academic journals for other reasons either. Among the publishers unable to deliver content to their customers in the last week or so were Elsevier, Springer, Nature, HighWire Press and Oxford Art Online.

I've long been a fan of Herbert van de Sompel's work, especially Memento. He's a co-author on the paper and we have been discussing it. Unusually, we've been disagreeing. We completely agree on the underlying problem of the fragility of academic communication in the Web era as opposed to its robustness in the paper era. Indeed, in the introduction of another (but much less visible) recent paper entitled Towards Robust Hyperlinks for Web-Based Scholarly Communication Herbert and his co-authors echo the comparison between the paper and Web worlds from the very first paper we published on the LOCKSS system a decade and a half ago. Nor am I critical of the research underlying the paper, which is clearly of high quality and which reveals interesting and disturbing properties of Web-based academic communication. All I'm disagreeing with Herbert about is the way the research is presented in the paper.

My problem with the presentation is that this paper, which has a far higher profile than other recent publications in this area, and which comes at a time of unexpectedly high visibility for web archiving, seems to me to be excessively optimistic, and to fail to analyze the roots of the problem it is addressing. It thus fails to communicate the scale of the problem.

The paper is, for very practical reasons of publication in a peer-reviewed journal, focused on links from academic papers to the web-at-large. But I see it as far too optimistic in its discussion of the likely survival of the papers themselves, and the other papers they link to (see Content Drift below). I also see it as far too optimistic in its discussion of proposals to fix the problem of web-at-large references that it describes (see Dependence on Authors below).

All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself (see Getting Links Right below) that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.

In addition, the paper fails to even mention the elephant in the room, the fact that both the papers and the web-at-large content are copyright. The archives upon which the proposed web-at-large solutions rest, such as the Internet Archive, are themselves fragile. Not just for the normal economic and technical reasons we outlined nearly a decade ago, but because they operate under the DMCA's "safe harbor" provision and thus must take down content upon request from a claimed copyright holder. The archives such as Portico and LOCKSS that preserve the articles themselves operate instead with permission from the publisher, and thus must impose access restrictions.

This is the root of the problem. In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like.

None of this is to suggest that developing and deploying partial solutions is a waste of time. It is what I've been doing the last quarter of my life. There cannot be a single comprehensive technical solution. The best we can do is to combine a diversity of partial solutions. But we need to be clear that even if we combine everything anyone has worked on we are still a long way from solving the problem. Now for some details.

Content DriftAs regards content drift, they write:
Content drift is hardly a matter of concern for references to journal articles, because of the inherent fixity that, especially PDF-formated, articles exhibit. Nevertheless, special-purpose solutions for long-term digital archiving of the digital journal literature, such as LOCKSS, CLOCKSS, and Portico, have emerged to ensure that articles and the articles they reference can be revisited even if the portals that host them vanish from the web. More recently, the Keepers Registry has been introduced to keep track of the extent to which the digital journal literature is archived by what memory organizations. These combined efforts ensure that it is possible to revisit the scholarly context that consists of articles referenced by a certain article long after its publication.While I understand their need to limit the scope of their research to web-at-large resources, the last sentence is far too optimistic.

First, research using the Keepers Registry and other resources shows that at most 50% of all articles are preserved. So future scholars depending on archives of digital journals will encounter large numbers of broken links.

Second, even the 50% of articles that are preserved may not be accessible to a future scholar. CLOCKSS is a dark archive and is not intended to provide access to future scholars unless the content is triggered. Portico is a subscription archive, future scholars' institutions may not have a subscription. LOCKSS provides access only to readers at institutions running a LOCKSS box. These restrictions are a response to the copyright on the content and are not susceptible to technical fixes.

Third, the assumption that journal articles exhibit "inherent fixity" is, alas, outdated. Both the HTML and PDF versions of articles from state-of-the-art publishing platforms contain dynamically generated elements, even when they are not entirely generated on-the-fly. The LOCKSS system encounters this on a daily basis. As each LOCKSS box collects content from the publisher independently, each box gets content that differs in unimportant respects. For example, the HTML content is probably personalized ("Welcome Stanford University") and updated ("Links to this article"). PDF content is probably watermarked ("Downloaded by"). Content elements such as these need to be filtered out of the comparisons between the "same" content at different LOCKSS boxes. One might assume that the words, figures, etc. that form the real content of articles do not drift, but in practice it would be very difficult to validate this assumption.

Soft-404 ResponsesI've written before about the problems caused for archiving by "soft-403 and soft-404" responses by Web servers. These result from Web site designers who believe their only audience is humans, so instead of providing the correct response code when they refuse to supply content, they return a pretty page with a 200 response code indicating valid content. The valid content is a refusal to supply the requested content. Interestingly, PubMed is an example, as I discovered when clicking on the (broken) PubMed link in the paper's reference 58.

Klein et al define a live web page thus:
On the one hand, the HTTP transaction chain could end successfully with a 2XX-level HTTP response code. In this case we declared the URI to be active on the live web. Their estimate of the proportion of links which are still live is thus likely to be optimistic, as they are likely to have encountered at least soft-404s if not soft-403s.

Getting Links RightEven when the resolver is working, its effectiveness in persisting links depends on its actually being used. Klein et al discover that in many cases it isn't:
one would assume that URI references to journal articles can readily be recognized by detecting HTTP URIs that carry a DOI, e.g., However, it turns out that references rather frequently have a direct link to an article in a publisher's portal, e.g., instead of the DOI link.The direct link may well survive relocation of the content within the publisher's site. But journals are frequently bought and sold between publishers, causing the link to break. I believe there are two causes for these direct links, publisher's platforms inserting them so as not to risk losing the reader, but more importantly the difficulty for authors to create correct links. Cutting and pasting from the URL bar in their browser necessarily gets the direct link, creating the correct one via requires the author to know that it should be hand-edited, and to remember to do it.

Attempts to ensure linked materials are preserved suffer from a similar problem:
The solutions component of Hiberlink also explores how to best reference archived snapshots. The common and obvious approach, followed by Webcitation and, is to replace the original URI of the referenced resource with the URI of the Memento deposited in a web archive. This approach has several drawbacks. First, through removal of the original URI, it becomes impossible to revisit the originally referenced resource, for example, to determine what its content has become some time after referencing. Doing so can be rather relevant, for example, for software or dynamic scientific wiki pages. Second, the original URI is the key used to find Mementos of the resource in all web archives, using both their search interface and the Memento protocol. Removing the original URI is akin to throwing away that key: it makes it impossible to find Mementos in web archives other than the one in which the specific Memento was deposited. This means that the success of the approach is fully dependent on the long term existence of that one archive. If it permanently ceases to exist, for example, as a result of legal or financial pressure, or if it becomes temporally inoperative as a result of technical failure, the link to the Memento becomes rotten. Even worse, because the original URI was removed from the equation, it is impossible to use other web archives as a fallback mechanism. As such, in the approach that is currently common, one link rot problem is replaced by another.The paper, and a companion paper, describe Hiberlink's solution, which is to decorate the link to the original resource with an additional link to its archived Memento. Rene Voorburg of the KB has extended this by implementing robustify.js
robustify.js checks the validity of each link a user clicks. If the linked page is not available, robustify.js will try to redirect the user to an archived version of the requested page. The script implements Herbert Van de Sompel's Memento Robust Links - Link Decoration specification (as part of the Hiberlink project) in how it tries to discover an archived version of the page. As a default, it will use the Memento Time Travel service as a fallback. You can easily implement robustify.js on your web pages in so that it redirects pages to your preferred web archive. Note, however, that soft-403s and soft-404s pose the same problem for robustify.js as they do for all Web archiving technologies.

Dependence on AuthorsMany of the solutions that have been proposed to the problem of reference rot also suffer from dependence on authors:
Webcitation was a pioneer in this problem domain when, years ago, it introduced the service that allows authors to archive, on demand, web resources they intend to reference. ... But Webcitation has not been met with great success, possibly the result of a lack of authors' awareness regarding reference rot, possibly because the approach requires an explicit action by authors, likely because of both.Webcitation is not the only one:
To a certain extent, portals like FigShare and Zenodo play in this problem domain as they allow authors to upload materials that might otherwise be posted to the web at large. The recent capability offered by these systems that allows creating a snapshot of a GitHub repository, deposit it, and receive a DOI in return, serves as a good example. The main drivers for authors to do so is to contribute to open science and to receive a citable DOI, and, hence potentially credit for the contribution. But the net effect, from the perspective of the reference rot problem domain, is the creation of a snapshot of an otherwise evolving resource. Still, these services target materials created by authors, not, like web archives do, resources on the web irrespective of their authorship. Also, an open question remains to which extent such portals truly fulfill a long term archival function rather than being discovery and access environments.Hiberlink is trying to reduce this dependence:
In the solutions thread of Hiberlink, we explore pro-active archiving approaches intended to seamlessly integrate into the life cycle of an article and to require less explicit intervention by authors. One example is an experimental Zotero extension that archives web resources as an author bookmarks them during note taking. Another is HiberActive, a service that can be integrated into the workflow of a repository or a manuscript submission system and that issues requests to web archives to archive all web at large resources referenced in submitted articles.But note that these services (and Voorburg's) depend on the author or the publisher installing them. Experience shows that authors are focused on getting their current paper accepted, large publishers are reluctant to implement extensions to their publishing platforms that offer no immediate benefit, and small publishers lack the expertise to do so.

Ideally, these services would be back-stopped by a service that scanned recently-published articles for web-at-large links and submitted them for archiving, thus requiring no action by author or publisher. The problem is that doing so requires the service to have access to the content as it is published. The existing journal archiving services, LOCKSS, CLOCKSS and Portico have such access to about half the published articles, and could in principle be extended to perform this service. In practice doing so would need at least modest funding. The problem isn't as simple as it appears at first glance, even for the articles that are archived. For those that aren't, primarily from less IT-savvy authors and small publishers, the outlook is bleak.

ArchivingFinally, the solutions assume that submitting a URL to an archive is enough to ensure preservation. It isn't. The referenced web site might have a robots.txt policy preventing collection. The site might have crawler traps, exceed the archive's crawl depth, or use Javascript in ways that prevent the archive collecting a usable representation. Or the archive may simply not process the request in time to avoid content drift or link rot.

AcknowledgementI have to thank Herbert van de Sompel for greatly improving this post through constructive criticism. But it remains my opinion alone.

DuraSpace News: UPDATE: Open Repositories 2015 DEVELOPER TRACK

Tue, 2015-02-10 00:00

Adam Field and Claire Knowles, OR2015 Developer Track Co-Chairs; Cool Tools, Daring Demos and Fab Features

Indianapolis, IN  The OR2015 developer track presents an opportunity to share the latest developments across the technical community. We will be running informal sessions of presentations and demonstrations showcasing community expertise and progress:

SearchHub: Mark Your Calendar: Lucene/Solr Revolution 2015

Mon, 2015-02-09 21:42
If you attended Lucene/Solr Revolution 2014 and took the post-event survey, you’ll know that we polled attendees on where they would like to see the next Revolution take place. Austin was our winner, and we couldn’t be more excited! Mark your calendar for October 13-16 for Lucene/Solr Revolution 2015 at the Hilton Austin for four days packed with hands-on developer training and multiple educational tracks led by industry experts focusing on Lucene/Solr in the enterprise, case studies, large-scale search, and data integration. We had a blast in DC for last year’s conference, and this year we’re adding even more opportunities to network and interact with other Solr enthusiasts, experts, and committers. Registration and Call for Papers will open this spring. To stay up-to-date on all things Lucene/Solr Revolution, visit and follow @LuceneSolrRev on Twitter. Revolution 2014 Resources Videos, presentations, and photos from last year’s conference are now available. Check them out at the links below. View presentation recordings from Lucene/Solr Revolution 2014 Download slides from Lucene/Solr Revolution 2014 presentations View photos from Lucene/Solr Revolution 2014

The post Mark Your Calendar: Lucene/Solr Revolution 2015 appeared first on Lucidworks.

LibraryThing (Thingology): New “More Like This” for LibraryThing for Libraries

Mon, 2015-02-09 18:04

We’ve just released “More Like This,” a major upgrade to LibraryThing for Libraries’ “Similar items” recommendations. The upgrade is free and automatic for all current subscribers to LibraryThing for Libraries Catalog Enhancement Package. It adds several new categories of recommendations, as well as new features.

We’ve got text about it below, but here’s a short (1:28) video:

What’s New

Similar items now has a See more link, which opens More Like This. Browse through different types of recommendations, including:

  • Similar items
  • More by author
  • Similar authors
  • By readers
  • Same series
  • By tags
  • By genre

You can also choose to show one or several of the new categories directly on the catalog page.

Click a book in the lightbox to learn more about it—a summary when available, and a link to go directly to that item in the catalog.

Rate the usefulness of each recommended item right in your catalog—hovering over a cover gives you buttons that let you mark whether it’s a good or bad recommendation.

Try it Out!

Click “See more” to open the More Like This browser in one of these libraries:

Find out more

Find more details for current customers on what’s changing and what customizations are available on our help pages.

For more information on LibraryThing for Libraries or if you’re interested in a free trial, email, visit, or register for a webinar.