news aggregator

Hellman, Eric: Library Group to Acquire Readmill Assets

planet code4lib - Tue, 2014-04-01 01:26
uixoticTechcrunch is reporting that a group of librarians known as the "ALA ThinkTank" has acquired the assets of shuttered startup Readmill. The new owners will turn the website and apps into a "library books in the cloud" site for library patrons.

Over the weekend, Readmill announced that it had been "acqui-hired" by cloud storage startup DropBox, and that its app and website would cease functioning on July 1. "Many challenges in the world of ebooks remain unsolved, and we failed to create a sustainable platform for reading" said Readmill founder Henrik Berggren, in his farewell message to site members. "Failure to have a sustainable platform for reading really resonates with librarians" responded ThinkTank co-founder J.P. Porcaro. "It's a match made in heaven - devoted users, quixotic economics, and lots of books to distract the staff." Porcaro will serve as CEO of the new incarnation of Readmill.

New Readmill CEO J. P. PorcaroThe acquisition also solves a problem Porcaro had been wrestling with- how to spend the group's Bitcoin millions. Far from its present incarnation as a Social Enterprise/Facebook Group hybrid, ALA ThinkTank originated as a solution for housing destitute librarians from New Jersey during the bi-annual conventions of the American Library Association. The group figured that by renting a house instead of renting hotel rooms, they could save money, learn from peers and throw great parties. The accompanying off-the-grid commerce in "assets" was never intended- it just sort of happened.

One of the librarians was friends with Penn State grad student Ross Ulbricht, who convinced the group to use Bitcoin for the purchase and sale of beer, pizza and "ebooks". "He kept talking about  piracy and medieval trade routes" reported Porcaro, "We thought he was normal ... though in retrospect it was kinda weird when he asked about using hitmen to collect overdue book fines."

The 10,000 fold increase in the value of ThinkTank's Bitcoin account over the past four years caught almost everyone completely off guard. The parties, which in past years were low rent, jeans-and-cardigan affairs, have morphed into multi-story "party hearty" extravaganzas packed with hipster librarians body-pierced with bitcoin encrusted baubles and wearing precious-metal badge ribbons.

Porcaro expects that Readmill's usage will skyrocket with the new management. He thinks that ALA ThinkTank's heady mix of critical pedagogy, "weeding" advice, gaming makerspaces, drink-porn, management theory, gender angst and a whiff of scandal are sure to "make it happen" for the moribund social reading site, which has suffered from the general boringness of books.

ThinkTank members are already hard at work planning the transition. A 13-step procedure that will allow Readmill users to keep their books exactly as they are has been spec-ed out by one library vendor. "If you like your ebooks you can keep them" Porcaro assured me. "If you don't like them, we can send them to India for you. Or Lafourche, Louisiana, your choice."

The backlash against the new Readmill has already begun. "Library books in the cloud is the dumbest thing I've ever heard of. How will people know which bits are theirs, and which need to be returned? How will we do inter-library loan? What will happen if it rains?" complained one senior library director who declined to be identified. "How will we get our books returned then?" she asked. "I don't even know HOW to hire a hitman."

In a press release, Scott Turow, past president of the Authors Guild, expressed his horror at the idea of "library books in the cloud." "Once again, librarians are scheming to take food out of the mouths of authors emaciated by hunger. These poor authors are dying miserable deaths, knowing that their copyrighted works are being misused and unread in this way. Library books in a cloud of nerve gas, more like!"

The American Library Association, which is completely unaffiliated with ALA ThinkTank, has formed a committee to study the cloud library ebook phenomenon.

ALA Equitable Access to Electronic Content: A booklover in Paris

planet code4lib - Mon, 2014-03-31 20:34

Last week, Alan Inouye, director of the American Library Association’s Office for Information Technology Policy, traveled to France to discuss the U.S. library ebook lending ecosystem at the Salon du Livre (in English, the “Paris Book Fair”). There, he served on a lively digital book panel along with Johanna Brinton, a business development executive from OverDrive and Maja Thomas, who is the former senior vice president at Hachette Book Group.

Livre Presentation Area

Inouye shared his experiences at the book fair in detail on the American Libraries magazine E-Content blog. Here’s a snippet from his article:

The ebook market in France is much smaller than in the United States, by roughly an order of magnitude. This contrast was clear as I walked down every aisle of the fair—I encountered only a handful of ebook vendors and saw little presence from technology companies in general. This was very different from my experience at the 2013 Book Expo America in New York City.

Given the state of the French ebook market overall, it is not surprising to learn that the French library ebook evolution is in its infancy. However, I did see keen interest from the full amphitheater, with librarians comprising about 20% of the attendees (I asked, and said it was “magnifique” in response to the good showing).

My remarks centered around ALA’s activities and experiences during the past few years. Our fundamental and current strategy is direct engagement with publishers and other players in the reading ecosystem. I recounted the dark days of 2011 and 2012, the improvements in 2013, and the lessons learned along the way. The questions were wide-ranging, from detailed queries about ebook distributor platforms (glad that a representative from OverDrive was present) to addressing the library ebook lending issue via legislative means.

The Ministry is taking a great interest in the library ebook issue. In addition to featuring the issue at this fair, the Ministry established last fall a working group of key stakeholders and is hopeful for expeditious work and progress.

View slides from Inouye’s presentation:

The post A booklover in Paris appeared first on District Dispatch.

ALA Equitable Access to Electronic Content: A booklover in Paris

planet code4lib - Mon, 2014-03-31 20:34

Last week, Alan Inouye, director of the American Library Association’s Office for Information Technology Policy, traveled to France to discuss the U.S. library ebook lending ecosystem at the Salon du Livre (in English, the “Paris Book Fair”). There, he served on a lively digital book panel along with Johanna Brinton, a business development executive from OverDrive and Maja Thomas, who is the former senior vice president at Hachette Book Group.

Livre Presentation Area

Inouye shared his experiences at the book fair in detail on the American Libraries magazine E-Content blog. Here’s a snippet from his article:

The ebook market in France is much smaller than in the United States, by roughly an order of magnitude. This contrast was clear as I walked down every aisle of the fair—I encountered only a handful of ebook vendors and saw little presence from technology companies in general. This was very different from my experience at the 2013 Book Expo America in New York City.

Given the state of the French ebook market overall, it is not surprising to learn that the French library ebook evolution is in its infancy. However, I did see keen interest from the full amphitheater, with librarians comprising about 20% of the attendees (I asked, and said it was “magnifique” in response to the good showing).

My remarks centered around ALA’s activities and experiences during the past few years. Our fundamental and current strategy is direct engagement with publishers and other players in the reading ecosystem. I recounted the dark days of 2011 and 2012, the improvements in 2013, and the lessons learned along the way. The questions were wide-ranging, from detailed queries about ebook distributor platforms (glad that a representative from OverDrive was present) to addressing the library ebook lending issue via legislative means.

The Ministry is taking a great interest in the library ebook issue. In addition to featuring the issue at this fair, the Ministry established last fall a working group of key stakeholders and is hopeful for expeditious work and progress.

View slides from Inouye’s presentation:

The post A booklover in Paris appeared first on District Dispatch.

James Cook University, Library Tech: Windows 8, EZproxy and Secret Sauce AdWare

planet code4lib - Mon, 2014-03-31 20:06
As of writing we've had 3 reported cases of Windows 8 not displaying content through EZproxy. Basically any proxied content doesn't display - no errors, no friendly error messages, just a blank screen. Everything else 'webby' works fine. With 360 Link 1-Click with the helper frame you see the helper frame (it's not proxied by EZproxy) but the iframe content is blank - viewing source shows that

James Cook University, Library Tech: Windows 8, EZproxy and Secret Sauce AdWare

planet code4lib - Mon, 2014-03-31 20:06
As of writing we've had 3 reported cases of Windows 8 not displaying content through EZproxy. Basically any proxied content doesn't display - no errors, no friendly error messages, just a blank screen. Everything else 'webby' works fine. With 360 Link 1-Click with the helper frame you see the helper frame (it's not proxied by EZproxy) but the iframe content is blank - viewing source shows that

ALA Equitable Access to Electronic Content: Are young Americans losing their religion with the Copyright Act?

planet code4lib - Mon, 2014-03-31 17:41

Image via David Blackwell.

It’s the early 1990s. You’re driving home from school with your friend when “Losing My Religion” by REM comes on the radio. “I love this song!” she yells, knocking your awkwardly oblong “car phone” out of the cigarette lighter as she lunges for the volume dial. Deciding to do your friend a solid, you plug a blank tape into your cassette recorder upon returning home and proceed to make her an REM mixtape with “Losing My Religion” as track number one. You take it to her the following night with a six-pack of Zima Gold and a large pizza. You’ve just completed an information transfer via the “sneakernet”: a colloquial term describing the process of sharing electronic information by physically transporting it from one location to another. Record labels, production companies and other copyright-sensitive rights holders have seldom taken issue with this sort of sharing.

Fast forward about twenty years. Your daughter starts a business that allows individuals to transfer their MP3 files to a central server and share in the profit when their files are purchased from the server by someone else. Despite being careful to stipulate that her customers must delete their files upon transferring them to her company’s server, your daughter only makes sales to a few customers —including, for symmetry’s sake, your REM-fan-friend’s daughter—before she is sued for copyright infringement. Why did your daughter’s sharing activity result in legal action while yours did not? Both you and your daughter transferred information on a small scale. Both you and your daughter seemed to engage in distribution activities that fell within the bounds of our Copyright Act’s “first sale doctrine” (the principle that allows an individual who has lawfully received a copyrighted work to by and large dispose of that work as he or she sees fit).

The answer: the growth of new technologies has made finding copyright infringement immeasurably easier. During the “sneakernet” era, there was virtually no way for a third party to know about a small-scale transfer of electronic information. Today, transfers of MP3s and other digital files are easily traceable. How do the younger generations that are in the vanguard of the culture that has taken shape around digitization feel about this development? At an American Bar Association program last Wednesday, entitled, “The Politics of Copyright,” ALA Office for Information Technology Policy (OITP) Consultant Jonathan Band suggested that our Copyright Act may be experiencing a “crisis of credibility” among young Americans. The men and women of this country who are in their twenties and younger grew up with digital technology. To many of these individuals, the litigious environment surrounding file-sharing is vexing, as are recent court rulings against the application of the “first-sale doctrine” to transactions taking place in the digital realm.

Here at the American Library Association, we feel it is very important to promote public understanding of the Copyright Act. However, many young Americans grow increasingly disillusioned as they familiarize themselves with copyright law and jurisprudence. In light of this trend, we must consider reforming our copyright law so that it better addresses the unique issues surrounding the movement of digital content. This means reexamining policy changes that the American Library Association has long supported. Let’s talk about shortening copyright terms, introducing new formalities regimes and reducing statutory damages. Admittedly, achieving any one of these goals is a formidable task, given current domestic political realities and international legal frameworks. However, if our discussions yielded even marginal progress, we would improve attitudes toward copyright not just among young people, but among all people who believe in the importance of promoting public access to information.

The post Are young Americans losing their religion with the Copyright Act? appeared first on District Dispatch.

ALA Equitable Access to Electronic Content: Are young Americans losing their religion with the Copyright Act?

planet code4lib - Mon, 2014-03-31 17:41

Image via David Blackwell.

It’s the early 1990s. You’re driving home from school with your friend when “Losing My Religion” by REM comes on the radio. “I love this song!” she yells, knocking your awkwardly oblong “car phone” out of the cigarette lighter as she lunges for the volume dial. Deciding to do your friend a solid, you plug a blank tape into your cassette recorder upon returning home and proceed to make her an REM mixtape with “Losing My Religion” as track number one. You take it to her the following night with a six-pack of Zima Gold and a large pizza. You’ve just completed an information transfer via the “sneakernet”: a colloquial term describing the process of sharing electronic information by physically transporting it from one location to another. Record labels, production companies and other copyright-sensitive rights holders have seldom taken issue with this sort of sharing.

Fast forward about twenty years. Your daughter starts a business that allows individuals to transfer their MP3 files to a central server and share in the profit when their files are purchased from the server by someone else. Despite being careful to stipulate that her customers must delete their files upon transferring them to her company’s server, your daughter only makes sales to a few customers —including, for symmetry’s sake, your REM-fan-friend’s daughter—before she is sued for copyright infringement. Why did your daughter’s sharing activity result in legal action while yours did not? Both you and your daughter transferred information on a small scale. Both you and your daughter seemed to engage in distribution activities that fell within the bounds of our Copyright Act’s “first sale doctrine” (the principle that allows an individual who has lawfully received a copyrighted work to by and large dispose of that work as he or she sees fit).

The answer: the growth of new technologies has made finding copyright infringement immeasurably easier. During the “sneakernet” era, there was virtually no way for a third party to know about a small-scale transfer of electronic information. Today, transfers of MP3s and other digital files are easily traceable. How do the younger generations that are in the vanguard of the culture that has taken shape around digitization feel about this development? At an American Bar Association program last Wednesday, entitled, “The Politics of Copyright,” ALA Office for Information Technology Policy (OITP) Consultant Jonathan Band suggested that our Copyright Act may be experiencing a “crisis of credibility” among young Americans. The men and women of this country who are in their twenties and younger grew up with digital technology. To many of these individuals, the litigious environment surrounding file-sharing is vexing, as are recent court rulings against the application of the “first-sale doctrine” to transactions taking place in the digital realm.

Here at the American Library Association, we feel it is very important to promote public understanding of the Copyright Act. However, many young Americans grow increasingly disillusioned as they familiarize themselves with copyright law and jurisprudence. In light of this trend, we must consider reforming our copyright law so that it better addresses the unique issues surrounding the movement of digital content. This means reexamining policy changes that the American Library Association has long supported. Let’s talk about shortening copyright terms, introducing new formalities regimes and reducing statutory damages. Admittedly, achieving any one of these goals is a formidable task, given current domestic political realities and international legal frameworks. However, if our discussions yielded even marginal progress, we would improve attitudes toward copyright not just among young people, but among all people who believe in the importance of promoting public access to information.

The post Are young Americans losing their religion with the Copyright Act? appeared first on District Dispatch.

Rochkind, Jonathan: Thank you again, Edward Snowden

planet code4lib - Mon, 2014-03-31 16:24

According to this Reuters article, the NSA intentionally weakened encryption in popular encryption software from the company RSA.

They did this because they wanted to make sure they could continue eavesdropping on us all, but in the process they made us more vulnerable to eavesdropping from other attackers too. Once you put in a backdoor, anyone else that figures it out can access it too, it wasn’t some kind of NSA-only backdoor.  I bet, for instance, China’s hackers and mathematicians are as clever as ours.

“We could have been more sceptical of NSA’s intentions,” RSA Chief Technologist Sam Curry told Reuters. “We trusted them because they are charged with security for the U.S. government and U.S. critical infrastructure.”

I’m not sure if I believe him — the $10 million NSA paid RSA for inserting the mathematical backdoors probably did a lot to assuage their skepticism too. What did they think NSA was paying for?

On the other hand, sure, the NSA is charged with improving our security, and does have expertise in that.  It was fairly reasonable to think that’s what they were doing. Suggesting they were intentionally putting some backdoors in instead would have probably got you called paranoid… pre-Snowden.  Not anymore.

It is thanks only to Edward Snowden that nobody will be making that mistake again for a long time. Edward Snowden, thank you for your service.


Filed under: General

Rochkind, Jonathan: Thank you again, Edward Snowden

planet code4lib - Mon, 2014-03-31 16:24

According to this Reuters article, the NSA intentionally weakened encryption in popular encryption software from the company RSA.

They did this because they wanted to make sure they could continue eavesdropping on us all, but in the process they made us more vulnerable to eavesdropping from other attackers too. Once you put in a backdoor, anyone else that figures it out can access it too, it wasn’t some kind of NSA-only backdoor.  I bet, for instance, China’s hackers and mathematicians are as clever as ours.

“We could have been more sceptical of NSA’s intentions,” RSA Chief Technologist Sam Curry told Reuters. “We trusted them because they are charged with security for the U.S. government and U.S. critical infrastructure.”

I’m not sure if I believe him — the $10 million NSA paid RSA for inserting the mathematical backdoors probably did a lot to assuage their skepticism too. What did they think NSA was paying for?

On the other hand, sure, the NSA is charged with improving our security, and does have expertise in that.  It was fairly reasonable to think that’s what they were doing. Suggesting they were intentionally putting some backdoors in instead would have probably got you called paranoid… pre-Snowden.  Not anymore.

It is thanks only to Edward Snowden that nobody will be making that mistake again for a long time. Edward Snowden, thank you for your service.


Filed under: General

Tennant, Roy: It Didn’t Start With You

planet code4lib - Mon, 2014-03-31 14:30

Recently a couple things happened that make me despair of ever having prior work not be repeated.

The first incident was at a large library conference at the beginning of the year, with a panel about aggregating metadata from multiple contributors. The room overflowed with attendees, as the topic was much more popular than the conference had allowed for. I first sat on the floor, then stood, but I was pained not by my lack of a chair, but by the proceedings.

We listened to individuals who seemed to imagine that this was the first time anyone had attempted to aggregate metadata from diverse contributors. I was in awe of Diane Hillmann, who was on the panel. She sat there mute and expressionless as the other panelists completely failed to acknowledge the work of the National Science Digital Library that had preceded their attempt by some years and that Diane had been deeply involved with. How she remained silent I will never know, and I am embarrassed that I failed to speak up on behalf of her and her colleagues whose prior work in the field was unacknowledged.

But it wasn’t just the NSDL work that was being ignored — no one was cited as having contributed anything to their thinking about the issues or their strategies for dealing with them. Europeana is an obvious example. UKOLN perhaps. JISC is another. As is CIC. And CDL. You can’t take a step in a Google search of “aggregating metadata”  without falling over prior art.

Then a few months later I was at another event where something similar happened. It was an informal discussion about data aggregation issues. In this case someone was describing writing code from scratch to cleanup dates, which are well known to be a big problem in metadata aggregations — particularly ones that include MARC data. It turns out they had never heard about the utility that CDL developed many years ago to do the same basic thing. Because they hadn’t even looked. A simple Google search on “date normalization” turns it up as the first hit.

When faced with a challenge, the very first step should be to investigate what has come before. The very first. Only by doing so can you avoid relearning lessons that were learned years ago, and that could help you avoid the bumps and bruises that learning as you go will entail. This will not always be the case — you may need to do things completely differently and create things from scratch. But if you do, you will know with a certainty that you must.

Believe it or not, it is possible to learn things from the previous experiences of others. Sometimes, just sometimes, it didn’t start with you.

Tennant, Roy: It Didn’t Start With You

planet code4lib - Mon, 2014-03-31 14:30

Recently a couple things happened that make me despair of ever having prior work not be repeated.

The first incident was at a large library conference at the beginning of the year, with a panel about aggregating metadata from multiple contributors. The room overflowed with attendees, as the topic was much more popular than the conference had allowed for. I first sat on the floor, then stood, but I was pained not by my lack of a chair, but by the proceedings.

We listened to individuals who seemed to imagine that this was the first time anyone had attempted to aggregate metadata from diverse contributors. I was in awe of Diane Hillmann, who was on the panel. She sat there mute and expressionless as the other panelists completely failed to acknowledge the work of the National Science Digital Library that had preceded their attempt by some years and that Diane had been deeply involved with. How she remained silent I will never know, and I am embarrassed that I failed to speak up on behalf of her and her colleagues whose prior work in the field was unacknowledged.

But it wasn’t just the NSDL work that was being ignored — no one was cited as having contributed anything to their thinking about the issues or their strategies for dealing with them. Europeana is an obvious example. UKOLN perhaps. JISC is another. As is CIC. And CDL. You can’t take a step in a Google search of “aggregating metadata”  without falling over prior art.

Then a few months later I was at another event where something similar happened. It was an informal discussion about data aggregation issues. In this case someone was describing writing code from scratch to cleanup dates, which are well known to be a big problem in metadata aggregations — particularly ones that include MARC data. It turns out they had never heard about the utility that CDL developed many years ago to do the same basic thing. Because they hadn’t even looked. A simple Google search on “date normalization” turns it up as the first hit.

When faced with a challenge, the very first step should be to investigate what has come before. The very first. Only by doing so can you avoid relearning lessons that were learned years ago, and that could help you avoid the bumps and bruises that learning as you go will entail. This will not always be the case — you may need to do things completely differently and create things from scratch. But if you do, you will know with a certainty that you must.

Believe it or not, it is possible to learn things from the previous experiences of others. Sometimes, just sometimes, it didn’t start with you.

Open Knowledge Foundation: Tackling the Resource Curse: Civil Society’s Fight for Better Access to Information and Open Contracting in Côte d’Ivoire

planet code4lib - Mon, 2014-03-31 13:23

This is a guest blog from our campaign partner Integrity Action, adapted from its original posted on their website here. This is the first in a series of blog posts from partner organisations of our #SecretContracts campaign. If you have stories to share about the problems of secrecy in contracting, get in touch with contact@stopsecretcontracts.org

Tackling the Resource Curse: Civil Society’s Fight for Better Access to Information and Open Contracting in Côte d’Ivoire

To date, many natural resource rich countries are plagued by rampant corruption, repression and poverty. We seem to have become accustomed to reading about tiny oil-rich countries such Equatorial Guinea – surely one of the world’s best examples of the resource curse. A country where large oil reserves fund the lavish lifestyles of the elite while the majority of the population finds itself in the undesirable position of having their basic human and economic rights not met.

Yet the picture is not all ‘doom and gloom’. Shifting the focus to countries such as Côte d’Ivoire allows for a different picture to emerge. Here, civil society organisations (CSOs) such as Social Justice have been working hard to ensure that extractive sector revenues benefit all members of society.

So how did local CSOs in Côte d’Ivoire bring about this incremental step change? In Jacqueville and d’Angovia, they began by working with people of influence to lobby local government and corporates to ensure improved access to contracts. Upon receipt of the contractual information, they organised information meetings with the communities and helped citizens develop strategies to better negotiate their entitlements, thereby ensuring that health care centres, maternity hospitals, schools and water towers were built.

Ensuring that communities have access to contractual information has been far from easy to achieve. Social Justice and other CSOs in Cote D’Ivoire have encountered frequent resistance from corporates as well as government officials, citing a lack of laws and regulations as reasons why open contracting has not become mainstream practice within the sector. Social Justice and other CSOs have tirelessly communicated the tangible benefits for local communities if open contracting was to be institutionalised and properly regulated. Moreover, they encouraged the formation of a resource centre tasked with working on access to information and freedom of information issues.

A significant step toward open contracting in Cote D’Ivoire is the recent adoption of an access to information law. Social Justice and other CSOs now rely heavily on the new law as well as the Extractive Industries Transparency Initiative Standard adopted in May 2013 in their continuous demand for open contracting.

There is no doubt that it would be all to easy to remain skeptical, yet Social Justice’s work in Cote d’Ivoire shows that access to contractual information enables communities, impacted by extractive sector activities, to ensure that key stakeholders within the sector live up to their social and economic responsibilities.

The Stop Secret Contracts campaign is designed to push the issue of open contracting up the international policy agenda. Join the campaign by signing the petition and spreading the word at StopSecretContracts.org

Find out more about Social Justice here: https://www.facebook.com/socialjusticecotedivoire2

Photo: Monitors from Social Justice at the Logement de Maitre in Adjue, Jaqueville, funded by gas company Foxtrot

Open Knowledge Foundation: Tackling the Resource Curse: Civil Society’s Fight for Better Access to Information and Open Contracting in Côte d’Ivoire

planet code4lib - Mon, 2014-03-31 13:23

This is a guest blog from our campaign partner Integrity Action, adapted from its original posted on their website here. This is the first in a series of blog posts from partner organisations of our #SecretContracts campaign. If you have stories to share about the problems of secrecy in contracting, get in touch with contact@stopsecretcontracts.org

Tackling the Resource Curse: Civil Society’s Fight for Better Access to Information and Open Contracting in Côte d’Ivoire

To date, many natural resource rich countries are plagued by rampant corruption, repression and poverty. We seem to have become accustomed to reading about tiny oil-rich countries such Equatorial Guinea – surely one of the world’s best examples of the resource curse. A country where large oil reserves fund the lavish lifestyles of the elite while the majority of the population finds itself in the undesirable position of having their basic human and economic rights not met.

Yet the picture is not all ‘doom and gloom’. Shifting the focus to countries such as Côte d’Ivoire allows for a different picture to emerge. Here, civil society organisations (CSOs) such as Social Justice have been working hard to ensure that extractive sector revenues benefit all members of society.

So how did local CSOs in Côte d’Ivoire bring about this incremental step change? In Jacqueville and d’Angovia, they began by working with people of influence to lobby local government and corporates to ensure improved access to contracts. Upon receipt of the contractual information, they organised information meetings with the communities and helped citizens develop strategies to better negotiate their entitlements, thereby ensuring that health care centres, maternity hospitals, schools and water towers were built.

Ensuring that communities have access to contractual information has been far from easy to achieve. Social Justice and other CSOs in Cote D’Ivoire have encountered frequent resistance from corporates as well as government officials, citing a lack of laws and regulations as reasons why open contracting has not become mainstream practice within the sector. Social Justice and other CSOs have tirelessly communicated the tangible benefits for local communities if open contracting was to be institutionalised and properly regulated. Moreover, they encouraged the formation of a resource centre tasked with working on access to information and freedom of information issues.

A significant step toward open contracting in Cote D’Ivoire is the recent adoption of an access to information law. Social Justice and other CSOs now rely heavily on the new law as well as the Extractive Industries Transparency Initiative Standard adopted in May 2013 in their continuous demand for open contracting.

There is no doubt that it would be all to easy to remain skeptical, yet Social Justice’s work in Cote d’Ivoire shows that access to contractual information enables communities, impacted by extractive sector activities, to ensure that key stakeholders within the sector live up to their social and economic responsibilities.

The Stop Secret Contracts campaign is designed to push the issue of open contracting up the international policy agenda. Join the campaign by signing the petition and spreading the word at StopSecretContracts.org

Find out more about Social Justice here: https://www.facebook.com/socialjusticecotedivoire2

Photo: Monitors from Social Justice at the Logement de Maitre in Adjue, Jaqueville, funded by gas company Foxtrot

Rosenthal, David: The Half-Empty Archive

planet code4lib - Mon, 2014-03-31 09:00
Cliff Lynch invited me to give one of UC Berkeley iSchool's "Information Access Seminars" entitled The Half-Empty Archive. It was based on my brief introductory talk at ANADP II last November, an expanded version given as a staff talk at the British Library last January, and the discussions following both. An edited text with links to the sources is below the fold.

I'm David Rosenthal from the LOCKSS Program at the Stanford University Libraries, which last October celebrated its 15th birthday. The demands of LOCKSS and CLOCKSS mean that I won't be able to do a lot more work on the big picture of preservation in the near future. So it is time for a summing-up, trying to organize the various areas I've been looking at into a coherent view of the big picture.

How Well Are We Doing?

To understand the challenges we face in preserving the world's digital heritage, we need to start by asking "how well are we currently doing?" I noted some of the attempts to answer this question in my iPRES talk:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) compare information extracted from publisher's web sites with the Keepers Registry and conclude:We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Wed is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
    • tried to access the URL from the live Web.
    • used Memento to ask the major Web archives whether they had at least one copy of that URL.
    Their results are somewhat difficult to interpret, but for their two more random samples they report: URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
Somewhat less than half sounds as though we have made good progress. Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.

An Optimistic Assessment

First, the assessment isn't risk-adjusted:
  • As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation, so what have those resources achieved for future readers?
  • As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
Second, the assessment isn't adjusted for difficulty:
  • A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
  • The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
  • Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking:
  • As regards scholarly communication it looks only at the traditional forms, books and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
  • As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.

Fourth, the assessment is likely to suffer measurement bias:
  • The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
  • As regards Web content, Ainsworth et al describe various forms of bias in their paper.
As Cliff pointed out in his summing-up of the recent IDCC conference, the scholarly literature and the surface Web are genres of content for which the denominator of the fraction being preserved (the total amount of genre content) is fairly well known, even if it is difficult to measure the numerator (the amount being preserved). For many other important genres, even the denominator is becoming hard to estimate as the Web enables a variety of distribution channels:
  • Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
  • YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
  • Much music now happens on YouTube (e.g. Pomplamoose
  • Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Of course, "what we should be preserving" is a judgement call, but clearly even purists who wish to preserve only stuff to which future scholars will undoubtedly require access would be hard pressed to claim that half that stuff is preserved.

Looking Forward

Each unit of the content we are currently not preserving will be more expensive than a similar unit of the content we currently are preserving. I don't know anyone who thinks digital preservation is likely to receive a vast increase in funding; we'll be lucky to maintain the current level. So if we continue to use our current techniques the long-term rate of content loss to future readers from failure to collect will be at least 50%. This will dwarf all other causes of loss.

If we are going to preserve the other more than half, we need a radical re-think of the way we currently work. Even ignoring the issues above, we need to more than halve the cost per unit of content.

I'm on the advisory board of the EU's "4C" project, which aims to pull together the results of the wide range of research into the costs of digital curation and preservation into a usable form. My rule of thumb, based on my reading of the research, is that in the past ingest has taken about one-half, preservation about one-third, and access about one-sixth of the total cost. What are the prospects for costs in each of these areas going forward? How much more than halving the cost do we need?

Future Costs: Ingest

Increasingly, the newly created content that needs to be ingested needs to be ingested from the Web. As we've discussed at two successive IIPC workshops, the Web is evolving from a set of hyper-linked documents to being a distributed programming environment, from HTML to Javascript. In order to find the links much of the collected content now needs to be executed as well as simply being parsed. This is already significantly increasing the cost of Web harvesting, both because executing the content is computationally much more expensive, and because elaborate defenses are required to protect the crawler against the possibility that the content might be malign.

The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.

The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

Future Costs: Preservation

The major cost of the preservation phase is storage. The cost of storing the collected content for the long term has not been a significant concern for digital preservation. Kryder's Law, the exponential increase in bit density of magnetic media such as disks and tape, stayed in force for three decades and resulted in the cost per byte of storing data halving roughly every two years. Thus, if you could afford to store a collection for the next few years you could afford to store it forever, assuming Kryder's Law continued in force for a fourth decade.

As XKCD points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve. In late 2011 Kryder's Law was abruptly violated as the floods in Thailand destroyed 40% of the world's capacity to build disks, and the cost per byte of disk doubled almost overnight. This graph, from Preeti Gupta at UC Santa Cruz, shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs are now, two and a half years after the floods, more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • If the industry projections pan out, as shown by the red lines, by 2020 disk costs will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
Long-term storage isn't free; it has to be paid for. There are three possible business models for doing so:
  • It can be monetized, as with Google's Gmail service, which funds storing your e-mail without charging you by selling ads alongside it.
  • It can be rented, as with Amazon's S3 and Glacier services, for a monthly payment per Terabyte.
  • It can be endowed, deposited together with a capital sum thought to be enough, with the interest it earns, to pay for storage "forever".
Archived content is rarely accessed by humans, so monetization is unlikely to be an effective sustainability strategy. In effect, a commitment to rent storage for a collection for the long term is equivalent to endowing it with the Net Present Value (NPV) of the stream of future rent payments. I started blogging about endowing data in 2007. In 2010 Serge Goldstein of Princeton described their endowed data service, based on his analysis that if they charged double the initial cost they could store data forever. I was skeptical, not least because what Princeton actually charged was $3K/TB. Using their cost model, this meant either that they were paying $1.5K/TB for disk at a time when Fry's was selling disks for $50/TB, or that they were skeptical too.

Earlier in 2010 I started predicting, for interconnected technological and business reasons, that Kryder's Law would slow down. I expressed skepticism about Princeton's model in a talk at the 2011 Personal Digital Archiving conference, and started work building an economic model of long-term storage.

A month before the Thai floods I presented initial results at the Library of Congress. A month after the floods I was able to model the effect of price spikes, and demonstrate that the endowment needed for a data collection depended on the Kryder rate in a worrying way. At the Kryder rates we were used to, with cost per byte dropping rapidly, the endowment needed was small and not very sensitive to the exact rate. As the Kryder rate decreased, the endowment needed rose rapidly and became very sensitive to the exact rate.

Since the floods, the difficulty and cost of the next generation of disk technology and the consolidation of the disk drive industry have combined to make it clear that future Kryder rates will be much lower than they were in the past. Thus storage costs will be much higher than they were expected to be, and much less predictable.

You may think I'm a pessimist about storage costs. So lets look at what the industry analysts are saying:
  • According to IDC, the demand for storage each year grows about 60%.
  • According to IHS iSuppli, the bit density on the platters of disk drives will grow no more than 20%/year for the next 5 years.
  • According to computereconomics.com, IT budgets in recent years have grown between 0%/year and 2%/year.
Here's a graph that projects these three numbers out for the next 10 years. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at computereconomics.com's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. On these numbers, if storage cost as a proportion of your budget is not to increase, your collections cannot grow at more than 22%/yr.

Recent industry analysts' projections of the Kryder rate have proved to be consistently optimistic. My industry contacts have recently suggested that 12% may be the best we can expect. My guess is that if your collection grows more than 10%/yr storage cost as a proportion of the total budget will increase.

Higher costs will lead to a search for economies of scale. In Cliff Lynch's summary of the ANADP I meeting, he pointed out:
When resources are very scarce, there's a great tendency to centralize, to standardize, to eliminate redundancy in the name of cost effectiveness. This can be very dangerous; it can produce systems that are very brittle and vulnerable, and that are subject to catastrophic failure.But monoculture is not the only problem. As I pointed out at the Preservation at Scale workshop, the economies of scale are often misleading. Typically they are an S-curve, and the steep part of the curve is at a fairly moderate scale. And the bulk of the economies end up with commercial suppliers operating well above that scale rather than with their customers. These vendors have large marketing budgets with which to mislead about the economies. Thus "the cloud" is not an answer to reducing storage costs for long-term preservation.

The actions of the Harper government in Canada demonstrate clearly why redundancy and diversity in storage is essential, not just at the technological but also at the organizational level. Content is at considerable risk if all its copies are under the control of a single institution, particularly these days a government vulnerable to capture by a radical ideology.

Cliff also highlighted another area in particular in which creating this kind of monoculture causes a serious problem:
I'm very worried that as we build up very visible instances of digital cultural heritage that these collections are going to become subject [to] attack in the same way the national libraries, museums, ... have been subject to deliberate attack and destruction throughout history.
...
Imagine the impact of having a major repository ... raided and having a Wikileaks type of dump of all of the embargoed collections in it. ... Or imagine the deliberate and systematic modification or corruption of materials.Edward Snowden's revelations have shown the attack capabilities that nation-state actors had a few years ago. How sure are you that no nation-state actor is a threat to your collections? A few years hence, many of these capabilities will be available in the exploit market for all to use. Right now, advanced persistent threat technology only somewhat less effective than that which recently compromised Stanford's network is available in point-and click form. Protecting against these very credible threats will increase storage costs further.

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than magnetic or solid state storage. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the archive problem. But access will always be slow enough that the data would have to be really cold before being committed to DNA.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we shall see, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)Future Costs: Access

It has always been assumed that the vast majority of archival content is rarely accessed. Research at UC Santa Cruz showed that the majority of accesses to archived data are for indexing and integrity checks. This is supported by the relatively small proportion access forms of total costs in the preservation cost studies.

But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 5.5 times as expensive.

An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the not insignificant costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year.

Implications

Earlier I showed that even if we assume that the other half of the content costs no more to preserve than the low-hanging fruit we're already preserving we need preservation techniques that are at least twice as cost-effective as the ones we currently have. But since then I've shown that:
  • The estimate of half is optimistic.
  • The rest of the content will be much more expensive to ingest.
  • The costs of storing even the content we're currently preserving have been underestimated.
  • The access that scholars will require to future digital collections will be much more expensive than that they required in the past.
Thus, we need a radical re-think of our entire set of digital preservation techniques with the aim of vastly reducing their cost. We probably need a cost reduction of at least 4 and maybe 10 times. I certainly don't know how to do this. Let's brainstorm some ideas that might help toward this goal.

Reducing Costs: Ingest

Much of the discussion of digital preservation concerns metadata. For example, there are 52 criteria for the Trusted Repository Audit, ISO 16363 Section 4. 29 (56%) are metadata-related. Creating and validating metadata is expensive:
  • Manually creating metadata is impractical at scale.
  • Extracting metadata from the content scales better, but it is still expensive since:
  • In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
We need less metadata so we can have more data. Two questions need to be asked:
  • When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
  • How important is the metadata to the task of preservation? Generating metadata because its possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources, and if validating the formats of incoming content using error-prone tools is used to reject allegedly non-conforming content it is counter-productive.
The LOCKSS and CLOCKSS systems take a very parsimonious approach to format metadata. Nevertheless, the requirements of ISO 16363 forced us to expend resources implementing and using FITS, whose output does not in fact contribute to our preservation strategy, and whose binaries are so large that we have to maintain two separate versions of the LOCKSS daemon, one with FITS for internal use and one without for actual preservation. Further, the demands we face for bibliographic metadata mean that metadata extraction is a major part of ingest costs for both systems. These demands come from requirements for:
  • Access via bibliographic (as opposed to full-text) search via, for example, OpenURL resolvers.
  • Meta-preservation services such as the Keepers Registry.
  • Competitive marketing.
Bibliographic search, preservation tracking and bragging about how many articles and books your system preserves are all important, but whether they justify the considerable cost involved is open to question.

It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.

Reducing Costs: Preservation

Perfect preservation is a myth, as I have been saying for at least 7 years using "A Petabyte for a Century" as a theme. Current storage technologies are about a million times too unreliable to keep a Petabyte intact for a century; stuff is going to get lost.

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.

Reducing Costs: Access

The Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. In most cases so far the cost of an access to an individual document has been small enough that archives have not charged the reader. But access to individual documents is not the way future scholars will want to access the collections. Either transferring a copy, typically by shipping a NAS box, or providing data-mining infrastructure at the archive is so expensive that scholars must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars.

The real problem here is that scholars are used to having free access to library collections, But what they increasingly want to do with the collections is expensive. A charging mechanism is needed to pay for the infrastructure and, because the scholar's access is spiky, the could provides both suitable infrastructure and a charging mechanism.

For smaller collections, Amazon provides Free Public Datasets, Amazon stores the, charging scholars accessing them for the computation rather than charging the owner of the data for storage.

Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. Right now it would be costing $7692/mo. Each month this would increase by $319. So a year would cost $115,272. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 outstanding requests would each need to pay about $300 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach.

Recently Twitter offered a limited number of scholars access to its infrastructure to data-mine from the feed, but this doesn't really change the argument.

There are potential storage technologies that combine computation and storage in a cost-effective way. Colleagues at UC Santa Cruz and I proposed one such architecture, which we called DAWN (Durable Array of Wimpy Nodes) in 2011. Architectures of this kind might significantly reduce the cost of providing scholars with data-mining access to the collections. The evolution of storage media is pointing in this direction. But there are very considerable business model difficulties in the way of commercializing such technologies.

Marketing

Any way of making preservation cheaper can be spun as "doing worse preservation". Jeff Rothenberg's Future Perfect 2012 keynote is an excellent example of this spin in action.

We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS is currently undergoing to obtain this check-off item will be well in excess of 10% of its annual budget.

Making the tradeoff of preserving more stuff using worse preservation would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.

Conclusion

Most current approaches to digital preservation aim to ensure that, once ingested, content is effectively immune from bit-rot and the mostly hypothetical threat of format obsolescence, and that future readers will be able to access individual documents via metadata. There are four major problems with these approaches:
  • Reading individual documents one-at-a-time is unlikely to be the access mode future scholars require.
  • At the scale required against the threats they address, bit-rot and format obsolescence, the effectiveness of current techniques is limited.
  • Against other credible threats, such as external attack and insider abuse, the effectiveness of current techniques is doubtful.
  • Current techniques are so expensive that by far the major cause of future scholars inability to access content will be that the content was not collected in the first place.
We need a complete re-think of the techniques for digital preservation that accepts a much higher risk of loss to preserved content and in return allows us to preserve much more content.

Rosenthal, David: The Half-Empty Archive

planet code4lib - Mon, 2014-03-31 09:00
Cliff Lynch invited me to give one of UC Berkeley iSchool's "Information Access Seminars" entitled The Half-Empty Archive. It was based on my brief introductory talk at ANADP II last November, an expanded version given as a staff talk at the British Library last January, and the discussions following both. An edited text with links to the sources is below the fold.

I'm David Rosenthal from the LOCKSS Program at the Stanford University Libraries, which last October celebrated its 15th birthday. The demands of LOCKSS and CLOCKSS mean that I won't be able to do a lot more work on the big picture of preservation in the near future. So it is time for a summing-up, trying to organize the various areas I've been looking at into a coherent view of the big picture.

How Well Are We Doing?

To understand the challenges we face in preserving the world's digital heritage, we need to start by asking "how well are we currently doing?" I noted some of the attempts to answer this question in my iPRES talk:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) compare information extracted from publisher's web sites with the Keepers Registry and conclude:We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Wed is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
    • tried to access the URL from the live Web.
    • used Memento to ask the major Web archives whether they had at least one copy of that URL.
    Their results are somewhat difficult to interpret, but for their two more random samples they report: URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
Somewhat less than half sounds as though we have made good progress. Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.

An Optimistic Assessment

First, the assessment isn't risk-adjusted:
  • As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation, so what have those resources achieved for future readers?
  • As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
Second, the assessment isn't adjusted for difficulty:
  • A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
  • The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
  • Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking:
  • As regards scholarly communication it looks only at the traditional forms, books and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
  • As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.

Fourth, the assessment is likely to suffer measurement bias:
  • The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
  • As regards Web content, Ainsworth et al describe various forms of bias in their paper.
As Cliff pointed out in his summing-up of the recent IDCC conference, the scholarly literature and the surface Web are genres of content for which the denominator of the fraction being preserved (the total amount of genre content) is fairly well known, even if it is difficult to measure the numerator (the amount being preserved). For many other important genres, even the denominator is becoming hard to estimate as the Web enables a variety of distribution channels:
  • Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
  • YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
  • Much music now happens on YouTube (e.g. Pomplamoose
  • Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Of course, "what we should be preserving" is a judgement call, but clearly even purists who wish to preserve only stuff to which future scholars will undoubtedly require access would be hard pressed to claim that half that stuff is preserved.

Looking Forward

Each unit of the content we are currently not preserving will be more expensive than a similar unit of the content we currently are preserving. I don't know anyone who thinks digital preservation is likely to receive a vast increase in funding; we'll be lucky to maintain the current level. So if we continue to use our current techniques the long-term rate of content loss to future readers from failure to collect will be at least 50%. This will dwarf all other causes of loss.

If we are going to preserve the other more than half, we need a radical re-think of the way we currently work. Even ignoring the issues above, we need to more than halve the cost per unit of content.

I'm on the advisory board of the EU's "4C" project, which aims to pull together the results of the wide range of research into the costs of digital curation and preservation into a usable form. My rule of thumb, based on my reading of the research, is that in the past ingest has taken about one-half, preservation about one-third, and access about one-sixth of the total cost. What are the prospects for costs in each of these areas going forward? How much more than halving the cost do we need?

Future Costs: Ingest

Increasingly, the newly created content that needs to be ingested needs to be ingested from the Web. As we've discussed at two successive IIPC workshops, the Web is evolving from a set of hyper-linked documents to being a distributed programming environment, from HTML to Javascript. In order to find the links much of the collected content now needs to be executed as well as simply being parsed. This is already significantly increasing the cost of Web harvesting, both because executing the content is computationally much more expensive, and because elaborate defenses are required to protect the crawler against the possibility that the content might be malign.

The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.

The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

Future Costs: Preservation

The major cost of the preservation phase is storage. The cost of storing the collected content for the long term has not been a significant concern for digital preservation. Kryder's Law, the exponential increase in bit density of magnetic media such as disks and tape, stayed in force for three decades and resulted in the cost per byte of storing data halving roughly every two years. Thus, if you could afford to store a collection for the next few years you could afford to store it forever, assuming Kryder's Law continued in force for a fourth decade.

As XKCD points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve. In late 2011 Kryder's Law was abruptly violated as the floods in Thailand destroyed 40% of the world's capacity to build disks, and the cost per byte of disk doubled almost overnight. This graph, from Preeti Gupta at UC Santa Cruz, shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs are now, two and a half years after the floods, more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • If the industry projections pan out, as shown by the red lines, by 2020 disk costs will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
Long-term storage isn't free; it has to be paid for. There are three possible business models for doing so:
  • It can be monetized, as with Google's Gmail service, which funds storing your e-mail without charging you by selling ads alongside it.
  • It can be rented, as with Amazon's S3 and Glacier services, for a monthly payment per Terabyte.
  • It can be endowed, deposited together with a capital sum thought to be enough, with the interest it earns, to pay for storage "forever".
Archived content is rarely accessed by humans, so monetization is unlikely to be an effective sustainability strategy. In effect, a commitment to rent storage for a collection for the long term is equivalent to endowing it with the Net Present Value (NPV) of the stream of future rent payments. I started blogging about endowing data in 2007. In 2010 Serge Goldstein of Princeton described their endowed data service, based on his analysis that if they charged double the initial cost they could store data forever. I was skeptical, not least because what Princeton actually charged was $3K/TB. Using their cost model, this meant either that they were paying $1.5K/TB for disk at a time when Fry's was selling disks for $50/TB, or that they were skeptical too.

Earlier in 2010 I started predicting, for interconnected technological and business reasons, that Kryder's Law would slow down. I expressed skepticism about Princeton's model in a talk at the 2011 Personal Digital Archiving conference, and started work building an economic model of long-term storage.

A month before the Thai floods I presented initial results at the Library of Congress. A month after the floods I was able to model the effect of price spikes, and demonstrate that the endowment needed for a data collection depended on the Kryder rate in a worrying way. At the Kryder rates we were used to, with cost per byte dropping rapidly, the endowment needed was small and not very sensitive to the exact rate. As the Kryder rate decreased, the endowment needed rose rapidly and became very sensitive to the exact rate.

Since the floods, the difficulty and cost of the next generation of disk technology and the consolidation of the disk drive industry have combined to make it clear that future Kryder rates will be much lower than they were in the past. Thus storage costs will be much higher than they were expected to be, and much less predictable.

You may think I'm a pessimist about storage costs. So lets look at what the industry analysts are saying:
  • According to IDC, the demand for storage each year grows about 60%.
  • According to IHS iSuppli, the bit density on the platters of disk drives will grow no more than 20%/year for the next 5 years.
  • According to computereconomics.com, IT budgets in recent years have grown between 0%/year and 2%/year.
Here's a graph that projects these three numbers out for the next 10 years. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at computereconomics.com's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. On these numbers, if storage cost as a proportion of your budget is not to increase, your collections cannot grow at more than 22%/yr.

Recent industry analysts' projections of the Kryder rate have proved to be consistently optimistic. My industry contacts have recently suggested that 12% may be the best we can expect. My guess is that if your collection grows more than 10%/yr storage cost as a proportion of the total budget will increase.

Higher costs will lead to a search for economies of scale. In Cliff Lynch's summary of the ANADP I meeting, he pointed out:
When resources are very scarce, there's a great tendency to centralize, to standardize, to eliminate redundancy in the name of cost effectiveness. This can be very dangerous; it can produce systems that are very brittle and vulnerable, and that are subject to catastrophic failure.But monoculture is not the only problem. As I pointed out at the Preservation at Scale workshop, the economies of scale are often misleading. Typically they are an S-curve, and the steep part of the curve is at a fairly moderate scale. And the bulk of the economies end up with commercial suppliers operating well above that scale rather than with their customers. These vendors have large marketing budgets with which to mislead about the economies. Thus "the cloud" is not an answer to reducing storage costs for long-term preservation.

The actions of the Harper government in Canada demonstrate clearly why redundancy and diversity in storage is essential, not just at the technological but also at the organizational level. Content is at considerable risk if all its copies are under the control of a single institution, particularly these days a government vulnerable to capture by a radical ideology.

Cliff also highlighted another area in particular in which creating this kind of monoculture causes a serious problem:
I'm very worried that as we build up very visible instances of digital cultural heritage that these collections are going to become subject [to] attack in the same way the national libraries, museums, ... have been subject to deliberate attack and destruction throughout history.
...
Imagine the impact of having a major repository ... raided and having a Wikileaks type of dump of all of the embargoed collections in it. ... Or imagine the deliberate and systematic modification or corruption of materials.Edward Snowden's revelations have shown the attack capabilities that nation-state actors had a few years ago. How sure are you that no nation-state actor is a threat to your collections? A few years hence, many of these capabilities will be available in the exploit market for all to use. Right now, advanced persistent threat technology only somewhat less effective than that which recently compromised Stanford's network is available in point-and click form. Protecting against these very credible threats will increase storage costs further.

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than magnetic or solid state storage. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the archive problem. But access will always be slow enough that the data would have to be really cold before being committed to DNA.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we shall see, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)Future Costs: Access

It has always been assumed that the vast majority of archival content is rarely accessed. Research at UC Santa Cruz showed that the majority of accesses to archived data are for indexing and integrity checks. This is supported by the relatively small proportion access forms of total costs in the preservation cost studies.

But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 5.5 times as expensive.

An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the not insignificant costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year.

Implications

Earlier I showed that even if we assume that the other half of the content costs no more to preserve than the low-hanging fruit we're already preserving we need preservation techniques that are at least twice as cost-effective as the ones we currently have. But since then I've shown that:
  • The estimate of half is optimistic.
  • The rest of the content will be much more expensive to ingest.
  • The costs of storing even the content we're currently preserving have been underestimated.
  • The access that scholars will require to future digital collections will be much more expensive than that they required in the past.
Thus, we need a radical re-think of our entire set of digital preservation techniques with the aim of vastly reducing their cost. We probably need a cost reduction of at least 4 and maybe 10 times. I certainly don't know how to do this. Let's brainstorm some ideas that might help toward this goal.

Reducing Costs: Ingest

Much of the discussion of digital preservation concerns metadata. For example, there are 52 criteria for the Trusted Repository Audit, ISO 16363 Section 4. 29 (56%) are metadata-related. Creating and validating metadata is expensive:
  • Manually creating metadata is impractical at scale.
  • Extracting metadata from the content scales better, but it is still expensive since:
  • In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
We need less metadata so we can have more data. Two questions need to be asked:
  • When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
  • How important is the metadata to the task of preservation? Generating metadata because its possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources, and if validating the formats of incoming content using error-prone tools is used to reject allegedly non-conforming content it is counter-productive.
The LOCKSS and CLOCKSS systems take a very parsimonious approach to format metadata. Nevertheless, the requirements of ISO 16363 forced us to expend resources implementing and using FITS, whose output does not in fact contribute to our preservation strategy, and whose binaries are so large that we have to maintain two separate versions of the LOCKSS daemon, one with FITS for internal use and one without for actual preservation. Further, the demands we face for bibliographic metadata mean that metadata extraction is a major part of ingest costs for both systems. These demands come from requirements for:
  • Access via bibliographic (as opposed to full-text) search via, for example, OpenURL resolvers.
  • Meta-preservation services such as the Keepers Registry.
  • Competitive marketing.
Bibliographic search, preservation tracking and bragging about how many articles and books your system preserves are all important, but whether they justify the considerable cost involved is open to question.

It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.

Reducing Costs: Preservation

Perfect preservation is a myth, as I have been saying for at least 7 years using "A Petabyte for a Century" as a theme. Current storage technologies are about a million times too unreliable to keep a Petabyte intact for a century; stuff is going to get lost.

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.

Reducing Costs: Access

The Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. In most cases so far the cost of an access to an individual document has been small enough that archives have not charged the reader. But access to individual documents is not the way future scholars will want to access the collections. Either transferring a copy, typically by shipping a NAS box, or providing data-mining infrastructure at the archive is so expensive that scholars must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars.

The real problem here is that scholars are used to having free access to library collections, But what they increasingly want to do with the collections is expensive. A charging mechanism is needed to pay for the infrastructure and, because the scholar's access is spiky, the could provides both suitable infrastructure and a charging mechanism.

For smaller collections, Amazon provides Free Public Datasets, Amazon stores the, charging scholars accessing them for the computation rather than charging the owner of the data for storage.

Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. Right now it would be costing $7692/mo. Each month this would increase by $319. So a year would cost $115,272. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 outstanding requests would each need to pay about $300 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach.

Recently Twitter offered a limited number of scholars access to its infrastructure to data-mine from the feed, but this doesn't really change the argument.

There are potential storage technologies that combine computation and storage in a cost-effective way. Colleagues at UC Santa Cruz and I proposed one such architecture, which we called DAWN (Durable Array of Wimpy Nodes) in 2011. Architectures of this kind might significantly reduce the cost of providing scholars with data-mining access to the collections. The evolution of storage media is pointing in this direction. But there are very considerable business model difficulties in the way of commercializing such technologies.

Marketing

Any way of making preservation cheaper can be spun as "doing worse preservation". Jeff Rothenberg's Future Perfect 2012 keynote is an excellent example of this spin in action.

We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS is currently undergoing to obtain this check-off item will be well in excess of 10% of its annual budget.

Making the tradeoff of preserving more stuff using worse preservation would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.

Conclusion

Most current approaches to digital preservation aim to ensure that, once ingested, content is effectively immune from bit-rot and the mostly hypothetical threat of format obsolescence, and that future readers will be able to access individual documents via metadata. There are four major problems with these approaches:
  • Reading individual documents one-at-a-time is unlikely to be the access mode future scholars require.
  • At the scale required against the threats they address, bit-rot and format obsolescence, the effectiveness of current techniques is limited.
  • Against other credible threats, such as external attack and insider abuse, the effectiveness of current techniques is doubtful.
  • Current techniques are so expensive that by far the major cause of future scholars inability to access content will be that the content was not collected in the first place.
We need a complete re-think of the techniques for digital preservation that accepts a much higher risk of loss to preserved content and in return allows us to preserve much more content.

Engard, Nicole: Bookmarks for March 30, 2014

planet code4lib - Sun, 2014-03-30 20:30

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for March 30, 2014 appeared first on What I Learned Today....

Related posts:

  1. WordPress bookshelf plugin
  2. Bulk WordPress Plugin Installer
  3. WordPress Automatic Upgrade

Engard, Nicole: Bookmarks for March 30, 2014

planet code4lib - Sun, 2014-03-30 20:30

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for March 30, 2014 appeared first on What I Learned Today....

Related posts:

  1. WordPress bookshelf plugin
  2. Bulk WordPress Plugin Installer
  3. WordPress Automatic Upgrade

Morgan, Eric Lease: Three RDF data models for archival collections

planet code4lib - Sun, 2014-03-30 18:49

Listed and illustrated here are three examples of RDF data models for archival collections. It is interesting to literally see the complexity or thoroughness of each model, depending on your perspective.


This one was designed by Aaron Rubinstein. I don’t know whether or not it was ever put into practice.


This is the model used in Project LOACH by the Archives Hub.


This final model — OAD — is being implemented in a project called ReLoad.

There are other ontologies of interest to cultural heritage institutions, but these three seem to be the most apropos to archivists.

This work is a part of a yet-to-be published book called the LiAM Guidebook, a text intended for archivists and computer technologists interested in the application of linked data to archival description.

Morgan, Eric Lease: Three RDF data models for archival collections

planet code4lib - Sun, 2014-03-30 18:49

Listed and illustrated here are three examples of RDF data models for archival collections. It is interesting to literally see the complexity or thoroughness of each model, depending on your perspective.


This one was designed by Aaron Rubinstein. I don’t know whether or not it was ever put into practice.


This is the model used in Project LOACH by the Archives Hub.


This final model — OAD — is being implemented in a project called ReLoad.

There are other ontologies of interest to cultural heritage institutions, but these three seem to be the most apropos to archivists.

This work is a part of a yet-to-be published book called the LiAM Guidebook, a text intended for archivists and computer technologists interested in the application of linked data to archival description.

Rochkind, Jonathan: Academic freedom in Israel and Palestine

planet code4lib - Sun, 2014-03-30 16:03

While I mostly try to keep this blog focused on professional concerns, I do think academic freedom is a professional concern for librarians, and I’m going to again use this platform to write about an issue of concern to me.

On December 17th, 2013, the American Studies Association membership endorsed a Resolution on Boycott of Israeli Academic Institutions. This resolution endorses and joins in a campaign organized by Palestinian civil society organizations for boycott of Israel for human rights violations against Palestinians — and specifically, for an academic boycott called for by Palestinian academics.

In late December and early January, very many American university presidents released letters opposing and criticizing the ASA boycott resolution, usually on the grounds that the ASA action threatened the academic freedom of Israeli academics.

Here at Johns Hopkins, the President and Provost issued such a letter on December 23rd. I am quite curious about what organizing took place that resulted in letters from so many university presidents within in a few weeks. Beyond letters of disproval from presidents, there has also been organizing to prevent scholars, departments, and institutions from affiliating with the ASA or to retaliate against scholars who do so (such efforts are, ironically, quite a threat to academic freedom themselves).

The ASA resolution (and the Palestinian academic boycott campaign in general) does not call for prohibition of cooperation with Israeli academics, but only against formal collaborations with Israeli academic institutions — and in the case of the ASA, only formal partnerships by the ASA itself, they are not trying to require any particular actions by members as a condition of membership in the ASA.  You can read more about the parameters of the ASA resolution, and the motivation that led to it, on the ASA’s FAQ on the subject, a concise and well-written document I definitely recommend reading.

So I don’t actually think the ASA resolution will have significant effect on academic freedom for scholars at Israeli institutions.  It’s mostly a symbolic action, although the fierce organizing against it shows how threatening the symbolic action is to the Israeli government and those who would like to protect it from criticism.

But, okay, especially if academic boycott of Israel continues to gain strength, then some academics at Israeli institutions will, at the very least, be inconvenienced in their academic affairs.  I can understand why some people find academic boycott an inappropriate tactic — even though I disagree with them.

But here’s the thing. The academic freedom of Palestinian scholars and students has been regularly, persistently, and severely infringed for quite some time.  In fact, acting in solidarity with Palestinian colleagues facing restrictions on freedom of movement and expression and inquiry was the motivation of the ASA’s resolution in the first place, as they write in their FAQ and the language of the resolution itself.

You can read more about restrictions in Palestinian academic freedom, and the complicity of Israeli academic institutions in these restrictions, in a report from Palestinian civil society here; or this campaign web page from Birzeit University and other Palestinian universities;  this report from the Israeli Alternative Information Center;  or in this 2006 essay by Judith Butler; or this 2011 essay by Riham Barghouti, one of the founding members of the Palestinian Campaign for the Academic and Cultural Boycott of Israel.

What are we to make of the fact that so many university presidents spoke up in alarm at an early sign of possible, in their views, impingements to academic freedom of scholars at Israeli institutions, but none have spoken up to defend significantly beleaguered Palestinian academic freedom?

Here at Hopkins, Students for Justice in Palestine believes that we do all have a responsibility to speak up in solidarity with our Palestinian colleagues, students and scholars, whose freedoms of inquiry and expression are severely curtailed; and that administrators silence on the issue does not in fact represent our community.  Hopkins SJP thinks the community should speak out in concern and support for Palestinian academic freedom, and they’ve written a letter Hopkins affiliates can sign on to.

I’ve signed the letter. I’d urge any readers who are also affiliated to Hopkins to read it, and consider it signing it as well. Here it is.


Filed under: General
Syndicate content