planet code4lib

Syndicate content
Planet Code4Lib - http://planet.code4lib.org
Updated: 20 weeks 5 days ago

Rosenthal, David: What Could Possibly Go Wrong?

Mon, 2014-04-07 09:00
I gave a talk at UC Berkeley's Swarm Lab entitled "What Could Possibly Go Wrong?" It was an initial attempt to summarize for non-preservationistas what we have learnt so far about the problem of preserving digital information for the long term in the more than 15 years of the LOCKSS Program. Follow me below the fold for an edited text with links to the sources.

I'm David Rosenthal and I'm an engineer. I'm about two-thirds of a century old. I wrote my first program almost half a century ago, in Fortran for an IBM1401. Eric Allman invited me to talk; I've known Eric for more than a third of a century. About a third of a century ago Bob Sproull recruited me for the Andrew project at C-MU, I where I worked on the user interface with James Gosling. I followed James to Sun to work on window systems, both X, which you've probably used, and a more interesting one called NeWS that you almost certainly haven't. Then I worked on operating systems with Bill Shannon, Rob Gingell and Steve Kleiman. More than a fifth of a century ago I was employee #4 at NVIDIA, helping Curtis Priem architect the first chip. Then I was an early employee at Vitria, the second company of JoMei Chang and Dale Skeen, founders of the company now called Tibco. One seventh of a century ago, after doing 3 companies, all of which IPO-ed, I was burnt out and decided to ease myself gradually into retirement.

Academic Journals and the Web

It was a total failure. I met Vicky Reich, the wife of the late Mark Weiser, CTO of Xerox PARC. She was a librarian at Stanford, and had been part of the team which, nearly a fifth of a century ago, started Stanford's HighWire Press and pioneered the transition of academic journals from paper to the Web.

In the paper world, librarians saw themselves as having two responsibilities, to provide current scholars with the materials they needed, and to preserve their accessibility for future scholars. They did this through a massively replicated. loosely coupled, fault-tolerant, tamper-evident, system of mutually untrusting but cooperating peers that had evolved over centuries. Libraries purchased copies of journals, monographs and books. The more popular the work, the more replicas were stored in the system. The storage of each replica was not very reliable; libraries put them in the stacks and let people take them away. Most times the replicas came back, sometimes they had coffee spilled on them, and sometimes they vanished. Damage could be repaired via inter-library loan and copy. There was a market for replicas; as the number of replicas of a work decreased, the value of a replica in this market increased, encouraging librarians who had a replica to take more care it, by moving it to more secure storage. The system resisted attempts at censorship or re-writing of history precisely because it was a loosely coupled peer-to-peer system; although it was easy to find a replica, it was hard to find all the replicas, or even to know exactly how many there were. And although it was easy to destroy a replica, it was fairly hard to modify one undetectably.

The transition of academic journals from paper to the Web destroyed two of the pillars of this system, ownership of copies, and massive replication. In the excitement of seeing how much more useful content on the Web was to scholars, librarians did not think through the fundamental implications of the transition. The system that arose meant that they no longer purchased a copy of the journal, they rented access to the publisher's copy. Renting satisfied their responsibility to current scholars, but it couldn't satisfy their responsibility to future scholars.

Librarians' concerns reached the Mellon Foundation, who funded exploratory work at Stanford and five other major research libraries. In what can only be described as a serious failure of systems analysis, the other five libraries each proposed essentially the same system, in which they would take custody of the journals. Other libraries would subscribe to this third-party archive service. If they could not get access from the original publisher and they had a current subscription to the third-party archive they could access the content from the archive. None of these efforts led to a viable system because they shared many fundamental problems including:
  • Libraries such as Harvard were reluctant to outsource a critical function to a competing library such as Yale. On the other hand, funders were reluctant to pay for more than one archive.
  • Publishers were reluctant to deliver their content to a library in order that the library might make money by re-publishing the content to others. This made the contract negotiations necessary to obtain content from the publishers time-consuming and expensive.
  • The concept of a subscription archive was not a solution to the problem of post-cancellation access; it was merely a second instance of exactly the same problem.
One of the problems I had been interested in at Sun and then again at Vitria was fault-tolerance. To a computer scientist, it was a solved problem. Byzantine Fault Tolerance (BFT) could prove that 3f+1 replicas could survive f simultaneous faults. To an engineer, it was not a solved problem. Two obvious questions were:
  • What is the probability that my system will encounter f simultaneous faults?
  • How could my system recover if it did?
There's a very good reason why suspension bridges use stranded cables. A solid rod would be cheaper, but the bridge would then have the same unfortunate property as BFT. It would work properly up to the point of failure, which would be sudden, catastrophic and from which recovery would be impossible.

I have long thought that the fundamental challenge facing system architects is to build systems that fail gradually, progressively, and slowly enough for remedial action to be effective, all the while emitting alarming noises to attract attention to impending collapse. In a post-Snowden world it is perhaps superfluous to say that these properties are especially important for failures caused by external attack or internal subversion.

The LOCKSS System

As Vicky explained the paper library system to me, I came to see two things:
  • It was a system in the physical world that had a very attractive set of fault-tolerance properties.
  • An analog of the paper system in the Web world could be built that retained those properties.
With a small grant from Michael Lesk, then at the NSF, I built a prototype system called LOCKSS (Lots Of Copies Keep Stuff Safe), modelled on the paper library system. By analogy with the stacks, libraries would run what you can think of as a persistent Web cache with a Web crawler which would pre-load the cache with the content to which the library subscribed. The contents of each cache would never be flushed, and would be monitored by a peer-to-peer anti-entropy protocol. Any damage detected would be repaired by the Web analog of inter-library copy. Because the system was an exact analog of the existing paper system, the copyright legalities were very simple.

The Mellon Foundation, and then Sun and the NSF funded the work to throw my prototype away and build a production-ready system. The interesting part of this started when we discovered that, as usual with my prototypes, the anti-entropy protocol had gaping security holes. I worked with Mary Baker and some of her students in CS, Petros Maniatis, Mema Roussopoulos and TJ Giuli, to build a real P2P anti-entropy protocol, for which we won Best Paper at SOSP a tenth of a century ago.

The interest in this paper is that it shows a system, albeit in a restricted area of application, that has a high probability of failing slowly and gradually, and of generating alarms in the case of external attack, even from a very powerful adversary. It is a true P2P system with no central control,  because that would provide a focus for attack. It uses three major defensive techniques:
  • Effort-balancing, to ensure that the computational cost of requesting a service from a peer exceeds the computational cost of satisfying the request. If this condition isn't true in a P2P network, the bad guy can wear the good guys down.
  • Rate-limiting, to ensure that the rate at which the bad guy can make bad things happen can't make the system fail quickly.
  • Lots of copies, so that the anti-entropy protocol can work with samples of the population of copies. Randomly sampling the peers makes it hard for the bad guy to know which peers are involved in which operations.
Recent DDoS attacks, such as the 400Gbps NTP Reflection attack on CloudFlare, have made clear the importance of rate-limiting to services such as DNS and NTP.

Now, our free, open source, peer-to-peer digital preservation system is in use at around 150 libraries worldwide. The program has been economically self-supporting for nearly 7 years using the "RedHat" model of free software and paid support. In addition to our SOSP paper, the program has published research into many aspects of digital preservation.

The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at current prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk. The median data item in the Global LOCKSS network has copies at a couple of dozen peers.

I doubt that copyright is high on your list of design problems. You may be wrong about that, but I'm not going to argue with you. So, the rest of this talk will not be about the LOCKSS system as such, but about the lessons we've learned in the last 15 years that are applicable to everyone who is trying to store digital information for the long term. The title of this talk is the question that you have to keep asking yourself over and over again as you work on digital preservation, "what could possibly go wrong?" Unfortunately, once I started writing this talk, it rapidly grew far too long for lunch. Don't expect a comprehensive list, you're only getting edited low-lights.

Stuff is going to get lost

Lets start by examining the problem in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

At scale, storing realistic amounts of data for human timescales is an unsolvable problem. Some stuff is going to get lost. This shouldn't be a surprise, even in the days of paper stuff got lost. But the essential information needed to keep society running, to keep science progressing, to keep the populace entertained was stored very robustly, with many copies on durable, somewhat tamper-evident media in a fault-tolerant, peer-to-peer, geographically and administratively diverse system.

This is no longer true. The Internet has, in the interest of reducing costs and speeding communication, removed the redundancy, the durability and the tamper-evidence from the system that stores society's critical data. Its now all on spinning rust, with hopefully at least one backup on tape covered in rust.

Two weeks ago, researchers at Berkeley co-authored a paper in which they reported that:
a rapid succession of coronal mass ejections ... sent a pulse of magnetized plasma barreling into space and through Earth’s orbit. Had the eruption come nine days earlier, when the ignition spot on the solar surface was aimed at Earth, it would have hit the planet, potentially wreaking havoc with the electrical grid, disabling satellites and GPS, and disrupting our increasingly electronic lives. ... A study last year estimated that the cost of a solar storm like [this] could reach $2.6 trillion worldwide.Most of the information needed to recover from such an event exists only in digital form on magnetic media. These days, most of it probably exists only in "the cloud", which is this happy place immune from the electromagnetic effects of coronal mass ejections and very easy to access after the power grid goes down.

How many of you have read the science fiction classic The Mote In God's Eye by Larry Niven and Jerry Pournelle? It describes humanity's first encounter with intelligent aliens, called Moties. Motie reproductive physiology locks their society into an unending cycle of over-population, war, societal collapse and gradual recovery. They cannot escape these Cycles, the best they can do is to try to ensure that each collapse starts from a higher level than the one before by preserving the record of their society's knowledge through the collapse to assist the rise of its successor. One technique they use is museums of their technology. As the next war looms, they wrap the museums in the best defenses they have. The Moties have become good enough at preserving their knowledge that the next war will feature lasers capable of sending light-sails to the nearby stars, and the use of asteroids as weapons. The museums are wrapped in spheres of two-meter thick metal, highly polished to reduce the risk from laser attack.

Larry and Jerry were writing a third of a century ago, but in the light of this week's IPCC report, they are starting to look uncomfortably prophetic. The problem we face is that, with no collective memory of a societal collapse, no-one is willing to pay either to fend it off or to build the museums to pass knowledge to the successor society.

Why is stuff going to get lost?

One way to express the "what could possibly go wrong?" question is to ask "against what threats are you trying to preserve data?" The threat model of a digital preservation system is a very important aspect of the design which is, alas, only rarely documented. In 2005 we did document the LOCKSS threat model. Unfortunately, we didn't consider coronal mass ejections or societal collapse from global warming.

We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure 
How much stuff is going to get lost?

The more we spend per byte, the safer the bytes are going to be. Unfortunately, this is subject to the Law of Diminishing Returns; each successive nine of reliability is exponentially more expensive than the last. We don't have an unlimited budget, so we're going to have to trade off cost against the probability of data loss. To do this we need models to predict the cost of storing data using a given technology, and models to predict the probability of that technology losing data. I've worked on both kinds of model and can report that they're both extremely difficult.

Models of Data Loss

There's quite a bit of research, from among others Google, C-MU and BackBlaze, showing that failure rates of storage media in service are much higher than the rates claimed by the manufacturers specifications. Why is this? For example, the Blu-Ray disks Facebook is experimenting with for cold storage claim a 50-year data life. No-one has seen a 50-year-old DVD disk, so how do they know?

The claims are based on a model of the failure mechanisms and data from accelerated life testing, in which batches of media are subjected to unrealistically high temperature and humidity. The model is used to extrapolate from these unrealistic conditions to the conditions to be encountered in service. There are two problems, the conditions in service typically don't match those assumed by the models, and the models only capture some of the failure mechanisms.

These problems are much worse when we try to model not just failures of individual media, but of the entire storage system. Research has shown that media failures account for less than half the failures encountered in service; other components of the system such as buses, controllers, power supplies and so on contribute the other half. But even models that include these components exclude many of the threats we identified, from operator errors to coronal mass ejections.

Even more of a problem is that the threats, especially the low-probability ones, are highly correlated. Operators are highly likely to make errors when they are stressed coping with, say, an external attack. The probability of economic failure is greatly increased by, say, insider abuse. Modelling these correlations is a nightmare.

It turns out that economics are by far the largest cause of data failing to reach future readers. A month ago I gave a seminar in the I-school entitled The Half-Empty Archive, in which I pulled together the various attempts to measure how much of the data that should be archived is being collected by archives, and assessed that it was much less than half.  No-one believes that archiving budgets are going to double, so we can be confident that the loss rate from unable to afford to collect is at least 50%. This dwarfs all other causes of data loss.

Lets Keep Everything For Ever!

Digital preservation has three cost areas; ingest, preservation and dissemination. In the seminar I looked at the prospects for radical cost decreases in  all three, but I assume that the one you are interested in is storage, which is the main cost of preservation. Everyone knows that, if not actually free, storage is so cheap that we can afford to store everything for ever. For example, Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:
But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely.

Kryder's Law

They believe this because they lived through a third of a century of Kryder's Law, the analog of Moore's Law for disks. Kryder's Law predicted that the bit density on the platters of disk drives would more than double every 18 months, leading to a consistent 30-40%/yr drop in cost per byte. Thus, long-term storage was effectively free. If you could afford to store something for a few years, you could afford to store it for ever. The cost would have become negligible.

As Randall Munroe points out, in the real world exponential growth can't continue for ever. It is always the first part of a S-curve. One of the things that most impressed me about Krste Asanovi?'s keynote on the ASPIRE Project at this year's FAST conference was that their architecture took for granted that Moore's Law was in the past. Kryder's Law is also flattening out.

Here's a graph, from Preeti Gupta at UCSC, showing that in 2010, even before the floods in Thailand doubled $/GB overnight, the Kryder curve was flattening. Currently, disk is about 7 times as expensive as it would have been had the pre-2010 Kryder's Law continued. Industry projections are for 10-20%/yr going forward - the red lines on the graph show that in 2020 disk is now expected to be 100-300 times more expensive than pre-2010 expectations.

Industry projections have a history of optimism, but if we believe that data grows at IDC's 60%/yr, disk density grows at IHS iSuppli's 20%/yr, and IT budgets are essentially flat, the annual cost of storing a decade's accumulated data is 20 times the first year's cost. If at the start of the decade storage is 5% of your budget, at the end it is more than 100% of your budget. So the Big Data zealot has an affordability problem.

Why Is Kryder's Law Slowing?

It is easy to, and we often do, conflate Kryder's Law, which describes the increase in the areal density of bits on disk platters, with the cost of disk storage in $/GB. We wave our hands and say that it roughly mapped one-for-one into a decrease in the cost of disk drives. We are not alone in using this approximation, Mark Kryder himself does (PDF):
Density is viewed as the most important factor ... because it relates directly to cost/GB and in the HDD marketplace, cost/GB has always been substantially more important than other performance parameters. To compare cost/GB, the approach used here was to assume that, to first order, cost/GB would scale in proportion to (density)-1My co-author Daniel Rosenthal (no relation) has investigated the relationship between bits/in2 and $/GB over the last couple of decades. Over that time, it appears that about 3/4 of the decrease in $/GB can be attributed to the increase in bits/in2. Where did the rest of the decrease come from? I can think of three possible causes:
  • Economies of scale. For most of the last two decades the unit shipments of drives have been increasing, resulting in lower fixed costs per drive. Unfortunately, unit shipments are currently declining, so this effect has gone into reverse. In 2005 Mark Kryder was quoted as predicting "In a few years the average U.S. consumer will own 10 to 20 disk drives in devices that he uses regularly," but what is in those devices now is flash. The remaining market for disks is the cloud; they are no longer a consumer technology.
  • Manufacturing technology. The technology to build drives has improved greatly over the last couple of decades, resulting in lower variable costs per drive. Unfortunately HAMR, the next generation of disk drive technology has proven to be extraordinarily hard to manufacture, so this effect has gone into reverse.
  • Vendor margins. Over the last couple of decades disk drive manufacturing was a very competitive business, with numerous competing vendors. This gradually drove margins down and caused the industry to consolidate. Before the Thai floods, there were only two major manufacturers left, with margins in the low single digits. Unfortunately, the lack of competition and the floods have led to a major increase in margins, so this effect has gone into reverse.
But these factors only account for 1/4 of the missing cost decrease. Where did the other 3/4 go? Here is a 2008 graph from Dave Anderson of Seagate showing how what looks like a smooth Kryder's Law curve is actually the superimposition of a series of S-curves, one for each successive technology. Note how Dave's graph shows Perpendicular Magnetic Recording (PMR) being replaced by Heat Assisted Magnetic Recording (HAMR) starting in 2009. No-one has yet shipped HAMR drives. Instead, the industry has resorted to stretching PMR by shingling (which increases the density) and helium (which increases the number of platters).

Each technology generation has to stay in the market long enough to earn a return on the cost of the transition from its predecessor. There are two problems:
  • The return it needs to earn is, in effect, the margins the vendors enjoy. The higher the margins, the longer the technology needs to be in the market. Margins have increased.
  • As technology advances, the easier problems get solved first. So each technology transition involves solving harder and harder problems, so it costs more. The transition from PMR to HAMR has turned out to be vastly more expensive than the industry expected. Getting the laser and the magnetics in the head assembly to cooperate is very hard, the transition involves a huge increase in the production of the lasers, and so on.
According to Dave's 6-year-old graph, we should now be almost done with HAMR and starting the transition to Bit Patterned Media (BPM). It is already clear that the HAMR-BPM transition will be even more expensive and thus even more delayed than the PMR-HAMR transition. So the projected 20%/yr Kryder rate is unlikely to be realized. The one good thing, if you can call it that, about the slowing of the Kryder rate for disk is that it puts off the day when the technology hits the superparamagnetic limit. This is when the shrinking magnetic domains become unstable at the temperatures encountered inside an operating disk, which are quite warm.

We'll Just Use Tape Instead of Disk

About 70% of all bytes of storage produced each year is disk,the rest being tape and solid state.. Tape has been the traditional medium for long-term storage. Its recording technology lags about 8 years behind disk; it is unlikely to run into the problems plaguing disk for some years. We can expect its relative cost per byte advantage over disk to grow in the medium term. But tape is losing ground in the market. Why is this?

In the past, the access patterns to archived data were stable. It was rarely accessed, and accesses other than integrity checks were sparse. But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until last week it was 5.5 times.

An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the considerable costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year. Recently, Twitter issued a call for a "small number of proposals to receive free datasets", but even Twitter can't support 400.

Thus future archives will need to keep at least one copy of their content on low-latency, high-bandwidth storage, not tape.

We'll Just Use Flash Instead

Flash memory's advantages, including low power, physical robustness and low access latency have overcome its higher cost per byte in many markets, such as tablets and servers. But there is no possibility of flash replacing disk in the bulk storage market; that would involve trebling the number of flash fabs. Even if we ignore the lead time to build the new fabs, the investment to do so would not pay dividends. Everyone understands that shrinking flash cells much further will impair their ability to store data. Increasing levels, stacking cells in 3D and increasingly desperate signal processing in the flash controller will keep density going for a little while, but not long enough to pay back the investment in the fabs.

We'll Just Use Flash Non-volatile RAM Instead

There are many technologies vying to be the successor to flash, and most can definitely keep scaling beyond the end of flash provided the semiconductor industry keeps on its road-map.  They all have significant advantages over flash, in particular they are byte- rather than block-addressable. But analysis by Mark Kryder and Chang Soo Kim (PDF) at Carnegie-Mellon is not encouraging about the prospects for either flash or the competing solid state technologies beyond the end of the decade.

We'll Just Use Metal Tape, Stone DVDs, Holographic DVDs DNA Instead

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than Kryder's or Moore's Laws. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the cold storage problem. But DNA access will always be slow enough that it can't store the only copy.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ...

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)Note that this analysis assumes that the drives fail under warranty. One thing the drive vendors did to improve their margins after the floods was to reduce the length of warranties.

Does Kryder's Law Slowing Matter?

Figures from SDSC suggest that media cost is about 1/3 of the lifecycle cost of storage, although figures from BackBlaze suggest a much higher proportion. As a rule of thumb, the research into digital preservation costs suggests that ingesting the content costs about 1/2 the total lifecycle costs, preserving it costs about 1/3 and disseminating it costs about 1/6. So why are we worrying about a slowing of the decrease in 1/9 of the total cost?

Different technologies with different media service lives involve spending different amounts of money at different times during the life of the data. To make apples-to-apples comparisons we need to use the equivalent of Discounted Cash Flow to compute the endowment needed for the data. This is the capital sum which, deposited with the data and invested at prevailing interest rates, would be sufficient to cover all the expenditures needed to store the data for its life.

We built an economic model of the cost of long-term storage. Here it is from 15 months ago plotting the endowment needed for 3 replicas of a 117TB dataset to have a 98% chance of not running out of money over 100 years, against the Kryder rate, using costs from Backblaze. Each line represents a policy of keeping the drives for 1,2 ... 5 years before replacing them.

In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decrease, we move into the steep part of the graph, which has two effects:
  • The endowment needed increases sharply.
  • The endowment needed becomes harder to predict, because it depends strongly on the precise Kryder rate.
The reason to worry is that the cost of storing data for the long term depends strongly on the Kryder rate if it falls much below 20%, which it has. Everyone's storage expectations, and budgets, are based on their pre-2010 experience, and on a belief that the effect of the floods was a one-off glitch; the industry will quickly get back to historic Kryder rates. It wasn't, and they won't.

Does Losing Stuff Matter?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.

Can We Do Better?

In the short term, the inertia of manufacturing investment means that things aren't going to change much. Bulk data is going to be on disk, it can't compete with other uses for the higher-value space on flash. But looking out to the end of the decade and beyond, we're going to be living in a world of much lower Kryder rates. What does this mean for storage system architectures?

The reason disks have a five-year service life isn't an accident of technology. Disks are engineered to have a five-year service life because, with a 40%/yr Kryder rate, it is uneconomic to keep the data on the drive for longer than 5 years. After 5 years the data will take up about 8% of the drive's replacement.

At lower Kryder rates the media, whatever they are, will be in service longer. That means that running cost will be a larger proportion of the total cost. It will be worth while to spend more on purchasing the media to spend less on running them. Three years ago Ian Adams, Ethan Miller and I were inspired by the FAWN paper from Carnegie-Mellon to do an analysis we called DAWN: Durable Array of Wimpy Nodes. In it we showed that, despite the much higher capital cost, a storage fabric consisting of a very large number of very small nodes each with a very low-power system-on-chip and a small amount of flash memory would be competitive with disk.

The reason was that DAWN's running cost would be so much lower, and its economic media life so much longer, that it would repay the higher initial investment. The more the Kryder rate slows, the better our analysis looks. DAWN's better performance was a bonus. To the extent that successors to flash behave like RAM, and especially if they can be integrated with the system-on-chip, they strengthen the case further with lower costs and an even bigger performance edge.


Summing Up

Expectations for future storage technologies and costs were built up during three decades of extremely rapid cost per byte decrease. We are now 4 years into a period of much slower cost decrease, but expectations remain unchanged. Some haven't noticed the change, some believe it is temporary and the industry will return to the good old days of 40%/yr Kryder rates.

Industry insiders are projecting no more than 20%/yr rates for the rest of the decade. Technological and market forces make it likely that, as usual, they are being optimistic. Lower Kryder rates greatly increase both the cost of long-term storage, and the uncertainty in estimating it.

Lower Kryder rates mean that the economic service life of media will be longer, placing more emphasis on lower running cost than on lower purchase cost. This is particularly true since bulk storage media are no longer a consumer product; businesses are better placed to make this trade-off. But they may not do so (see the work of Andrew Haldane and Richard Davies at the Bank of England, and Doyne Farmer of the Santa Fe Institute and John Geanakoplos of Yale).

The idea that archived data can live on long-latency, low-bandwidth media is no longer the case. Future archival storage architectures must deliver adequate performance to sustain data-mining as well as low cost. Bundling computation into the storage medium is the way to do this.

Discussion

As usual, I was too busy answering questions to remember most of them. Here are the ones I remember, rephrased, with apologies the the questioners whose contributions slipped my memory:
  • Won't the evolution of flash technology drive its price down more quickly than disk? The problem is that the manufacturing capacity doesn't, and won't exist for flash to displace disk in the bulk storage space. Flash is a better technology than disk for many applications, so it is likely always to command a premium over disk.
  • Isn't DNA a really noisy technology to build long-term memory from? At the raw media level, all storage technologies are unpleasantly noisy. The signal processing that goes on inside your disk or flash controlled is amazing. DNA has the advantage that the signal processing has a vast number of replicas to work with.
  • Doesn't experience with flash suggest that it isn't capable of storing data reliably for the long term? The way current flash controllers use the raw medium optimizes things other than data retention, such as performance (for SSDs) and low cost (for SD cards, see Bunnie Huang and xobs' talk at the Chaos Computer Conference). That doesn't mean it isn't possible, with alternate flash controller technology, to optimize for data retention.

Rosenthal, David: What Could Possibly Go Wrong?

Mon, 2014-04-07 09:00
I gave a talk at UC Berkeley's Swarm Lab entitled "What Could Possibly Go Wrong?" It was an initial attempt to summarize for non-preservationistas what we have learnt so far about the problem of preserving digital information for the long term in the more than 15 years of the LOCKSS Program. Follow me below the fold for an edited text with links to the sources.

I'm David Rosenthal and I'm an engineer. I'm about two-thirds of a century old. I wrote my first program almost half a century ago, in Fortran for an IBM1401. Eric Allman invited me to talk; I've known Eric for more than a third of a century. About a third of a century ago Bob Sproull recruited me for the Andrew project at C-MU, I where I worked on the user interface with James Gosling. I followed James to Sun to work on window systems, both X, which you've probably used, and a more interesting one called NeWS that you almost certainly haven't. Then I worked on operating systems with Bill Shannon, Rob Gingell and Steve Kleiman. More than a fifth of a century ago I was employee #4 at NVIDIA, helping Curtis Priem architect the first chip. Then I was an early employee at Vitria, the second company of JoMei Chang and Dale Skeen, founders of the company now called Tibco. One seventh of a century ago, after doing 3 companies, all of which IPO-ed, I was burnt out and decided to ease myself gradually into retirement.

Academic Journals and the Web

It was a total failure. I met Vicky Reich, the wife of the late Mark Weiser, CTO of Xerox PARC. She was a librarian at Stanford, and had been part of the team which, nearly a fifth of a century ago, started Stanford's HighWire Press and pioneered the transition of academic journals from paper to the Web.

In the paper world, librarians saw themselves as having two responsibilities, to provide current scholars with the materials they needed, and to preserve their accessibility for future scholars. They did this through a massively replicated. loosely coupled, fault-tolerant, tamper-evident, system of mutually untrusting but cooperating peers that had evolved over centuries. Libraries purchased copies of journals, monographs and books. The more popular the work, the more replicas were stored in the system. The storage of each replica was not very reliable; libraries put them in the stacks and let people take them away. Most times the replicas came back, sometimes they had coffee spilled on them, and sometimes they vanished. Damage could be repaired via inter-library loan and copy. There was a market for replicas; as the number of replicas of a work decreased, the value of a replica in this market increased, encouraging librarians who had a replica to take more care it, by moving it to more secure storage. The system resisted attempts at censorship or re-writing of history precisely because it was a loosely coupled peer-to-peer system; although it was easy to find a replica, it was hard to find all the replicas, or even to know exactly how many there were. And although it was easy to destroy a replica, it was fairly hard to modify one undetectably.

The transition of academic journals from paper to the Web destroyed two of the pillars of this system, ownership of copies, and massive replication. In the excitement of seeing how much more useful content on the Web was to scholars, librarians did not think through the fundamental implications of the transition. The system that arose meant that they no longer purchased a copy of the journal, they rented access to the publisher's copy. Renting satisfied their responsibility to current scholars, but it couldn't satisfy their responsibility to future scholars.

Librarians' concerns reached the Mellon Foundation, who funded exploratory work at Stanford and five other major research libraries. In what can only be described as a serious failure of systems analysis, the other five libraries each proposed essentially the same system, in which they would take custody of the journals. Other libraries would subscribe to this third-party archive service. If they could not get access from the original publisher and they had a current subscription to the third-party archive they could access the content from the archive. None of these efforts led to a viable system because they shared many fundamental problems including:
  • Libraries such as Harvard were reluctant to outsource a critical function to a competing library such as Yale. On the other hand, funders were reluctant to pay for more than one archive.
  • Publishers were reluctant to deliver their content to a library in order that the library might make money by re-publishing the content to others. This made the contract negotiations necessary to obtain content from the publishers time-consuming and expensive.
  • The concept of a subscription archive was not a solution to the problem of post-cancellation access; it was merely a second instance of exactly the same problem.
One of the problems I had been interested in at Sun and then again at Vitria was fault-tolerance. To a computer scientist, it was a solved problem. Byzantine Fault Tolerance (BFT) could prove that 3f+1 replicas could survive f simultaneous faults. To an engineer, it was not a solved problem. Two obvious questions were:
  • What is the probability that my system will encounter f simultaneous faults?
  • How could my system recover if it did?
There's a very good reason why suspension bridges use stranded cables. A solid rod would be cheaper, but the bridge would then have the same unfortunate property as BFT. It would work properly up to the point of failure, which would be sudden, catastrophic and from which recovery would be impossible.

I have long thought that the fundamental challenge facing system architects is to build systems that fail gradually, progressively, and slowly enough for remedial action to be effective, all the while emitting alarming noises to attract attention to impending collapse. In a post-Snowden world it is perhaps superfluous to say that these properties are especially important for failures caused by external attack or internal subversion.

The LOCKSS System

As Vicky explained the paper library system to me, I came to see two things:
  • It was a system in the physical world that had a very attractive set of fault-tolerance properties.
  • An analog of the paper system in the Web world could be built that retained those properties.
With a small grant from Michael Lesk, then at the NSF, I built a prototype system called LOCKSS (Lots Of Copies Keep Stuff Safe), modelled on the paper library system. By analogy with the stacks, libraries would run what you can think of as a persistent Web cache with a Web crawler which would pre-load the cache with the content to which the library subscribed. The contents of each cache would never be flushed, and would be monitored by a peer-to-peer anti-entropy protocol. Any damage detected would be repaired by the Web analog of inter-library copy. Because the system was an exact analog of the existing paper system, the copyright legalities were very simple.

The Mellon Foundation, and then Sun and the NSF funded the work to throw my prototype away and build a production-ready system. The interesting part of this started when we discovered that, as usual with my prototypes, the anti-entropy protocol had gaping security holes. I worked with Mary Baker and some of her students in CS, Petros Maniatis, Mema Roussopoulos and TJ Giuli, to build a real P2P anti-entropy protocol, for which we won Best Paper at SOSP a tenth of a century ago.

The interest in this paper is that it shows a system, albeit in a restricted area of application, that has a high probability of failing slowly and gradually, and of generating alarms in the case of external attack, even from a very powerful adversary. It is a true P2P system with no central control,  because that would provide a focus for attack. It uses three major defensive techniques:
  • Effort-balancing, to ensure that the computational cost of requesting a service from a peer exceeds the computational cost of satisfying the request. If this condition isn't true in a P2P network, the bad guy can wear the good guys down.
  • Rate-limiting, to ensure that the rate at which the bad guy can make bad things happen can't make the system fail quickly.
  • Lots of copies, so that the anti-entropy protocol can work with samples of the population of copies. Randomly sampling the peers makes it hard for the bad guy to know which peers are involved in which operations.
Recent DDoS attacks, such as the 400Gbps NTP Reflection attack on CloudFlare, have made clear the importance of rate-limiting to services such as DNS and NTP.

Now, our free, open source, peer-to-peer digital preservation system is in use at around 150 libraries worldwide. The program has been economically self-supporting for nearly 7 years using the "RedHat" model of free software and paid support. In addition to our SOSP paper, the program has published research into many aspects of digital preservation.

The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at current prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk. The median data item in the Global LOCKSS network has copies at a couple of dozen peers.

I doubt that copyright is high on your list of design problems. You may be wrong about that, but I'm not going to argue with you. So, the rest of this talk will not be about the LOCKSS system as such, but about the lessons we've learned in the last 15 years that are applicable to everyone who is trying to store digital information for the long term. The title of this talk is the question that you have to keep asking yourself over and over again as you work on digital preservation, "what could possibly go wrong?" Unfortunately, once I started writing this talk, it rapidly grew far too long for lunch. Don't expect a comprehensive list, you're only getting edited low-lights.

Stuff is going to get lost

Lets start by examining the problem in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

At scale, storing realistic amounts of data for human timescales is an unsolvable problem. Some stuff is going to get lost. This shouldn't be a surprise, even in the days of paper stuff got lost. But the essential information needed to keep society running, to keep science progressing, to keep the populace entertained was stored very robustly, with many copies on durable, somewhat tamper-evident media in a fault-tolerant, peer-to-peer, geographically and administratively diverse system.

This is no longer true. The Internet has, in the interest of reducing costs and speeding communication, removed the redundancy, the durability and the tamper-evidence from the system that stores society's critical data. Its now all on spinning rust, with hopefully at least one backup on tape covered in rust.

Two weeks ago, researchers at Berkeley co-authored a paper in which they reported that:
a rapid succession of coronal mass ejections ... sent a pulse of magnetized plasma barreling into space and through Earth’s orbit. Had the eruption come nine days earlier, when the ignition spot on the solar surface was aimed at Earth, it would have hit the planet, potentially wreaking havoc with the electrical grid, disabling satellites and GPS, and disrupting our increasingly electronic lives. ... A study last year estimated that the cost of a solar storm like [this] could reach $2.6 trillion worldwide.Most of the information needed to recover from such an event exists only in digital form on magnetic media. These days, most of it probably exists only in "the cloud", which is this happy place immune from the electromagnetic effects of coronal mass ejections and very easy to access after the power grid goes down.

How many of you have read the science fiction classic The Mote In God's Eye by Larry Niven and Jerry Pournelle? It describes humanity's first encounter with intelligent aliens, called Moties. Motie reproductive physiology locks their society into an unending cycle of over-population, war, societal collapse and gradual recovery. They cannot escape these Cycles, the best they can do is to try to ensure that each collapse starts from a higher level than the one before by preserving the record of their society's knowledge through the collapse to assist the rise of its successor. One technique they use is museums of their technology. As the next war looms, they wrap the museums in the best defenses they have. The Moties have become good enough at preserving their knowledge that the next war will feature lasers capable of sending light-sails to the nearby stars, and the use of asteroids as weapons. The museums are wrapped in spheres of two-meter thick metal, highly polished to reduce the risk from laser attack.

Larry and Jerry were writing a third of a century ago, but in the light of this week's IPCC report, they are starting to look uncomfortably prophetic. The problem we face is that, with no collective memory of a societal collapse, no-one is willing to pay either to fend it off or to build the museums to pass knowledge to the successor society.

Why is stuff going to get lost?

One way to express the "what could possibly go wrong?" question is to ask "against what threats are you trying to preserve data?" The threat model of a digital preservation system is a very important aspect of the design which is, alas, only rarely documented. In 2005 we did document the LOCKSS threat model. Unfortunately, we didn't consider coronal mass ejections or societal collapse from global warming.

We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure 
How much stuff is going to get lost?

The more we spend per byte, the safer the bytes are going to be. Unfortunately, this is subject to the Law of Diminishing Returns; each successive nine of reliability is exponentially more expensive than the last. We don't have an unlimited budget, so we're going to have to trade off cost against the probability of data loss. To do this we need models to predict the cost of storing data using a given technology, and models to predict the probability of that technology losing data. I've worked on both kinds of model and can report that they're both extremely difficult.

Models of Data Loss

There's quite a bit of research, from among others Google, C-MU and BackBlaze, showing that failure rates of storage media in service are much higher than the rates claimed by the manufacturers specifications. Why is this? For example, the Blu-Ray disks Facebook is experimenting with for cold storage claim a 50-year data life. No-one has seen a 50-year-old DVD disk, so how do they know?

The claims are based on a model of the failure mechanisms and data from accelerated life testing, in which batches of media are subjected to unrealistically high temperature and humidity. The model is used to extrapolate from these unrealistic conditions to the conditions to be encountered in service. There are two problems, the conditions in service typically don't match those assumed by the models, and the models only capture some of the failure mechanisms.

These problems are much worse when we try to model not just failures of individual media, but of the entire storage system. Research has shown that media failures account for less than half the failures encountered in service; other components of the system such as buses, controllers, power supplies and so on contribute the other half. But even models that include these components exclude many of the threats we identified, from operator errors to coronal mass ejections.

Even more of a problem is that the threats, especially the low-probability ones, are highly correlated. Operators are highly likely to make errors when they are stressed coping with, say, an external attack. The probability of economic failure is greatly increased by, say, insider abuse. Modelling these correlations is a nightmare.

It turns out that economics are by far the largest cause of data failing to reach future readers. A month ago I gave a seminar in the I-school entitled The Half-Empty Archive, in which I pulled together the various attempts to measure how much of the data that should be archived is being collected by archives, and assessed that it was much less than half.  No-one believes that archiving budgets are going to double, so we can be confident that the loss rate from unable to afford to collect is at least 50%. This dwarfs all other causes of data loss.

Lets Keep Everything For Ever!

Digital preservation has three cost areas; ingest, preservation and dissemination. In the seminar I looked at the prospects for radical cost decreases in  all three, but I assume that the one you are interested in is storage, which is the main cost of preservation. Everyone knows that, if not actually free, storage is so cheap that we can afford to store everything for ever. For example, Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:
But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely.

Kryder's Law

They believe this because they lived through a third of a century of Kryder's Law, the analog of Moore's Law for disks. Kryder's Law predicted that the bit density on the platters of disk drives would more than double every 18 months, leading to a consistent 30-40%/yr drop in cost per byte. Thus, long-term storage was effectively free. If you could afford to store something for a few years, you could afford to store it for ever. The cost would have become negligible.

As Randall Munroe points out, in the real world exponential growth can't continue for ever. It is always the first part of a S-curve. One of the things that most impressed me about Krste Asanovi?'s keynote on the ASPIRE Project at this year's FAST conference was that their architecture took for granted that Moore's Law was in the past. Kryder's Law is also flattening out.

Here's a graph, from Preeti Gupta at UCSC, showing that in 2010, even before the floods in Thailand doubled $/GB overnight, the Kryder curve was flattening. Currently, disk is about 7 times as expensive as it would have been had the pre-2010 Kryder's Law continued. Industry projections are for 10-20%/yr going forward - the red lines on the graph show that in 2020 disk is now expected to be 100-300 times more expensive than pre-2010 expectations.

Industry projections have a history of optimism, but if we believe that data grows at IDC's 60%/yr, disk density grows at IHS iSuppli's 20%/yr, and IT budgets are essentially flat, the annual cost of storing a decade's accumulated data is 20 times the first year's cost. If at the start of the decade storage is 5% of your budget, at the end it is more than 100% of your budget. So the Big Data zealot has an affordability problem.

Why Is Kryder's Law Slowing?

It is easy to, and we often do, conflate Kryder's Law, which describes the increase in the areal density of bits on disk platters, with the cost of disk storage in $/GB. We wave our hands and say that it roughly mapped one-for-one into a decrease in the cost of disk drives. We are not alone in using this approximation, Mark Kryder himself does (PDF):
Density is viewed as the most important factor ... because it relates directly to cost/GB and in the HDD marketplace, cost/GB has always been substantially more important than other performance parameters. To compare cost/GB, the approach used here was to assume that, to first order, cost/GB would scale in proportion to (density)-1My co-author Daniel Rosenthal (no relation) has investigated the relationship between bits/in2 and $/GB over the last couple of decades. Over that time, it appears that about 3/4 of the decrease in $/GB can be attributed to the increase in bits/in2. Where did the rest of the decrease come from? I can think of three possible causes:
  • Economies of scale. For most of the last two decades the unit shipments of drives have been increasing, resulting in lower fixed costs per drive. Unfortunately, unit shipments are currently declining, so this effect has gone into reverse. In 2005 Mark Kryder was quoted as predicting "In a few years the average U.S. consumer will own 10 to 20 disk drives in devices that he uses regularly," but what is in those devices now is flash. The remaining market for disks is the cloud; they are no longer a consumer technology.
  • Manufacturing technology. The technology to build drives has improved greatly over the last couple of decades, resulting in lower variable costs per drive. Unfortunately HAMR, the next generation of disk drive technology has proven to be extraordinarily hard to manufacture, so this effect has gone into reverse.
  • Vendor margins. Over the last couple of decades disk drive manufacturing was a very competitive business, with numerous competing vendors. This gradually drove margins down and caused the industry to consolidate. Before the Thai floods, there were only two major manufacturers left, with margins in the low single digits. Unfortunately, the lack of competition and the floods have led to a major increase in margins, so this effect has gone into reverse.
But these factors only account for 1/4 of the missing cost decrease. Where did the other 3/4 go? Here is a 2008 graph from Dave Anderson of Seagate showing how what looks like a smooth Kryder's Law curve is actually the superimposition of a series of S-curves, one for each successive technology. Note how Dave's graph shows Perpendicular Magnetic Recording (PMR) being replaced by Heat Assisted Magnetic Recording (HAMR) starting in 2009. No-one has yet shipped HAMR drives. Instead, the industry has resorted to stretching PMR by shingling (which increases the density) and helium (which increases the number of platters).

Each technology generation has to stay in the market long enough to earn a return on the cost of the transition from its predecessor. There are two problems:
  • The return it needs to earn is, in effect, the margins the vendors enjoy. The higher the margins, the longer the technology needs to be in the market. Margins have increased.
  • As technology advances, the easier problems get solved first. So each technology transition involves solving harder and harder problems, so it costs more. The transition from PMR to HAMR has turned out to be vastly more expensive than the industry expected. Getting the laser and the magnetics in the head assembly to cooperate is very hard, the transition involves a huge increase in the production of the lasers, and so on.
According to Dave's 6-year-old graph, we should now be almost done with HAMR and starting the transition to Bit Patterned Media (BPM). It is already clear that the HAMR-BPM transition will be even more expensive and thus even more delayed than the PMR-HAMR transition. So the projected 20%/yr Kryder rate is unlikely to be realized. The one good thing, if you can call it that, about the slowing of the Kryder rate for disk is that it puts off the day when the technology hits the superparamagnetic limit. This is when the shrinking magnetic domains become unstable at the temperatures encountered inside an operating disk, which are quite warm.

We'll Just Use Tape Instead of Disk

About 70% of all bytes of storage produced each year is disk,the rest being tape and solid state.. Tape has been the traditional medium for long-term storage. Its recording technology lags about 8 years behind disk; it is unlikely to run into the problems plaguing disk for some years. We can expect its relative cost per byte advantage over disk to grow in the medium term. But tape is losing ground in the market. Why is this?

In the past, the access patterns to archived data were stable. It was rarely accessed, and accesses other than integrity checks were sparse. But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until last week it was 5.5 times.

An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the considerable costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year. Recently, Twitter issued a call for a "small number of proposals to receive free datasets", but even Twitter can't support 400.

Thus future archives will need to keep at least one copy of their content on low-latency, high-bandwidth storage, not tape.

We'll Just Use Flash Instead

Flash memory's advantages, including low power, physical robustness and low access latency have overcome its higher cost per byte in many markets, such as tablets and servers. But there is no possibility of flash replacing disk in the bulk storage market; that would involve trebling the number of flash fabs. Even if we ignore the lead time to build the new fabs, the investment to do so would not pay dividends. Everyone understands that shrinking flash cells much further will impair their ability to store data. Increasing levels, stacking cells in 3D and increasingly desperate signal processing in the flash controller will keep density going for a little while, but not long enough to pay back the investment in the fabs.

We'll Just Use Flash Non-volatile RAM Instead

There are many technologies vying to be the successor to flash, and most can definitely keep scaling beyond the end of flash provided the semiconductor industry keeps on its road-map.  They all have significant advantages over flash, in particular they are byte- rather than block-addressable. But analysis by Mark Kryder and Chang Soo Kim (PDF) at Carnegie-Mellon is not encouraging about the prospects for either flash or the competing solid state technologies beyond the end of the decade.

We'll Just Use Metal Tape, Stone DVDs, Holographic DVDs DNA Instead

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than Kryder's or Moore's Laws. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the cold storage problem. But DNA access will always be slow enough that it can't store the only copy.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ...

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)Note that this analysis assumes that the drives fail under warranty. One thing the drive vendors did to improve their margins after the floods was to reduce the length of warranties.

Does Kryder's Law Slowing Matter?

Figures from SDSC suggest that media cost is about 1/3 of the lifecycle cost of storage, although figures from BackBlaze suggest a much higher proportion. As a rule of thumb, the research into digital preservation costs suggests that ingesting the content costs about 1/2 the total lifecycle costs, preserving it costs about 1/3 and disseminating it costs about 1/6. So why are we worrying about a slowing of the decrease in 1/9 of the total cost?

Different technologies with different media service lives involve spending different amounts of money at different times during the life of the data. To make apples-to-apples comparisons we need to use the equivalent of Discounted Cash Flow to compute the endowment needed for the data. This is the capital sum which, deposited with the data and invested at prevailing interest rates, would be sufficient to cover all the expenditures needed to store the data for its life.

We built an economic model of the cost of long-term storage. Here it is from 15 months ago plotting the endowment needed for 3 replicas of a 117TB dataset to have a 98% chance of not running out of money over 100 years, against the Kryder rate, using costs from Backblaze. Each line represents a policy of keeping the drives for 1,2 ... 5 years before replacing them.

In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decrease, we move into the steep part of the graph, which has two effects:
  • The endowment needed increases sharply.
  • The endowment needed becomes harder to predict, because it depends strongly on the precise Kryder rate.
The reason to worry is that the cost of storing data for the long term depends strongly on the Kryder rate if it falls much below 20%, which it has. Everyone's storage expectations, and budgets, are based on their pre-2010 experience, and on a belief that the effect of the floods was a one-off glitch; the industry will quickly get back to historic Kryder rates. It wasn't, and they won't.

Does Losing Stuff Matter?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.

Can We Do Better?

In the short term, the inertia of manufacturing investment means that things aren't going to change much. Bulk data is going to be on disk, it can't compete with other uses for the higher-value space on flash. But looking out to the end of the decade and beyond, we're going to be living in a world of much lower Kryder rates. What does this mean for storage system architectures?

The reason disks have a five-year service life isn't an accident of technology. Disks are engineered to have a five-year service life because, with a 40%/yr Kryder rate, it is uneconomic to keep the data on the drive for longer than 5 years. After 5 years the data will take up about 8% of the drive's replacement.

At lower Kryder rates the media, whatever they are, will be in service longer. That means that running cost will be a larger proportion of the total cost. It will be worth while to spend more on purchasing the media to spend less on running them. Three years ago Ian Adams, Ethan Miller and I were inspired by the FAWN paper from Carnegie-Mellon to do an analysis we called DAWN: Durable Array of Wimpy Nodes. In it we showed that, despite the much higher capital cost, a storage fabric consisting of a very large number of very small nodes each with a very low-power system-on-chip and a small amount of flash memory would be competitive with disk.

The reason was that DAWN's running cost would be so much lower, and its economic media life so much longer, that it would repay the higher initial investment. The more the Kryder rate slows, the better our analysis looks. DAWN's better performance was a bonus. To the extent that successors to flash behave like RAM, and especially if they can be integrated with the system-on-chip, they strengthen the case further with lower costs and an even bigger performance edge.


Summing Up

Expectations for future storage technologies and costs were built up during three decades of extremely rapid cost per byte decrease. We are now 4 years into a period of much slower cost decrease, but expectations remain unchanged. Some haven't noticed the change, some believe it is temporary and the industry will return to the good old days of 40%/yr Kryder rates.

Industry insiders are projecting no more than 20%/yr rates for the rest of the decade. Technological and market forces make it likely that, as usual, they are being optimistic. Lower Kryder rates greatly increase both the cost of long-term storage, and the uncertainty in estimating it.

Lower Kryder rates mean that the economic service life of media will be longer, placing more emphasis on lower running cost than on lower purchase cost. This is particularly true since bulk storage media are no longer a consumer product; businesses are better placed to make this trade-off. But they may not do so (see the work of Andrew Haldane and Richard Davies at the Bank of England, and Doyne Farmer of the Santa Fe Institute and John Geanakoplos of Yale).

The idea that archived data can live on long-latency, low-bandwidth media is no longer the case. Future archival storage architectures must deliver adequate performance to sustain data-mining as well as low cost. Bundling computation into the storage medium is the way to do this.

Discussion

As usual, I was too busy answering questions to remember most of them. Here are the ones I remember, rephrased, with apologies the the questioners whose contributions slipped my memory:
  • Won't the evolution of flash technology drive its price down more quickly than disk? The problem is that the manufacturing capacity doesn't, and won't exist for flash to displace disk in the bulk storage space. Flash is a better technology than disk for many applications, so it is likely always to command a premium over disk.
  • Isn't DNA a really noisy technology to build long-term memory from? At the raw media level, all storage technologies are unpleasantly noisy. The signal processing that goes on inside your disk or flash controlled is amazing. DNA has the advantage that the signal processing has a vast number of replicas to work with.
  • Doesn't experience with flash suggest that it isn't capable of storing data reliably for the long term? The way current flash controllers use the raw medium optimizes things other than data retention, such as performance (for SSDs) and low cost (for SD cards, see Bunnie Huang and xobs' talk at the Chaos Computer Conference). That doesn't mean it isn't possible, with alternate flash controller technology, to optimize for data retention.

ALA Equitable Access to Electronic Content: Reminder: Last chance to apply for Google summer fellowship

Mon, 2014-04-07 06:30

Google Fellows visit the ALA Washington Office for a luncheon last year.

The American Library Association’s Washington Office is calling for graduate students, especially those in library and information science-related academic programs, to apply for the 2014 Google Policy Fellows program. Applications are due by Monday, April 14, 2014.

For the summer of 2014, the selected fellow will spend 10 weeks in residence at the ALA Washington Office to learn about national policy and complete a major project. Google provides the $7,500 stipend for the summer, but the work agenda is determined by the ALA and the selected fellow.

The Google Washington office provides an educational program for all of the fellows, such as lunchtime talks and interactions with Google Washington staff.

The fellows work in diverse areas of information policy that may include digital copyright, e-book licenses and access, future of reading, international copyright policy, broadband deployment, telecommunications policy (including e-rate and network neutrality), digital divide, access to information, free expression, digital literacy, online privacy, the future of libraries generally, and many other topics.

Jamie Schleser, a doctoral student at American University, served as the ALA 2013 Google Policy Fellow. Schleser worked with OITP to apply her dissertation research regarding online-specific digital libraries to articulate visions and strategies for the future of libraries.

Further information about the program and host organizations is available at the Google Public Policy Fellowship website.

The post Reminder: Last chance to apply for Google summer fellowship appeared first on District Dispatch.

ALA Equitable Access to Electronic Content: Reminder: Last chance to apply for Google summer fellowship

Mon, 2014-04-07 06:30

Google Fellows visit the ALA Washington Office for a luncheon last year.

The American Library Association’s Washington Office is calling for graduate students, especially those in library and information science-related academic programs, to apply for the 2014 Google Policy Fellows program. Applications are due by Monday, April 14, 2014.

For the summer of 2014, the selected fellow will spend 10 weeks in residence at the ALA Washington Office to learn about national policy and complete a major project. Google provides the $7,500 stipend for the summer, but the work agenda is determined by the ALA and the selected fellow.

The Google Washington office provides an educational program for all of the fellows, such as lunchtime talks and interactions with Google Washington staff.

The fellows work in diverse areas of information policy that may include digital copyright, e-book licenses and access, future of reading, international copyright policy, broadband deployment, telecommunications policy (including e-rate and network neutrality), digital divide, access to information, free expression, digital literacy, online privacy, the future of libraries generally, and many other topics.

Jamie Schleser, a doctoral student at American University, served as the ALA 2013 Google Policy Fellow. Schleser worked with OITP to apply her dissertation research regarding online-specific digital libraries to articulate visions and strategies for the future of libraries.

Further information about the program and host organizations is available at the Google Public Policy Fellowship website.

The post Reminder: Last chance to apply for Google summer fellowship appeared first on District Dispatch.

Murray, Peter: Rocks and Roles

Sun, 2014-04-06 17:51

Rebecca has rocks in her head and they are not coming out. This will not be a post on library technology.

Rebecca is the five-year-old daughter of Kat and Eric, friends in Cleveland that I met while I worked at Case Western Reserve University. This week Kat and Eric told Rebecca that another tumor had grown in her brain, that it could not be removed, and that their search for a drug or a technique to shrink it would probably be fruitless. It was Rebecca who knew this meant she was going to die. Earlier than any child should.

To Kat and Eric: thank you for living this chapter of your life in the open. I hope putting your thoughts and feelings on to the internet have been a helpful form of therapy. I hope the comments from around the world have been a source of buoyancy and strength to you and by extension your family.

Please know your loving response is an inspiration. I’m writing this post in part to add my voice to the chorus of support, and in part to celebrate you as a role model for parents in similar situations. You are in my thoughts, hopes and dreams.

Link to this post!

Murray, Peter: Rocks and Roles

Sun, 2014-04-06 17:51

Rebecca has rocks in her head and they are not coming out. This will not be a post on library technology.

Rebecca is the five-year-old daughter of Kat and Eric, friends in Cleveland that I met while I worked at Case Western Reserve University. This week Kat and Eric told Rebecca that another tumor had grown in her brain, that it could not be removed, and that their search for a drug or a technique to shrink it would probably be fruitless. It was Rebecca who knew this meant she was going to die. Earlier than any child should.

To Kat and Eric: thank you for living this chapter of your life in the open. I hope putting your thoughts and feelings on to the internet have been a helpful form of therapy. I hope the comments from around the world have been a source of buoyancy and strength to you and by extension your family.

Please know your loving response is an inspiration. I’m writing this post in part to add my voice to the chorus of support, and in part to celebrate you as a role model for parents in similar situations. You are in my thoughts, hopes and dreams.

Link to this post!

Reese, Terry: Regular Expression Recursive Replacement in MarcEdit

Sat, 2014-04-05 03:44

One of the new functions in MarcEdit is the inclusion of a Field Edit Data function.  For the most part, batch edits on field data has been handled primarily via regular expressions using the traditional replace function.  For example, if I had the following field:

=999  \\$zdata1$ydata2

And I wanted to swap the subfield order, I’d use a regular expression in the Replace function and construct the following:
Find: (=999.{4})(\$z.*[^$])(\$y.*)
Replace: $1$3$2

This works well – when needing to do advanced edits.  The problem was that any field edits that didn’t fit into the Edit Subfield tool needed to be done as a regular expression.  In an effort to simplify this process, I’ve introduced an Edit Field Data tool.

This tool exposes the data, following the indicators, for edit.  So, in the following field:
=999  \\$aTest Data

The Edit Field Data tool could interact with the data: “$aTest Data”.  Ideally, this will potentially simplify the process of doing most field edits.  However, it also opens up the opportunity to do recursive group and replacements.

When harvesting data, often times subjects may be concatenated with a delimiter, so for example, a single 650 field may represent multiple subjects, separated by a delimiter of a semicolon.  The new function will allow users to capture data, and recursively create new fields from the groups.  So for example, if I had the following data:
=650  \7$adata – data; data – data; data – data;

And I wanted the output to look like:
=650  \7$adata – data
=650  \7$adata – data
=650  \7$adata – data

I could now use this function to achieve that result.  Using a simple regular expression, I can create a recursively matching group, and then generate new fields using the “/r” parameter.  So, to do this, I would use the following arguments:
Field: 650
Find: (data – data[; ]?)+
Replace: $$a$+/r
Check Use Regular Expressions.

 

The important part of the above expression is in the Replacement syntax.  To tell MarcEdit that the recursion should result in a new line, the mnemonic /r is used at the end of the string to tell the tool that the recursion should result in a new line.

This new function will be available for use as of 4/7/2014. 

–tr

Reese, Terry: Regular Expression Recursive Replacement in MarcEdit

Sat, 2014-04-05 03:44

One of the new functions in MarcEdit is the inclusion of a Field Edit Data function.  For the most part, batch edits on field data has been handled primarily via regular expressions using the traditional replace function.  For example, if I had the following field:

=999  \\$zdata1$ydata2

And I wanted to swap the subfield order, I’d use a regular expression in the Replace function and construct the following:
Find: (=999.{4})(\$z.*[^$])(\$y.*)
Replace: $1$3$2

This works well – when needing to do advanced edits.  The problem was that any field edits that didn’t fit into the Edit Subfield tool needed to be done as a regular expression.  In an effort to simplify this process, I’ve introduced an Edit Field Data tool.

This tool exposes the data, following the indicators, for edit.  So, in the following field:
=999  \\$aTest Data

The Edit Field Data tool could interact with the data: “$aTest Data”.  Ideally, this will potentially simplify the process of doing most field edits.  However, it also opens up the opportunity to do recursive group and replacements.

When harvesting data, often times subjects may be concatenated with a delimiter, so for example, a single 650 field may represent multiple subjects, separated by a delimiter of a semicolon.  The new function will allow users to capture data, and recursively create new fields from the groups.  So for example, if I had the following data:
=650  \7$adata – data; data – data; data – data;

And I wanted the output to look like:
=650  \7$adata – data
=650  \7$adata – data
=650  \7$adata – data

I could now use this function to achieve that result.  Using a simple regular expression, I can create a recursively matching group, and then generate new fields using the “/r” parameter.  So, to do this, I would use the following arguments:
Field: 650
Find: (data – data[; ]?)+
Replace: $$a$+/r
Check Use Regular Expressions.

 

The important part of the above expression is in the Replacement syntax.  To tell MarcEdit that the recursion should result in a new line, the mnemonic /r is used at the end of the string to tell the tool that the recursion should result in a new line.

This new function will be available for use as of 4/7/2014. 

–tr

Prom, Chris: Job Announcement – University of Illinois

Fri, 2014-04-04 21:50

I would like to take the liberty of noting something off topic, that the University of Illinois at Urbana-Champaign is recruiting to fill the position of Archives and Literary Manuscript Specialist in our Rare Book and Manuscript Library.This is a new position reporting to the head of the Rare Book and Manuscript Library, and I am chairing the search committee. The position description, required/preferred qualifications, and application instructions may be found at https://jobs.illinois.edu/academic-job-board/job-details?jobID=38533.

This is an exciting opportunity to work with word class manuscript collections in a top-tier research Library. Over time, the incumbent will work not only with paper based materials, but also with born-digital personal papers.

I am happy to answer any questions from potential candidates who are interested in applying.

Prom, Chris: Job Announcement – University of Illinois

Fri, 2014-04-04 21:50

I would like to take the liberty of noting something off topic, that the University of Illinois at Urbana-Champaign is recruiting to fill the position of Archives and Literary Manuscript Specialist in our Rare Book and Manuscript Library.This is a new position reporting to the head of the Rare Book and Manuscript Library, and I am chairing the search committee. The position description, required/preferred qualifications, and application instructions may be found at https://jobs.illinois.edu/academic-job-board/job-details?jobID=38533.

This is an exciting opportunity to work with word class manuscript collections in a top-tier research Library. Over time, the incumbent will work not only with paper based materials, but also with born-digital personal papers.

I am happy to answer any questions from potential candidates who are interested in applying.

Morgan, Eric Lease: What is linked data and why should I care?

Fri, 2014-04-04 20:51

“Tell me about Rome. Why should I go there?”

Linked data is a standardized process for sharing and using information on the World Wide Web. Since the process of linked data is woven into the very fabric of the way the Web operates, it is standardized and will be applicable as long as the Web is applicable. The process of linked data is domain agnostic meaning its scope is equally apropos to archives, businesses, governments, etc. Everybody can and everybody is equally invited to participate. Linked data is application independent. As long as your computer is on the Internet and knows about the World Wide Web, then it can take advantage of linked data.

Linked data is about sharing and using information (not mere data but data put into context). This information takes the form of simple “sentences” which are intended to be literally linked together to communicate knowledge. The form of linked data is similar to the forms of human language, and like human languages, linked data is expressive, nuanced, dynamic, and exact all at once. Because of its atomistic nature, linked data simultaneously simplifies and transcends previous information containers. It reduces the need for profession-specific data structures, but at the same time it does not negate their utility. This makes it easy for you to give your information away, and for you to use other people’s information.

The benefits of linked data boil down to two things: 1) it makes information more accessible to both people as well as computers, and 2) it opens the doors to any number of knowledge services limited only by the power of human imagination. Because it standardized, agnostic, independent, and mimics human expression linked data is more universal than the current processes of information dissemination. Universality infers decentralization, and decentralization promotes dissemination. On the Internet anybody can say anything at anytime. In the aggregate, this is a good thing and it enables information to be combined in ways yet to be imagined. Publishing information as linked data enables you to seamlessly enhance your own knowledge services as well as simultaneously enhance the knowledge of others.

“Rome is the Eternal City. After visting Rome you will be better equipped to participate in the global conversation of the human condition.”

Morgan, Eric Lease: What is linked data and why should I care?

Fri, 2014-04-04 20:51

“Tell me about Rome. Why should I go there?”

Linked data is a standardized process for sharing and using information on the World Wide Web. Since the process of linked data is woven into the very fabric of the way the Web operates, it is standardized and will be applicable as long as the Web is applicable. The process of linked data is domain agnostic meaning its scope is equally apropos to archives, businesses, governments, etc. Everybody can and everybody is equally invited to participate. Linked data is application independent. As long as your computer is on the Internet and knows about the World Wide Web, then it can take advantage of linked data.

Linked data is about sharing and using information (not mere data but data put into context). This information takes the form of simple “sentences” which are intended to be literally linked together to communicate knowledge. The form of linked data is similar to the forms of human language, and like human languages, linked data is expressive, nuanced, dynamic, and exact all at once. Because of its atomistic nature, linked data simultaneously simplifies and transcends previous information containers. It reduces the need for profession-specific data structures, but at the same time it does not negate their utility. This makes it easy for you to give your information away, and for you to use other people’s information.

The benefits of linked data boil down to two things: 1) it makes information more accessible to both people as well as computers, and 2) it opens the doors to any number of knowledge services limited only by the power of human imagination. Because it standardized, agnostic, independent, and mimics human expression linked data is more universal than the current processes of information dissemination. Universality infers decentralization, and decentralization promotes dissemination. On the Internet anybody can say anything at anytime. In the aggregate, this is a good thing and it enables information to be combined in ways yet to be imagined. Publishing information as linked data enables you to seamlessly enhance your own knowledge services as well as simultaneously enhance the knowledge of others.

“Rome is the Eternal City. After visting Rome you will be better equipped to participate in the global conversation of the human condition.”

Prom, Chris: Selected Email Preservation Resources

Fri, 2014-04-04 20:45

This past Monday, I spoke at the Museums and the Web “Deep Dive” on email preservation.  At the session, I distributed the following handout, which is drawn largely from my Digital Preservation Coalition Tech Watch Report.  I am posting it here, in response to a request at the seminar.

Selected Email Preservation Resources

Key Readings

David Bearman,  “Managing Electronic Mail.”  Archives and Manuscripts 22/1 (1994), pp. 28–50: outlines the major social, technical and legal issues that an email preservation project must address; is particularly useful in suggesting ways that system designs can support the effective implementation of policies.

Maureen Pennock, “Curating E-Mails: A Life-cycle Approach to the Management and Preservation of E-mail Messages,” 2006: Reviews the major challenges to email preservation and summarises some prospective approaches, with particular emphasis on the need to manage email effectively during its period of creation and active; also outlines the major conceptual approaches that can be used to preserve email, with somewhat less description of particular tools or services. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/curating-e-mails

Richard Cox, “Electronic Mail and Personal Recordkeeping. In Personal Archives and a New Archival Calling: Readings, Reflections and Ruminations. Duluth, Minnesota: Litwin Books, pp. 201–42.  Reviews the history of attempts that the archival profession has made in preserving email messages and their content, suggesting that the best approaches will understand and preserve them as the organic outcome of our professional and personal lives. Cox suggests that those wishing to preserve email draw on concepts and procedures from both the records management and manuscript archives traditions, but the chapter contains relatively little direct implementation advice.

Gareth Knight, InSPECT: Investigating Significant Properties of Electronic Content 2009: A report on email migration tools, completed for the InSPECT project, includes a description and analysis of the structure of an email message, identifying 14 properties of the message header and 50 properties of the message body that must be maintained during migration if an email is to be considered authentic and complete. The report also outlines a procedure for testing whether particular email migration tools preserve those properties and applies that procedure to three specific tools.  http://www.significantproperties.org.uk/

Christopher Prom, Preserving Email, Digital Preservation Coalition Technology Watch Report: Provides a summary of social, legal, and technical challenges and opportunities for email preservation, reviewes and explains internet standards and technologies for email exchange and storage, and recommends particular approaches to consider in an email preservation project. http://dx.doi.org/10.7207/twr11-01.

Useful Tools

 Glossary

Exchange Server: A proprietary application developed and licensed by Microsoft Corporation, providing server-based email, calendar, contact and task management features. Exchange servers are typically used in conjunction with Microsoft Outlook or the Outlook Express web agent. Exchange servers use a proprietary storage format and messages sent using Exchange typically include extensive changes to the header of the file. Calendar entries, contacts, and tasks are also managed via extensions to the email storage packet. Depending on local system configuration, users may be able to connect to a specific Exchange server using an IMAP-aware client application.

Internet Message Access Protocol (IMAP): A code of procedures and behaviours regulating one method by which email user agents may connect with email servers and message transfer agents, allowing an individual to view, create, transfer, manage and delete messages. Typically contrasted with the POP3 protocol, IMAP is defined in the IETF’s RFC 3501. Email clients connecting to a server using IMAP usually leave a copy of the message on the server, unless the user explicitly deletes a message or has configured the client software with rules that automatically delete messages meeting defined criteria.

Multipurpose Internet Mail Extensions (MIME): A protocol for including non-ASCII information in email messages. Specified in IETF RFC 2045, 2046, 2047, 4288, 4289 and 2049, MIME defines the precise method by which non-Latin characters, multipart bodies, attachments and inline images may be included in email messages. MIME is necessary because email supports only seven-bit, not eight-bit ASCII characters. It is also used in other communication exchange mechanisms, such as HTTP. Software such as message transfer agents, email clients, and web browsers typically include interpreters that convert MIME content to and from its native format, as needed.

PST: .pst is a file extension for local ‘personal stores’ written by the program Microsoft Outlook. PST files contain email messages and calendar entries using a proprietary but open format, and they may be found on local or networked drives of email end users. Several tools can read and migrate PST files to other formats.

Simple Mail Transfer Protocol (SMTP): A set of rules that defines how outgoing email messages are transmitted from one Mail Transfer Agent to another across the Internet, until they reach their final destination. Defined most recently in IETF RFC 5321.

Prom, Chris: Selected Email Preservation Resources

Fri, 2014-04-04 20:45

This past Monday, I spoke at the Museums and the Web “Deep Dive” on email preservation.  At the session, I distributed the following handout, which is drawn largely from my Digital Preservation Coalition Tech Watch Report.  I am posting it here, in response to a request at the seminar.

Selected Email Preservation Resources

Key Readings

David Bearman,  “Managing Electronic Mail.”  Archives and Manuscripts 22/1 (1994), pp. 28–50: outlines the major social, technical and legal issues that an email preservation project must address; is particularly useful in suggesting ways that system designs can support the effective implementation of policies.

Maureen Pennock, “Curating E-Mails: A Life-cycle Approach to the Management and Preservation of E-mail Messages,” 2006: Reviews the major challenges to email preservation and summarises some prospective approaches, with particular emphasis on the need to manage email effectively during its period of creation and active; also outlines the major conceptual approaches that can be used to preserve email, with somewhat less description of particular tools or services. http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/curating-e-mails

Richard Cox, “Electronic Mail and Personal Recordkeeping. In Personal Archives and a New Archival Calling: Readings, Reflections and Ruminations. Duluth, Minnesota: Litwin Books, pp. 201–42.  Reviews the history of attempts that the archival profession has made in preserving email messages and their content, suggesting that the best approaches will understand and preserve them as the organic outcome of our professional and personal lives. Cox suggests that those wishing to preserve email draw on concepts and procedures from both the records management and manuscript archives traditions, but the chapter contains relatively little direct implementation advice.

Gareth Knight, InSPECT: Investigating Significant Properties of Electronic Content 2009: A report on email migration tools, completed for the InSPECT project, includes a description and analysis of the structure of an email message, identifying 14 properties of the message header and 50 properties of the message body that must be maintained during migration if an email is to be considered authentic and complete. The report also outlines a procedure for testing whether particular email migration tools preserve those properties and applies that procedure to three specific tools.  http://www.significantproperties.org.uk/

Christopher Prom, Preserving Email, Digital Preservation Coalition Technology Watch Report: Provides a summary of social, legal, and technical challenges and opportunities for email preservation, reviewes and explains internet standards and technologies for email exchange and storage, and recommends particular approaches to consider in an email preservation project. http://dx.doi.org/10.7207/twr11-01.

Useful Tools

 Glossary

Exchange Server: A proprietary application developed and licensed by Microsoft Corporation, providing server-based email, calendar, contact and task management features. Exchange servers are typically used in conjunction with Microsoft Outlook or the Outlook Express web agent. Exchange servers use a proprietary storage format and messages sent using Exchange typically include extensive changes to the header of the file. Calendar entries, contacts, and tasks are also managed via extensions to the email storage packet. Depending on local system configuration, users may be able to connect to a specific Exchange server using an IMAP-aware client application.

Internet Message Access Protocol (IMAP): A code of procedures and behaviours regulating one method by which email user agents may connect with email servers and message transfer agents, allowing an individual to view, create, transfer, manage and delete messages. Typically contrasted with the POP3 protocol, IMAP is defined in the IETF’s RFC 3501. Email clients connecting to a server using IMAP usually leave a copy of the message on the server, unless the user explicitly deletes a message or has configured the client software with rules that automatically delete messages meeting defined criteria.

Multipurpose Internet Mail Extensions (MIME): A protocol for including non-ASCII information in email messages. Specified in IETF RFC 2045, 2046, 2047, 4288, 4289 and 2049, MIME defines the precise method by which non-Latin characters, multipart bodies, attachments and inline images may be included in email messages. MIME is necessary because email supports only seven-bit, not eight-bit ASCII characters. It is also used in other communication exchange mechanisms, such as HTTP. Software such as message transfer agents, email clients, and web browsers typically include interpreters that convert MIME content to and from its native format, as needed.

PST: .pst is a file extension for local ‘personal stores’ written by the program Microsoft Outlook. PST files contain email messages and calendar entries using a proprietary but open format, and they may be found on local or networked drives of email end users. Several tools can read and migrate PST files to other formats.

Simple Mail Transfer Protocol (SMTP): A set of rules that defines how outgoing email messages are transmitted from one Mail Transfer Agent to another across the Internet, until they reach their final destination. Defined most recently in IETF RFC 5321.

Morgan, Eric Lease: Impressed with ReLoad

Fri, 2014-04-04 13:56

I’m impressed with the linked data project called ReLoad. Their data is robust, complete, and full of URIs as well as human-readable labels. From the project’s home page:

The ReLoad project (Repository for Linked open archival data) will foster experimentation with the technology and methods of linked open data for archival resources. Its goal is the creation of a web of linked archival data.
LOD-LAM, which is an acronym for Linked Open Data for Libraries, Archives and Museums, is an umbrella term for the community and active projects in this area.

The first experimental phase will make use of W3C semantic web standards, mash-up techniques, software for linking and for defining the semantics of the data in the selected databases.

The archives that have made portions of their institutions’ data and databases openly available for this project are the Central State Archive, and the Cultural Heritage Institute of Emilia Romagna Region. These will be used to test methodologies to expose the resources as linked open data.

For example, try these links:

Their data is rich enough so things like LodLive can visualize resources well:

Morgan, Eric Lease: Impressed with ReLoad

Fri, 2014-04-04 13:56

I’m impressed with the linked data project called ReLoad. Their data is robust, complete, and full of URIs as well as human-readable labels. From the project’s home page:

The ReLoad project (Repository for Linked open archival data) will foster experimentation with the technology and methods of linked open data for archival resources. Its goal is the creation of a web of linked archival data.
LOD-LAM, which is an acronym for Linked Open Data for Libraries, Archives and Museums, is an umbrella term for the community and active projects in this area.

The first experimental phase will make use of W3C semantic web standards, mash-up techniques, software for linking and for defining the semantics of the data in the selected databases.

The archives that have made portions of their institutions’ data and databases openly available for this project are the Central State Archive, and the Cultural Heritage Institute of Emilia Romagna Region. These will be used to test methodologies to expose the resources as linked open data.

For example, try these links:

Their data is rich enough so things like LodLive can visualize resources well:

State Library of Denmark: Sparse everything

Fri, 2014-04-04 13:08

The SOLR-5894 issue “Speed up high-cardinality facets with sparse counters” is close to being functionally complete (facet.method=fcs and facet.sort=index still pending). This post explains the different tricks used in the implementation and their impact on performance.

Baseline

Most of the different Solr faceting methods (fc & fcs; with and without doc-values; single- and multi-value) uses the same overall principle for counting tag occurrences in facets:

  1. Allocate one or more counters of total size #unique_tags
  2. Fill the counters by iterating a hit queue (normally a bitmap) and getting corresponding counter indexes from a mapper
  3. Extract top-x tags with highest count by iterating all counters

There are 3 problems with this 3 step process: Allocation of a (potentially large) structure from memory, iteration of a bitmap with #total_documents entries and iteration of a counter with #unique_tags. Ideally this would be no allocation, iteration of just the IDs of the matched documents and iteration of just the tag counters that were updated. Sparse facet counting solves 2 out of the 3 problems.

Sparse

In this context sparse is seen as performance-enhancing, not space-reducing. SOLR-5894 solves the extraction time problem by keeping track of which counters are updated. With this information, the extraction process no longer needs to visit all counters.  A detailed explanation can be found at fast-faceting-with-high-cardinality-and-small-result-set. However, there are some peculiarities to sparse tracking that must be considered.

Processing overhead

The black line is Solr field faceting on a multi-valued field (3 values/document), the red line is the sparse implementation on the same field. When the result set is small, sparse processing time is markedly lower than standard, but when the result set is > 10% of all documents, it becomes slower. When the result set reaches 50%, sparse takes twice as long as standard.

This makes sense when one consider that both updating and extraction of a single counter has more processing overhead for sparse: When the number of hits rises, the accumulated overhead gets bad.

Maximum sparse size

Okay, so tracking does not make much sense past a given point. Besides, having a tracker the size of the counters themselves (100% overhead) seems a bit wasteful. Fixing the tracker size to the cross-over-point is the way to go. We choose 8% here. Thanks to the beauty of the tracking mechanism, exceeding the tracker capacity does not invalidate the collected results, it just means a logical switch to non-track-mode.

No doubt about where the sparse counter switches to non-sparse mode. Note how the distance from Solr standard (black line) to sparse with tracker-overflow (red line past 8%) is near-constant: Up until 8% there is an overhead for updating the tracker. When the tracker has overflowed that overhead disappears for the rest of the counter updates, but the cycles used for tracking up to that point are wasted.

Selection strategy

So memory overhead was reduced to 8% and performance was better for the very high hit counts, but still quite a bit worse than Solr standard. If only we could foresee if the sparse tracker would be overflowed or not.

We cannot determine 100% whether the tracker will be blown or not (at least not in the general case), but we can guess. Under the assumption that the references from documents to tags are fairly uniformly distributed, we can use the hit count (which we know when we start facet calculation) to guess whether the number of updated tag counts will exceed the tracker capacity.

The chart demonstrates how bad guessing of the result size affects performance. The conservative guess (red line) means that many of the faceting calls are done by falling back to standard Solr and that the sparse speed-up is wasted. The optimistic guess (cyan line) has a higher risk of failed sparse-attempts, which means bad performance. In this example, the bad guess was around 10%. Still, even with such hiccups, the overall positive effect of using sparse counting is clear.

Good guessing

The best cut-off point for sparse/non-sparse attempt depends on the corpus and the searches, as well as the willingness to risk increased response times. Properly tuned and with a corpus without extreme outliers (such as a single very popular document referencing 10% of all tags), the result will be very satisfying.

For the low price of 8% memory overhead we get much better performance for small result sets and no penalty for larger result sets (under the assumption of correct guessing).

Counter allocation

Doing a little instrumentation it becomes clear that it is by no means free just to allocate a new counter structure with each facet call and throw it away after use. In the example above, 5M*3*4byte = 60MB are used for a counter. With a 2GB heap and otherwise non-tweaked execution of Solr’s start.jar, the average time used to allocate the 60MB was 13ms!

An alternative strategy is to keep a pool of counters and re-use them. This means that counters must be cleared after use, but this is markedly faster than allocating new ones. Furthermore this can be done by a background thread, so that the client can get the response immediately after the extraction phase. Enabling this, the picture gets even better.

For very small result sets there is virtually no performance penalty for faceting.


State Library of Denmark: Sparse everything

Fri, 2014-04-04 13:08

The SOLR-5894 issue “Speed up high-cardinality facets with sparse counters” is close to being functionally complete (facet.method=fcs and facet.sort=index still pending). This post explains the different tricks used in the implementation and their impact on performance.

Baseline

Most of the different Solr faceting methods (fc & fcs; with and without doc-values; single- and multi-value) uses the same overall principle for counting tag occurrences in facets:

  1. Allocate one or more counters of total size #unique_tags
  2. Fill the counters by iterating a hit queue (normally a bitmap) and getting corresponding counter indexes from a mapper
  3. Extract top-x tags with highest count by iterating all counters

There are 3 problems with this 3 step process: Allocation of a (potentially large) structure from memory, iteration of a bitmap with #total_documents entries and iteration of a counter with #unique_tags. Ideally this would be no allocation, iteration of just the IDs of the matched documents and iteration of just the tag counters that were updated. Sparse facet counting solves 2 out of the 3 problems.

Sparse

In this context sparse is seen as performance-enhancing, not space-reducing. SOLR-5894 solves the extraction time problem by keeping track of which counters are updated. With this information, the extraction process no longer needs to visit all counters.  A detailed explanation can be found at fast-faceting-with-high-cardinality-and-small-result-set. However, there are some peculiarities to sparse tracking that must be considered.

Processing overhead

The black line is Solr field faceting on a multi-valued field (3 values/document), the red line is the sparse implementation on the same field. When the result set is small, sparse processing time is markedly lower than standard, but when the result set is > 10% of all documents, it becomes slower. When the result set reaches 50%, sparse takes twice as long as standard.

This makes sense when one consider that both updating and extraction of a single counter has more processing overhead for sparse: When the number of hits rises, the accumulated overhead gets bad.

Maximum sparse size

Okay, so tracking does not make much sense past a given point. Besides, having a tracker the size of the counters themselves (100% overhead) seems a bit wasteful. Fixing the tracker size to the cross-over-point is the way to go. We choose 8% here. Thanks to the beauty of the tracking mechanism, exceeding the tracker capacity does not invalidate the collected results, it just means a logical switch to non-track-mode.

No doubt about where the sparse counter switches to non-sparse mode. Note how the distance from Solr standard (black line) to sparse with tracker-overflow (red line past 8%) is near-constant: Up until 8% there is an overhead for updating the tracker. When the tracker has overflowed that overhead disappears for the rest of the counter updates, but the cycles used for tracking up to that point are wasted.

Selection strategy

So memory overhead was reduced to 8% and performance was better for the very high hit counts, but still quite a bit worse than Solr standard. If only we could foresee if the sparse tracker would be overflowed or not.

We cannot determine 100% whether the tracker will be blown or not (at least not in the general case), but we can guess. Under the assumption that the references from documents to tags are fairly uniformly distributed, we can use the hit count (which we know when we start facet calculation) to guess whether the number of updated tag counts will exceed the tracker capacity.

The chart demonstrates how bad guessing of the result size affects performance. The conservative guess (red line) means that many of the faceting calls are done by falling back to standard Solr and that the sparse speed-up is wasted. The optimistic guess (cyan line) has a higher risk of failed sparse-attempts, which means bad performance. In this example, the bad guess was around 10%. Still, even with such hiccups, the overall positive effect of using sparse counting is clear.

Good guessing

The best cut-off point for sparse/non-sparse attempt depends on the corpus and the searches, as well as the willingness to risk increased response times. Properly tuned and with a corpus without extreme outliers (such as a single very popular document referencing 10% of all tags), the result will be very satisfying.

For the low price of 8% memory overhead we get much better performance for small result sets and no penalty for larger result sets (under the assumption of correct guessing).

Counter allocation

Doing a little instrumentation it becomes clear that it is by no means free just to allocate a new counter structure with each facet call and throw it away after use. In the example above, 5M*3*4byte = 60MB are used for a counter. With a 2GB heap and otherwise non-tweaked execution of Solr’s start.jar, the average time used to allocate the 60MB was 13ms!

An alternative strategy is to keep a pool of counters and re-use them. This means that counters must be cleared after use, but this is markedly faster than allocating new ones. Furthermore this can be done by a background thread, so that the client can get the response immediately after the extraction phase. Enabling this, the picture gets even better.

For very small result sets there is virtually no performance penalty for faceting.


Miedema, John: Pier Gerlofs Donia, “Grutte Pier”: Legendary warrior, video game hero, my ancestor?

Fri, 2014-04-04 12:35

Pier Gerlofs Donia was a sixteenth century warrior in Friesland, best known as Grutte Pier (Big Pier).

A tower of a fellow as strong as an ox, of dark complexion, broad shouldered, with a long black beard and moustache. A natural rough humorist, who through unfortunate circumstances was recast into an awful brute. Out of personal revenge for the bloody injustice that befell him (in 1515) with the killing of kinsfolk and destruction of his property he became a freedom fighter of legendary standing. (Pier Gerlofs Donia).

Grutte Pier just might be an ancestor. He fought for his home in Friesland, the northern Dutch province where my folks came from. I visited Friesland in 2000. It is a pastoral province, with rolling fields and cows and churches, a lot like Prince Edward Island in Canada. It is natural that Pier began his life as a farmer. His home was destroyed and his family killed by the Black Band, a violent military regiment. Pier led a rebellion. “Leaver Dea as Slaef” (rather dead than slave) (Battle of Warns).  Pier was legendary for his height and strength, wielding a massive long sword that could take down many enemies with a single stroke.

Braveheart. Gladiator. It is time to tell the story of Grutte Pier. Cross of the Dutchman is an upcoming video game by Triangle. Amused by my remote ancestral connection and sharing an interest in warrior sports, I made a small donation to the project. My Friesian name, Miedema, will appear in the end credits.