news aggregator

Summers, Ed: Where Brooklyn At?

planet code4lib - Tue, 2014-04-08 19:41

As a follow up to my last post I added a script to my fork of Aaron’s py-flarchive that will load up a Redis instance with comments, notes, tags and sets for Flickr images that were uploaded by Brooklyn Museum. The script assumes you’ve got a snapshot of the archived metadata, which I downloaded as a tarball. It took several hours to unpack the tarball on a medium ec2 instance; so if you want to play around and just want the redis database let me know and I’ll get it to you.

Once I loaded up Redis I was able to generate some high level stats:

  • images: 5,697
  • authors: 4,617
  • tags: 6,132
  • machine tags: 933
  • comments: 7,353
  • notes: 963
  • sets: 141

Given how many images there were there it represents an astonishing number of authors: unique people who added tags, comments or notes. If you are curious I generated a list of the tags and saved them as a Google Doc. The machine tags were particularly interesting to me. The majority (849) of them look like Brooklyn Museum IDs of some kind, for example:

bm:unique=S10_08_Thebes/9928

But there were also 51 geotags, and what looks like 23 links to items in Pleiades, for example:

tag:pleiades:depicts=721417202

If I had to guess I’d say this particular machine tag indicated that the Brooklyn Museum image depicted Abu Simbel. Now there weren’t tons of these machine tags but it’s important to remember that other people use Flickr as a scratch space for annotating images this way.

If you aren’t familiar with them, Flickr notes are annotations of an image, where the user has attached a textual note to a region in the image. Just eyeballing the list, it appears that there is quite a bit of diversity in them, ranging from the whimsical:

  • cool! they look soo surreal
  • teehee somebody wrote some graffiti in greek
  • Lol are these painted?
  • Steaks are ready!

to the seemingly useful:

  • Hunter’s Island
  • Ramesses III Temple
  • Lapland Village
  • Lake Michigan
  • Montuemhat Crypt
  • Napoleon’s troops are often accused of destroying the nose, but they are not the culprits. The nose was already gone during the 18th century.

Similarly the general comments run the gamut from:

  • very nostalgic…
  • always wanted to visit Egypt

to:

  • Just a few points. This is not ‘East Jordan’ it is in the Hauran region of southern Syria. Second it is not Qarawat (I guess you meant Qanawat) but Suweida. Third there is no mention that the house is enveloped by the colonnade of a Roman peripteral temple.
  • The fire that destroyed the buildings was almost certainly arson. it occurred at the height of the Pullman strike and at the time, rightly or wrongly, the strikers were blamed.
  • You can see in the background, the TROCADERO with two towers .. This “medieval city” was built on the right bank where are now buildings in modern art style erected for the exposition of 1937.

Brooklyn Museum pulled over 48 tags from Flickr before they deleted the account. That’s just 0.7% of the tags that were there. None of the comments or notes were moved over.

In the data that Aaron archived there was one indicator of user engagement: the datetime included with comments. Combined with the upload time for the images it was possible to create a spreadsheet that correlates the number of comments with the number of uploads per month:

I’m guessing the drop off in December of 2013 is due to that being the last time Aaron archived Brooklyn Museum’s metadata. You can see that there was a decline in user engagement: the peak in late 2008 / early 2009 was never matched again. I was half expecting to see that user engagement fell off when Brooklyn Museum’s interest in the platform (uploads) fell off. But you can see that they continued to push content to Flickr, without seeing much of a reward, at least in the shape of comments. It’s impossible now to tell if tagging, notes or sets trended differently.

Since Flickr includes the number of times each image was viewed it’s possible to look at all the images and see how many times images were viewed, the answer?

9,193,331

Not a bad run for 5,697 images. I don’t know if Brooklyn Museum downloaded their metadata prior to removing their account. But luckily Aaron did.

Summers, Ed: Where Brooklyn At?

planet code4lib - Tue, 2014-04-08 19:41

As a follow up to my last post I added a script to my fork of Aaron’s py-flarchive that will load up a Redis instance with comments, notes, tags and sets for Flickr images that were uploaded by Brooklyn Museum. The script assumes you’ve got a snapshot of the archived metadata, which I downloaded as a tarball. It took several hours to unpack the tarball on a medium ec2 instance; so if you want to play around and just want the redis database let me know and I’ll get it to you.

Once I loaded up Redis I was able to generate some high level stats:

  • images: 5,697
  • authors: 4,617
  • tags: 6,132
  • machine tags: 933
  • comments: 7,353
  • notes: 963
  • sets: 141

Given how many images there were there it represents an astonishing number of authors: unique people who added tags, comments or notes. If you are curious I generated a list of the tags and saved them as a Google Doc. The machine tags were particularly interesting to me. The majority (849) of them look like Brooklyn Museum IDs of some kind, for example:

bm:unique=S10_08_Thebes/9928

But there were also 51 geotags, and what looks like 23 links to items in Pleiades, for example:

tag:pleiades:depicts=721417202

If I had to guess I’d say this particular machine tag indicated that the Brooklyn Museum image depicted Abu Simbel. Now there weren’t tons of these machine tags but it’s important to remember that other people use Flickr as a scratch space for annotating images this way.

If you aren’t familiar with them, Flickr notes are annotations of an image, where the user has attached a textual note to a region in the image. Just eyeballing the list, it appears that there is quite a bit of diversity in them, ranging from the whimsical:

  • cool! they look soo surreal
  • teehee somebody wrote some graffiti in greek
  • Lol are these painted?
  • Steaks are ready!

to the seemingly useful:

  • Hunter’s Island
  • Ramesses III Temple
  • Lapland Village
  • Lake Michigan
  • Montuemhat Crypt
  • Napoleon’s troops are often accused of destroying the nose, but they are not the culprits. The nose was already gone during the 18th century.

Similarly the general comments run the gamut from:

  • very nostalgic…
  • always wanted to visit Egypt

to:

  • Just a few points. This is not ‘East Jordan’ it is in the Hauran region of southern Syria. Second it is not Qarawat (I guess you meant Qanawat) but Suweida. Third there is no mention that the house is enveloped by the colonnade of a Roman peripteral temple.
  • The fire that destroyed the buildings was almost certainly arson. it occurred at the height of the Pullman strike and at the time, rightly or wrongly, the strikers were blamed.
  • You can see in the background, the TROCADERO with two towers .. This “medieval city” was built on the right bank where are now buildings in modern art style erected for the exposition of 1937.

Brooklyn Museum pulled over 48 tags from Flickr before they deleted the account. That’s just 0.7% of the tags that were there. None of the comments or notes were moved over.

In the data that Aaron archived there was one indicator of user engagement: the datetime included with comments. Combined with the upload time for the images it was possible to create a spreadsheet that correlates the number of comments with the number of uploads per month:

I’m guessing the drop off in December of 2013 is due to that being the last time Aaron archived Brooklyn Museum’s metadata. You can see that there was a decline in user engagement: the peak in late 2008 / early 2009 was never matched again. I was half expecting to see that user engagement fell off when Brooklyn Museum’s interest in the platform (uploads) fell off. But you can see that they continued to push content to Flickr, without seeing much of a reward, at least in the shape of comments. It’s impossible now to tell if tagging, notes or sets trended differently.

Since Flickr includes the number of times each image was viewed it’s possible to look at all the images and see how many times images were viewed, the answer?

9,193,331

Not a bad run for 5,697 images. I don’t know if Brooklyn Museum downloaded their metadata prior to removing their account. But luckily Aaron did.

Morgan, Eric Lease: The 3D Printing Working Group is maturing, complete with a shiny new mailing list

planet code4lib - Tue, 2014-04-08 19:28

A couple of weeks ago Kevin Phaup took the lead of facilitating a 3D printing workshop here in the Libraries’s Center For Digital Scholarship. More than a dozen students from across the University participated. Kevin presented them with an overview of 3D printing, pointed them towards a online 3D image editing application (Shapeshifter), and everybody created various objects which Matt Sisk has been diligently printing. The event was deemed a success, and there will probably be more specialized workshops scheduled for the Fall.

Since the last blog posting there has also been another Working Group meeting. A short dozen of us got together in Stinson-Remick where we discussed the future possibilities for the Group. The consensus was to create a more formal mailing list, maybe create a directory of people with 3D printing interests, and see about doing something more substancial — with a purpose — for the University.

To those ends, a mailing list has been created. Its name is 3D Printing Working Group . The list is open to anybody, and its purpose is to facilitate discussion of all things 3D printing around Notre Dame and the region. To subscribe address an email message to listserv@listserv.nd.edu, and in the body of the message include the following command:

subscribe nd-3d-printing Your Name

where Your Name is… your name.

Finally, the next meeting of the Working Group has been scheduled for Wednesday, May 14. It will be sponsored by Bob Sutton of Springboard Technologies, and it will be located in Innovation Park across from the University, and it will take place from 11:30 to 1 o’clock. I’m pretty sure lunch will be provided. The purpose of the meeting will be continue to outline the future directions of the Group as well as to see a demonstration of a printer called the Isis3D.

Morgan, Eric Lease: The 3D Printing Working Group is maturing, complete with a shiny new mailing list

planet code4lib - Tue, 2014-04-08 19:28

A couple of weeks ago Kevin Phaup took the lead of facilitating a 3D printing workshop here in the Libraries’s Center For Digital Scholarship. More than a dozen students from across the University participated. Kevin presented them with an overview of 3D printing, pointed them towards a online 3D image editing application (Shapeshifter), and everybody created various objects which Matt Sisk has been diligently printing. The event was deemed a success, and there will probably be more specialized workshops scheduled for the Fall.

Since the last blog posting there has also been another Working Group meeting. A short dozen of us got together in Stinson-Remick where we discussed the future possibilities for the Group. The consensus was to create a more formal mailing list, maybe create a directory of people with 3D printing interests, and see about doing something more substancial — with a purpose — for the University.

To those ends, a mailing list has been created. Its name is 3D Printing Working Group . The list is open to anybody, and its purpose is to facilitate discussion of all things 3D printing around Notre Dame and the region. To subscribe address an email message to listserv@listserv.nd.edu, and in the body of the message include the following command:

subscribe nd-3d-printing Your Name

where Your Name is… your name.

Finally, the next meeting of the Working Group has been scheduled for Wednesday, May 14. It will be sponsored by Bob Sutton of Springboard Technologies, and it will be located in Innovation Park across from the University, and it will take place from 11:30 to 1 o’clock. I’m pretty sure lunch will be provided. The purpose of the meeting will be continue to outline the future directions of the Group as well as to see a demonstration of a printer called the Isis3D.

Tennant, Roy: Being a Savvy Social Media User

planet code4lib - Tue, 2014-04-08 16:25

Recently my colleague Karen Smith-Yoshimura noted a blog post that demonstrates effective traits for using social media on behalf of an organization. Titled “Social Change”, the post documents the choices that Brooklyn Museum staff made recently to pare down their social media participation to venues that they find most effective. As they put it:

There comes a moment in every trajectory where one has to change course.  As part of a social media strategic plan, we are changing gears a bit to deploy an engagement strategy which focuses on our in-building audience, closely examines which channels are working for us, and aligns our energies in places where we feel our voice is needed, but allows for us to pull away where things are happening on their own.

This clearly indicates that it doesn’t make a lot of sense to simply get an account on every social media site out there and let’er rip. For one reason it is highly unlikely that your organization has the bandwidth to engage effectively in every platform. Another is that without the ability to engage effectively, it’s best to not even attempt it. Having a moribund presence on a social platform is worse than having no presence at all.

Therefore, being a savvy social media user means consciously reviewing your social media use periodically to:

  • Identify venues that are no longer useful to you and either shutdown the account or put it on ice.
  • Identify venues that you find useful and maintain or increase your use of those venues.
  • Consider whether the nature of your engagement should change. For example, should you use more pictures to make your posts more engaging? Should you craft messages that are more intriguing than informative, thus potentially increasing visits to your site?

Kudos to the Brooklyn Museum for doing this right. Read the post, and understand what it means to be a thoughtful social media user. We should all be so savvy.

 

Image courtesy of Brantley Davidson, Creative Commons Attribution 2.0 Generic License

Tennant, Roy: Being a Savvy Social Media User

planet code4lib - Tue, 2014-04-08 16:25

Recently my colleague Karen Smith-Yoshimura noted a blog post that demonstrates effective traits for using social media on behalf of an organization. Titled “Social Change”, the post documents the choices that Brooklyn Museum staff made recently to pare down their social media participation to venues that they find most effective. As they put it:

There comes a moment in every trajectory where one has to change course.  As part of a social media strategic plan, we are changing gears a bit to deploy an engagement strategy which focuses on our in-building audience, closely examines which channels are working for us, and aligns our energies in places where we feel our voice is needed, but allows for us to pull away where things are happening on their own.

This clearly indicates that it doesn’t make a lot of sense to simply get an account on every social media site out there and let’er rip. For one reason it is highly unlikely that your organization has the bandwidth to engage effectively in every platform. Another is that without the ability to engage effectively, it’s best to not even attempt it. Having a moribund presence on a social platform is worse than having no presence at all.

Therefore, being a savvy social media user means consciously reviewing your social media use periodically to:

  • Identify venues that are no longer useful to you and either shutdown the account or put it on ice.
  • Identify venues that you find useful and maintain or increase your use of those venues.
  • Consider whether the nature of your engagement should change. For example, should you use more pictures to make your posts more engaging? Should you craft messages that are more intriguing than informative, thus potentially increasing visits to your site?

Kudos to the Brooklyn Museum for doing this right. Read the post, and understand what it means to be a thoughtful social media user. We should all be so savvy.

 

Image courtesy of Brantley Davidson, Creative Commons Attribution 2.0 Generic License

Library Hackers Unite blog: OpenSSL Vulnerability

planet code4lib - Tue, 2014-04-08 15:51

SSL certificates can be compromised using a new vulnerability that shipped on currently supported versions of Debian, Ubuntu, CentOS, Fedora, the BSDs, etc.

Time update your servers, regenerate certs and if you are being rigorous about it, go through the certificate revocation process for your old ones. BUT, be careful that you have available OpenSSL 1.0.1g (or newer, should their be one). Versions previous to 1.0.1 are NOT vulnerable to heartbleed. Though many of these old versions are vulnerable to other bugs, you would not want to update from 1.0.0 for the sole purpose of avoiding heartbleed, if you are only going to land in 1.0.1e, thereby introducing the problem.

Considering the widespread deployment of OpenSSL, it is hard to overstate how common this bug is online.

Library Hackers Unite blog: OpenSSL Vulnerability

planet code4lib - Tue, 2014-04-08 15:51

SSL certificates can be compromised using a new vulnerability that shipped on currently supported versions of Debian, Ubuntu, CentOS, Fedora, the BSDs, etc.

Time update your servers, regenerate certs and if you are being rigorous about it, go through the certificate revocation process for your old ones. BUT, be careful that you have available OpenSSL 1.0.1g (or newer, should their be one). Versions previous to 1.0.1 are NOT vulnerable to heartbleed. Though many of these old versions are vulnerable to other bugs, you would not want to update from 1.0.0 for the sole purpose of avoiding heartbleed, if you are only going to land in 1.0.1e, thereby introducing the problem.

Considering the widespread deployment of OpenSSL, it is hard to overstate how common this bug is online.

Miedema, John: Cognitive technologies can eliminate the silly amount of time we spend sifting through search results. ‘Whatson’ success criteria revisited.

planet code4lib - Tue, 2014-04-08 03:21

My first build of ‘Whatson’ left me wanting. I felt I needed to better define how cognitive technology differed from good-old-fashioned-search, like Google. On one level, cognitive technology is, well, more mental. It uses more than keyword matching and regular expressions; but then so does Google. It uses language analysis; so does Google. It succeeds using very large unstructured data sets. So too Google. So what distinguishes cognitive technology like Watson, and must be wired into the bone of my Whatson?

Some of my posts pointed to the deeper feature I was struggling to find. An early post asked, “Natural Language Processing — Can we do away with Unique Identifiers?” Another asked, “You have just been beamed aboard the starship Enterprise. You can ask one question of the ship’s computer. What would it be?” A more conclusive post was entitled, “Good-bye database design and unique identifiers. Strong NLP and the singularity of Watson.

I benefited by reading, Final Jeopardy: Man vs. Machine and the Quest to Know Everything, by Stephen Baker. The difference between search and cognitive technology is the difference between a set of search results and a single correct answer, between looking and finding, seeking versus knowing. Google provides a “vague pointer” to the answer. Watson provides a single, precise answer. Many versus one.

There is rarely one right answer to a question. The essence of critical thinking is the ability to find other ways of thinking about a problem. Google stacks up a list of results and assigns a confidence level to each one. So do cognitive technologies. Unlike Google, cognitive technologies have to be good enough that the top answer is right most of the time. Watson made its public debut playing the game of Jeopardy. Part of its smarts was knowing when to pass a turn, but it had to be able to answer quickly and correctly most of the time or it would lose the game. Cognitive technology raises the bar. It must use more sophisticated language analysis to really understand a human question. It has to be better at pattern recognition. It must employ more thoughtful decision making and follow a big picture strategy.

We have become so used to Google that we are content with a list of search results. What it would be like if we could answer a question on the first try? Would that be it? Done? Not quite. A game can have one right answer, but not the real world. What cognitive technologies can do is eliminate is the silly amount of time we spend sifting through search results. We could ask a question and get a satisfactory answer, and then, just like a dialog with a person, we would ask another question. Beautiful.

Miedema, John: Cognitive technologies can eliminate the silly amount of time we spend sifting through search results. ‘Whatson’ success criteria revisited.

planet code4lib - Tue, 2014-04-08 03:21

My first build of ‘Whatson’ left me wanting. I felt I needed to better define how cognitive technology differed from good-old-fashioned-search, like Google. On one level, cognitive technology is, well, more mental. It uses more than keyword matching and regular expressions; but then so does Google. It uses language analysis; so does Google. It succeeds using very large unstructured data sets. So too Google. So what distinguishes cognitive technology like Watson, and must be wired into the bone of my Whatson?

Some of my posts pointed to the deeper feature I was struggling to find. An early post asked, “Natural Language Processing — Can we do away with Unique Identifiers?” Another asked, “You have just been beamed aboard the starship Enterprise. You can ask one question of the ship’s computer. What would it be?” A more conclusive post was entitled, “Good-bye database design and unique identifiers. Strong NLP and the singularity of Watson.

I benefited by reading, Final Jeopardy: Man vs. Machine and the Quest to Know Everything, by Stephen Baker. The difference between search and cognitive technology is the difference between a set of search results and a single correct answer, between looking and finding, seeking versus knowing. Google provides a “vague pointer” to the answer. Watson provides a single, precise answer. Many versus one.

There is rarely one right answer to a question. The essence of critical thinking is the ability to find other ways of thinking about a problem. Google stacks up a list of results and assigns a confidence level to each one. So do cognitive technologies. Unlike Google, cognitive technologies have to be good enough that the top answer is right most of the time. Watson made its public debut playing the game of Jeopardy. Part of its smarts was knowing when to pass a turn, but it had to be able to answer quickly and correctly most of the time or it would lose the game. Cognitive technology raises the bar. It must use more sophisticated language analysis to really understand a human question. It has to be better at pattern recognition. It must employ more thoughtful decision making and follow a big picture strategy.

We have become so used to Google that we are content with a list of search results. What it would be like if we could answer a question on the first try? Would that be it? Done? Not quite. A game can have one right answer, but not the real world. What cognitive technologies can do is eliminate is the silly amount of time we spend sifting through search results. We could ask a question and get a satisfactory answer, and then, just like a dialog with a person, we would ask another question. Beautiful.

Ronallo, Jason: Questions Asked During the Presentation Websockets For Real-time And Interactive Interfaces At Code4lib 2014

planet code4lib - Mon, 2014-04-07 23:30

During my presentation on WebSockets, there were a couple points where folks in the audience could enter text in an input field that would then show up on a slide. The data was sent to the slides via WebSockets. It is not often that you get a chance to incorporate the technology that you’re talking about directly into how the presentation is given, so it was a lot of fun. At the end of the presentation, I allowed folks to anonymously submit questions directly to the HTML slides via WebSockets.

I ran out of time before I could answer all of the questions that I saw. I’ll try to answer them now.

Questions From Slides

You can see in the YouTube video at the end of my presentation (at 1h38m26s) the following questions came in. ([Full presentation starts here[(https://www.youtube.com/watch?v=_8MJATYsqbY&feature=share&t=1h25m37s).) Some lines that came in were not questions at all. For those that are really questions, I’ll answer them now, even if I already answered them.

Are you a trained dancer?

No. Before my presentation I was joking with folks about how little of a presentation I’d have, at least for the interactive bits, if the wireless didn’t work well enough. Tim Shearer suggested I just do an interpretive dance in that eventuality. Luckily it didn’t come to that.

When is the dance?

There was no dance. Initially I thought the dance might happen later, but it didn’t. OK, I’ll admit it, I was never going to dance.

Did you have any efficiency problems with the big images and chrome?

On the big video walls in Hunt Library we often use Web technologies to create the content and Chrome for displaying it on the wall. For the most part we don’t have issues with big images or lots of images on the wall. But there’s a bit of trick happening here. For instance when we display images for My #HuntLibrary on the wall, they’re just images from Instagram so only 600x600px. We initially didn’t know how these would look blown up on the video wall, but they end up looking fantastic. So you don’t necessarily need super high resolution images to make a very nice looking display.

Upstairs on the Visualization Wall, I display some digitized special collections images. While the possible resolution on the display is higher, the current effective resolution is only about 202px wide for each MicroTile. The largest image is then only 404px side. In this case we are also using a Djatoka image server to deliver the images. Djatoka has an issue with the quality of its scaling between quality levels where the algorithm chosen can make the images look very poor. How I usually work around this is to pick the quality level that is just above the width required to fit whatever design. Then the browser scales the image down and does a better job making it look OK than the image server would. I don’t know which of these factors effect the look on the Visualization Wall the most, but some images have a stair stepping look on some lines. This especially effects line drawings with diagonal lines, while photographs can look totally acceptable. We’ll keep looking for how to improve the look of images on these walls especially in the browser.

Have you got next act after Wikipedia?

This question is referring to the adaptation of Listen to Wikipedia for the Immersion Theater. You can see video of what this looks like on the big Hunt Library Immersion Theater wall.

I don’t currently have solid plans for developing other content for any of the walls. Some of the work that I and others in the Libraries have done early on has been to help see what’s possible in these spaces and begin to form the cow paths for others to produce content more easily. We answered some big questions. Can we deliver content through the browser? What templates can we create to make this work easier? I think the next act is really for the NCSU Libraries to help more students and researchers to publish and promote their work through these spaces.

Is it lunchtime yet?

In some time zone somewhere, yes. Hopefully during the conference lunch came soon enough for you and was delicious and filling.

Could you describe how testing worked more?

I wish I could think of some good way to test applications that are destined for these kinds of large displays. There’s really no automated testing that is going to help here. BrowserStack doesn’t have a big video wall that they can take screenshots on. I’ve also thought that it’d be nice to have a webcam trained on the walls so that I could make tweaks from a distance.

But Chrome does have its screen emulation developer tools which were super helpful for this kind of work. These kinds of tools are useful not just for mobile development, which is how they’re usually promoted, but for designing for very large displays as well. Even on my small workstation monitor I could get a close enough approximation of what something would look like on the wall. Chrome will shrink the content to fit to the available viewport size. I could develop for the exact dimensions of the wall while seeing all of the content shrunk down to fit my desktop. This meant that I could develop and get close enough before trying it out on the wall itself. Being able to design in the browser has huge advantages for this kind of work.

I work at DH Hill Library while these displays are in Hunt Library. I don’t get over there all that often, so I would schedule some time to see the content on the walls when I happened to be over there for a meeting. This meant that there’d often be a lag of a week or two before I could get over there. This was acceptable as this wasn’t the primary project I was working on.

By the time I saw it on the wall, though, we were really just making tweaks for design purposes. We wanted the panels to the left and right of the Listen to Wikipedia visualization to fall along the bezel. We would adjust font sizes for how they felt once you’re in the space. The initial, rough cut work of modifying the design to work in the space was easy, but getting the details just right required several rounds of tweaks and testing. Sometimes I’d ask someone over at Hunt to take a picture with their phone to ensure I’d fixed an issue.

While it would have been possible for me to bring my laptop and sit in front of the wall to work, I personally didn’t find that to work well for me. I can see how it could work to make development much faster, though, and it is possible to work this way.

Race condition issues between devices?

Some spaces could allow you to control a wall from a kiosk and completely avoid any possibility of a race condition. When you allow users to bring their own device as a remote control to your spaces you have some options. You could allow the first remote to connect and lock everyone else out for a period of time. Because of how subscriptions and presence notifications work this would certainly be possible to do.

For Listen to Wikipedia we allow more than one user to control the wall at the same time. Then we use WebSockets to try to keep multiple clients in sync. Even though we attempt to quickly update all the clients, it is certainly possible that there could be race conditions, though it seems unlikely. Because we’re not dealing with persisting data, I don’t really worry about it too much. If one remote submits just after another but before it is synced, then the wall will reflect the last to submit. That’s perfectly acceptable in this case. If a client were to get out of sync with what is on the wall, then any change by that client would just be sent to the wall as is. There’s no attempt to make sure a client had the most recent, freshest version of the data prior to submitting.

While this could be an issue for other use cases, it does not adversely effect the experience here. We do an alright job keeping the clients in sync, but don’t shoot for perfection.

How did you find the time to work on this?

At the time I worked on these I had at least a couple other projects going. When waiting for someone else to finish something before being able to make more progress or on a Friday afternoon, I’d take a look at one of these projects for a little. It meant the progress was slow, but these also weren’t projects that anyone was asking to be delivered on a deadline. I like to have a couple projects of this nature around. If I’ve got a little time, say before a meeting, but not enough for something else, I can pull one of these projects out.

I wonder, though, if this question isn’t more about the why I did these projects. There were multiple motivations. A big motivation was to learn more about WebSockets and how the technology could be applied in the library context. I always like to have a reason to learn new technologies, especially Web technologies, and see how to apply them to other types of applications. And now that I know more about WebSockets I can see other ways to improve the performance and experience of other applications in ways that might not be as overt in their use of the technology as these project were.

For the real-time digital collections view this is integrated into an application I’ve developed and it did not take much to begin adding in some new functionality. We do a great deal of business analytic tracking for this application. The site has excellent SEO for the kind of content we have. I wanted to explore other types of metrics of our success.

The video wall projects allowed us to explore several different questions. What does it take to develop Web content for them? What kinds of tools can we make available for others to develop content? What should the interaction model be? What messaging is most effective? How should we kick off an interaction? Is it possible to develop bring your own device interactions? All of these kinds of questions will help us to make better use of these kinds of spaces.

Speed of an unladen swallow?

I think you’d be better off asking a scientist or a British comedy troupe.

Questions From Twitter

Mia (@mia_out) tweeted at 11:47 AM on Tue, Mar 25, 2014
@ostephens @ronallo out of curiosity, how many interactions compared to visitor numbers? And in-app or relying on phone reader?

sebchan (@sebchan) tweeted at 0:06 PM on Tue, Mar 25, 2014
@ostephens @ronallo (but) what are the other options for ‘interacting’?

This question was in response to how 80% of the interactions with the Listen to Wikipedia application are via QR code. We placed a URL and QR code on the wall for Listen to Wikipedia not knowing which would get the most use.

Unfortunately there’s no simple way I know of to kick off an interaction in these spaces when the user brings their own device. Once when there was a stable exhibit for a week we used a kiosk iPad to control a wall so that the visitor did not need to bring a device. We are considering how a kiosk tablet could be used more generally for this purpose. In cases where the visitor brings their own device it is more complicated. The visitor either must enter a URL or scan a QR code. We try to make the URLs short, but because we wanted to use some simple token authentication they’re at least 4 characters longer than they might otherwise be. I’ve considered using geolocation services as the authentication method, but they are not as exact as we might want them to be for this purpose, especially if the device uses campus wireless rather than GPS. We also did not want to have a further hurdle of asking for permission of the user and potentially being rejected. For the QR code the visitor must have a QR code reader already on their device. The QR code includes the changing token. Using either the URL or QR code sends the visitor to a page in their browser.

Because the walls I’ve placed content on are in public spaces there is no good way to know how many visitors there are compared to the number of interactions. One interesting thing about the Immersion Theater is that I’ll often see folks standing outside of the opening to the space looking in, so even if there where some way to track folks going in and out of the space, that would not include everyone who has viewed the content.

Other Questions

If you have other questions about anything in my presentation, please feel free to ask. (If you submit them through the slides I won’t ever see them, so better to email or tweet at me.)

Ronallo, Jason: Questions Asked During the Presentation Websockets For Real-time And Interactive Interfaces At Code4lib 2014

planet code4lib - Mon, 2014-04-07 23:30

During my presentation on WebSockets, there were a couple points where folks in the audience could enter text in an input field that would then show up on a slide. The data was sent to the slides via WebSockets. It is not often that you get a chance to incorporate the technology that you’re talking about directly into how the presentation is given, so it was a lot of fun. At the end of the presentation, I allowed folks to anonymously submit questions directly to the HTML slides via WebSockets.

I ran out of time before I could answer all of the questions that I saw. I’ll try to answer them now.

Questions From Slides

You can see in the YouTube video at the end of my presentation (at 1h38m26s) the following questions came in. ([Full presentation starts here[(https://www.youtube.com/watch?v=_8MJATYsqbY&feature=share&t=1h25m37s).) Some lines that came in were not questions at all. For those that are really questions, I’ll answer them now, even if I already answered them.

Are you a trained dancer?

No. Before my presentation I was joking with folks about how little of a presentation I’d have, at least for the interactive bits, if the wireless didn’t work well enough. Tim Shearer suggested I just do an interpretive dance in that eventuality. Luckily it didn’t come to that.

When is the dance?

There was no dance. Initially I thought the dance might happen later, but it didn’t. OK, I’ll admit it, I was never going to dance.

Did you have any efficiency problems with the big images and chrome?

On the big video walls in Hunt Library we often use Web technologies to create the content and Chrome for displaying it on the wall. For the most part we don’t have issues with big images or lots of images on the wall. But there’s a bit of trick happening here. For instance when we display images for My #HuntLibrary on the wall, they’re just images from Instagram so only 600x600px. We initially didn’t know how these would look blown up on the video wall, but they end up looking fantastic. So you don’t necessarily need super high resolution images to make a very nice looking display.

Upstairs on the Visualization Wall, I display some digitized special collections images. While the possible resolution on the display is higher, the current effective resolution is only about 202px wide for each MicroTile. The largest image is then only 404px side. In this case we are also using a Djatoka image server to deliver the images. Djatoka has an issue with the quality of its scaling between quality levels where the algorithm chosen can make the images look very poor. How I usually work around this is to pick the quality level that is just above the width required to fit whatever design. Then the browser scales the image down and does a better job making it look OK than the image server would. I don’t know which of these factors effect the look on the Visualization Wall the most, but some images have a stair stepping look on some lines. This especially effects line drawings with diagonal lines, while photographs can look totally acceptable. We’ll keep looking for how to improve the look of images on these walls especially in the browser.

Have you got next act after Wikipedia?

This question is referring to the adaptation of Listen to Wikipedia for the Immersion Theater. You can see video of what this looks like on the big Hunt Library Immersion Theater wall.

I don’t currently have solid plans for developing other content for any of the walls. Some of the work that I and others in the Libraries have done early on has been to help see what’s possible in these spaces and begin to form the cow paths for others to produce content more easily. We answered some big questions. Can we deliver content through the browser? What templates can we create to make this work easier? I think the next act is really for the NCSU Libraries to help more students and researchers to publish and promote their work through these spaces.

Is it lunchtime yet?

In some time zone somewhere, yes. Hopefully during the conference lunch came soon enough for you and was delicious and filling.

Could you describe how testing worked more?

I wish I could think of some good way to test applications that are destined for these kinds of large displays. There’s really no automated testing that is going to help here. BrowserStack doesn’t have a big video wall that they can take screenshots on. I’ve also thought that it’d be nice to have a webcam trained on the walls so that I could make tweaks from a distance.

But Chrome does have its screen emulation developer tools which were super helpful for this kind of work. These kinds of tools are useful not just for mobile development, which is how they’re usually promoted, but for designing for very large displays as well. Even on my small workstation monitor I could get a close enough approximation of what something would look like on the wall. Chrome will shrink the content to fit to the available viewport size. I could develop for the exact dimensions of the wall while seeing all of the content shrunk down to fit my desktop. This meant that I could develop and get close enough before trying it out on the wall itself. Being able to design in the browser has huge advantages for this kind of work.

I work at DH Hill Library while these displays are in Hunt Library. I don’t get over there all that often, so I would schedule some time to see the content on the walls when I happened to be over there for a meeting. This meant that there’d often be a lag of a week or two before I could get over there. This was acceptable as this wasn’t the primary project I was working on.

By the time I saw it on the wall, though, we were really just making tweaks for design purposes. We wanted the panels to the left and right of the Listen to Wikipedia visualization to fall along the bezel. We would adjust font sizes for how they felt once you’re in the space. The initial, rough cut work of modifying the design to work in the space was easy, but getting the details just right required several rounds of tweaks and testing. Sometimes I’d ask someone over at Hunt to take a picture with their phone to ensure I’d fixed an issue.

While it would have been possible for me to bring my laptop and sit in front of the wall to work, I personally didn’t find that to work well for me. I can see how it could work to make development much faster, though, and it is possible to work this way.

Race condition issues between devices?

Some spaces could allow you to control a wall from a kiosk and completely avoid any possibility of a race condition. When you allow users to bring their own device as a remote control to your spaces you have some options. You could allow the first remote to connect and lock everyone else out for a period of time. Because of how subscriptions and presence notifications work this would certainly be possible to do.

For Listen to Wikipedia we allow more than one user to control the wall at the same time. Then we use WebSockets to try to keep multiple clients in sync. Even though we attempt to quickly update all the clients, it is certainly possible that there could be race conditions, though it seems unlikely. Because we’re not dealing with persisting data, I don’t really worry about it too much. If one remote submits just after another but before it is synced, then the wall will reflect the last to submit. That’s perfectly acceptable in this case. If a client were to get out of sync with what is on the wall, then any change by that client would just be sent to the wall as is. There’s no attempt to make sure a client had the most recent, freshest version of the data prior to submitting.

While this could be an issue for other use cases, it does not adversely effect the experience here. We do an alright job keeping the clients in sync, but don’t shoot for perfection.

How did you find the time to work on this?

At the time I worked on these I had at least a couple other projects going. When waiting for someone else to finish something before being able to make more progress or on a Friday afternoon, I’d take a look at one of these projects for a little. It meant the progress was slow, but these also weren’t projects that anyone was asking to be delivered on a deadline. I like to have a couple projects of this nature around. If I’ve got a little time, say before a meeting, but not enough for something else, I can pull one of these projects out.

I wonder, though, if this question isn’t more about the why I did these projects. There were multiple motivations. A big motivation was to learn more about WebSockets and how the technology could be applied in the library context. I always like to have a reason to learn new technologies, especially Web technologies, and see how to apply them to other types of applications. And now that I know more about WebSockets I can see other ways to improve the performance and experience of other applications in ways that might not be as overt in their use of the technology as these project were.

For the real-time digital collections view this is integrated into an application I’ve developed and it did not take much to begin adding in some new functionality. We do a great deal of business analytic tracking for this application. The site has excellent SEO for the kind of content we have. I wanted to explore other types of metrics of our success.

The video wall projects allowed us to explore several different questions. What does it take to develop Web content for them? What kinds of tools can we make available for others to develop content? What should the interaction model be? What messaging is most effective? How should we kick off an interaction? Is it possible to develop bring your own device interactions? All of these kinds of questions will help us to make better use of these kinds of spaces.

Speed of an unladen swallow?

I think you’d be better off asking a scientist or a British comedy troupe.

Questions From Twitter

Mia (@mia_out) tweeted at 11:47 AM on Tue, Mar 25, 2014
@ostephens @ronallo out of curiosity, how many interactions compared to visitor numbers? And in-app or relying on phone reader?

sebchan (@sebchan) tweeted at 0:06 PM on Tue, Mar 25, 2014
@ostephens @ronallo (but) what are the other options for ‘interacting’?

This question was in response to how 80% of the interactions with the Listen to Wikipedia application are via QR code. We placed a URL and QR code on the wall for Listen to Wikipedia not knowing which would get the most use.

Unfortunately there’s no simple way I know of to kick off an interaction in these spaces when the user brings their own device. Once when there was a stable exhibit for a week we used a kiosk iPad to control a wall so that the visitor did not need to bring a device. We are considering how a kiosk tablet could be used more generally for this purpose. In cases where the visitor brings their own device it is more complicated. The visitor either must enter a URL or scan a QR code. We try to make the URLs short, but because we wanted to use some simple token authentication they’re at least 4 characters longer than they might otherwise be. I’ve considered using geolocation services as the authentication method, but they are not as exact as we might want them to be for this purpose, especially if the device uses campus wireless rather than GPS. We also did not want to have a further hurdle of asking for permission of the user and potentially being rejected. For the QR code the visitor must have a QR code reader already on their device. The QR code includes the changing token. Using either the URL or QR code sends the visitor to a page in their browser.

Because the walls I’ve placed content on are in public spaces there is no good way to know how many visitors there are compared to the number of interactions. One interesting thing about the Immersion Theater is that I’ll often see folks standing outside of the opening to the space looking in, so even if there where some way to track folks going in and out of the space, that would not include everyone who has viewed the content.

Other Questions

If you have other questions about anything in my presentation, please feel free to ask. (If you submit them through the slides I won’t ever see them, so better to email or tweet at me.)

ALA Equitable Access to Electronic Content: Two billion for E-rate provides “2-for-1” benefits”

planet code4lib - Mon, 2014-04-07 22:04

Today, the American Library Association (ALA) called on (pdf) the Federal Communications Commission (FCC) to deploy newly identified E-rate program funding to boost library broadband access and alleviate historic shortfalls in funding for internal connections. In response to the FCC’s March Public Notice, the ALA seeks to leverage existing high-speed, scalable networks to increase library broadband speeds, improve area networks and further explore cost efficiencies that could be enabled through new consortium approaches.

ALA proposes:

  • Supporting school-library wide-area network partnerships to better leverage local E-rate investments and support community use of high-capacity connections during non-school hours;
  • Providing short-term funding focused on deployment where libraries are in close proximity to providers that can ensure scalable broadband at affordable construction charges and recurring costs over time; and
  • Advancing cost-efficient library network development with new diagnostic and technical support provided at the state level.

“ALA welcomes this new $2 billion investment to support broadband networks in our nations’ libraries and schools so we may meet growing community demand for services ranging from interactive online learning to videoconferencing to downloading and streaming increasingly digital collections,” said ALA President Barbara Stripling. “This infusion can provide ‘two-for-one’ benefits by advancing library broadband to and within our buildings immediately and continuing to improve the E-rate program in the near future.”

Read the ALA press release

The post Two billion for E-rate provides “2-for-1” benefits” appeared first on District Dispatch.

ALA Equitable Access to Electronic Content: Two billion for E-rate provides “2-for-1” benefits”

planet code4lib - Mon, 2014-04-07 22:04

Today, the American Library Association (ALA) called on (pdf) the Federal Communications Commission (FCC) to deploy newly identified E-rate program funding to boost library broadband access and alleviate historic shortfalls in funding for internal connections. In response to the FCC’s March Public Notice, the ALA seeks to leverage existing high-speed, scalable networks to increase library broadband speeds, improve area networks and further explore cost efficiencies that could be enabled through new consortium approaches.

ALA proposes:

  • Supporting school-library wide-area network partnerships to better leverage local E-rate investments and support community use of high-capacity connections during non-school hours;
  • Providing short-term funding focused on deployment where libraries are in close proximity to providers that can ensure scalable broadband at affordable construction charges and recurring costs over time; and
  • Advancing cost-efficient library network development with new diagnostic and technical support provided at the state level.

“ALA welcomes this new $2 billion investment to support broadband networks in our nations’ libraries and schools so we may meet growing community demand for services ranging from interactive online learning to videoconferencing to downloading and streaming increasingly digital collections,” said ALA President Barbara Stripling. “This infusion can provide ‘two-for-one’ benefits by advancing library broadband to and within our buildings immediately and continuing to improve the E-rate program in the near future.”

Read the ALA press release

The post Two billion for E-rate provides “2-for-1” benefits” appeared first on District Dispatch.

Engard, Nicole: Bookmarks for April 7, 2014

planet code4lib - Mon, 2014-04-07 20:30

Today I found the following resources and bookmarked them on <a href=

  • Lubuntu Lubuntu is a fast and lightweight operating system developed by a community of Free and Open Source enthusiasts. The core of the system is based on Linux and Ubuntu .
  • Lubuntu XP three flavors XP Themes for Lubuntu to help people transistion to Linux.

Digest powered by RSS Digest

The post Bookmarks for April 7, 2014 appeared first on What I Learned Today....

Related posts:

  1. Can you say Kebberfegg 3 times fast
  2. What’s new in Ubuntu?
  3. Amazon’s bestselling laptop is open source!

Engard, Nicole: Bookmarks for April 7, 2014

planet code4lib - Mon, 2014-04-07 20:30

Today I found the following resources and bookmarked them on <a href=

  • Lubuntu Lubuntu is a fast and lightweight operating system developed by a community of Free and Open Source enthusiasts. The core of the system is based on Linux and Ubuntu .
  • Lubuntu XP three flavors XP Themes for Lubuntu to help people transistion to Linux.

Digest powered by RSS Digest

The post Bookmarks for April 7, 2014 appeared first on What I Learned Today....

Related posts:

  1. Can you say Kebberfegg 3 times fast
  2. What’s new in Ubuntu?
  3. Amazon’s bestselling laptop is open source!

Summers, Ed: Glass Houses

planet code4lib - Mon, 2014-04-07 16:29

You may have noticed Brooklyn Museum’s recent announcement that they have pulled out of Flickr Commons. Apparently they’ve seen a “steady decline in engagement level” on Flickr, and decided to remove their content from that platform, so they can focus on their own website as well as Wikimedia Commons.

Brooklyn Museum announced three years ago that they would be cross-posting their content to Internet Archive and Wikimedia Commons. Perhaps I’m not seeing their current bot, but they appear to have two, neither of which have done an upload since March of 2011, based on their user activity. It’s kind of ironic that content like this was uploaded to Wikimedia Commons by Flickr Uploader Bot and not by one of their own bots.

The announcement stirred up a fair bit of discussion about how an institution devoted to the preservation and curation of cultural heritage material could delete all the curation that has happened at Flickr. The theory being that all the comments, tagging and annotation that has happened on Flickr has not been migrated to Wikimedia Commons. I’m not even sure if there’s a place where this structured data could live at Wikimedia Commons. Perhaps some sort of template could be created, or it could live in Wikidata?

Fortunately, Aaron Straup-Cope has a backup copy of Flickr Commons metadata, which includes a snapshot of the Brooklyn Museum’s content. He’s been harvesting this metadata out of concern for Flickr’s future, but surprise, surprise — it was an organization devoted to preservation of cultural heritage material that removed it. It would be interesting to see how many comments there were. I’m currently unpacking a tarball of Aaron’s metadata on an ec2 instance just to see if it’s easy to summarize.

But:

I’m pretty sure I’m living in one of those.

I agree with Ben:

@edsu @textfiles @dantobias @waxpancake @brooklynmuseum Yep. Unfortunately this is a blind spot even for orgs doing things relatively well

— Ben Fino-Radin (@benfinoradin)

April 7, 2014

It would help if we had a bit more method to the madness of our own Web presence. Too often the Web is treated as a marketing platform instead of our culture’s predominant content delivery mechanism. Brooklyn Museum deserves a lot of credit for talking about this issue openly. Most organizations just sweep it under the carpet and hope nobody notices.

What do you think? Is it acceptable that Brooklyn Museum discarded the user contributions that happened on Flickr, and that all the people who happened to be pointing at said content from elsewhere now have broken links? Could Brooklyn Museum instead decided to leave the content there, with a banner of some kind indicating that it is no longer actively maintained? Don’t lots of copies keep stuff safe?

Or perhaps having too many copies detracts from the perceived value of the currently endorsed places of finding the content? Curators have too many places to look, which aren’t synchronized, which add confusion and duplication. Maybe it’s better to have one place where people can focus their attention?

Perhaps these two positions aren’t at odds, and what’s actually at issue is a framework for thinking about how to migrate Web content between platforms. And different expectations about content that is self hosted, and content that is hosted elsewhere?

Summers, Ed: Glass Houses

planet code4lib - Mon, 2014-04-07 16:29

You may have noticed Brooklyn Museum’s recent announcement that they have pulled out of Flickr Commons. Apparently they’ve seen a “steady decline in engagement level” on Flickr, and decided to remove their content from that platform, so they can focus on their own website as well as Wikimedia Commons.

Brooklyn Museum announced three years ago that they would be cross-posting their content to Internet Archive and Wikimedia Commons. Perhaps I’m not seeing their current bot, but they appear to have two, neither of which have done an upload since March of 2011, based on their user activity. It’s kind of ironic that content like this was uploaded to Wikimedia Commons by Flickr Uploader Bot and not by one of their own bots.

The announcement stirred up a fair bit of discussion about how an institution devoted to the preservation and curation of cultural heritage material could delete all the curation that has happened at Flickr. The theory being that all the comments, tagging and annotation that has happened on Flickr has not been migrated to Wikimedia Commons. I’m not even sure if there’s a place where this structured data could live at Wikimedia Commons. Perhaps some sort of template could be created, or it could live in Wikidata?

Fortunately, Aaron Straup-Cope has a backup copy of Flickr Commons metadata, which includes a snapshot of the Brooklyn Museum’s content. He’s been harvesting this metadata out of concern for Flickr’s future, but surprise, surprise — it was an organization devoted to preservation of cultural heritage material that removed it. It would be interesting to see how many comments there were. I’m currently unpacking a tarball of Aaron’s metadata on an ec2 instance just to see if it’s easy to summarize.

But:

I’m pretty sure I’m living in one of those.

I agree with Ben:

@edsu @textfiles @dantobias @waxpancake @brooklynmuseum Yep. Unfortunately this is a blind spot even for orgs doing things relatively well

— Ben Fino-Radin (@benfinoradin)

April 7, 2014

It would help if we had a bit more method to the madness of our own Web presence. Too often the Web is treated as a marketing platform instead of our culture’s predominant content delivery mechanism. Brooklyn Museum deserves a lot of credit for talking about this issue openly. Most organizations just sweep it under the carpet and hope nobody notices.

What do you think? Is it acceptable that Brooklyn Museum discarded the user contributions that happened on Flickr, and that all the people who happened to be pointing at said content from elsewhere now have broken links? Could Brooklyn Museum instead decided to leave the content there, with a banner of some kind indicating that it is no longer actively maintained? Don’t lots of copies keep stuff safe?

Or perhaps having too many copies detracts from the perceived value of the currently endorsed places of finding the content? Curators have too many places to look, which aren’t synchronized, which add confusion and duplication. Maybe it’s better to have one place where people can focus their attention?

Perhaps these two positions aren’t at odds, and what’s actually at issue is a framework for thinking about how to migrate Web content between platforms. And different expectations about content that is self hosted, and content that is hosted elsewhere?

Rosenthal, David: What Could Possibly Go Wrong?

planet code4lib - Mon, 2014-04-07 09:00
I gave a talk at UC Berkeley's Swarm Lab entitled "What Could Possibly Go Wrong?" It was an initial attempt to summarize for non-preservationistas what we have learnt so far about the problem of preserving digital information for the long term in the more than 15 years of the LOCKSS Program. Follow me below the fold for an edited text with links to the sources.

I'm David Rosenthal and I'm an engineer. I'm about two-thirds of a century old. I wrote my first program almost half a century ago, in Fortran for an IBM1401. Eric Allman invited me to talk; I've known Eric for more than a third of a century. About a third of a century ago Bob Sproull recruited me for the Andrew project at C-MU, I where I worked on the user interface with James Gosling. I followed James to Sun to work on window systems, both X, which you've probably used, and a more interesting one called NeWS that you almost certainly haven't. Then I worked on operating systems with Bill Shannon, Rob Gingell and Steve Kleiman. More than a fifth of a century ago I was employee #4 at NVIDIA, helping Curtis Priem architect the first chip. Then I was an early employee at Vitria, the second company of JoMei Chang and Dale Skeen, founders of the company now called Tibco. One seventh of a century ago, after doing 3 companies, all of which IPO-ed, I was burnt out and decided to ease myself gradually into retirement.

Academic Journals and the Web

It was a total failure. I met Vicky Reich, the wife of the late Mark Weiser, CTO of Xerox PARC. She was a librarian at Stanford, and had been part of the team which, nearly a fifth of a century ago, started Stanford's HighWire Press and pioneered the transition of academic journals from paper to the Web.

In the paper world, librarians saw themselves as having two responsibilities, to provide current scholars with the materials they needed, and to preserve their accessibility for future scholars. They did this through a massively replicated. loosely coupled, fault-tolerant, tamper-evident, system of mutually untrusting but cooperating peers that had evolved over centuries. Libraries purchased copies of journals, monographs and books. The more popular the work, the more replicas were stored in the system. The storage of each replica was not very reliable; libraries put them in the stacks and let people take them away. Most times the replicas came back, sometimes they had coffee spilled on them, and sometimes they vanished. Damage could be repaired via inter-library loan and copy. There was a market for replicas; as the number of replicas of a work decreased, the value of a replica in this market increased, encouraging librarians who had a replica to take more care it, by moving it to more secure storage. The system resisted attempts at censorship or re-writing of history precisely because it was a loosely coupled peer-to-peer system; although it was easy to find a replica, it was hard to find all the replicas, or even to know exactly how many there were. And although it was easy to destroy a replica, it was fairly hard to modify one undetectably.

The transition of academic journals from paper to the Web destroyed two of the pillars of this system, ownership of copies, and massive replication. In the excitement of seeing how much more useful content on the Web was to scholars, librarians did not think through the fundamental implications of the transition. The system that arose meant that they no longer purchased a copy of the journal, they rented access to the publisher's copy. Renting satisfied their responsibility to current scholars, but it couldn't satisfy their responsibility to future scholars.

Librarians' concerns reached the Mellon Foundation, who funded exploratory work at Stanford and five other major research libraries. In what can only be described as a serious failure of systems analysis, the other five libraries each proposed essentially the same system, in which they would take custody of the journals. Other libraries would subscribe to this third-party archive service. If they could not get access from the original publisher and they had a current subscription to the third-party archive they could access the content from the archive. None of these efforts led to a viable system because they shared many fundamental problems including:
  • Libraries such as Harvard were reluctant to outsource a critical function to a competing library such as Yale. On the other hand, funders were reluctant to pay for more than one archive.
  • Publishers were reluctant to deliver their content to a library in order that the library might make money by re-publishing the content to others. This made the contract negotiations necessary to obtain content from the publishers time-consuming and expensive.
  • The concept of a subscription archive was not a solution to the problem of post-cancellation access; it was merely a second instance of exactly the same problem.
One of the problems I had been interested in at Sun and then again at Vitria was fault-tolerance. To a computer scientist, it was a solved problem. Byzantine Fault Tolerance (BFT) could prove that 3f+1 replicas could survive f simultaneous faults. To an engineer, it was not a solved problem. Two obvious questions were:
  • What is the probability that my system will encounter f simultaneous faults?
  • How could my system recover if it did?
There's a very good reason why suspension bridges use stranded cables. A solid rod would be cheaper, but the bridge would then have the same unfortunate property as BFT. It would work properly up to the point of failure, which would be sudden, catastrophic and from which recovery would be impossible.

I have long thought that the fundamental challenge facing system architects is to build systems that fail gradually, progressively, and slowly enough for remedial action to be effective, all the while emitting alarming noises to attract attention to impending collapse. In a post-Snowden world it is perhaps superfluous to say that these properties are especially important for failures caused by external attack or internal subversion.

The LOCKSS System

As Vicky explained the paper library system to me, I came to see two things:
  • It was a system in the physical world that had a very attractive set of fault-tolerance properties.
  • An analog of the paper system in the Web world could be built that retained those properties.
With a small grant from Michael Lesk, then at the NSF, I built a prototype system called LOCKSS (Lots Of Copies Keep Stuff Safe), modelled on the paper library system. By analogy with the stacks, libraries would run what you can think of as a persistent Web cache with a Web crawler which would pre-load the cache with the content to which the library subscribed. The contents of each cache would never be flushed, and would be monitored by a peer-to-peer anti-entropy protocol. Any damage detected would be repaired by the Web analog of inter-library copy. Because the system was an exact analog of the existing paper system, the copyright legalities were very simple.

The Mellon Foundation, and then Sun and the NSF funded the work to throw my prototype away and build a production-ready system. The interesting part of this started when we discovered that, as usual with my prototypes, the anti-entropy protocol had gaping security holes. I worked with Mary Baker and some of her students in CS, Petros Maniatis, Mema Roussopoulos and TJ Giuli, to build a real P2P anti-entropy protocol, for which we won Best Paper at SOSP a tenth of a century ago.

The interest in this paper is that it shows a system, albeit in a restricted area of application, that has a high probability of failing slowly and gradually, and of generating alarms in the case of external attack, even from a very powerful adversary. It is a true P2P system with no central control,  because that would provide a focus for attack. It uses three major defensive techniques:
  • Effort-balancing, to ensure that the computational cost of requesting a service from a peer exceeds the computational cost of satisfying the request. If this condition isn't true in a P2P network, the bad guy can wear the good guys down.
  • Rate-limiting, to ensure that the rate at which the bad guy can make bad things happen can't make the system fail quickly.
  • Lots of copies, so that the anti-entropy protocol can work with samples of the population of copies. Randomly sampling the peers makes it hard for the bad guy to know which peers are involved in which operations.
Recent DDoS attacks, such as the 400Gbps NTP Reflection attack on CloudFlare, have made clear the importance of rate-limiting to services such as DNS and NTP.

Now, our free, open source, peer-to-peer digital preservation system is in use at around 150 libraries worldwide. The program has been economically self-supporting for nearly 7 years using the "RedHat" model of free software and paid support. In addition to our SOSP paper, the program has published research into many aspects of digital preservation.

The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at current prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk. The median data item in the Global LOCKSS network has copies at a couple of dozen peers.

I doubt that copyright is high on your list of design problems. You may be wrong about that, but I'm not going to argue with you. So, the rest of this talk will not be about the LOCKSS system as such, but about the lessons we've learned in the last 15 years that are applicable to everyone who is trying to store digital information for the long term. The title of this talk is the question that you have to keep asking yourself over and over again as you work on digital preservation, "what could possibly go wrong?" Unfortunately, once I started writing this talk, it rapidly grew far too long for lunch. Don't expect a comprehensive list, you're only getting edited low-lights.

Stuff is going to get lost

Lets start by examining the problem in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

At scale, storing realistic amounts of data for human timescales is an unsolvable problem. Some stuff is going to get lost. This shouldn't be a surprise, even in the days of paper stuff got lost. But the essential information needed to keep society running, to keep science progressing, to keep the populace entertained was stored very robustly, with many copies on durable, somewhat tamper-evident media in a fault-tolerant, peer-to-peer, geographically and administratively diverse system.

This is no longer true. The Internet has, in the interest of reducing costs and speeding communication, removed the redundancy, the durability and the tamper-evidence from the system that stores society's critical data. Its now all on spinning rust, with hopefully at least one backup on tape covered in rust.

Two weeks ago, researchers at Berkeley co-authored a paper in which they reported that:
a rapid succession of coronal mass ejections ... sent a pulse of magnetized plasma barreling into space and through Earth’s orbit. Had the eruption come nine days earlier, when the ignition spot on the solar surface was aimed at Earth, it would have hit the planet, potentially wreaking havoc with the electrical grid, disabling satellites and GPS, and disrupting our increasingly electronic lives. ... A study last year estimated that the cost of a solar storm like [this] could reach $2.6 trillion worldwide.Most of the information needed to recover from such an event exists only in digital form on magnetic media. These days, most of it probably exists only in "the cloud", which is this happy place immune from the electromagnetic effects of coronal mass ejections and very easy to access after the power grid goes down.

How many of you have read the science fiction classic The Mote In God's Eye by Larry Niven and Jerry Pournelle? It describes humanity's first encounter with intelligent aliens, called Moties. Motie reproductive physiology locks their society into an unending cycle of over-population, war, societal collapse and gradual recovery. They cannot escape these Cycles, the best they can do is to try to ensure that each collapse starts from a higher level than the one before by preserving the record of their society's knowledge through the collapse to assist the rise of its successor. One technique they use is museums of their technology. As the next war looms, they wrap the museums in the best defenses they have. The Moties have become good enough at preserving their knowledge that the next war will feature lasers capable of sending light-sails to the nearby stars, and the use of asteroids as weapons. The museums are wrapped in spheres of two-meter thick metal, highly polished to reduce the risk from laser attack.

Larry and Jerry were writing a third of a century ago, but in the light of this week's IPCC report, they are starting to look uncomfortably prophetic. The problem we face is that, with no collective memory of a societal collapse, no-one is willing to pay either to fend it off or to build the museums to pass knowledge to the successor society.

Why is stuff going to get lost?

One way to express the "what could possibly go wrong?" question is to ask "against what threats are you trying to preserve data?" The threat model of a digital preservation system is a very important aspect of the design which is, alas, only rarely documented. In 2005 we did document the LOCKSS threat model. Unfortunately, we didn't consider coronal mass ejections or societal collapse from global warming.

We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure 
How much stuff is going to get lost?

The more we spend per byte, the safer the bytes are going to be. Unfortunately, this is subject to the Law of Diminishing Returns; each successive nine of reliability is exponentially more expensive than the last. We don't have an unlimited budget, so we're going to have to trade off cost against the probability of data loss. To do this we need models to predict the cost of storing data using a given technology, and models to predict the probability of that technology losing data. I've worked on both kinds of model and can report that they're both extremely difficult.

Models of Data Loss

There's quite a bit of research, from among others Google, C-MU and BackBlaze, showing that failure rates of storage media in service are much higher than the rates claimed by the manufacturers specifications. Why is this? For example, the Blu-Ray disks Facebook is experimenting with for cold storage claim a 50-year data life. No-one has seen a 50-year-old DVD disk, so how do they know?

The claims are based on a model of the failure mechanisms and data from accelerated life testing, in which batches of media are subjected to unrealistically high temperature and humidity. The model is used to extrapolate from these unrealistic conditions to the conditions to be encountered in service. There are two problems, the conditions in service typically don't match those assumed by the models, and the models only capture some of the failure mechanisms.

These problems are much worse when we try to model not just failures of individual media, but of the entire storage system. Research has shown that media failures account for less than half the failures encountered in service; other components of the system such as buses, controllers, power supplies and so on contribute the other half. But even models that include these components exclude many of the threats we identified, from operator errors to coronal mass ejections.

Even more of a problem is that the threats, especially the low-probability ones, are highly correlated. Operators are highly likely to make errors when they are stressed coping with, say, an external attack. The probability of economic failure is greatly increased by, say, insider abuse. Modelling these correlations is a nightmare.

It turns out that economics are by far the largest cause of data failing to reach future readers. A month ago I gave a seminar in the I-school entitled The Half-Empty Archive, in which I pulled together the various attempts to measure how much of the data that should be archived is being collected by archives, and assessed that it was much less than half.  No-one believes that archiving budgets are going to double, so we can be confident that the loss rate from unable to afford to collect is at least 50%. This dwarfs all other causes of data loss.

Lets Keep Everything For Ever!

Digital preservation has three cost areas; ingest, preservation and dissemination. In the seminar I looked at the prospects for radical cost decreases in  all three, but I assume that the one you are interested in is storage, which is the main cost of preservation. Everyone knows that, if not actually free, storage is so cheap that we can afford to store everything for ever. For example, Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:
But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely.

Kryder's Law

They believe this because they lived through a third of a century of Kryder's Law, the analog of Moore's Law for disks. Kryder's Law predicted that the bit density on the platters of disk drives would more than double every 18 months, leading to a consistent 30-40%/yr drop in cost per byte. Thus, long-term storage was effectively free. If you could afford to store something for a few years, you could afford to store it for ever. The cost would have become negligible.

As Randall Munroe points out, in the real world exponential growth can't continue for ever. It is always the first part of a S-curve. One of the things that most impressed me about Krste Asanovi?'s keynote on the ASPIRE Project at this year's FAST conference was that their architecture took for granted that Moore's Law was in the past. Kryder's Law is also flattening out.

Here's a graph, from Preeti Gupta at UCSC, showing that in 2010, even before the floods in Thailand doubled $/GB overnight, the Kryder curve was flattening. Currently, disk is about 7 times as expensive as it would have been had the pre-2010 Kryder's Law continued. Industry projections are for 10-20%/yr going forward - the red lines on the graph show that in 2020 disk is now expected to be 100-300 times more expensive than pre-2010 expectations.

Industry projections have a history of optimism, but if we believe that data grows at IDC's 60%/yr, disk density grows at IHS iSuppli's 20%/yr, and IT budgets are essentially flat, the annual cost of storing a decade's accumulated data is 20 times the first year's cost. If at the start of the decade storage is 5% of your budget, at the end it is more than 100% of your budget. So the Big Data zealot has an affordability problem.

Why Is Kryder's Law Slowing?

It is easy to, and we often do, conflate Kryder's Law, which describes the increase in the areal density of bits on disk platters, with the cost of disk storage in $/GB. We wave our hands and say that it roughly mapped one-for-one into a decrease in the cost of disk drives. We are not alone in using this approximation, Mark Kryder himself does (PDF):
Density is viewed as the most important factor ... because it relates directly to cost/GB and in the HDD marketplace, cost/GB has always been substantially more important than other performance parameters. To compare cost/GB, the approach used here was to assume that, to first order, cost/GB would scale in proportion to (density)-1My co-author Daniel Rosenthal (no relation) has investigated the relationship between bits/in2 and $/GB over the last couple of decades. Over that time, it appears that about 3/4 of the decrease in $/GB can be attributed to the increase in bits/in2. Where did the rest of the decrease come from? I can think of three possible causes:
  • Economies of scale. For most of the last two decades the unit shipments of drives have been increasing, resulting in lower fixed costs per drive. Unfortunately, unit shipments are currently declining, so this effect has gone into reverse. In 2005 Mark Kryder was quoted as predicting "In a few years the average U.S. consumer will own 10 to 20 disk drives in devices that he uses regularly," but what is in those devices now is flash. The remaining market for disks is the cloud; they are no longer a consumer technology.
  • Manufacturing technology. The technology to build drives has improved greatly over the last couple of decades, resulting in lower variable costs per drive. Unfortunately HAMR, the next generation of disk drive technology has proven to be extraordinarily hard to manufacture, so this effect has gone into reverse.
  • Vendor margins. Over the last couple of decades disk drive manufacturing was a very competitive business, with numerous competing vendors. This gradually drove margins down and caused the industry to consolidate. Before the Thai floods, there were only two major manufacturers left, with margins in the low single digits. Unfortunately, the lack of competition and the floods have led to a major increase in margins, so this effect has gone into reverse.
But these factors only account for 1/4 of the missing cost decrease. Where did the other 3/4 go? Here is a 2008 graph from Dave Anderson of Seagate showing how what looks like a smooth Kryder's Law curve is actually the superimposition of a series of S-curves, one for each successive technology. Note how Dave's graph shows Perpendicular Magnetic Recording (PMR) being replaced by Heat Assisted Magnetic Recording (HAMR) starting in 2009. No-one has yet shipped HAMR drives. Instead, the industry has resorted to stretching PMR by shingling (which increases the density) and helium (which increases the number of platters).

Each technology generation has to stay in the market long enough to earn a return on the cost of the transition from its predecessor. There are two problems:
  • The return it needs to earn is, in effect, the margins the vendors enjoy. The higher the margins, the longer the technology needs to be in the market. Margins have increased.
  • As technology advances, the easier problems get solved first. So each technology transition involves solving harder and harder problems, so it costs more. The transition from PMR to HAMR has turned out to be vastly more expensive than the industry expected. Getting the laser and the magnetics in the head assembly to cooperate is very hard, the transition involves a huge increase in the production of the lasers, and so on.
According to Dave's 6-year-old graph, we should now be almost done with HAMR and starting the transition to Bit Patterned Media (BPM). It is already clear that the HAMR-BPM transition will be even more expensive and thus even more delayed than the PMR-HAMR transition. So the projected 20%/yr Kryder rate is unlikely to be realized. The one good thing, if you can call it that, about the slowing of the Kryder rate for disk is that it puts off the day when the technology hits the superparamagnetic limit. This is when the shrinking magnetic domains become unstable at the temperatures encountered inside an operating disk, which are quite warm.

We'll Just Use Tape Instead of Disk

About 70% of all bytes of storage produced each year is disk,the rest being tape and solid state.. Tape has been the traditional medium for long-term storage. Its recording technology lags about 8 years behind disk; it is unlikely to run into the problems plaguing disk for some years. We can expect its relative cost per byte advantage over disk to grow in the medium term. But tape is losing ground in the market. Why is this?

In the past, the access patterns to archived data were stable. It was rarely accessed, and accesses other than integrity checks were sparse. But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until last week it was 5.5 times.

An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the considerable costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year. Recently, Twitter issued a call for a "small number of proposals to receive free datasets", but even Twitter can't support 400.

Thus future archives will need to keep at least one copy of their content on low-latency, high-bandwidth storage, not tape.

We'll Just Use Flash Instead

Flash memory's advantages, including low power, physical robustness and low access latency have overcome its higher cost per byte in many markets, such as tablets and servers. But there is no possibility of flash replacing disk in the bulk storage market; that would involve trebling the number of flash fabs. Even if we ignore the lead time to build the new fabs, the investment to do so would not pay dividends. Everyone understands that shrinking flash cells much further will impair their ability to store data. Increasing levels, stacking cells in 3D and increasingly desperate signal processing in the flash controller will keep density going for a little while, but not long enough to pay back the investment in the fabs.

We'll Just Use Flash Non-volatile RAM Instead

There are many technologies vying to be the successor to flash, and most can definitely keep scaling beyond the end of flash provided the semiconductor industry keeps on its road-map.  They all have significant advantages over flash, in particular they are byte- rather than block-addressable. But analysis by Mark Kryder and Chang Soo Kim (PDF) at Carnegie-Mellon is not encouraging about the prospects for either flash or the competing solid state technologies beyond the end of the decade.

We'll Just Use Metal Tape, Stone DVDs, Holographic DVDs DNA Instead

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than Kryder's or Moore's Laws. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the cold storage problem. But DNA access will always be slow enough that it can't store the only copy.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ...

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)Note that this analysis assumes that the drives fail under warranty. One thing the drive vendors did to improve their margins after the floods was to reduce the length of warranties.

Does Kryder's Law Slowing Matter?

Figures from SDSC suggest that media cost is about 1/3 of the lifecycle cost of storage, although figures from BackBlaze suggest a much higher proportion. As a rule of thumb, the research into digital preservation costs suggests that ingesting the content costs about 1/2 the total lifecycle costs, preserving it costs about 1/3 and disseminating it costs about 1/6. So why are we worrying about a slowing of the decrease in 1/9 of the total cost?

Different technologies with different media service lives involve spending different amounts of money at different times during the life of the data. To make apples-to-apples comparisons we need to use the equivalent of Discounted Cash Flow to compute the endowment needed for the data. This is the capital sum which, deposited with the data and invested at prevailing interest rates, would be sufficient to cover all the expenditures needed to store the data for its life.

We built an economic model of the cost of long-term storage. Here it is from 15 months ago plotting the endowment needed for 3 replicas of a 117TB dataset to have a 98% chance of not running out of money over 100 years, against the Kryder rate, using costs from Backblaze. Each line represents a policy of keeping the drives for 1,2 ... 5 years before replacing them.

In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decrease, we move into the steep part of the graph, which has two effects:
  • The endowment needed increases sharply.
  • The endowment needed becomes harder to predict, because it depends strongly on the precise Kryder rate.
The reason to worry is that the cost of storing data for the long term depends strongly on the Kryder rate if it falls much below 20%, which it has. Everyone's storage expectations, and budgets, are based on their pre-2010 experience, and on a belief that the effect of the floods was a one-off glitch; the industry will quickly get back to historic Kryder rates. It wasn't, and they won't.

Does Losing Stuff Matter?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.

Can We Do Better?

In the short term, the inertia of manufacturing investment means that things aren't going to change much. Bulk data is going to be on disk, it can't compete with other uses for the higher-value space on flash. But looking out to the end of the decade and beyond, we're going to be living in a world of much lower Kryder rates. What does this mean for storage system architectures?

The reason disks have a five-year service life isn't an accident of technology. Disks are engineered to have a five-year service life because, with a 40%/yr Kryder rate, it is uneconomic to keep the data on the drive for longer than 5 years. After 5 years the data will take up about 8% of the drive's replacement.

At lower Kryder rates the media, whatever they are, will be in service longer. That means that running cost will be a larger proportion of the total cost. It will be worth while to spend more on purchasing the media to spend less on running them. Three years ago Ian Adams, Ethan Miller and I were inspired by the FAWN paper from Carnegie-Mellon to do an analysis we called DAWN: Durable Array of Wimpy Nodes. In it we showed that, despite the much higher capital cost, a storage fabric consisting of a very large number of very small nodes each with a very low-power system-on-chip and a small amount of flash memory would be competitive with disk.

The reason was that DAWN's running cost would be so much lower, and its economic media life so much longer, that it would repay the higher initial investment. The more the Kryder rate slows, the better our analysis looks. DAWN's better performance was a bonus. To the extent that successors to flash behave like RAM, and especially if they can be integrated with the system-on-chip, they strengthen the case further with lower costs and an even bigger performance edge.


Summing Up

Expectations for future storage technologies and costs were built up during three decades of extremely rapid cost per byte decrease. We are now 4 years into a period of much slower cost decrease, but expectations remain unchanged. Some haven't noticed the change, some believe it is temporary and the industry will return to the good old days of 40%/yr Kryder rates.

Industry insiders are projecting no more than 20%/yr rates for the rest of the decade. Technological and market forces make it likely that, as usual, they are being optimistic. Lower Kryder rates greatly increase both the cost of long-term storage, and the uncertainty in estimating it.

Lower Kryder rates mean that the economic service life of media will be longer, placing more emphasis on lower running cost than on lower purchase cost. This is particularly true since bulk storage media are no longer a consumer product; businesses are better placed to make this trade-off. But they may not do so (see the work of Andrew Haldane and Richard Davies at the Bank of England, and Doyne Farmer of the Santa Fe Institute and John Geanakoplos of Yale).

The idea that archived data can live on long-latency, low-bandwidth media is no longer the case. Future archival storage architectures must deliver adequate performance to sustain data-mining as well as low cost. Bundling computation into the storage medium is the way to do this.

Discussion

As usual, I was too busy answering questions to remember most of them. Here are the ones I remember, rephrased, with apologies the the questioners whose contributions slipped my memory:
  • Won't the evolution of flash technology drive its price down more quickly than disk? The problem is that the manufacturing capacity doesn't, and won't exist for flash to displace disk in the bulk storage space. Flash is a better technology than disk for many applications, so it is likely always to command a premium over disk.
  • Isn't DNA a really noisy technology to build long-term memory from? At the raw media level, all storage technologies are unpleasantly noisy. The signal processing that goes on inside your disk or flash controlled is amazing. DNA has the advantage that the signal processing has a vast number of replicas to work with.
  • Doesn't experience with flash suggest that it isn't capable of storing data reliably for the long term? The way current flash controllers use the raw medium optimizes things other than data retention, such as performance (for SSDs) and low cost (for SD cards, see Bunnie Huang and xobs' talk at the Chaos Computer Conference). That doesn't mean it isn't possible, with alternate flash controller technology, to optimize for data retention.

Syndicate content