planet code4lib

Syndicate content
Planet Code4Lib - http://planet.code4lib.org
Updated: 1 week 1 day ago

ALA Equitable Access to Electronic Content: 404 Day: Stopping excessive internet filtration

Wed, 2014-04-02 21:58

Every day, libraries across the country are routinely overblocking content far more than is necessary under the law in order to comply with the Children’s Internet Protection Act (CIPA), the law that requires public libraries and K-12 schools to employ internet filtering software in exchange for certain federal funding. This week, patrons and students will get the chance to call attention to banned websites and excessive Internet filtration in libraries.

Library advocates will have the opportunity to participate in a no-cost educational internet filtering event. Join the Electronic Frontier Foundation, the MIT Center for Civic Media and the National Coalition Against Censorship on Friday, April 4, 2014, at 3:00pm EST, when they collaborate to host a digital teach-in that will include discussions with top researchers and librarians working to push back against the use of Internet filters on library computers. The digital teach-in will be archived.

Digital Teach-in
When: Friday, April 4, 2014
Time: 3:00pm EST
Watch event live

Speakers:

  • Moderator: April Glaser, activist at the Electronic Frontier Foundation
  • Deborah Caldwell-Stone, Deputy Director of the American Library Association’s Office for Intellectual Freedom. She has written extensively about CIPA and blocked websites in libraries.
  • Chris Peterson, a research affiliate at the Center for Civic Media at the MIT Media Lab. He is currently working on the Mapping Information Access Project.
  • Sarah Houghton, Director for the San Rafael Public Library in Northern California. She has also blogged as the Librarian in Black for over a decade.

The post 404 Day: Stopping excessive internet filtration appeared first on District Dispatch.

Ng, Cynthia: BCLA 2014: Libraries, Architecture, and the Urban Context

Wed, 2014-04-02 19:19
Michael Heeney, Bing Thom Architects The urban life has changed. Used to be was in education very few years, but now with double the lifespan, we spend more time in education. Example: Fort Worth, cascading courtyards in building the college campus connecting downtown to waterfront. Sunset Community Centre – modest project, integrate community into buildings. […]

Ng, Cynthia: BCLA 2014: Libraries, Architecture, and the Urban Context

Wed, 2014-04-02 19:19
Michael Heeney, Bing Thom Architects The urban life has changed. Used to be was in education very few years, but now with double the lifespan, we spend more time in education. Example: Fort Worth, cascading courtyards in building the college campus connecting downtown to waterfront. Sunset Community Centre – modest project, integrate community into buildings. […]

Ng, Cynthia: BCLA 2014: Linked Open Data and Libraries. When? Or NOW!

Wed, 2014-04-02 17:59
Panel of 3 on linked open data. Catelynne Sahadath – A Change Would Do You Good Back of house need to work together with front of house. The link is very important in order to properly serve users. Communication is key. We stop ourselves about communicating effectively. We need an open line of communication. Current […]

Ng, Cynthia: BCLA 2014: Linked Open Data and Libraries. When? Or NOW!

Wed, 2014-04-02 17:59
Panel of 3 on linked open data. Catelynne Sahadath – A Change Would Do You Good Back of house need to work together with front of house. The link is very important in order to properly serve users. Communication is key. We stop ourselves about communicating effectively. We need an open line of communication. Current […]

LITA: Jobs in Information Technology: April 2

Wed, 2014-04-02 17:42

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Associate Director for Library Information Technology,  Stony Brook University, Stony Brook, NY

Librarian Supervisor I,  AskUsNow Coordinator, Enoch Pratt Free Library, Baltimore, MD

Librarian Supervisor II, Maryland Department , Enoch Pratt Free Library, Baltimore,  MD

 

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

LITA: Jobs in Information Technology: April 2

Wed, 2014-04-02 17:42

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Associate Director for Library Information Technology,  Stony Brook University, Stony Brook, NY

Librarian Supervisor I,  AskUsNow Coordinator, Enoch Pratt Free Library, Baltimore, MD

Librarian Supervisor II, Maryland Department , Enoch Pratt Free Library, Baltimore,  MD

 

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

OCLC Dev Network: Planned Downtime for April Release

Wed, 2014-04-02 15:00

On Sunday, April 6, 2014 there will be a scheduled service downtime as the WorldShare Platform receives this quarter’s planned enhancements.

OCLC Dev Network: Planned Downtime for April Release

Wed, 2014-04-02 15:00

On Sunday, April 6, 2014 there will be a scheduled service downtime as the WorldShare Platform receives this quarter’s planned enhancements.

Morgan, Eric Lease: Tiny Text Mining Tools

Wed, 2014-04-02 14:57

I have posted to Github the very beginnings of Perl library used to support simple and introductory text mining analysis — tiny text mining tools.

Presently the library is implemented in a set of subroutines stored in a single file supporting:

  • simple in-memory indexing and single-term searching
  • relevancy ranking through term-frequency inverse document frequency (TFIDF) for searching and classification
  • cosine similarity for clustering and “finding more items like this one”

I use these subroutines and the associated Perl scripts to do quick & dirty analysis against corpuses of journal articles, books, and websites.

I know, I know. It would be better to implement these thing as a set of Perl modules, but I’m practicing what I preach. “Give it away even if it is not ready.” The ultimate idea is to package these things into a single distribution, and enable researchers to have them at their finger tips as opposed to a Web-based application.

Morgan, Eric Lease: Tiny Text Mining Tools

Wed, 2014-04-02 14:57

I have posted to Github the very beginnings of Perl library used to support simple and introductory text mining analysis — tiny text mining tools.

Presently the library is implemented in a set of subroutines stored in a single file supporting:

  • simple in-memory indexing and single-term searching
  • relevancy ranking through term-frequency inverse document frequency (TFIDF) for searching and classification
  • cosine similarity for clustering and “finding more items like this one”

I use these subroutines and the associated Perl scripts to do quick & dirty analysis against corpuses of journal articles, books, and websites.

I know, I know. It would be better to implement these thing as a set of Perl modules, but I’m practicing what I preach. “Give it away even if it is not ready.” The ultimate idea is to package these things into a single distribution, and enable researchers to have them at their finger tips as opposed to a Web-based application.

Open Knowledge Foundation: The Open Knowledge Foundation Newsletter, April 2014

Wed, 2014-04-02 14:22
Hi!

After last month’s launch-fest, March has been a thoughtful month, with reflective and planning pieces taking centre-stage on our blog. Of course OKFestival has been ramping up since its launch, giving more detail on topics and running sessions to help with submitting proposals; however we’ve also had more from the Community Survey results, as well as guest posts dealing with ‘open washing’ and exploring what open data means to different people.

Keep checking in on the Community Stories Tumblr for the latest news on what people are doing around the world to push the agenda for Open Knowledge. This month’s updates come from India, Tanzania, Greece, Malta, Russia and Germany, and from OpenMENA (Middle East and North Africa) – the new group focusing on Open Knowledge in the Arab world.

Also, congratulations to our very own Rufus Pollock, named a Tech Hero for Good by NESTA :-)

OKFestival 2014

Plans have been moving at pace over the last month.

So many proposals came in, and so many people wanted more time to submit, we extended the deadline for proposals to March 30th. We’ll have to wait until May to learn if our proposals have been accepted, and later in May for the programme announcement, but many thanks to all who have proposed sessions – and good luck to you!

It’s not long to go now, so don’t forget to buy your ticket

If you need distraction from the wait, check out this flash-back to last year: the 2013 Open Reader, a collection of stories and articles inspired by Open Knowledge Conference 2013.

Stop Secret Contracts

Last month we launched our campaign for a stop to secret contracts, asking various organisations to partner with us and asking you who care about openness to sign up to show your support.

Spread the word to your colleagues, friends and family to show that we will not stand for corruption, fraud, unaccountability or backdoor deals.

Signatures not enough? To get more involved please contact us and help us stop secret contracts.

#SecretContracts

Coming Up

The School of Data heads to Perugia! Europe’s Biggest Data Journalism Event, from The European Journalism Centre, Open Knowledge and the International Journalism Festival, the School of Data Journalism takes place 30th April to 4th May. This event has an impressive programme with free entry to panel and workshops so check it out and register to save your place.

OGP grows to 62 countries. The Open Government Partnership (OGP) will welcome 8 new countries during April: ‘Cohort 4′ consists of Australia, Ireland, Malawi, Mongolia, New Zealand, Sierra Leone, Serbia, and Trinidad and Tobago.

And… Time-zone changes! It messes with schedules and deadlines, but adds to the fun of this time of year.

All the best from Open Knowledge!

Open Knowledge Foundation: The Open Knowledge Foundation Newsletter, April 2014

Wed, 2014-04-02 14:22
Hi!

After last month’s launch-fest, March has been a thoughtful month, with reflective and planning pieces taking centre-stage on our blog. Of course OKFestival has been ramping up since its launch, giving more detail on topics and running sessions to help with submitting proposals; however we’ve also had more from the Community Survey results, as well as guest posts dealing with ‘open washing’ and exploring what open data means to different people.

Keep checking in on the Community Stories Tumblr for the latest news on what people are doing around the world to push the agenda for Open Knowledge. This month’s updates come from India, Tanzania, Greece, Malta, Russia and Germany, and from OpenMENA (Middle East and North Africa) – the new group focusing on Open Knowledge in the Arab world.

Also, congratulations to our very own Rufus Pollock, named a Tech Hero for Good by NESTA :-)

OKFestival 2014

Plans have been moving at pace over the last month.

So many proposals came in, and so many people wanted more time to submit, we extended the deadline for proposals to March 30th. We’ll have to wait until May to learn if our proposals have been accepted, and later in May for the programme announcement, but many thanks to all who have proposed sessions – and good luck to you!

It’s not long to go now, so don’t forget to buy your ticket

If you need distraction from the wait, check out this flash-back to last year: the 2013 Open Reader, a collection of stories and articles inspired by Open Knowledge Conference 2013.

Stop Secret Contracts

Last month we launched our campaign for a stop to secret contracts, asking various organisations to partner with us and asking you who care about openness to sign up to show your support.

Spread the word to your colleagues, friends and family to show that we will not stand for corruption, fraud, unaccountability or backdoor deals.

Signatures not enough? To get more involved please contact us and help us stop secret contracts.

#SecretContracts

Coming Up

The School of Data heads to Perugia! Europe’s Biggest Data Journalism Event, from The European Journalism Centre, Open Knowledge and the International Journalism Festival, the School of Data Journalism takes place 30th April to 4th May. This event has an impressive programme with free entry to panel and workshops so check it out and register to save your place.

OGP grows to 62 countries. The Open Government Partnership (OGP) will welcome 8 new countries during April: ‘Cohort 4′ consists of Australia, Ireland, Malawi, Mongolia, New Zealand, Sierra Leone, Serbia, and Trinidad and Tobago.

And… Time-zone changes! It messes with schedules and deadlines, but adds to the fun of this time of year.

All the best from Open Knowledge!

Schmidt, Aaron: Earning Trust

Wed, 2014-04-02 14:00

Earning the trust of your library members is crucial to delivering a great user experience. Without trust, it is impossible to connect to library members in a meaningful way.

Libraries benefit in all sorts of ways when they’re trusted institutions. Trust breeds loyalty, and loyal library users are more likely to take advantage of the library. What’s more, loyal patrons will also be more apt to sing the praises of the library to neighbors and colleagues. For libraries, thinking about trust highlights the importance of recognizing members as individuals. Thinking of users not as a homogenous group but rather as persons will allow your library staff to develop more empathy and build stronger ­relationships.

There are many ways to earn—and lose—people’s trust in a library. Let’s take a look at a few:

FACE-TO-FACE CUSTOMER SERVICE

As we are social creatures, the human interactions that happen inside of our buildings are often a make-or-break aspect of building trust. In fact, customer service is so tightly linked to trust and the overall user experience (UX), it is often the only aspect of UX that librarians consider. Genuinely friendly and helpful interactions lead people to accomplishing their goals, demonstrate respect, and tell people, “Yes, we really do care about you as a person.” Poor customer service usually diminishes even the most desirable services.

Customer service also involves ­follow-through. Libraries must do what they say they’re going to do. This applies both to small- and large-scale claims. On the granular level, it is important that librarians carry out the tasks they promise members; reserving an item or phoning them with the answer to a reference question, for instance. On a broader level, libraries need to back up the big claims often found in mission statements. Things like “improve the quality of life for all citizens” and “provide access to the world of social and cultural ideas” can only be demonstrated through action. Simply pasting some nice words onto a web page won’t cut it. Show, don’t tell.

SHOWING YOUR PERSONALITY

It is easier to relate to a group of people than it is to a building. I’ve worked with a lot of libraries’ staffs over the years, and I don’t think I’ve met a single group that didn’t have at least a strong contingent of enthusiastic and fun employees. Letting librarians’ personalities show makes it easier for individuals to relate to—and therefore trust—the library.

There are plenty of opportunities for this: displays, events, contributions in newsletters, emails, and on the web, among others. Have some fun, be yourself, and ensure that your library’s brand makes it apparent that it is an organization filled with people. Remember, being fun and engaging folks doesn’t necessarily mean you’re dumbing the library down. Only people who take themselves too seriously think that way!

MAKING PEOPLE SUCCESSFUL

A great way to earn loyalty is to help patrons to be successful. When it is apparent that the library has their best interests at heart, people are likely to use the library more—and advocate for it. Remember, people’s actions in a library don’t exist in a vacuum. When they check out a DVD, they’re hoping to be entertained. When they ask a reference question, they probably have a goal they’re hoping to attain. Even if that goal is a barroom bet (maybe especially so!), helping people to reach their ends is an important way to earn their trust. Would your library be a different place if you started thinking of it as an organization that works with members to accomplish goals?

WEBSITES

The content on your website, what people can accomplish using it, and its visual design all impact the level of trust people place both in the site and in your institution as a whole. A website with outdated information or poor legibility raises a red flag and leads people to believe the site is sloppy or ineffectual.

HONESTY

In “The Transparent Library: Living Out Loud” (LJ 6/1/07, p. 34), Michael Stephens and Michael Casey illustrate how transparent libraries set themselves up to build long-lasting relationships.

“Transparency and arrogance are like oil and water—the two simply don’t mix. This is a very good reason for encouraging transparency in any organization. It’s very difficult for a transparent library to lie and shy away from the truth….”

If a library isn’t honest with its members, it is unlikely that a trusting relationship will form. Making it known why your organization makes the decisions it does and being forthright when it makes mistakes are effective ways to humanize your library. Engaging patrons with participatory design methods and involving them in the planning process take this idea further. The more deliberate the transparency, the better the result.

Miedema, John: Google Translate for Emoji announced on April Fools Day. But something like this is possible with sentiment analysis.

Wed, 2014-04-02 13:38

“Can a word smile? Can it roll its eyes? Or say sorry, not sorry? … We’re excited to announce Google Translate support for Emoji. With a click of a button our translation algorithm interprets the content and tone of words, and distills them down into clear, articulate and meaningful symbols.” Google announced it yesterday. They had me going for a minute, till I realized it was April Fool’s Day. “The Chrome team is really excited about being able to translate the entire internet into Emoji. URLs, emojify. E-commmerce, emojify. Medical journals, emojify. … Some of our engineers have even started to code in emoji.”

Okay, Google got me. However, they are not entirely kidding, which is probably why the joke worked. Sentiment analysis is the use of natural language processing to extract subjective meaning from text. Emoji translation is a perfectly achievable concept, not to replace text completely, but to enrich the readability of text in some contexts.

More:
Chrome translates mobile Web to emoji for April Fools’
How to Use Emojis on Your Android Device

Miedema, John: Google Translate for Emoji announced on April Fools Day. But something like this is possible with sentiment analysis.

Wed, 2014-04-02 13:38

“Can a word smile? Can it roll its eyes? Or say sorry, not sorry? … We’re excited to announce Google Translate support for Emoji. With a click of a button our translation algorithm interprets the content and tone of words, and distills them down into clear, articulate and meaningful symbols.” Google announced it yesterday. They had me going for a minute, till I realized it was April Fool’s Day. “The Chrome team is really excited about being able to translate the entire internet into Emoji. URLs, emojify. E-commmerce, emojify. Medical journals, emojify. … Some of our engineers have even started to code in emoji.”

Okay, Google got me. However, they are not entirely kidding, which is probably why the joke worked. Sentiment analysis is the use of natural language processing to extract subjective meaning from text. Emoji translation is a perfectly achievable concept, not to replace text completely, but to enrich the readability of text in some contexts.

More:
Chrome translates mobile Web to emoji for April Fools’
How to Use Emojis on Your Android Device

Rosenthal, David: EverCloud workshop

Wed, 2014-04-02 09:00
I was invited to a workshop sponsored by ISAT/DARPA entitled The EverCloud: Anticipating and Countering Cloud-Rot that arose from Yale's EverCloud project. I gave a brief statement on an initial panel; an edited text with links to the sources is below the fold.

I'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries. For the last 15 years I've been working on the problem of keeping data safe for the long term. I'm going to run through some of the key lessons about storage we have learned in that time.

First, at scale this is an insoluble problem, in the sense that you are never going to know that you have solved it. Consider the simplest possible model of long-term storage, a black box into which you put a Petabyte and out of which 100 years later you take a Petabyte. You want to have a 50% chance that every bit you take out is the same as when it went in. Think of each bit like a radioactive atom that randomly decays. You have just specified a half-life for the bits; it is about 60M times the age of the universe. There's no feasible experiment you can do that would prove no process with a half-life less than 60M times the age of the universe was going on inside the box. Another way of looking at a Petabyte for a Century is that it needs 18 nines of reliability; 7 nines more reliable than S3's design goal.

We are going to lose stuff. How much stuff we lose depends on how much we spend storing it; the more we spend the safer the bits. Unfortunately, this is subject to the Law of Diminishing Returns. Each successive 9 of reliability is exponentially more expensive.

We need to trade off loss rate and cost, so we need a model of the cost of long-term storage. This matters because the main threat to stored data is economic. If data grows at IDC's 60%/yr, disk density grows at IHS iSuppli's 20%/yr, and IT budgets are essentially flat, the annual cost of storing a decade's accumulated data is 20 times the first year's cost.

Different technologies with different media service lives involve spending different amounts of money at different times during the life of the data. To make apples-to-apples comparisons we need to use the equivalent of Discounted Cash Flow to compute the endowment needed for the data. This is the capital sum which, deposited with the data and invested at prevailing interest rates, would be sufficient to cover all the expenditures needed to store the data for its life.

Until recently, this has not been a concern. Kryder's Law, the exponential increase in bit density on disk platters, meant the cost per byte of storage dropped about 40%/yr. If you could afford to store the data for a few years you could afford to store it "forever"; the cost rapidly became negligible.

We built an economic model of the cost of long-term storage. Here it is from 15 months ago plotting the endowment needed for 3 replicas of a 117TB dataset to have a 98% chance of not running out of money over 100 years, against the Kryder rate, using costs from Backblaze. Each line represents a policy of keeping the drives for 1,2 ... 5 years before replacing them.

In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decrease, we move into the steep part of the graph, which has two effects:
  • The endowment needed increases sharply.
  • The endowment needed becomes harder to predict, because it depends strongly on the precise Kryder rate.
As Randall Munroe points out, in the real world exponential growth can't continue for ever. It is always the first part of a S-curve. Just as with Moore's Law, Kryder's Law is flattening out.

Here's a graph, from Preeti Gupta at UCSC, showing that in 2010, even before the floods in Thailand doubled $/GB overnight, the curve was flattening. Currently, disk is about 7 times as expensive as it would have been had the pre-2010 Kryder's Law continued. Industry projections are for 10-20%/yr going forward - the red lines on the graph show that in 2020 disk is now expected to be 100-300 times more expensive than pre-2010 expectations.

Here, I've added S3 to Preeti's disk data, assuming S3 is equivalent to amortizing 3 disks over 3 years:
  • Initially, S3 was somewhat more expensive than raw disk. Fair enough, it includes infrastructure, running costs and Amazon's margins.
  • S3's Kryder rate is much lower than disk's, so over time it becomes a lot more expensive.
Here is our economic model comparing 3 local replicas with a 3-year drive life to S3 and Glacier. Glacier and local disk look about the same. But this isn't an apples-to-apples comparison. Disk's Kryder rate is projected as 10-20%/yr. Glacier's 1c/GB/mo rate is unlikely to change for a long time, its near-term Kryder rate will be zero. So even Glacier is significantly more expensive than doing it yourself.

The justification for using the cloud is to save money. For peak-load usage, such as intermittent computations or temporary working storage, it certainly does. Amazon can aggregate a lot of spiky demands so they pay base-load costs and charge less than peak-load prices. Customers avoid the cost of over-provisioning to cope with the peaks.

For base-load computational and and in particular long-term storage tasks, below a certain scale the cloud probably does save money. The scale at which it stops saving money varies, but it is surprisingly low. The reason is two-fold:
  • There are no spikes for Amazon to aggregate.
  • To be cheaper than Amazon, your costs don't have to be cheaper than Amazon's, they only have to be cheaper than Amazon's costs plus Amazon's margins.
Amazon is notoriously a low-margin company, so why is this margin stacking a problem? The reason is that, so far, cloud has not been a competitive market. According to Gartner, Amazon has:
more than five times the compute capacity in use than the aggregate total of the other fourteen providersAmazon's margins on S3 may have been minimal at introduction, now they are extortionate, as Google has noticed:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."As the graph earlier showed, and as Glacier shows, Amazon introduces products at very competitive pricing, which it reduces much more slowly than its costs. As with books:
Amazon was simply following in the tradition of any large company that gains control of a market. “You lower your prices until the competition is out of the picture, and then you raise your prices and get your money back,” [Steven Blake Mettee] said. Google's recent dramatic price cuts suggest that they are determined to make the cloud a two-vendor market. But, as we see with the disk drive business itself, a two-vendor market is not a guarantee of low margins. Amazon cut prices in response to Google, but didn't go the whole way to match them. Won't they lose all their customers? No, bandwidth charges act to lock-in existing customers. Getting 1PB out of S3 into a competitor in 2 months costs about 2 months storage.  If the competitor is 10% cheaper, the decision to switch doesn't start paying off for a year. Who knows if the competitor will still be 10% cheaper in a year? A competitor has to be a lot cheaper to motivate customers to switch.

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. In 2009 Seagate did a study of the market for disks with an archival service life, which they could easily make, and discovered that no-one would pay the extra for them.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

The reason that the idea of long-lived media is so attractive is that it suggests that you can design a system ignoring the possibility of media failures. You can't, because long-lived does not mean more reliable, it means that their reliability degrades more slowly with time. Even if you could it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. I posted this in a different forum: Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)Update: After the talk, I updated the graph that included S3 pricing to reflect Amazon's response to Google's price cut. It shows the dramatic size of the price cut, but also that the result is that the discount you get for renting large amounts of space has been greatly reduced. Before the price cut, the first GB cost $0.095/GB/mo and half a petabyte $0.065/GB/mo, or 68% of the first GB. After the price cut, the first GB costs $0.03/GB/mo and half a petabyte costs $0.0285/GB/mo, or 95% of the first GB.

Rosenthal, David: EverCloud workshop

Wed, 2014-04-02 09:00
I was invited to a workshop sponsored by ISAT/DARPA entitled The EverCloud: Anticipating and Countering Cloud-Rot that arose from Yale's EverCloud project. I gave a brief statement on an initial panel; an edited text with links to the sources is below the fold.

I'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries. For the last 15 years I've been working on the problem of keeping data safe for the long term. I'm going to run through some of the key lessons about storage we have learned in that time.

First, at scale this is an insoluble problem, in the sense that you are never going to know that you have solved it. Consider the simplest possible model of long-term storage, a black box into which you put a Petabyte and out of which 100 years later you take a Petabyte. You want to have a 50% chance that every bit you take out is the same as when it went in. Think of each bit like a radioactive atom that randomly decays. You have just specified a half-life for the bits; it is about 60M times the age of the universe. There's no feasible experiment you can do that would prove no process with a half-life less than 60M times the age of the universe was going on inside the box. Another way of looking at a Petabyte for a Century is that it needs 18 nines of reliability; 7 nines more reliable than S3's design goal.

We are going to lose stuff. How much stuff we lose depends on how much we spend storing it; the more we spend the safer the bits. Unfortunately, this is subject to the Law of Diminishing Returns. Each successive 9 of reliability is exponentially more expensive.

We need to trade off loss rate and cost, so we need a model of the cost of long-term storage. This matters because the main threat to stored data is economic. If data grows at IDC's 60%/yr, disk density grows at IHS iSuppli's 20%/yr, and IT budgets are essentially flat, the annual cost of storing a decade's accumulated data is 20 times the first year's cost.

Different technologies with different media service lives involve spending different amounts of money at different times during the life of the data. To make apples-to-apples comparisons we need to use the equivalent of Discounted Cash Flow to compute the endowment needed for the data. This is the capital sum which, deposited with the data and invested at prevailing interest rates, would be sufficient to cover all the expenditures needed to store the data for its life.

Until recently, this has not been a concern. Kryder's Law, the exponential increase in bit density on disk platters, meant the cost per byte of storage dropped about 40%/yr. If you could afford to store the data for a few years you could afford to store it "forever"; the cost rapidly became negligible.

We built an economic model of the cost of long-term storage. Here it is from 15 months ago plotting the endowment needed for 3 replicas of a 117TB dataset to have a 98% chance of not running out of money over 100 years, against the Kryder rate, using costs from Backblaze. Each line represents a policy of keeping the drives for 1,2 ... 5 years before replacing them.

In the past, with Kryder rates in to 30-40% range, we were in the flatter part of the graph where the precise Kryder rate wasn't that important in predicting the long-term cost. As Kryder rates decrease, we move into the steep part of the graph, which has two effects:
  • The endowment needed increases sharply.
  • The endowment needed becomes harder to predict, because it depends strongly on the precise Kryder rate.
As Randall Munroe points out, in the real world exponential growth can't continue for ever. It is always the first part of a S-curve. Just as with Moore's Law, Kryder's Law is flattening out.

Here's a graph, from Preeti Gupta at UCSC, showing that in 2010, even before the floods in Thailand doubled $/GB overnight, the curve was flattening. Currently, disk is about 7 times as expensive as it would have been had the pre-2010 Kryder's Law continued. Industry projections are for 10-20%/yr going forward - the red lines on the graph show that in 2020 disk is now expected to be 100-300 times more expensive than pre-2010 expectations.

Here, I've added S3 to Preeti's disk data, assuming S3 is equivalent to amortizing 3 disks over 3 years:
  • Initially, S3 was somewhat more expensive than raw disk. Fair enough, it includes infrastructure, running costs and Amazon's margins.
  • S3's Kryder rate is much lower than disk's, so over time it becomes a lot more expensive.
Here is our economic model comparing 3 local replicas with a 3-year drive life to S3 and Glacier. Glacier and local disk look about the same. But this isn't an apples-to-apples comparison. Disk's Kryder rate is projected as 10-20%/yr. Glacier's 1c/GB/mo rate is unlikely to change for a long time, its near-term Kryder rate will be zero. So even Glacier is significantly more expensive than doing it yourself.

The justification for using the cloud is to save money. For peak-load usage, such as intermittent computations or temporary working storage, it certainly does. Amazon can aggregate a lot of spiky demands so they pay base-load costs and charge less than peak-load prices. Customers avoid the cost of over-provisioning to cope with the peaks.

For base-load computational and and in particular long-term storage tasks, below a certain scale the cloud probably does save money. The scale at which it stops saving money varies, but it is surprisingly low. The reason is two-fold:
  • There are no spikes for Amazon to aggregate.
  • To be cheaper than Amazon, your costs don't have to be cheaper than Amazon's, they only have to be cheaper than Amazon's costs plus Amazon's margins.
Amazon is notoriously a low-margin company, so why is this margin stacking a problem? The reason is that, so far, cloud has not been a competitive market. According to Gartner, Amazon has:
more than five times the compute capacity in use than the aggregate total of the other fourteen providersAmazon's margins on S3 may have been minimal at introduction, now they are extortionate, as Google has noticed:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."As the graph earlier showed, and as Glacier shows, Amazon introduces products at very competitive pricing, which it reduces much more slowly than its costs. As with books:
Amazon was simply following in the tradition of any large company that gains control of a market. “You lower your prices until the competition is out of the picture, and then you raise your prices and get your money back,” [Steven Blake Mettee] said. Google's recent dramatic price cuts suggest that they are determined to make the cloud a two-vendor market. But, as we see with the disk drive business itself, a two-vendor market is not a guarantee of low margins. Amazon cut prices in response to Google, but didn't go the whole way to match them. Won't they lose all their customers? No, bandwidth charges act to lock-in existing customers. Getting 1PB out of S3 into a competitor in 2 months costs about 2 months storage.  If the competitor is 10% cheaper, the decision to switch doesn't start paying off for a year. Who knows if the competitor will still be 10% cheaper in a year? A competitor has to be a lot cheaper to motivate customers to switch.

Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. In 2009 Seagate did a study of the market for disks with an archival service life, which they could easily make, and discovered that no-one would pay the extra for them.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

The reason that the idea of long-lived media is so attractive is that it suggests that you can design a system ignoring the possibility of media failures. You can't, because long-lived does not mean more reliable, it means that their reliability degrades more slowly with time. Even if you could it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. I posted this in a different forum: Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)Update: After the talk, I updated the graph that included S3 pricing to reflect Amazon's response to Google's price cut. It shows the dramatic size of the price cut, but also that the result is that the discount you get for renting large amounts of space has been greatly reduced. Before the price cut, the first GB cost $0.095/GB/mo and half a petabyte $0.065/GB/mo, or 68% of the first GB. After the price cut, the first GB costs $0.03/GB/mo and half a petabyte costs $0.0285/GB/mo, or 95% of the first GB.

Ng, Cynthia: BCLA 2014: Marian Bantjes Keynote

Wed, 2014-04-02 01:33
Designer, typographer, writer, illustrator. Libraries are great (you can get free books)! Introduction to some design (commercial work), using typography and art together to promote something. e.g. Toulouse-Lautrec, Mondrian Very visual oriented design where it was more about the artwork (not necessarily even having to do with what it’s selling). Most revered, Paul Rand encapsulated […]

Ng, Cynthia: BCLA 2014: Marian Bantjes Keynote

Wed, 2014-04-02 01:33
Designer, typographer, writer, illustrator. Libraries are great (you can get free books)! Introduction to some design (commercial work), using typography and art together to promote something. e.g. Toulouse-Lautrec, Mondrian Very visual oriented design where it was more about the artwork (not necessarily even having to do with what it’s selling). Most revered, Paul Rand encapsulated […]