You are here

Feed aggregator

David Rosenthal: Steve Hetzler's "Touch Rate" Metric

planet code4lib - Fri, 2014-11-21 16:41
Steve Hetzler of IBM gave a talk at the recent Storage Valley Supper Club on a new, scale-free metric for evaluating storage performance that he calls "Touch Rate". He defines this as the proportion of the store's total content that can be accessed per unit time. This leads to some very illuminating graphs that I discuss below the fold.

Steve's basic graph is a log-log plot with performance increasing up and to the right. Response time for accessing an object (think latency) decreases to the right on the X-axis and the touch rate, the proportion of the total capacity that can be accessed by random reads in a year (think bandwidth) increases on the Y-axis. For example, a touch rate of 100/yr means that random reads could access the entire contents 100 times a year. He divides the graph into regions suited to different applications, with minimum requirements for response time and touch rate. So, for example, transaction processing requires response times below 10ms and touch rates above 100 (the average object is accessed about once every 3 days).

The touch rate depends on the size of the objects being accessed. If you take a specific storage medium, you can use its specifications to draw a curve on the graph as the size varies. Here Steve uses "capacity disk" (i.e. commodity 3.5" SATA drives) to show the typical curve, which varies from being bandwidth limited (for large objects on the left, horizontal side) being response limited (for small objects on the right, vertical side).

As an example of the use of these graphs, Steve analyzed the idea of MAID (Massive Array of Idle Drives). He used HGST MegaScale DC 4000.B SATA drives, and assumed that at any time 10% of them would be spun-up and the rest would be in standby. With random accesses to data objects, 9 out of 10 of them will encounter a 15sec spin-up delay, which sets the response time limit. Fully powering-down the drives as Facebook's cold storage does would save more power but increase the spin-up time to 20s. The system provides only (actually somewhat less than) 10% of the bandwidth per unit content, which sets the touch rate limit.

The Steve looked at the fine print of the drive specifications. He found two significant restrictions:
  • The drives have a life-time limit of 50K start/stop cycles.
  • For reasons that are totally opaque, the drives are limited to a total transfer of 180TB/yr.
Applying these gives this modified graph. The 180TB/yr limit is the horizontal line, reducing the touch rate for large objects. If the drives have a 4-year life, we would need 8M start/stop cycles to achieve a 15sec response time. But we only have 50K. To stay within this limit, the response time has to increase by a factor of 8M/50K, or 160, which is the vertical line. So in fact a traditional MAID system is effective only in the region below the horizontal line and left of the vertical line, much smaller than expected.

This analysis suggests that traditional MAID is not significantly better than tapes in a robot. Here, for example, Steve examines configurations varying from one tape drive for 1600 LTO6 tapes, or 4PB per drive, to a quite unrealistically expensive 1 drive per 10 tapes, or 60TB per drive. Tape drives have a 120K lifetime load/unload cycle limit, and the tapes can withstand at most 260 full-file passes, so tape has a similar pair of horizontal and vertical lines.

The reason that Facebook's disk-based cold storage doesn't suffer from the same limits as traditional MAID is that it isn't doing random I/O. Facebook's system schedules I/Os so that it uses the full bandwidth of the disk array, raising the touch rate limit to that of the drives, and reducing the number of start-stop cycles. Admittedly, the response time for a random data object is now a worst-case 7 times the time for which a group of drives is active, but this is not a critical parameter for Facebook's application.

Steve's metric seems to be a major contribution to the analysis of storage systems.

Jenny Rose Halperin: Townhall, not Shopping Mall! Community, making, and the future of the Internet

planet code4lib - Fri, 2014-11-21 15:59

I presented a version of this talk at the 2014 Futurebook Conference in London, England. They also kindly featured me in the program. Thank you to The Bookseller for a wonderful conference filled with innovation and intelligent people!

A few days ago, I was in the Bodleian Library at Oxford University, often considered the most beautiful library in the world. My enthusiastic guide told the following story:

After the Reformation when all the books in Oxford were burned, Sir Thomas Bodley decided to create a place where people could go and access all the world’s information at their fingertips, for free.

“What does that sound like?” she asked. “…the Internet?”

While this is a lovely conceit, the part of the story that resonated with me for this talk is the other big change that Bodley made, which was to work with publishers, who were largely a monopoly at that point, to fill his library for free by turning the library into a copyright library. While this seemed antithetical to the ways that publishers worked, in giving a copy of their very expensive books away, they left an indelible and permanent mark on the face of human knowledge. It was not only preservation, but self-preservation.

Bodley was what people nowadays would probably call “an innovator” and maybe even in the parlance of my field, a “community manager.”

By thinking outside of the scheme of how publishing works, he joined together with a group of skeptics and created one of the greatest knowledge repositories in the world, one that still exists 700 years later. This speaks to a few issues:

Sharing economies, community, and publishing should and do go hand in hand and have since the birth of libraries. By stepping outside of traditional models, you are creating a world filled with limitless knowledge and crafting it in new and unexpected ways.

The bound manuscript is one of the most enduring technologies. This story remains relevant because books are still books and people are still reading them.

As the same time, things are definitely changing. For the most part, books and manuscripts were pretty much identifiable as books and manuscripts for the past 1000 years.

But what if I were to give Google Maps to a 16th Century Map Maker? Or what if I were to show Joseph Pulitzer Medium? Or what if I were to hand Gutenberg a Kindle? Or Project Gutenberg for that matter? What if I were to explain to Thomas Bodley how I shared the new Lena Dunham book with a friend by sending her the file instead of actually handing her the physical book? What if I were to try to explain Lena Dunham?

These innovations have all taken place within the last twenty years, and I would argue that we haven’t even scratched the surface in terms of the innovations that are to come.

We need to accept that the future of the printed word may vary from words on paper to an ereader or computer in 500 years, but I want to emphasize that in the 500 years to come, it will more likely vary from the ereader to a giant question mark.

International literacy rates have risen rapidly over the past 100 years and companies are scrambling to be the first to reach what they call “developing markets” in terms of connectivity. In the vein of Mark Surman’s talk at the Mozilla Festival this year, I will instead call these economies post-colonial economies.

Because we (as people of the book) are fundamentally idealists who believe that the printed word can change lives, we need to be engaged with rethinking the printed word in a way that recognizes power structures and does not settle for the limited choices that the corporate Internet provides (think Facebook vs WhatsApp). This is not as a panacea to fix the world’s ills.

In the Atlantic last year, Phil Nichols wrote an excellent piece that paralleled Web literacy and early 20th century literacy movements. The dualities between “connected” and “non-connected,” he writes, impose the same kinds of binaries and blind cure-all for social ills that the “literacy” movement imposed in the early 20th century. In equating “connectedness” with opportunity, we are “hiding an ideology that is rooted in social control.”

Surman, who is director of the Mozilla Foundation, claims that the Web, which had so much potential to become a free and open virtual meeting place for communities, has started to resemble a shopping mall. While I can go there and meet with my friends, it’s still controlled by cameras that are watching my every move and its sole motive is to get me to buy things.

85 percent of North America is connected to the Internet and 40 percent of the world is connected. Connectivity increased at a rate of 676% in the past 13 years. Studies show that literacy and connectivity go hand in hand.

How do you envision a fully connected world? How do you envision a fully literate world? How can we empower a new generation of connected communities to become learners rather than consumers?

I’m not one of these technology nuts who’s going to argue that books are going to somehow leave their containers and become networked floating apparatuses, and I’m not going to argue that the ereader is a significantly different vessel than the physical book.

I’m also not going to argue that we’re going to have a world of people who are only Web literate and not reading books in twenty years. To make any kind of future prediction would be a false prophesy, elitist, and perhaps dangerous.

Although I don’t know what the printed word will look like in the next 500 years,

I want to take a moment to think outside the book,

to think outside traditional publishing models, and to embrace the instantaneousness, randomness, and spontaneity of the Internet as it could be, not as it is now.

One way I want you to embrace the wonderful wide Web is to try to at least partially decouple your social media followers from your community.

Twitter and other forms of social media are certainly a delightful and fun way for communities to communicate and get involved, but your viral campaign, if you have it, is not your community.

True communities of practice are groups of people who come together to think beyond traditional models and innovate within a domain. For a touchstone, a community of practice is something like the Penguin Labs internal innovation center that Tom Weldon spoke about this morning and not like Penguin’s 600,000 followers on Twitter. How can we bring people together to allow for innovation, communication, and creation?

The Internet provides new and unlimited opportunities for community and innovation, but we have to start managing communities and embracing the people we touch as makers rather than simply followers or consumers.

The maker economy is here— participatory content creation has become the norm rather than the exception. You have the potential to reach and mobilize 2.1 billion people and let them tell you what they want, but you have to identify leaders and early adopters and you have to empower them.

How do you recognize the people who create content for you? I don’t mean authors, but instead the ambassadors who want to get involved and stay involved with your brand.

I want to ask you, in the spirit of innovation from the edges

What is your next platform for radical participation? How are you enabling your community to bring you to the next level? How can you differentiate your brand and make every single person you touch psyched to read your content, together? How can you create a community of practice?

Community is conversation. Your users are not your community.

Ask yourself the question Rachel Fershleiser asked when building a community on Tumblr: Are you reaching out to the people who want to hear from you and encouraging them or are you just letting your community be unplanned and organic?

There reaches a point where we reach the limit of unplanned organic growth. Know when you reach this limit.

Target, plan, be upbeat, and encourage people to talk to one another without your help and stretch the creativity of your work to the upper limit.

Does this model look different from when you started working in publishing? Good.

As the story of the Bodelian Library illustrated, sometimes a totally crazy idea can be the beginning of an enduring institution.

To repeat, the book is one of the most durable technologies and publishing is one of the most durable industries in history. Its durability has been put to the test more than once, and it will surely be put to the test again. Think of your current concerns as a minor stumbling block in a history filled with success, a history that has documented and shaped the world.

Don’t be afraid of the person who calls you up and says, “I have this crazy idea that may just change the way you work…” While the industry may shift, the printed word will always prevail.

Publishing has been around in some shape or form for 1000 years. Here’s hoping that it’s around for another 1000 more.

District Dispatch: ALA Washington Office copyright event “too good to be true”

planet code4lib - Fri, 2014-11-21 15:59

(Left to right) ALA Washington Office Executive Director Emily Sheketoff, Jonathan Band, Brandon Butler and Mary Rasenberger.

On Tuesday, November 18th, the American Library Association (ALA) held a panel discussion on recent judicial interpretations of the doctrine of fair use. The discussion, entitled “Too Good to be True: Are the Courts Revolutionizing Fair Use for Education, Research and Libraries?” is the first in a series of information policy discussions to help us chart the way forward as the ongoing digital revolution fundamentally changes the way we access, process and disseminate information.

These events are part of the ALA Office for Information Technology Policy’s broader Policy Revolution! initiative—an ongoing effort to establish and maintain a national public policy agenda that will amplify the voice of the library community in the policymaking process and position libraries to best serve their patrons in the years ahead.

Tuesday’s event convened three copyright experts to discuss and debate recent developments in digital fair use. The experts—ALA legislative counsel Jonathan Band; American University practitioner-in-practice Brandon Butler; and Authors Guild executive director Mary Rasenberger—engaged in a lively discussion that highlighted some points of agreement and disagreement between librarians and authors.

The library community is a strong proponent of fair use, a flexible copyright exception that enables use of copyrighted works without prior authorization from the rights holder. Fair use can be determined by the consideration of four factors. A number of court decisions issued over the last three years have affirmed the use of copyrighted works by libraries as fair, including the mass digitization of books housed in some research libraries, such as Authors Guild v. HathiTrust.

Band and Butler disagreed with Rasenberger on several points concerning recent judicial fair use interpretations. Band and Butler described judicial rulings on fair use in disputes like the Google Books case and the HathiTrust case as on-point, and rejected arguments that the reproductions of content at issue in these cases could result in economic injury to authors. Rasenberger, on the other hand, argued that repositories like HathiTrust and Google Books can in fact lead to negative market impacts for authors, and therefore do not represent a fair use.

Rasenberger believes that licensing arrangements should be made between authors and members of the library, academic and research communities who want to reproduce the content to which they hold rights. She takes specific issue with judicial interpretations of market harm that require authors to demonstrate proof of a loss of profits, suggesting that such harm can be established by showing that future injury is likely to befall an author as a result of the reproduction of his or her work.

Despite their differences of opinion, the panelists provided those in attendance at Tuesday’s event with some meaningful food for thought, and offered a thorough overview of the ongoing judicial debates over fair use. We were pleased that the Washington Internet Daily published an article “Georgia State Case Highlights Fair Use Disagreement Among Copyright Experts,” on November 20, 2014, about our session. ALA continues to fight for public access to information as these debates play out.

Stay tuned for the next event, planned for early 2015!

The post ALA Washington Office copyright event “too good to be true” appeared first on District Dispatch.

Cherry Hill Company: Deployment and Development workflows at Cherry Hill

planet code4lib - Fri, 2014-11-21 15:58

Last year, we reached a milestone at Cherry Hill when we moved all of our projects into a managed deployment system. We have talked about Jenkins, one of the tools that we use to manage our workflow and there has been continued interest on what our "recipe" consists of. Being that we are using open source tools, and we think of ourselves as part of the (larger than Drupal) open source community, I want to share a bit more of what we use and how it is stitched together. Our hope is that this helps to spark a larger discussion of the tools others are using, so we can all learn from each other.

Git is a distributed code revision control system. While we could use any revision control system such as CSV, Subversion (and even though this is a given with most agencies, we strongly suggest you use *some* system over nothing at all), git is fairly easy to use, has great...

Read more »

Shelley Gullikson: Weekly user tests: Finding subject guides

planet code4lib - Fri, 2014-11-21 14:54

This week we did a guerrilla-style test to see how (or if) people find our subject guides, particularly if they are not in our main listing. We asked “Pretend that someone has told you there is a really great subject guide on the library website about [subject]. What would you do to find it?” We cycled through three different subjects not listed on our main subject guide page: Canadian History, Ottawa, and Homelessness.

Some Context

Our subject guides use a template created in-house (not LibGuides) and we use Drupal Views and Taxonomy to create our lists. The main subject guide page has an A-Z list, an autocomplete search box, a list of broad subjects (e.g. Arts and Social Sciences) and a list of narrower subjects (e.g. Sociology). The list of every subject guide is on another page. Subject specialists were not sure if users would find guides that didn’t correspond to the narrower subjects (e.g. Sociology of Sport).

Results

The 21 students we saw did all kinds of things to find subject guides. We purposely used the same vocabulary as what is on the site because it wasn’t supposed to be a test about the label “subject guide.” However, less than 30% clicked on the Subject Guides link; the majority used some sort of search.

Here you can see the places people went to on our home page most (highlighted in red), next frequently (in orange) and just once (yellow).

When people used our site search, they had little problem finding the guide (although a typo stymied one person). However, a lot of participants used our Summon search. I think there are a couple of reasons for this:

  • Students didn’t know what a subject guide was and so looked for guides the way they look for articles, books, etc.
  • Students think the Summon search box is for everything

Of the 6 students who did click on the Subject Guides link:

  • 2 used broad subjects (and neither was successful with this strategy)
  • 2 used narrow subjects (both were successful)
  • 1 used the A-Z list (with success)
  • 1 used the autocomplete search (with success)

One person thought that she couldn’t possibly find the Ottawa guide under “Subject Guides” because she thought those were only for courses. I found this very interesting because a number of our subject guides do not map directly to courses.

The poor performance of the broad subjects on the subject guide page is an issue and Web Committee will look at how we might address that. Making our site search more forgiving of typos is also going to move up the to-do list. But I think the biggest takeaway is that we really have to figure out how to get our guides indexed in Summon.


Hathi Trust Large Scale Search: Practical Relevance Ranking for 11 Million Books, Part 3: Document Length Normalization.

planet code4lib - Thu, 2014-11-20 23:21

In Part 2 we argued that most relevance ranking algorithms used for ranking text documents are based on three fundamental features:

read more

District Dispatch: ALA welcomes Simon & Schuster change to Buy It Now program

planet code4lib - Thu, 2014-11-20 21:34

ALA President Courtney Young

Today, the American Library Association (ALA) and its Digital Content Working Group (DCWG) welcomed Simon & Schuster’s announcement that it will allow libraries to opt into the “Buy It Now” program. The publisher began offering all of its ebook titles for library lending nationwide in June 2014, with required participation in the “Buy It Now” merchandising program, which enables library users to directly purchase a title rather than check it out from the library. Simon & Schuster ebooks are available for lending for one year from the date of purchase.

In an ALA statement, ALA President Courtney Young applauded the move:

From the beginning, the ALA has advocated for the broadest and most affordable library access to e-titles, as well as licensing terms that give libraries flexibility to best meet their community needs.

We appreciate that Simon & Schuster is modifying its library ebook program to provide libraries a choice in whether or not to participate in Buy It Now. Providing options like these allow libraries to enable digital access while also respecting local norms or policies. This change also speaks to the importance of sustaining conversations among librarians, publishers, distributors and authors to continue advancing our shared goals of connecting writers and readers.

DCWG Co-Chairs Carolyn Anthony and Erika Linke also commented on the Simon & Schuster announcement:

“We are still in the early days of this digital publishing revolution, and we hope we can co-create solutions that expand access, increase readership and improve exposure for diverse and emerging voices,” said. “Many challenges remain including high prices, privacy concerns, and other terms under which ebooks are offered to libraries. We are continuing our discussions with publishers.”

For more library ebook lending news, visit the American Libraries magazine E-Content blog.

The post ALA welcomes Simon & Schuster change to Buy It Now program appeared first on District Dispatch.

In the Library, With the Lead Pipe: Introducing Library Pipeline

planet code4lib - Thu, 2014-11-20 21:30

South Coast Pipe by Colm Walsh (CC-BY)

In Brief: We’re creating a nonprofit, Library Pipeline, that will operate independently from In the Library with the Lead Pipe, but will have similar and complementary aims: increasing and diversifying professional development; improving strategies and collaboration; fostering more innovation and start-ups, and encouraging LIS-related publishing and publications. In the Library with the Lead Pipe is a platform for ideas; Library Pipeline is a platform for projects.

At In the Library with the Lead Pipe, our goal has been to change libraries, and the world, for the better. It’s on our About page: We improve libraries, professional organizations, and their communities of practice by exploring new ideas, starting conversations, documenting our concerns, and arguing for solutions. Those ideas, conversations, concerns, and solutions are meant to extend beyond libraries and into the societies that libraries serve.

What we want to see is innovation–new ideas and new projects and collaborations. Innovative libraries create better educated citizens and communities with stronger social ties.

Unfortunately, libraries’ current funding structures and the limited professional development options available to librarians make it difficult to introduce innovation at scale. As we started talking about a couple of years ago, in our reader survey and in a subsequent editorial marking our fourth anniversary, we need to extend into other areas, besides publication, in order to achieve our goals. So we’re creating a nonprofit, Library Pipeline, that will operate independently from In the Library with the Lead Pipe, but will have similar and complementary aims.

Library Pipeline is dedicated to supporting structural changes by providing opportunities, funding, and services that improve the library as an institution and librarianship as a profession. In the Library with the Lead Pipe, the journal we started in 2008, is a platform for ideas; Library Pipeline is a platform for projects. Although our mission is provisional until our founding advisory board completes its planning process, we have identified four areas in which modest funding, paired with guidance and collaboration, should lead to significant improvements.

Professional Development

A few initiatives, notably the American Library Association’s Emerging Leaders and Spectrum Scholars programs, increase diversity and provide development opportunities for younger librarians. We intend to expand on these programs by offering scholarships, fellowships, and travel assistance that enable librarians to participate in projects that shift the trajectory of their careers and the libraries where they work.

Collaboration

Organized, diverse groups can solve problems that appear intractable if participants have insufficient time, resources, perspective, or influence. We would support collaborations that last a day, following the hack or camp model, or a year or two, like task forces or working groups.

Start-ups

We are inspired by incubators and accelerators, primarily YCombinator and SXSW’s Accelerator. The library and information market, though mostly dormant, could support several dozen for-profit and nonprofit start-ups. The catalyst will be mitigating founders’ downside risk by funding six months of development, getting them quick feedback from representative users, and helping them gain customers or donors.

Publishing

Librarianship will be stronger when its practitioners have as much interest in documenting and serving our own field as we have in supporting the other disciplines and communities we serve. For that to happen, our professional literature must become more compelling, substantive, and easier to access. We would support existing open access journals as well as restricted journals that wish to become open access, and help promising writers and editors create new publications.

These four areas overlap by design. For example, we envision an incubator for for-profit and nonprofit companies that want to serve libraries. In this example, we would provide funding for a diverse group of library students, professionals, and their partners who want to incorporate, and bring this cohort to a site where they can meet with seasoned librarians and entrepreneurs. After a period of time, perhaps six months, the start-ups would reconvene for a demo day attended by potential investors, partners, donors, and customers.

Founding Advisory Board

We were inspired by the Constellation Model for our formation process, as adapted by the Digital Public Library of America and the National Digital Preservation Alliance (see: “Using Emergence to Take Social Innovation to Scale”). Our first step was identifying a founding advisory board, whose members have agreed to serve a two-year term (July 2014-June 2016). At the end of which the Board will be dissolved and replaced with a permanent governing board. During this period, the advisory board will formalize and ratify Library Pipeline’s governance and structure, establish its culture and business model, promote its mission, and define the organizational units that will succeed the advisory board, such as a permanent board of trustees and paid staff.

The members of our founding advisory board are:

The board will coordinate activity among, and serve as liaisons to, the volunteers on what we anticipate will eventually be six subcommittees (similar to DPLA’s workstreams). This is going to be a shared effort; the job is too big for ten people. Those six subcommittees and their provisional charges are:

  • Professional Development within LIS (corresponding to our “Professional Development” area). Provide professional development funding, in the form of scholarships, fellowships, or travel assistance, for librarians or others who are working in behalf of libraries or library organizations, with an emphasis on participation in cross-disciplinary projects or conferences that extend the field of librarianship in new directions and contribute to increased diversity among practitioners and the population we serve.
  • Strategies for LIS (corresponding to “Collaboration”). Bring together librarians and others who are committed to supporting libraries or library-focused organizations. These gatherings could be in-person or online, could last a day or could take a year, and could be as basic as brainstorming solutions to a timely, significant issue or as directed as developing solutions to a specific problem.
  • Innovation within LIS (corresponding to “Start-Ups”). Fund and advise library-related for-profit or nonprofit startups that have the potential to help libraries better serve their communities and constituents. We believe this area will be our primary focus, at least initially.
  • LIS Publications (corresponding with “Publishing”). Fund and advise LIS publications, including In the Library with the Lead Pipe. We could support existing open access journals or restricted journals that wish to become open access, and help promising writers and editors create new publications.
  • Governance. This may not need to be a permanent subcommittee, though in our formative stages it would be useful to work with people who understand how to create governance structures that provide a foundation that promotes stability and growth.
  • Sustainability. This would include fundraising, but it also seems to be the logical committee for creating the assessment metrics we need to have in place to ensure that we are fulfilling our commitment to libraries and the people who depend on them.
How Can You Help?

We’re looking for ideas, volunteers, and partners. Contact Brett or Lauren if you want to get involved, or want to share a great idea with us.

District Dispatch: ALA and E-rate in the press

planet code4lib - Thu, 2014-11-20 21:21

For nearly a year-and-a-half, the FCC has been engaged in an ongoing effort to update the E-rate program for the digital age. The American Library Association (ALA) has been actively engaged in this effort, submitting comments and writing letters to the FCC and holding meetings with FCC staff and other key E-rate stakeholders.

Our work on the E-rate modernization has drawn the attention of several media outlets over the past week, as the FCC prepares to consider an order that we expect to help libraries from the most populated cities to the most rural areas meet their needs related to broadband capacity and Wi-Fi:

The FCC Plans to Increase Your Phone Bill to Build Better Internet in Schools (ALA quoted)
E-Rate Funding Would Get Major Boost Under FCC Chair’s Plan
FCC’s Wheeler Draws Fans With E-Rate Cap Hike
Is expanding Wi-Fi to 10 million more students worth a cup of coffee?

ALA was also mentioned in articles from CQ Roll Call and PoliticoPro on Monday.

The new E-rate order is the second in the E-rate modernization proceeding. The FCC approved a first order on July 11th, which focuses on Wi-Fi and internal connections. ALA applauds the FCC for listening to our recommendations throughout the proceeding. Its work reflects an appreciation for all that libraries do to serve community needs related to Education, Employment, Entrepreneurship, Empowerment, and Engagement—the E’s of Libraries.

The post ALA and E-rate in the press appeared first on District Dispatch.

Open Library: Open Library Scheduled Hardware Maintenance

planet code4lib - Thu, 2014-11-20 17:10

Image from Automobile maintenance by Ray F. Kuns

Open Library will be down from 4:30PM to approximately 5:00PM (PST, UTC/GMT -7 hours) on Thursday November 20, 2014 due to scheduled hardware maintenance. We’ll post updates here and on @openlibrary twitter. Thank you for your cooperation.

FOSS4Lib Recent Releases: Torus - 2.30

planet code4lib - Thu, 2014-11-20 16:42
Package: TorusRelease Date: Thursday, November 20, 2014

Last updated November 20, 2014. Created by Peter Murray on November 20, 2014.
Log in to edit this page.

2.30 Thu 20 Nov 2014 11:34:12 CET

- MKT-168: fix parent's 'created' lost during update

- MKT-170: bootstrap 'originDate' for non-inherit records

OCLC Dev Network: Learning Linked Data: SPARQL

planet code4lib - Thu, 2014-11-20 16:15

One thing you realize pretty quickly is that it is very hard to work with Linked Data and just confine one’s explorations to a single site or data set. The links inevitably lead you on a pilgrimage from one data set to another and another. In the case of the WorldCat Discovery API, my pilgrimage led me from WorldCat to id.loc.gov, FAST and VIAF and from VIAF on to dbpedia. Dbpedia is an amazingly fun data set to play with. Using it to provide additional richness and context to the discovery experience has been enlightening.

HangingTogether: Libraries & Research: Changes in libraries

planet code4lib - Thu, 2014-11-20 14:16

[This is the fourth in a short series on our 2014 OCLC Research Library Partnership meeting, Libraries and Research: Supporting Change/Changing Support. You can read the firstsecond, and third posts and also refer to the event webpage that contains links to slides, videos, photos, and a Storify summary.]

And now, onward to the final session of the meeting, which focused appropriately enough on changes in libraries, which include new roles and and preparing to support future service demands. They are engaging in new alliances and are restructuring themselves to prepare for change in accordance with their strategic plans.

[Paul-Jervis Heath, Lynn Silipigni Connaway, and Jim Michalko]

Lynn Silipigni Connaway (Senior Research Scientist, OCLC Research) [link to video] shared the results of several studies that identify the importance of user-centered assessment and evaluation. Lynn has been working actively in this area since 2003, looking at not only researchers but also future researchers (students!). In interviews on virtual reference, focusing on perspective users, Lynn and her team found that students use Google and Wikipedia but also rely on human resources — other students, advisers, graduate students and faculty. In looking through years of data, interviewees tend to use generic terms like “database” and refer to specific tools and sources only when they are further along in their career — this doesn’t mean they don’t use them, rather, they get used to using more sophisticated terminology as they go along. No surprise, convenience trumps everything; researchers at all levels are eager to optimize their time so many “satisfice” if the assignment or task doesn’t warrant extra time spent. From my perspective, one of the most interesting findings from Lynn’s studies relates to students’ somewhat furtive use of Wikipedia, which she calls the Learning Black Market (students look up something in Google, find sources in Wikipedia, copy and paste the citation into their paper!). Others use Facebook to get help. Some interesting demographic differences — more established researchers use Twitter, and use of Wikipedia declines as researchers get more experience. In regards to the library, engagement around new issues (like data management) causes researchers to think anew about ways the library might be useful. Although researchers of all stripes will reach out to humans for help, librarians rank low on that list. Given all of these challenges, there are opportunities for librarians and library services — be engaging and be where researchers are, both physically and virtually. We should always assess what we are doing — keep doing what’s working, cut or reinvent what is not. Lynne’s presentation provides plenty of links and references for you to check out.

Paul-Jervis Heath (Head of Innovation & Chief Designer, University of Cambridge) [link to video] spoke from the  perspective of a designer, not a librarian (he has worked on smart homes, for example). He shared findings from recent work with the Cambridge University libraries. Because of disruption, libraries face a perfect storm of change in teaching, funding, and scholarly communications. User expectations are formed by consumer technology. While we look for teachable moments, Google and tech companies do not — they try to create intuitive experiences. Despite all the changes, libraries don’t need to sit on the sidelines, they can be engaged players. Design research is important and distinguished from market research in that it doesn’t measure how people think but how they act. From observation studies, we can see that students want to study together in groups, even if they are doing their own thing. The library needs to be optimized for that. Another technique employed, asking students to use diaries to document their days. Many students prefer the convenience of studying in their room but what propels them to the library is the desire to be with others in order to focus. At Cambridge, students have a unique geographic triangle defined by where they live, the department where they go to class, and the market they prefer to shop in. Perceptions about how far something (like the library) is outside of the triangle are relative. Depending on how far your triangle points are, life can be easy or hard. Students are not necessarily up on technology so don’t make assumptions. It turns out that books (the regular, paper kind) are great for studying! But students use ebooks to augment their paper texts, or will use when all paper books are gone. Shadowing (with permission) is another technique which allows you to immerse yourself in a researcher’s life and understand their mental models. Academics wear lot of different hats, play different roles within the university and are too pressed for time to learn new systems. It’s up to the library to create efficiencies and make life easier for researchers. Paul closed by emphasizing six strategic themes: transition from physical to digital; library spaces; sustainable classic library services; supporting research and scholarly communications; making special collections more available; and creating touchpoints that will bring people back to the library seamlessly.

Jim Michalko (Vice President, OCLC Research Library Partnership) [link to video] talked about his recent work looking at library organizational structures and restructuring. (Jim will be blogging about this work soon, so I won’t give more than a few highlights.) For years, libraries have been making choices about what to do and how to do it, and libraries have been reorganizing themselves to get this (new) work done. Jim gathered feedback from 65 institutions in the OCLC Research Library Partnership and conducted interviews with a subset of those, in order to find out if structure indeed follows strategy. Do new structures represent markets or adjacent strategies (in business speak)? We see libraries developing capacities in customer relationship management and we see this reflected in user-focused activities. Almost all institutions interviewed were undertaking restructuring based on a changes external to the library, such as new constituencies and expectations. Organizations are orienting themselves to be more user centered, and to align themselves with a new direction taken by the university. We see many libraries bringing in skill sets beyond those normally found in the library package. Many institutions charged a senior position with helping to run a portion of a regional or national service. Other similarities: all had a lot of communication about restructuring. Almost all also related to a space plan.

This session was followed by a discussion session and I invite you to watch it, and also to watch this lovely summary of our meeting delivered by colleague Titia van der Werf (less than 7 minutes long and worth watching!):

If you attended the meeting or were part of the remote viewing audience for all or part of it, or if you watched any of the videos, I hope you will leave some comments with your reactions. Thanks for reading!

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (274)

Library of Congress: The Signal: All the News That’s Fit to Archive

planet code4lib - Thu, 2014-11-20 14:03

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

The Library has had a web archiving program since the early 2000s.  As with other national libraries, the Library of Congress web archiving program started out harvesting the web sites of its national election campaigns, followed by some collections to harvest sites for period of time connected with events (for example, an Iraq War web archive and a papal transition 2005 web archive along with collecting the sites of the U.S. House and Senate and the legislative branch of government more broadly.

An American of the 1930s getting his news by reading a newspaper. These days he’d likely be looking at a computer screen. Photo courtesy of the Library of Congress Prints and Photographs division.

The question for the Library of Congress of “what else” to harvest beyond these collections is harder to answer than one might think because of the relatively small web archiving capacity of the Library of Congress (which is influenced by our permissions approach) compared to the vast immenseness of the Internet.  About six years ago we started a collection now known as the Public Policy Topics, for which we would acquire sites with content reflecting different viewpoints and research on a broad selection of public policy questions, including the sites of national political parties, selected advocacy organizations and think tanks and other organizations with a national voice in America’s policy discussions that could be of interest to future researchers.  We are adding more sites to Public Policy Topics continuously.

Eventually I decided to include some news web sites that contained significant discussion of policy issues from particular points of view – sites ranging from DailyKos.com to Townhall.com, from TruthDig.com to Redstate.com.  We started crawling these sites on a weekly basis to try to assure complete capture over time and to build a representation of how the site looked as different news events came and went in the public consciousness (and on these web sites).  We have been able to assess the small number of such sites that we have crawled and have decided that the results are acceptable.  But this was obviously not a very large-scale effort compared to the increasing number of sites presenting general news on the Internet -for many people, their current equivalent of a newspaper.

Newspapers – they are a critical source for historical research and the Library of Congress has a long history of collecting and providing access to U.S. (and other countries’) newspapers.  Having started to collect a small number of “newspaper-like” U.S. news sites for the Public Policy Topics collection, I began a conversation with three reference librarian colleagues from the Newspaper & Current Periodical Reading Room – Amber Paranick, Roslyn Pachoca and Gary Johnson ­- about expanding this effort to a new collection, a “General News on the Internet” web archive.  They explained to me:

Our newspaper collections are invaluable to researchers.  Newspapers provide a first-hand draft of history.  They provide supplemental information that cannot be found anywhere else.  They ‘fill in the gaps,’ so to speak. The way people access news has been changing and evolving ever since newspapers were first being published. We recognized the need to capture news published in another format.  It is reasonable to expect us to continue to connect these kinds of resources to our current and future patrons. Websites tend to be ephemeral and may disappear completely.  Without a designated archive, critical news content may be lost.

In short, my colleagues shared my interest, concern and enthusiasm for starting a larger collection of Internet-only general news sites as a web archiving collection.  I’ll let them explain their thinking further:

When we first got started on the project, we weren’t sure how to proceed.  Once we established clear boundaries on what to include, what types of news sites would be within scope for this collection, our selection process became easier. We asked for help in finding websites from our colleagues. 

We felt it was important to include sites that focus on general news with significant national presence where there are articles that have an author’s voice, such as with HuffingtonPost.com or BuzzFeed.com (even as some of these sites also contain articles that are meant to attract visitors, so-called “click bait).  We wanted to include a variety of sites that represent more cutting edge ways of presenting general news, such as Vox.com and TheVerge, and we felt sites that focus on parody such as TheOnion.com were also important to have represented.  Of course, these sites are not the only sources from which people obtain their news, but we tried to choose a variety that included more trendy or popular sources as well as the conventional or traditional types.  Again, the idea is to assure future users have access to a significant representation of how Americans accessed news at this time using the Internet.

The Library of Congress has an internal process for proposing new web archiving collections.  I worked with Amber, Roslyn and Gary and they submitted a “General News on the Internet” project proposal and it was approved.  Yay!  Then the work began – Amber, Roslyn and Gary describe some of the hurdles:

We understand that archiving video content is a problem. We thought websites like NowThisNews.com could be great candidates but in effect, because they contained so much video and a kind of Tumblr-like portal entry point for news, we had to reject them.  Since we do not do “one hop out” crawling, the linked-to content that is the substantive content (i.e., the news) would be entirely missed.   Also, websites like Vice.com change their content so frequently, it might be impossible to capture all of its content.

In addition, it was decided that sites chosen would not include general news sites associated primarily with other delivery vehicles, such as CNN.com or NYTimes.com.  Many of these types also have paywalls and therefore obviously would create limitations when trying to archive.

We also encountered another type of challenge with Drudgereport.com.  Since it is primarily a news-aggregator with most of the site consisting of links to news on other sites it would be tough to include the many links with the limitations in crawling (again, the “one hop” limitation – we don’t harvest links that are on a different URL).  In the end we decided to proceed in archiving The Drudge Report site since it is well known for the content that is original to that site.

The harvesting for this collection has now been underway for several months; we are examining the results.  We look forward to making an archived version of today’s news as brought to you by the Internet available to Library of Congress patrons for many tomorrows.

What news sites do you think we should collect?

David Rosenthal: Talk "Costs: Why Do We Care?"

planet code4lib - Thu, 2014-11-20 09:22
Investing in Opportunity: Policy Practice and Planning for a Sustainable Digital Future sponsored by the 4C project and the Digital Preservation Coalition featured a keynote talk each day. The first, by Fran Berman, is here.

Mine was the second, entitled Costs: Why Do We Care? It was an update and revision of The Half-Empty Archive, stressing the importance of collecting, curating and analyzing cost data. Below the fold, an edited text with links to the sources.
IntroductionI'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries, and I have two reasons for being especially happy to be here today. First, I'm a Londoner. Second, under the auspices of JISC the UK has been a very active participant in the LOCKSS program since 2006. As with all my talks, you don't need to take notes or ask for the slides. The text of the talk, with links to the sources, will go up on my blog shortly.

Why do I think am I qualified to stand here and pontificate about preservation costs? The LOCKSS Program develops, and supports users of, the LOCKSS digital preservation technology. This is a peer-to-peer system designed to let libraries collect and preserve copyright content published on the Web, such as e-journals and e-books. LOCKSS users participate in a number of networks customized for these and other forms of content including government documents, social science datasets, library special collections, and so on. One of these networks, the CLOCKSS archive, a community-managed dark archive of e-journals and e-books, was recently certified to the Trusted Repository Audit Criteria, equalling the previous highest score and gaining the first-ever perfect score for technology. The LOCKSS software is free open source, the LOCKSS team charges for support and services. On that basis, with no grant funding, for more than 7 years we have covered our costs and accumulated some reserves.

Because understanding and controlling our costs is very important for us, and because the LOCKSS system's Lots Of Copies trades using more disk space for using less of other resources (especially lawyers), I have been researching the costs of storage for some years.

Like all of you, the LOCKSS team has to plan and justify our budget each year. It is clear that economic failure is one of the most significant threats to the content we preserve, as it is even for the content national libraries preserve. For each of us individually the answer to "Costs: Why Do We Care?" is obvious. But I want to talk about why the work we are discussing over these two days, of collecting, curating, normalizing, analyzing and disseminating cost information about digital curation and preservation, is important not just at an individual level but for the big picture of preservation. What follows is in three sections:
  • The current  situation.
  • Cost trends.
  • What can be done?
The Current SituationHow well are we doing at the task of preservation? Attempts have been made to measure the probability that content is preserved in some areas; e-journals, e-theses and the surface Web:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) compare information extracted from journal publisher's web sites with the Keepers Registry and conclude:We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
  • The Prelida project Hiberlink project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
    • tried to access the URL from the live Web.
    • used Memento to ask the major Web archives whether they had at least one copy of that URL.
    Their results are somewhat difficult to interpret, but for their two more random samples they report: URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
So, are we preserving half the stuff that should be preserved? Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.
An Optimistic AssessmentFirst, the assessment isn't risk-adjusted:
  • As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation, so what have those resources achieved for future readers?
  • As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
Second, the assessment isn't adjusted for difficulty:
  • A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
  • The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
  • Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking:
  • As regards scholarly communication it looks only at the traditional forms, books and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
  • As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.
Fourth, the assessment is likely to suffer measurement bias:
  • The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
  • As regards Web content, Ainsworth et al describe various forms of bias in their paper.
As Cliff Lynch pointed out in his summing-up of the 2014 IDCC conference, the scholarly literature and the surface Web are genres of content for which the denominator of the fraction being preserved (the total amount of genre content) is fairly well known, even if it is difficult to measure the numerator (the amount being preserved). For many other important genres, even the denominator is becoming hard to estimate as the Web enables a variety of distribution channels:
  • Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
  • YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
  • Much music now happens on YouTube (e.g. Pomplamoose
  • Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Of course, "what we should be preserving" is a judgement call, but clearly even purists who wish to preserve only stuff to which future scholars will undoubtedly require access would be hard pressed to claim that half that stuff is preserved.
Preserving the RestOverall, its clear that we are preserving much less than half of the stuff that we should be preserving. What can we do to preserve the rest of it?
  • We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.
  • We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels.
  • We can more than halve the cost per unit content. Doing so requires a radical re-think of our preservation processes and technology.
Such a radical re-think requires understanding where the costs go in our current preservation methodology, and how they can be funded. As an engineer, I'm used to using rules of thumb. The one I use to summarize most of the research into past costs is that ingest takes half the lifetime cost, preservation takes one third, and access takes one sixth.

On this basis, one would think that the most important thing to do would be to reduce the cost of ingest. It is important, but not as important as you might think. The reason is that ingest is a one-time, up-front cost. As such, it is relatively easy to fund. In principle, research grants, author page charges, submission fees and other techniques can transfer the cost of ingest to the originator of the content, and thereby motivate them to explore the many ways that ingest costs can be reduced. But preservation and dissemination costs continue for the life of the data, for "ever". Funding a stream of unpredictable payments stretching into the indefinite future is hard. Reductions in preservation and dissemination costs will have a much bigger effect on sustainability than equivalent reductions in ingest costs.
Cost TrendsWe've been able to ignore this problem for a long time, for two reasons. From at least 1980 to 2010 storage costs followed Kryder's Law, the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if you could afford to store the data for a few years, the cost of storing it for the rest of time could be ignored, because of course Kryder's Law would continue forever. The second is that as the data got older, access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored.

Can we continue to ignore these problems?
PreservationKryder's Law held for three decades, an astonishing feat for exponential growth. Something that goes on that long gets built into people's model of the world, but as Randall Munroe points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve.

This graph, from Preeti Gupta of UC Santa Cruz, plots the cost per GB of disk drives against time. In 2010 Kryder's Law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disks, and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10-20% per year going forward (the red lines on the graph). This means that disk is now about 7 times as expensive as was expected in 2010 (the green line), and that in 2020 it will be between 100 and 300 times as expensive as 2010 projections.

These are big numbers, but do they matter? After all, preservation is only about one-third of the total. and only about one-third of that is media costs.

Our models of the economics of long-term storage compute the endowment, the amount of money that, deposited with the data and invested at interest, would fund its preservation "for ever". This graph, from my initial rather crude prototype model, is based on hardware cost data from Backblaze and running cost data from the San Diego Supercomputer Center (much higher than Backblaze's) and Google. It plots the endowment needed for three copies of a 117TB dataset to have a 95% probability of not running out of money in 100 years, against the Kryder rate (the annual percentage drop in $/GB). The different curves represent policies of keeping the drives for 1,2,3,4,5 years. Up to 2010, we were in the flat part of the graph, where the endowment is low and doesn't depend much on the exact Kryder rate. This is the environment in which everyone believed that long-term storage was effectively free. But suppose the Kryder rate were to drop below about 20%/yr. We would be in the steep part of the graph, where the endowment needed is both much higher and also strongly dependent on the exact Kryder rate.

We don't need to suppose. Preeti's graph and industry projections show that now and for the foreseeable future we are in the steep part of the graph. What happened to slow Kryder's Law? There are a lot of factors, we outlined many of them in a paper for UNESCO's Memory of the World conference (PDF). Briefly, both the disk and tape markets have consolidated to a couple of vendors, turning what used to be a low-margin, competitive market into one with much better margins. Each successive technology generation requires a much bigger investment in manufacturing, so requires bigger margins, so drives consolidation. And the technology needs to stay in the market longer to earn back the investment, reducing the rate of technological progress.

Thanks to aggressive marketing, it is commonly believed that "the cloud" solves this problem. Unfortunately, cloud storage is actually made of the same kind of disks as local storage, and is subject to the same slowing of the rate at which it was getting cheaper. In fact, when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale. Cloud storage really is cheaper if your demand is spiky, but digital preservation is the canonical base-load application.

You may think that the cloud is a competitive market; in fact it is dominated by Amazon.
Jillian Mirandi, senior analyst at Technology Business Research Group (TBRI), estimated that AWS will generate about $4.7 billion in revenue this year, while comparable estimated IaaS revenue for Microsoft and Google will be $156 million and $66 million, respectively. When Google recently started to get serious about competing, they pointed out that Amazon's margins may have been minimal at introduction, by then they were extortionate:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."Notice that the major price drop triggered by Google was a one-time event; it was a signal to Amazon that they couldn't have the market to themselves, and to smaller players that they would no longer be able to compete.

In fact commercial cloud storage is a trap. It is free to put data in to a cloud service such as Amazon's S3, but it costs to get it out. For example, getting your data out of Amazon's Glacier without paying an arm and a leg takes 2 years. If you commit to the cloud as long-term storage, you have two choices. Either keep a copy of everything outside the cloud (in other words, don't commit to the cloud), or stay with your original choice of provider no matter how much they raise the rent.

Unrealistic expectations that we can collect and store the vastly increased amounts of data projected by consultants such as IDC within current budgets place currently preserved content at great risk of economic failure. Here are three numbers that illustrate the looming crisis in long-term storage, its cost:
Here's a graph that projects these three numbers out for the next 10 years. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at computereconomics.com's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC, all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget.
DisseminationThe storage part of preservation isn't the only on-going cost that will be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. With research data this can be a real difficulty; the value of the data may not be evident for a long time. Shang dynasty astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.

In most cases so far the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archived data showed that access was rare, sparse, and mostly for integrity checking.

But the advent of "Big Data" techniques mean that, going forward, scholars increasingly want not to access a few individual items in a collection, but to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed, and almost immediately had 400-odd requests for access to the collection. The scholars weren't interested in a few individual tweets, but in mining information from the entire history of tweets. Unfortunately, the most the Library could afford to do with the feed is to write two copies to tape. There's no way they could afford the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.
IngestAlmost everyone agrees that ingest is the big cost element. Where does the money go? The two main cost drivers appear to be the real world, and metadata.

In the real world it is natural that the cost per unit content increases through time, for two reasons. The content that's easy to ingest gets ingested first, so over time the difficulty of ingestion increases. And digital technology evolves rapidly, mostly by adding complexity. For example, the early Web was a collection of linked static documents. Its language was HTML. It was reasonably easy to collect and preserve. The language of today's Web is Javascript, and much of the content you see is dynamic. This is much harder to ingest. In order to find the links much of the collected content now needs to be executed as well as simply being parsed. This is already significantly increasing the cost of Web harvesting, both because executing the content is computationally much more expensive, and because elaborate defenses are required to protect the crawler against the possibility that the content might be malign.

It is worth noting, however, that the very first US web site in 1991 featured dynamic content, a front-end to a database!

The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.

Worse, the W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

Metadata in the real world is widely known to be of poor quality, both format and bibliographic kinds. Efforts to improve the quality are expensive, because they are mostly manual and, inevitably, reducing entropy after it has been generated is a lot more expensive than not generating it in the first place.
What can be done?We are preserving less than half of the content that needs preservation. The cost per unit content of each stage of our current processes is predicted to rise. Our budgets are not predicted to rise enough to cover the increased cost, let alone more than doubling to preserve the other more than half. We need to change our processes to greatly reduce the cost per unit content.
PreservationIt is often assumed that, because it is possible to store and copy data perfectly, only perfect data preservation is acceptable. There are two problems with this expectation.

To illustrate the first problem, lets examine the technical problem of storing data in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability per unit time. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

Here's some back-of-the-envelope hand-waving. Amazon's S3 is a state-of-the-art storage system. Its design goal is an annual probability of loss of a data object of 10-11. If the average object is 10K bytes, the bit half-life is about a million years, way too short to meet the requirement but still really hard to measure.

Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale, and it all shows them under-performing expectations based on the specifications of the media. Why is this? Real storage systems are large, complex systems subject to correlated failures that are very hard to model.

Worse, the threats against which they have to defend their contents are diverse and almost impossible to model. Nine years ago we documented the threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure
To illustrate the second problem, consider that building systems to defend against all these threats combined is expensive, and can't ever be perfectly effective. So we have to resign ourselves to the fact that stuff will get lost. This has always been true, it should not be a surprise. And it is subject to the law of diminishing returns. Coming back to the economics, how much should we spend reducing the probability of loss?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 150th most visited site, whereas loc.gov is the 1519th. For UK users archive.org is currently the 131st most visited site, whereas bl.uk is the 2744th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more really is better.

Unrealistic expectations for how well data can be preserved make the best be the enemy of the good. We spend money reducing even further the small probability of even the smallest loss of data that could instead preserve vast amounts of additional data, albeit with a slightly higher risk of loss.

Within the next decade all current popular storage media, disk, tape and flash, will be up against very hard technological barriers. A disruption of the storage market is inevitable. We should work to ensure that the needs of long-term data storage will influence the result. We should pay particular attention to the work underway at Facebook and elsewhere that uses techniques such as erasure coding, geographic diversity, and custom hardware based on mostly spun-down disks and DVDs to achieve major cost savings for cold data at scale. 

Every few months there is another press release announcing that some new,  quasi-immortal medium such as fused silica glass or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than magnetic or solid state storage. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the archive problem. But access will always be slow enough that the data would have to be really cold before being committed to DNA.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:

  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)DisseminationThe real problem here is that scholars are used to having free access to library collections and research data, but what scholars now want to do with archived data is so expensive that they must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars. A charging mechanism is needed to pay for the infrastructure. Fortunately, because the scholar's access is spiky, the cloud provides both suitable infrastructure and a charging mechanism.

For smaller collections, Amazon provides Free Public Datasets, Amazon stores a copy of the data with no charge, charging scholars accessing the data for the computation rather than charging the owner of the data for storage.

Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library of Congress kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 initial requests would each need to pay about $125 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach. Because the Library's preservation copy isn't in the cloud, they aren't locked-in.

In the near term, separating the access and preservation copies in this way is a promising way not so much to reduce the cost of access, but to fund it more realistically by transferring it from the archive to the user. In the longer term, architectural changes to preservation systems that closely integrate limited amounts of computation into the storage fabric have the potential for significant cost reductions to both preservation and dissemination. There are encouraging early signs that the storage industry is moving in that direction.
IngestThere are two parts to the ingest process, the content and the metadata.

The evolution of the Web that poses problems for preservation also poses problems for search engines such as Google. Where they used to parse the HTML of a page into its Document Object Model (DOM) in order to find the links to follow and the text to index, they now have to construct the CSS object model (CSSOM), including executing the Javascript, and combine the DOM and CSSOM into the render tree to find the words in context. Preservation crawlers such as Heritrix used to construct the DOM to find the links, and then preserve the HTML. Now they also have to construct the CSSOM and execute the Javascript. It might be worth investigating whether preserving a representation of the render tree rather than the HTML, CSS, Javascript, and all the other components of the page as separate files would reduce costs.

It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.

Metadata is commonly regarded as essential for preservation. For example, there are 52 criteria for ISO 16363 Section 4. Of these, 29 (56%) are metadata-related. Creating and validating metadata is expensive:
  • Manually creating metadata is impractical at scale.
  • Extracting metadata from the content scales better, but it is still expensive since:
  • In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
We need less metadata so we can have more data. Two questions need to be asked:
  • When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
  • How important is the metadata to the task of preservation? Generating metadata because it is possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources. If the reason to validate the formats of incoming content using error-prone tools is to reject allegedly non-conforming content, it is counter-productive. The majority of content in formats such as HTML and PDF fails validation but renders legibly.
The LOCKSS and CLOCKSS systems take a very parsimonious approach to format metadata. Nevertheless, the requirements of ISO 16363 forced us to expend resources implementing and using FITS, whose output does not in fact contribute to our preservation strategy, and whose binaries are so large that we have to maintain two separate versions of the LOCKSS daemon, one with FITS for internal use and one without for actual preservation. Further, the demands we face for bibliographic metadata mean that metadata extraction is a major part of ingest costs for both systems. These demands come from requirements for:
  • Access via bibliographic (as opposed to full-text) search, For example, OpenURL resolution.
  • Meta-preservation services such as the Keepers Registry.
  • Competitive marketing.
Bibliographic search, preservation tracking and bragging about exactly how many articles and books your system preserves are all important, but whether they justify the considerable cost involved is open to question. Because they are cleaning up after the milk has been spilt, digital preservation systems are poorly placed to improve metadata quality.

Resources should be devoted to avoiding spilling milk rather than cleanup. For example, given how much the academic community spends on the services publishers allegedly provide in the way of improving the quality of publications, it is an outrage than even major publishers cannot spell their own names consistently, cannot format DOIs correctly, get authors' names wrong, and so on.

The alternative is to accept that metadata correct enough to rely on is impossible, downgrade its importance to that of a hint, and stop wasting resources on it. One of the reasons full-text search dominates bibliographic search is that it handles the messiness of the real world better.
ConclusionAttempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" will vastly exceed that from all other causes, such as bit rot and format obsolescence. This raises two questions:
  • Will persisting with current preservation technologies improve the odds of preservation? At each stage of the preservation process current projections of cost per unit content are higher than they were a few years ago. Projections for future preservation budgets are at best no higher. So clearly the answer is no.
  • If not, what changes are needed to improve the odds? At each stage of the preservation process we need to at least halve the cost per unit content. I have set out some ideas, others will have different ideas. But the need for major cost reductions needs to be the focus of discussion and development of digital preservation technology and processes.
Unfortunately, any way of making preservation cheaper can be spun as "doing worse preservation". Jeff Rothenberg's Future Perfect 2012 keynote is an excellent example of this spin in action. Even if we make large cost reductions, institutions have to decide to use them, and "no-one ever got fired for choosing IBM".

We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS underwent to obtain this check-off item was well in excess of 10% of its annual budget.

Making the tradeoff of preserving more stuff using "worse preservation" would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.

To sum up, we need to talk about major cost reductions. The basis for this conversation must be more and better cost data.

SearchHub: Stump The Chump: D.C. Winners

planet code4lib - Wed, 2014-11-19 22:55

Last week was another great Stump the Chump session at Lucene/Solr Revolution in DC. Today, I’m happy to anounce the winners:

  • First Prize: Jeff Wartes ($100 Amazon gift certificate)
  • Second Prize: Fudong Li ($50 Amazon gift certificate)
  • Third Prize: Venkata Marrapu ($25 Amazon gift certificate)

Keep an eye on the Lucidworks YouTube page to watch the video as soon as it is available and see the winning questions.

I want to thank everyone who participated — either by sending in your questions, or by being there in person to heckle me. But I would especially like to thank the judges, and our moderator Cassandra Targett, who had to do all the hard work preparing the questions.

See you next year!

The post Stump The Chump: D.C. Winners appeared first on Lucidworks.

Nicole Engard: Bookmarks for November 19, 2014

planet code4lib - Wed, 2014-11-19 20:30

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for November 19, 2014 appeared first on What I Learned Today....

Related posts:

  1. Add weather warnings to your calendar
  2. Amazon Shopping List
  3. Get a wiki for your family

SearchHub: Lucidworks Fusion v1.1 Now Available

planet code4lib - Wed, 2014-11-19 19:53
Hot on the heels of the v1 release of Lucidworks Fusion, we’re back with a whole new set of features and improvements to help you design, build, and deploy search apps with lightning speed. Here’s what’s new in Fusion v1.1: Windows Support We know some of you were a little miffed that Fusion didn’t support Windows out of the gate. With the release of v1.1, Fusion now supports Windows 7, Windows 8.1, Windows 2008 Server, and Windows 2012 Server. Enhanced Signal Processing Framework Fusion’s signal processing framework has added several improvements to allow more complex interactions between signals types to give your users higher relevancy and insights including co-occurrence aggregations, extensive new math options, and alternative integration options. UI Updates A new streamlined interface lets you edit and configure schemas all right in the browser – without using the command line or editing a config file. This lets non-technical users access the power and flexibility of Fusion. Quick Start Getting started with Fusion is easier than ever with our new Quick Start which walks you through creating your first collection, indexing data, and getting your first app up and running. Relevancy Workbench Our new relevancy workbench provides a more intuitive interface to make it easier than ever to increase relevancy and fine-tune results – even for non-technical users. Connector Bonanza Fusion v1.0 shipped with over 25 connectors so you can index no matter where it lives. Fusion v1.1 now ships with connectors for Sharepoint 2010 and 2013, Subversion 1.8 and greater, Google Drive, Couchbase and Jive. Grab it now! Lucidworks Fusion v1.1 is now available for download.

The post Lucidworks Fusion v1.1 Now Available appeared first on Lucidworks.

DPLA: New Uses for Old Advertising

planet code4lib - Wed, 2014-11-19 18:00

3 Feeds One Cent, International Stock Food Company, Minneapolis, Minnesota, ca.1905. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.

Digitization efforts in the US have, to date, been overwhelmingly dominated by academic libraries, but public libraries are increasingly finding a niche by looking to their local collections as sources for original content. The Hennepin County Library has partnered with the Minnesota Digital Library (MDL)—and now the Digital Public Library of America—to bring thousands of items to the digital realm from its extensive holdings in the James K. Hosmer Special Collections Department. These items include maps, atlases, programs, annual reports, photographs, diaries, advertisements, and trade catalogs.

Our partnership with MDL has not only provided far greater access to these hidden parts of our collections, it has also made patrons much more aware of the significance of our collections and the large number of materials that we could be digitizing. The link to DPLA has further increased our awareness of the potential reach of our collections: DPLA is already the second largest source of referrals to our digital content on MDL. All this has motivated us to increase our digitization activities and place greater emphasis on the role of digital content in our services.

Recently, we have been contributing hundreds of items related to local businesses in the form of large advertising posters, trade catalogs, and over 300 business trade cards from Minneapolis companies. These vividly illustrated materials provide a fascinating view of advertising techniques, local businesses, consumer and industrial goods, social mores and popular culture from the late 19th and early 20th centuries.

Hennepin County Library is committed to serving as Hennepin County’s partner in lifelong learning with programs for babies to seniors, new immigrants, small business owners and students of all ages. It comprises 41 libraries, and has holdings of more than five million books, CDs, and DVDs in 40 world languages. It manages around 1,750 public computers, has 11 library board members, and is one great system serving 1.1 million residents of Hennepin County.

Featured image credit: Detail of 1893 Minneapolis Industrial Exposition Catalog, Minneapolis, Minnesota. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.

All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

Pages

Subscribe to code4lib aggregator