Journal of Web Librarianship: Tutorials on Google Analytics: How to Craft a Web Analytics Report for a Library Web Site
Journal of Web Librarianship: Exploring Library Discovery Positions: Are They Emerging or Converging?
Nadine P. Ellero
Journal of Web Librarianship: Assessment of Digitized Library and Archives Materials: A Literature Review
Elizabeth Joan Kelly
Journal of Web Librarianship: A Review of “More Technology for the Rest of Us: A Second Primer on Computing for the Non-IT Librarian”
Dena L. Luce
Journal of Web Librarianship: A Review of “Information Services and Digital Literacy: In Search of the Boundaries of Knowing”
Journal of Web Librarianship: A Review of “Handbook of Indexing Techniques: A Guide for Beginning Indexers”
Bradford Lee Eden
Journal of Web Librarianship: A Review of “The Transformed Library: E-Books, Expertise, and Evolution”
Robert J. Vander Hart
Journal of Web Librarianship: A Review of “Optimizing Academic Library Services in the Digital Milieu: Digital Devices and Their Emerging Trends”
I had the opportunity, at work (and a bit outside of work), to learn the GitHub API, as wrapped by Python’s github3 module. I found the documentation really hard to follow, maybe because I don’t have a lot of experience reading API docs, or because it wasn’t organized in the way I think about things, or maybe just because my work on this API was part of a larger, much more harrowing project, and I was already discouraged* … who knows?I made a thing! Maybe it’s helpful!
Ultimately, I ended up documenting the parts of it I needed to understand in an IPython notebook; if you’d like to play with the GitHub API, then, please, feel free to download and run it, after filling in your own username, your chosen repository name, and your API token for GitHub (which you will want to keep secret, of course).
I’m not sure the format I used is going to be helpful to others, but I kept referring back to this notebook over and over as I worked, so, at the very least, I’ve found a format that’s helpful for me! Since I was trying to mock GitHub objects, I was very interested in return types. I hope it’s useful for others who want to understand github3…I made a funnier thing! Maybe you’ll want play with it?
Now, because I had a funny idea (and I wanted to remind myself that I like programming), I also spent most of a Sunday building a tool to help make GitHub feel less like a game I have to win**. In short, it makes one commit per day on my GitHub account—actually, to the repository in which it is housed, because I was feeling puckish—so that I maintain a perfect streak. See? Perfection, since the day I wrote the script!
The code itself is pretty short; look at do_the_thing.py. You can tell I had fun writing it, though. :)
The trickier (read: way more time-consuming) part ended up being the “make it run daily” bit. I had to learn about launchd, which Mac seems to prefer to cron. That’s too bad, since I already understood cron, but launchd has some nice features, like running when the machine wakes up if it was sleeping at the appointed time.
Anyway, once I gave in and took the advice of the launchd tutorial linked above and installed LaunchPad, it went better for me.
I’ve included my plist file in the repository, so nobody else has to write their own; I also gave the directory to put it in and listed the one necessary change, in the README. It shouldn’t be too hard to get running if you’ve got a Mac on hand.
I’d like to deploy this on Heroku at some point, rather than having to keep my laptop on all the time; maybe if I find that doable I’ll write a follow-up post, or just edit this one. :)
*We have this big code base, which is built on Flask (which I don’t know) and MongoDB (which I don’t know) and which has a RESTful API (which I haven’t learned yet; that’s my current project, happily!), which requires some internal routing to build and resolve URLs (which I don’t really understand, though I can trace through it with PyCharm). My job was to use mock (which I didn’t know at the time and still have trouble keeping my head wrapped around) to write tests (which I have only minimal experience with) on some internal API routes and permissions stuff for the GitHub addon (which was written in a fairly complicated way, in part by necessity, and which I didn’t understand at all at the outset of this project). This was supposed to be a good learning project, and I don’t doubt the intentions of anyone involved with assigning it, but … it was a pretty terrible, demoralizing experience, made worse because I was instructed not to ask the more-senior devs any questions, and the time estimate I was given was not realistic. (I did finish, pretty much. I have to redo my pull request in the morning.)
** “GitHub displays a lot of useless stats about how many followers you have, and some completely psychologically manipulative stats about how often you commit and how many days it is since you had a day off” – Why GitHub is not your CV, by James Coglan
On Saturday, 3 January 2015, OCLC has scheduled a technology upgrade to support system performance and reliability. During this upgrade, all OCLC services will be unavailable on 3 January 2015, from 12:01 am to 3:00 pm, U.S. Eastern Standard Time (approximately 15 hours).
You’re sitting down, right? This Thursday marks the culmination of the Federal Communications Commission’s (FCC) E-rate modernization proceeding—18 months in the making. The Commissioners will vote on a landmark E-rate order that addresses the broadband capacity gap facing many public libraries and the long-term funding shortage of the E-rate program. For the American Library Association (ALA) this is a very big deal as we have spent countless hours in meetings, on calls, late-night drafting and revising, cajoling our members for more cost data, and a few times engaging in down-the-hall-tirades during the tensest moments.
For libraries, this vote is a very, very big deal. In July the Commission voted on its first E-rate Order that focused on increasing the Wi-Fi capacity for libraries and schools (among a number of other program changes). At that time, FCC Chairman Tom Wheeler made a commitment to taking up the outstanding issues, making it clear that the modernization process would be multi-phased. The outstanding issues that matter most for libraries are the lack of high-capacity broadband to the door of the library—because it’s actually just not available or if it is, the monthly cost is much too much. A second issue left open in July was the long-term funding needs of the program.
ALA fought hard to have these issues addressed and on Thursday, the Chairman is living up to his commitment by bringing a second order before the Commission that squarely takes on both, making strategic rule changes and adding (sitting down still?) $1.5 billion to the fund, permanently. We are very pleased.
And how do you celebrate 18 months of work that involved all of our allies (in states spread across the country)? Step one is to join the meeting on Thursday virtually. While all E-rate meetings are interesting, this one will be especially so. The Chairman has invited librarians, teachers, and students to meet with him before the public open meeting so he can hear directly from the beneficiaries of the program on the difference having a library (or school) connected to high-capacity broadband makes.
On behalf of libraries, the Chairman will be joined by Andrea Berstler, executive director, Wicomico Public Library; Rose Dawson, director of Libraries, Alexandria Library; Nicholas Kerelchuck, manager, Digital Commons, Martin Luther King Jr Memorial Library, DC; and Richard Reyes-Gavilan, executive director, DC Public Library. In addition to meeting with the Chairman, Richard will also present during the Commission meeting. We were pleased to be asked to provide a library perspective and are thrilled to have representatives from a variety of libraries who give the color to why what the Commission is doing will have such positive impact on libraries and the communities they serve.Join in the fun
A little lighthearted fun at an E-rate meeting? Of course. Play E-rate Bingo online:
When one of the Commissioners or the Chairman says one of the words on your card, mark it out. Use the twitter hashtag #libraryerate to let everyone know when you get Bingo! Since the meeting starts at 10:30 Eastern time, we can’t encourage adult beverages so use chocolate when you hear one of your words. Of course you should also tweet throughout the meeting. Tell everyone what more broadband will mean for your library.
After the meeting, we will still have to wait to see the actual order until the Commission releases it publicly. We are planning a number of outreach activities to help navigate the changes to the E-rate program and to help libraries take advantage of them. The first will be a webinar in collaboration with the Public Library Association. This will be January 8 at 2:00 Eastern. Also, look for a summary of the order once we’ve had a chance to read it!
We’re looking forward to Thursday and the work ahead. So while we’re taking at least a day off to reflect and high-five a little, stay tuned. More is on the way.
Mita Williams: From DIY to Working Together: Using a Hackerspace to Build Community : keynote from Scholars Portal Day 2014
You can’t tell how many apples are in a seed.
In May of 2010, I, Art Rhyno, Nicole Noel and late and sorely missed Jean Foster hosted an unconference at the central branch of the Windsor Public Library.
Unconferences are seemingly no longer in vogue, so just in case you don’t know, an unconference is a conference where the topics of discussion are determined by those in the room who gather and disperse in conversation as their interests dictate.
The unconference was called WEChangeCamp and it was one several ChangeCamp unconferences that occurred across the country at that time.
At this particular unconference, 40 people from the community came together to answer this question: “How can we re-imagine Windsor-Essex as a stronger and more vibrant community?”
And on that day the topic of a Windsor Hackerspace was suggested by a young man who I later learned was working on his doctorate in electrical engineering. What I remember of that conversation four years ago was Aaron explaining the problem at hand: he and his friends needed regular infusions of money to rent a place to build a hackerspace so they needed a group of people who would pay monthly membership fees. But they couldn’t get paying members until they could attract them with a space.
Shortly thereafter, Aaron - like so many other young people in Windsor- left the city for work elsewhere. It’s a bit of an epidemic here. We have the second highest unemployment rate in Canada and it’s been said that youth unemployment rate in Windsor is at a staggering 20%.
In Aaron’s case, he moved to Palo Alto, California to do robotics work in an automotive R&D lab.
In the meantime back in Windsor, in May 2012, I helped host code4lib North at the University of Windsor. We had the pleasure to host librarians from many OCUL libraries over those two days as well as staff from the Windsor Public Library. Also in the audience was Doug Satori. Doug had helped in the development of the WPL’s CanGuru mobile library application. He came to code4lib north because he was was curious about the first generation Raspberry pi that John Fink of McMaster had brought with him. You have to remember that in 2012 that the Raspberry Pi - the $40 computer card - was still never very new in the world.
A year later, in May 2013, Windsor got its first Hackerspace when Hackforge was officially opened. The Windsor Public Library graciously lent Hackforge the empty space in the front of their Central Branch that was previously a Woodcarver’s Museum.
When Hackforge launches, Doug Satori is president and I’m on the board of directors.
In our 20 months of our existence, I’m proud to say that Hackforge has accomplished quite a lot for itself and for our community.
We’ve co-hosted three hackathons along with the local technology accelerator WETech Alliance.
The first hackathon was called HackWE - and it lasted a weekend, was hosted at the University of Windsor and was based on the City of Windsor’s Open Data Catalogue.
HackWE 2.0 was a 24-hour hackathon based on residential energy data collected by Smart Meters and was part of a larger Ontario Apps for Energy Challenge.
And the third HackWE 3.0 - which happened just this past October - had events stretched over a week and based on open scientific data in celebration of Science and Technology week.
We’ve hosted our own independent hackathons as well. Last year Hackforge held a two week Summer Games event for people who wanted to try their hand at video game design. Everyone who completed a game won a trophy. My own video game won the prize for being the Most endearing.
But in general, our members are more engaged in the regular activities of Hackforge.
We have monthly Maptime events in the space. Maptime is an open learning environment for all levels related to digital map making but there is a definite an emphasis on support for the beginner.
This photo is from our first Windsor Maptime event which was dedicated to OpenStreetMap. There are Maptime chapters all around the world, and the next Maptime Toronto meeting is December 11th, if you are curious and if you near or in the GTA.
The Hackforge Software Guild meets weekly to work on personal projects as well as practicing pair programming on coding challenges called katas. For example, one of the first kata challenges was to write a program that would correctly write out the lyrics of 99 bottles of beer on the wall and one of more recent is how to code bowling scores.
We also have an Open Data Interest group and we are going to launch our own Open Data portal for Windsor’s non-profit community in 2015. We’re able to do this because this year we have received Trillium funding to hire a part-time coordinator and to small pay stipends to people to help with this work.
Our first dataset is likely going to be a community asset map that was compiled by the Ford City Renewal group. Ford City is one of several neighbourhoods in Windsor in which more than 30% of the population is have income levels that at poverty level. Average incomes of those from the the City of Windsor as a whole isn’t actually that much less than average for all of Canada - its just that we’re just the most economically polarized urban area in the country. That’s one of the reasons why, in January Hackforge is going to be working with Ford City Renewal to host a build your computer event for young people in the neighborhood.
As well, our 3 year Trillium grant also funds another part-time coordinator who matches individuals seeking technology experience with non-profits such as the Windsor Homeless Coalition who need technology work and support.
Hackforge has also collaborated with the Windsor Public Library to put on co-hosted events such as the Robot Sumo contest.
And we’ve worked with the City of Windsor to produce persistence of vision bicycle wheels for the their WAVES light and sound art festival. I know it’s difficult to see but in the photo on the screen is a bicycle wheel with a narrow set of lights that are strapped to three spokes on the wheel. When the wheel spins, the lights animate and give the impression that there’s an image in the wheel - it only works with the human eye - because of our persistence of vision - and it’s something that really come across in a photo very well.
[here's a video!]
Also, the City of Windsor commissioned us to build a Librarybox for their event which I thought was really cool!
And like most other Hackerspaces, we have 3D printers. We have robotic kits. We have soldering irons, and we have lots and lots of spare electronic and computer parts. But unlike most other hackerspaces who charge their members $30 to $50 a month to join and make use their space, our hackerspace is currently free to members who pay for their membership with volunteer work.
This brings us to today in the last days of 2014.
2014 is also the year that Aaron came back to us from California. He’s now my fellow board member at Hackforge. And, incidentally, so is Art Rhyno, who - if you don’t know - is a fellow librarian from the University of Windsor.
I was asked by Scholars Portal if I could share some of my experiences with Hackforge in light of today’s theme of building community. And that is what my talk will be about today: how to use a hackerspace to build community. And I will do so by expanding on five themes.
But as you know know - we are only 2 years old, and so - this talk is really about just the beginning steps we’ve been taking and those steps that we are still trying to take. We admittedly have a long way to go.
Helping out with Hackforge has been a very rich and rewarding experience and I’ve learned much from it. And it’s also been hard work and sometimes it has been very time consuming.
All those decisions we made as we started our hackerspace were the first ones we’ve ever had to make for our new organization. This process was exhilarating but it also was occasionally exhausting. Which brings us to our first theme:
Institutions reduce the choices available to their members
The reason why starting up an organization is so exhausting can be found in Ronald Coase’s work. Coase is famous for introducing the concept of transaction costs to explain the nature and limits of firms and that earned him the Nobel Prize in Economics in 1991. Now I haven’t read his Nobel prize winning work, myself. I was first introduced to Coase when I read a book last year called The org: the underlying logic of the office by Ray Fisman and Tim Sullivan.
I also read Coase being referenced in a blog post by media critic Clay Shirky that was about about the differences between new media and old media. It’s Shirky’s words on the screen right now:
These frozen choices are what gives institutions their vitality — they are in fact what make them institutions. Freed of the twin dangers of navel-gazing and random walks, an institution can concentrate its efforts on some persistent, medium-sized, and tractable problem, working at a scale and longevity unavailable to its individual participants.
Further on in his post Shirky explains what he means by this through an example of what happens at a daily newspaper:
The editors meet every afternoon to discuss the front page. They have to decide whether to put the Mayor’s gaffe there or in Metro, whether to run the picture of the accused murderer or the kids running in the fountain, whether to put the Biker Grandma story above or below the fold. Here are some choices they don’t have to make at that meeting: Whether to have headlines. Whether to be a tabloid or a broadsheet. Whether to replace the entire front page with a single ad. Whether to drop the whole news-coverage thing and start selling ice cream. Every such meeting, in other words, involves a thousand choices, but not a billion, because most of the big choices have already been made.
When you are starting a new organization or any new venture, really, every small decision can sometime seem to bog you down. There is navel-gazing and random walks.
We got bogged down at the beginning of Hackforge. We actually received the keys to the space in the Windsor Public Library in October of 2012. Why the delay? We had decided that we would launch the opening of our space with a homemade keypass locking system for the doors because we thought it wouldn’t take much time at all.
And if we were considering how long it would take one talented person to build such a system by themselves, then maybe we would been right. But instead, we were very wrong. And looking back at it, now it seems obvious why this was the case:
We had a set of people who have never worked together before, who don’t necessarily even speak the same programming languages, working without an authority structure, in a scarcely born organization with no promise that we will succeed or survive, nor sure promise of reward.
Now it’s very important for me say that this so I'm absolutely clear - I am not complaining about our volunteers!!!
Hackforge would not have succeeded if it weren’t for those very first volunteers who made Hackforge happen in those early days when we were starting with nothing.
And the same holds to this day. When we say that Hackforge is made of volunteers, what we are really saying is that Hackforge = volunteers.
Our volunteers are especially remarkable because -- like all volunteers - they give up their own time that’s left over after their pre-existing commitments to work, school, family and friends. In volunteer work, every interaction is a gift. But, that being said, not every promise in a volunteer organization is one that is fulfilled. Sometimes you learn the hard way that first thing on Tuesday means 3pm.
But the delay wasn’t just from the building of the system. Once it was built, we then we had to make sure that the keypass system was okay with the library and that it was okay with the fire marshall. And we had to figure out how who was going to make the key cards, how they were going to be distributed and how we would use to decide who would get a keycard to the space and who would not. Ultimately, it took us 8 months to figure this all of this out.
I wanted to explicitly mention this observation because I’ve noticed that within our own institution of libraries that sometimes when a new group or committee is started up, there is the occasional individual who interprets the slow goings and long initial discussions of the first meetings as, at best, extreme inefficiency, and at worst, a sign of imminent failure.
When in fact, we should recognize that slow starts are normal.
Culture is to a organization as community is to a city
New organizations and new ventures happen slowly and furthermore, they should happen slowly because each decision made is one that further that defines the “how” of “what an organization is”. Are we, as an organization, formal or informal? Who takes the minutes at meetings? Do we need to give a notice of motion? Do we do our own books or do we hire an accountant? Do we provide food at our events? Do we sell swag or do we give it away? How should we fundraise? How do we deal with bad actors? Every decision further defines the work that we do.
It’s very important to take the time to take these steps slowly in order to make sure that the way you do things match up with the why you do things. As I think we can appreciate in libraryland, once institutions reduce choices of their members it is very difficult - although not impossible to open them up again for rethinking and refactoring.
One of reasons why Hackforge has been very successful in its brief existence - is that it was formed with clearly articulated reasons and clear guiding principles that continue to help us shape the form of our work. And I know this, because the vision of what Hackforge should be was told to be me when I was invited to serve of the board when Hackforge began and, I can attest to the fact, that it is the same the as the one we have now.
Now, there are many different types of hacker and makerspaces: some are dedicated to artists, others to entrepreneurs, while others are dedicated to the hobbyist. Hackforge - in less than 140 characters has been described as this: Hackforge supports capacity building in the community and supporting a culture of mentorship and inclusivity.
More specifically, we exist to help with youth retention in Windsor. We aim to be a place where individuals who work or want to work in technology can find support from each other.
I know it might sound strange to you that we believe that our local IT industry needs support, especially when we read about the excesses of Silicon Valley on a regular basis.
But in Windsor, there are not many options for those with a technology background to find work and so, despite of the impression we give to those pursuing a career in STEM, tech jobs in Windsor can be poorly paid and the working conditions can be very problematic.
Many of the provisions in the labour law - the ones that entitle employees to set working hours, to breaks between and within shifts, to overtime and even time to eat - have exemptions for those who work in IT. I’ve been told that the only way to get a raise while working in IT in this town is to find a better paying job.
The IT industry sometimes treats people as if they were machines themselves.
Hackforge was built as a response to this environment. It was build in hopes that it could help grow something better. At Hackforge we know our strength does not come from the machines that we have in our space, but our amazing members and the time and work that they give to others.
I mean, we love 3D printers because they are a honeypot that brings curious folks into our space, but the secret is we are not really about 3D printers.
And yet if you look at all of what our media coverage we receive, you would think we’re just another makerspace that loves 3D printers and robots.
This is why it is SO important to be visible with your values, which is our second theme.
Show your work
One of the challenges that we have at Hackforge is that we don’t have very many women in our ranks. Women make up half of our board of directors but our larger membership is not representative of the Windsor community and it’s likely not representative in the other aspects of identity, for that matter, either.
We know that if we wanted to change this situation, it would require sustained work on our part. And so when we had our official launch of Hackforge last year, we, as part of the event, hosted a Women in Technology Panel that featured four women who work in IT, including the very successful Girl Develop IT from Detroit, all of whom both shared their experiences and offer strategies to make the field of technology a more inclusive environment and better place for everyone.
In the audience for that panel discussion was a representative of WEST. WEST is a local non-profit group who works and stands for Women’s Enterprise Skills Training. Starting next year, with the support of another Ontario Trillium grant, Hackforge and WEST are going to be launching a project that will offer free computer skills training workshops for women as well as trying to create a community of support, and continue to advocate for women in the IT field.
So I can’t stress this enough. You have to do your work in public if you want your future collaborators to find you.
I have also another Women in Technology story to start our third theme.
So remember I told you about unconferences? Well, the Hackforge members who run the Software Guild do something similar. Sometimes instead of coding, the folks do something like this. They write down all thing the things they want to talk about, vote for the topics and then talk the most voted topics within strict time limits. But they don’t call it an unconference:
They call it LEAN COFFEE.
I love it. It’s so adorable.
Anyway, at one of these Lean Coffee sessions, our staff coordinator suggested the topic Women in Technology. And the response she received was this: We know there’s a problem because Hackforge doesn’t have enough women. But we are not sure how to fix this.
To me, I found this statement very encouraging.
Its sad, but in these these times, when people can admit that there’s a problem without any deflection or allocation of blame is actually very refreshing.
I mean, within librarianship - we have some organizations who consistently organize speaking events made up of mostly men. Whenever I raise this matter I usually told that if the speaking topic is not about gender, then it’s not about gender. In other words, they tell me that there is no problem.
But sometimes there is a problem.
Look at this photo: from this you would never guess that it was taken in a city that is over 80% African American. This photo from the first meeting of Maptime Detroit that I attended last month. One of the first things that was said during the evening’s introduction was a simple statement by the organizer. “I want to acknowledge who isn’t this in room” And what followed was a plan to hold the next Maptime meetings, not in the mid-town Tech Incubator, but within the various neighbourhoods in the city and alongside partner organizations already working with Detroiters where they live.
So before we can be more inclusive, we need to recognize when we are not.
We can start by acknowledging who isn’t in the room. It isn’t hard to do.
Quinn Norton wrote a lovely essay about this called Count. Speaking of counting, we are now at theme four.
A mailing list is not a community
What you might find surprising is that - for Hackforge being a gathering of people who generally love love love the Internet, is that we really don’t even have a strong online space for folks to hang out in, with the exception of our IRC channel. We used to have forum software, but is was so overwhelmed with spam on a daily basis it was almost immediately rendered unusable.
Also, Hackforge doesn’t even have a listserv mailing list.
And I would go as far to say that one of the reasons why Hackforge has been as successful as we have been is in part, because that we *don’t* have a mailing list.
There’s a website that’s called Running a Hackerspace that is a collection of animated gifs that metaphorically capture the essence of Running a hackerspace. I think it’s particularly telling that there are many recurrent topics that arise this Tumblr: like the complaints that folks don’t clean up after themselves.
(And this is when I confess that when I drop by Hackforge, I am also sometimes made sad).
But the most prevalent theme in the blog is mailing list rage.
You would think this would have been a solved problem by now: how do you support project work that is done asynchronously and dispersed over geography. Many open source communities are finding that the traditional tools of mailing lists, forum software, and IRC channels are not doing enough in helping their communities do good work together. More often than not, these technologies seem to be better than boosting the noise rather than the signal.
Distributed companies like Wordpress are moving from IRC to software platforms such as Slack. As I’ve mentioned before, I’m involved with a largely self-organized group called Maptime and we also make use of Slack, which is essentially user friendly IRC, chat, and messaging along with images, file sharing, archiving and social media capture.
At Hackforge, we’ve recently decided to use the Jira issue tracker to manage the hacking work that we need to do in the space and we will be switching to Nation Builder software to manage our members and member communications. When activists, non-profits, and political parties are using software like Nation Builder to manage the contact info, the interests, and the fundraising of tens of thousands of people, it makes me wonder when libraries are going to start using similar software to manage the relationships it has with its community.
And at a time when my neighbours who rent the skating rink for collective use, use volunteer management software to figure out who’s turn it is to bring the hot chocolate, I would like to suggest that libraries perhaps could start using similar software to - at least - manage our internal work and communications as well. Good tools make great communication possible within organizations and our communities. They are are worth the investment.
Invest in but do not outsource community management
Before I end my presentation with this last theme, I do want to offer a caveat to everything I’ve said. If you asked all of the people who have been involved in Hackforge - those who have come by our events, spent time in the space, or even volunteered some mentoring at an event - if you asked them if they felt they were part of a community, I think most people asked would probably say, no. I think we have a wonderful group of people who have contributed to Hackforge and I think we have a group of people who have even found friends at Hackforge, but I think we still can’t call the whole of what we do "a community" - at least not yet.
Hackforge is approaching its 2nd birthday and this talk has been a wonderful excuse to reflect on what we do well and what we still need to work on.
What works for us are regular events, contests and Hackathons. We are well aware of the limitations of hackathons and how they produce imperfect work but, for us, it seems to be that that pre-defined limits and deadlines produce more work and generate more interest and excitement than unstructured free time seems to.
Unlike many hackerspaces, we don’t tend to have many group projects. The door project - as you have learned - was one of few group projects, and that one took longer than expected. In our early months, we also had a LED sign project that was never completed and actually resulted in some people leaving Hackforge in frustration.
We are a volunteer organization and as such, by the process of evolution, we are a place for the patient and the forgiving. Sometimes we have gotten our first impressions wrong.
One of the largest challenges I think we have as an organization is to be more accessible to beginners. In fact, that the feedback that we’ve been getting.
Aaron recently had a tech talk about tech talks and the message he received was that Hackforge should provide more sessions for beginners. And this is a particular challenge that we haven’t really addressed yet. We’re luckily that Hackforge has people who are both generous with their time and not afraid of public speaking and give tech talks. But many of our speakers don’t preface their talks with an introduction that a newbie could understand. They are so excited to have fellow experts in the crowd and they jump right into the code or electrical specs or what have you.
Likewise, it’s amazing and wonderful that we have regular supportive events like our member’s coding katas in which those who work with software can practice and share their coding practice with others. But at the moment, we don’t really have anything for those who want to learn how to code. And you might not be shocked to hear this, but Hackforge’s machines like our 3D printers - lack even the most basic documentation on how to use the machines.
Without expanding the work of communicating, documenting, explaining, and teaching, Hackforge won’t be able to attract new members.
Hackforge started as a top down organization. Our job as board has been to the build the systems that will allow more of the day to day work of the Hackforge to move from the board to our community and program managers. We were able to hire our managers in the middle of this year and already, they have made wonderful contributions to Hackforge. Our next challenge will be how to move more of the operational work of the managers to the members themselves.
In other words, the challenge for Hackforge is to ensure that the work that needs to be done - all of that communicating, documenting, explaining, teaching - needs to be embraced by all of its members as a community of practice. And through this practice, it’s hoped we can build a community.
So, those are my five themes for building community with a hackerspace:
Institutions reduce the choices available to their members (so choose carefully)
Show your work (so future collaborators can find you).
Acknowledge who isn’t in the room (Count is only the start).
A mailing list is not a community (Invest in tools that do better).
Invest but do not outsource community management.
The work of figuring how to get a bunch of people to come together and face a shared challenge isn’t just the way the build a community. This is also how political movements begin. It’s also how a game begins. I would like to thanks to Scholars Portal for giving me the opportunity to begin Scholars Portal Day with you all.
Today, the Department of Labor will host a webinar that will outline and discuss suggested activities states should undertake to comply with the Workforce Innovation and Opportunity Act (WIOA) beginning immediately through July 1, 2015. This session has limited space so please register quickly.
During the webinar “WIOA Technical Assistance Webinar- Jump-Starting Your Implementation,” speakers will highlight areas where states have existing authority to take action to comply with WIOA, as well as provide technical assistance, based on statutory requirements, on additional areas in which states are encouraged to move forward in the transition from Workforce Investment Act to WIOA.Presenters:
- Adele Gagliardi, Administrator, Office of Policy Development and Research, U.S. Department of Labor, Employment and Training Administration
- Christine Quinn, Special Assistant, U.S. Department of Labor, Employment and Training Administration
- Lori H. Collins, Director, Division of Workforce & Employment Services,
- Kentucky Career Center
- Mike Riley, ETPL lead, Division of Workforce & Employment Services, Kentucky Career Center
- Lisa Salazar, Director, City of Los Angeles OneSource, Youth Opportunity System
- Thomas Colombo, Assistant Director, Division of Employment & Rehabilitation Services, Interim Deputy, State of Arizona
- Alice Sweeney, Director, Department of Career Services, State of Massachusetts
- Scott C. Fennell, Chief Operation and Financial Officer, Career Source Florida
- Moderator: Maggie Ewell, Policy and Reporting Team Lead Office of Grants Management, U.S. Department of Labor, Employment and Training Administration
Date: December 9, 2014
Time: 2:00pm ET (1:00pm/Central, 12:00pm/Mountain, 11:00am/Pacific)
Join the webinar
The ALA Washington Office does not know if the webinar will be archived. Contact the Department of Labor with questions about the webinar.
I spent much of last week in New York City as part of the American Library Association (ALA) advocacy effort regarding ebooks. These meetings with publishing executives are described in my post on the American Libraries magazine’s E-content blog. However, I also engaged in some other activities during this trip.
I had the pleasure of participating in Jim Neal’s retirement celebration, held at Casa Italiana, Columbia University, which saw hundreds in attendance to pay tribute to him. As many of you know, Jim is a long-time, strategic, and highly-respected contributor to the library community at the national level. Among his many contributions, he has served on ALA’s Executive Board for three separate tenures, including one presently, and is a former treasurer of the Association.
Lee Bollinger, president of Columbia University, kicked off the formal program to recognize Jim’s contributions as vice president of information services and university librarian. Several other Columbia University officials participated in the praise, including our close collaborator Bob Wolven, an associate university librarian and former co-chair of the ALA Digital Content Working Group. Bob commented on Jim’s extraordinary energy and initiative, noting that if Jim received some lemons, he would not make lemonade—instead he would develop a plan for a multi-division business and demand more lemons.
In 2015, Jim will become university librarian emeritus, an honor bestowed only two times previously in the university’s history. Jim will remain active in the field and so ALA and the national library community will continue to benefit from Jim’s strategic guidance in copyright policy and other areas for some time to come. As he is a member of the Policy Revolution! Initiative’s Library Advisory Committee, I am relieved to learn that we will continue to benefit from his counsel in our efforts to reshape national public policy for libraries. While on campus, I also had separate meetings with Jim and Bob to discuss broad issues in digital content and information policy.
The award ceremony for the I Love My Librarian award took place last week and so I was able to attend it, held at the offices of The New York Times. Of course, I expected to hear about the exemplary library work of the awardees, but the intense level of emotion exhibited at the event was a bit unexpected.
For these librarians, their work extends far beyond a job, becoming a calling—and their patrons see it that way as well. This award is extremely competitive, with over 1000 applications received that ultimately yielded 10 awardees.
Finally, I made it to Connecticut for an evening. My first stop was the Darien Public Library to meet with Amanda Goodman, a staff librarian and member of the Office for Information Technology Policy (OITP) advisory committee.
I got a looksee at the library’s four 3D printers and its fine children’s library. I then met up with Dr. Roger Levien, author of our policy brief Confronting the Future. Roger is working on a new book on the future of public libraries and we discussed varied aspects of his developing analysis in the context of related work in the field.
Trips are great, but then they end and you end up back in the office trying to catch up. I don’t ever seem to catch up—I’m just trying to keep the backlog under embarrassing levels!
The DPLA Board of Directors will hold an open call on Monday, December 15, at 3:00 PM Eastern. Agenda and dial-in information is included below.Agenda
- What’s coming in early 2015
- Update from Executive Director
- Questions and comments from the public
- Voice approval of strategic plan
- Review of draft tax return
- Update on nominating subcommittee
- Update on current grant activities
The Global Open Data Index 2014 team is thrilled to announce that the Global Open Data Index 2014 is now live!
We would not have arrived here without the incredible support from our network and the wider open knowledge community in making sure that so many countries/places are represented in the Index and that the agenda for open data moves forward. We’re already seeing this tool being used for advocacy around the world, and hope that the full and published version will allow you to do the same!How you can help us spread the news
You can embed a map for your country on your blog or website by following these instructions.
Press materials are available in 6 languages so far (English, German, Spanish, Portuguese, Japanese and French), with more expected. If you want to share where you are please share a link to our press page. If you see any coverage of the Global Open Data Index, please submit them to us via this form so we can track coverage.
We are really grateful for everyone’s help in this great community effort!Here are some of the results of the Global Open Data Index 2014
The Global Open Data Index ranks countries based on the availability and accessibility of information in ten key areas, including government spending, election results, transport timetables, and pollution levels.
The UK tops the 2014 Index retaining its pole position with an overall score of 96%, closely followed by Denmark and then France at number 3 up from 12th last year. Finland comes in 4th while Australia and New Zealand share the 5th place. Impressive results were seen from India at #10 (up from #27) and Latin American countries like Colombia and Uruguay who came in joint 12th .
Sierra Leone, Mali, Haiti and Guinea rank lowest of the countries assessed, but there are many countries where the governments are less open but that were not assessed because of lack of openness or a sufficiently engaged civil society.
Overall, whilst there is meaningful improvement in the number of open datasets (from 87 to 105), the percentage of open datasets across all the surveyed countries remained low at only 11%.
Even amongst the leaders on open government data there is still room for improvement: the US and Germany, for example, do not provide a consolidated, open register of corporations. There was also a disappointing degree of openness around the details of government spending with most countries either failing to provide information at all or limiting the information available – only two countries out of 97 (the UK and Greece) got full marks here. This is noteworthy as in a period of sluggish growth and continuing austerity in many countries, giving citizens and businesses free and open access to this sort of data would seem to be an effective means of saving money and improving government efficiency.
Explore the Global Open Data Index 2014 for yourself!
Attempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" vastly exceeds that from all other causes, such as bit rot and format obsolescence. Will persisting with current preservation technologies improve the odds of preservation? If not, what changes are needed to improve them?It covered much of the same material as Costs: Why Do We Care, with some differences in emphasis. Below the fold, the text with links to the sources.
IntroductionI'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries. As with all my talks, you don't need to take notes or ask for the slides. The text of the talk, with links to the sources, will go up on my blog shortly.
One of the preservation networks that the LOCKSS Program operates is the CLOCKSS archive, a large dark archive of e-journals and e-books. We operate it under contract to a not-for-profit organization jointly run by publishers and libraries. Earlier this year we completed a more than year-long process that resulted in the CLOCKSS Archive being certified to the Trusted Repository Audit Criteria (TRAC) by CRL. We equalled the previous highest score and gained the first-ever perfect score for Technology. At documents.clockss.org you will find all the non-confidential material upon which the auditors based their assessment. And on my blog you will find posts announcing the certification, describing the process we went through, discussing the lessons learned, and describing how you can run the demos we put on for the auditors.
Although CRL's certification was to TRAC, the documents include a finding aid structured according to ISO16363, the official ISO standard that is superseding TRAC. If you look at the finding aid or at the ISO16363 documents you will see that many of the criteria are concerned with economic sustainability. Among the confidential materials the auditors requested were "Budgets for last three years and projections for next two showing revenue and expenses".
We actually gave them five-year projections. This is an area where we had a good story to tell. The LOCKSS Program got started with grant funds from the NSF, the Andrew W. Mellon Foundation, and Sun Microsystems. But grant funding isn't a sustainable basis for long-term preservation. In 2005, the Mellon Foundation gave us a 2-year grant which we had to match, and after which we had to be off grant funding. For more than 7 years we have been in the black without grant funding. The LOCKSS software is free open source, the LOCKSS team charges for support and services.
Achieving this economic sustainability has required a consistent focus on minimizing the cost of every aspect of our operations. Because the LOCKSS system's Lots Of Copies trades using more disk space for using less of other resources (especially lawyers), I have been researching in particular the costs of storage for some years. In what follows I want to look at the big picture of digital preservation costs and their implications. It is in three sections:
- The current situation.
- Cost trends.
- What can be done?
- In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
- Luis Faria and co-authors (PDF) compare information extracted from journal publisher's web sites with the Keepers Registry and conclude:We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
- The Hiberlink project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive.
- Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
- tried to access the URL from the live Web.
- used Memento to ask the major Web archives whether they had at least one copy of that URL.
An Optimistic AssessmentFirst, the assessment isn't risk-adjusted:
- As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation*, so what have those resources achieved for future readers?
- As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
- A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
- The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
- Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Third, the assessment is backward-looking:
- As regards scholarly communication it looks only at the traditional forms, books, theses and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
- As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.
- The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
- As regards Web content, Ainsworth et al describe various forms of bias in their paper.
- Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
- YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
- Much music now happens on YouTube (e.g. Pomplamoose)
- Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Preserving the RestOverall, its clear that we are preserving much less than half of the stuff that we should be preserving. What can we do to preserve the rest of it?
- We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.
- We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels.
- We can more than halve the cost per unit content. Doing so requires a radical re-think of our preservation processes and technology.
On this basis, one would think that the most important thing to do would be to reduce the cost of ingest. It is important, but not as important as you might think. The reason is that ingest is a one-time, up-front cost. As such, it is relatively easy to fund. In principle, research grants, author page charges, submission fees and other techniques can transfer the cost of ingest to the originator of the content, and thereby motivate them to explore the many ways that ingest costs can be reduced. But preservation and dissemination costs continue for the life of the data, for "ever". Funding a stream of unpredictable payments stretching into the indefinite future is hard. Reductions in preservation and dissemination costs will have a much bigger effect on sustainability than equivalent reductions in ingest costs.
Cost TrendsWe've been able to ignore this problem for a long time, for two reasons. From at least 1980 to 2010 storage costs followed Kryder's Law, the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if you could afford to store the data for a few years, the cost of storing it for the rest of time could be ignored, because of course Kryder's Law would continue forever. The second is that as the data got older, access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored.
But can we continue to ignore these problems?
PreservationKryder's Law held for three decades, an astonishing feat for exponential growth. Something that goes on that long gets built into people's model of the world, but as Randall Munroe points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve.
This graph, from Preeti Gupta of UC Santa Cruz, plots the cost per GB of disk drives against time. In 2010 Kryder's Law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disks, and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10-20% per year going forward (the red lines on the graph). This means that disk is now about 7 times as expensive as was expected in 2010 (the green line), and that in 2020 it will be between 100 and 300 times as expensive as 2010 projections.
These are big numbers, but do they matter? After all, preservation is only about one-third of the total. and only about one-third of that is media costs.
Our models of the economics of long-term storage compute the endowment, the amount of money that, deposited with the data and invested at interest, would fund its preservation "for ever". This graph, from my initial rather crude prototype model, is based on hardware cost data from Backblaze and running cost data from the San Diego Supercomputer Center (much higher than Backblaze's) and Google. It plots the endowment needed for three copies of a 117TB dataset to have a 95% probability of not running out of money in 100 years, against the Kryder rate (the annual percentage drop in $/GB). The different curves represent policies of keeping the drives for 1,2,3,4,5 years. Up to 2010, we were in the flat part of the graph, where the endowment is low and doesn't depend much on the exact Kryder rate. This is the environment in which everyone believed that long-term storage was effectively free.
But suppose the Kryder rate were to drop below about 20%/yr. We would be in the steep part of the graph, where the endowment needed is both much higher and also strongly dependent on the exact Kryder rate.
We don't need to suppose. Preeti's graph and industry projections show that now and for the foreseeable future we are in the steep part of the graph. What happened to slow Kryder's Law? There are a lot of factors, we outlined many of them in a paper for UNESCO's Memory of the World conference (PDF). Briefly, both the disk and tape markets have consolidated to a couple of vendors, turning what used to be a low-margin, competitive market into one with much better margins. Each successive technology generation requires a much bigger investment in manufacturing, so requires bigger margins, so drives consolidation. And the technology needs to stay in the market longer to earn back the investment, reducing the rate of technological progress.
Thanks to aggressive marketing, it is commonly believed that "the cloud" solves this problem. Unfortunately, cloud storage is actually made of the same kind of disks as local storage, and is subject to the same slowing of the rate at which it was getting cheaper. In fact, when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale. Cloud storage really is cheaper if your demand is spiky, but digital preservation is the canonical base-load application.
You may think that the cloud is a competitive market; in fact it is dominated by Amazon.
Jillian Mirandi, senior analyst at Technology Business Research Group (TBRI), estimated that AWS will generate about $4.7 billion in revenue this year, while comparable estimated IaaS revenue for Microsoft and Google will be $156 million and $66 million, respectively. When Google recently started to get serious about competing, they pointed out that Amazon's margins may have been minimal at introduction, by then they were extortionate:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."Notice that the major price drop triggered by Google was a one-time event; it was a signal to Amazon that they couldn't have the market to themselves, and to smaller players that they would no longer be able to compete.
In fact commercial cloud storage is a trap. It is free to put data in to a cloud service such as Amazon's S3, but it costs to get it out. For example, getting your data out of Amazon's Glacier without paying an arm and a leg takes 2 years. If you commit to the cloud as long-term storage, you have two choices. Either keep a copy of everything outside the cloud (in other words, don't commit to the cloud), or stay with your original choice of provider no matter how much they raise the rent.
Unrealistic expectations that we can collect and store the vastly increased amounts of data projected by consultants such as IDC within current budgets place currently preserved content at great risk of economic failure.
Here's a graph that illustrates the looming crisis in long-term storage, its cost. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at computereconomics.com's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC,** all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget.
DisseminationThe storage part of preservation isn't the only on-going cost that will be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. With research data this can be a real difficulty; the value of the data may not be evident for a long time. Shang dynasty astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.
In most cases so far the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archived data showed that access was rare, sparse, and mostly for integrity checking.
But the advent of "Big Data" techniques mean that, going forward, scholars increasingly don't want to access a few individual items in a collection, they want to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed, and almost immediately had 400-odd requests for access to the collection. The scholars weren't interested in a few individual tweets, but in mining information from the entire history of tweets. Unfortunately, the most the Library could afford to do with the feed is to write two copies to tape. There's no way they could afford the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.
IngestAlmost everyone agrees that ingest is the big cost element. Where does the money go? The two main cost drivers appear to be the real world, and metadata.
It is worth noting, however, that the very first US web site in 1991 featured dynamic content, a front-end to a database!
The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.
Worse, the W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.
Metadata in the real world is widely known to be of poor quality, both format and bibliographic kinds. Efforts to improve the quality are expensive, because they are mostly manual and, inevitably, reducing entropy after it has been generated is a lot more expensive than not generating it in the first place.
What can be done?We are preserving less than half of the content that needs preservation. The cost per unit content of each stage of our current processes is predicted to rise. Our budgets are not predicted to rise enough to cover the increased cost, let alone more than doubling to preserve the other more than half. We need to change our processes to greatly reduce the cost per unit content.
PreservationIt is often assumed that, because it is possible to store and copy data perfectly, only perfect data preservation is acceptable. There are two problems with this expectation.
To illustrate the first problem, lets examine the technical problem of storing data in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.
Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability per unit time. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.
Here's some back-of-the-envelope hand-waving. Amazon's S3 is a state-of-the-art storage system. Its design goal is an annual probability of loss of a data object of 10-11. If the average object is 10K bytes, the bit half-life is about a million years, way too short to meet the requirement but still really hard to measure.
Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale, and it all shows them under-performing expectations based on the specifications of the media. Why is this? Real storage systems are large, complex systems subject to correlated failures that are very hard to model.
Worse, the threats against which they have to defend their contents are diverse and almost impossible to model. Nine years ago we documented the threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:
- Media failure
- Hardware failure
- Software failure
- Network failure
- Natural Disaster
- Operator error
- External Attack
- Insider Attack
- Economic Failure
- Organizational Failure
Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.
However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.
Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 150th most visited site, whereas loc.gov is the 1519th. For UK users archive.org is currently the 131st most visited site, whereas bl.uk is the 2744th.
Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more really is better.
Unrealistic expectations for how well data can be preserved make the best be the enemy of the good. We spend money reducing even further the small probability of even the smallest loss of data that could instead preserve vast amounts of additional data, albeit with a slightly higher risk of loss.
Within the next decade all current popular storage media, disk, tape and flash, will be up against very hard technological barriers. A disruption of the storage market is inevitable. We should work to ensure that the needs of long-term data storage will influence the result. We should pay particular attention to the work underway at Facebook and elsewhere that uses techniques such as erasure coding, geographic diversity, and custom hardware based on mostly spun-down disks and DVDs to achieve major cost savings for cold data at scale.
Every few months there is another press release announcing that some new, quasi-immortal medium such as fused silica glass or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.
The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.
The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
- Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
- Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)DisseminationThe real problem here is that scholars are used to having free access to library collections and research data, but what scholars now want to do with archived data is so expensive that they must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars. A charging mechanism is needed to pay for the infrastructure. Fortunately, because the scholar's access is spiky, the cloud provides both suitable infrastructure and a charging mechanism.
For smaller collections, Amazon provides Free Public Datasets, Amazon stores a copy of the data with no charge, charging scholars accessing the data for the computation rather than charging the owner of the data for storage.
Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library of Congress kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 initial requests would each need to pay about $125 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach. Because the Library's preservation copy isn't in the cloud, they aren't locked-in.
In the near term, separating the access and preservation copies in this way is a promising way not so much to reduce the cost of access, but to fund it more realistically by transferring it from the archive to the user. In the longer term, architectural changes to preservation systems that closely integrate limited amounts of computation into the storage fabric have the potential for significant cost reductions to both preservation and dissemination. There are encouraging early signs that the storage industry is moving in that direction.
IngestThere are two parts to the ingest process, the content and the metadata.
It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.
Metadata is commonly regarded as essential for preservation. For example, there are 52 criteria for ISO 16363 Section 4. Of these, 29 (56%) are metadata-related. Creating and validating metadata is expensive:
- Manually creating metadata is impractical at scale.
- Extracting metadata from the content scales better, but it is still expensive since:
- Considerable per-site work is needed to extract bibliographic metadata.
- Generating format metadata is computationally expensive.
- In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
- When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
- How important is the metadata to the task of preservation? Generating metadata because it is possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources. If the reason to validate the formats of incoming content using error-prone tools is to reject allegedly non-conforming content, it is counter-productive. The majority of content in formats such as HTML and PDF fails validation but renders legibly.
- Access via bibliographic (as opposed to full-text) search, For example, OpenURL resolution.
- Meta-preservation services such as the Keepers Registry.
- Competitive marketing.
Resources should be devoted to avoiding spilling milk rather than cleanup. For example, given how much the academic community spends on the services publishers allegedly provide in the way of improving the quality of publications, it is an outrage than even major publishers cannot spell their own names consistently, cannot format DOIs correctly, get authors' names wrong, and so on.
The alternative is to accept that metadata correct enough to rely on is impossible, downgrade its importance to that of a hint, and stop wasting resources on it. One of the reasons full-text search dominates bibliographic search is that it handles the messiness of the real world better.
ConclusionAttempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" will vastly exceed that from all other causes, such as bit rot and format obsolescence. This raises two questions:
- Will persisting with current preservation technologies improve the odds of preservation? At each stage of the preservation process current projections of cost per unit content are higher than they were a few years ago. Projections for future preservation budgets are at best no higher. So clearly the answer is no.
- If not, what changes are needed to improve the odds? At each stage of the preservation process we need to at least halve the cost per unit content. I have set out some ideas, others will have different ideas. But the need for major cost reductions needs to be the focus of discussion and development of digital preservation technology and processes.
We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS underwent to obtain this check-off item was well in excess of 10% of its annual budget.
Making the tradeoff of preserving more stuff using "worse preservation" would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.
To sum up, we need to talk about major cost reductions. The basis for this conversation must be more and better cost data. I'm on the advisory board for the EU's 4C project, the Collaboration to Clarify the Costs of Curation. They are addressing the need for more and better cost data by setting up the Curation Cost Exchange. Please go there and submit whatever cost data you can come up with for your own curation operations.
* But notice the current stand-off between Dutch libraries and Elsevier.
** Bill Arms intervened to point out that IDC's 60% growth rate is ridiculous, and thus the graph is ridiculous. He is of course correct, but the point is that unless your archive is growing less than the Kryder rate, your annual storage cost is increasing. The Kryder rate may well be as low as 10%/yr, and very few digital preservation systems are growing at less than 10%/yr.
Today, the American Library Association’s (ALA) Washington Office is pleased to launch the new and reinvigorated District Dispatch! We’ve made it easier for you to find content on the site, search articles, share with friends, subscribe to the blog and learn more about library policy issues. Most importantly, the new and improved site includes features that make it easier for library advocates to find the critical federal policy information they need to take action for libraries.
Here’s what the new District Dispatch can do for you:
- Want to know more about library policy issues? Find all the information you need in one place.
- Thinking about registering for National Library Legislative Day? Learn more here
- Are you looking for free webinars developed for library staff?
- Want to attend events or receive e-newsletters? Sign up for ALA Washington Office articles and notifications
- Like what you’re seeing? We’d love to hear from you!
We hope our new site will be a great resource for you.