You are here

Feed aggregator

District Dispatch: E-rate order at the FCC? B-i-n-g-o!

planet code4lib - Tue, 2014-12-09 20:24

You’re sitting down, right? This Thursday marks the culmination of the Federal Communications Commission’s (FCC) E-rate modernization proceeding—18 months in the making. The Commissioners will vote on a landmark E-rate order that addresses the broadband capacity gap facing many public libraries and the long-term funding shortage of the E-rate program. For the American Library Association (ALA) this is a very big deal as we have spent countless hours in meetings, on calls, late-night drafting and revising, cajoling our members for more cost data, and a few times engaging in down-the-hall-tirades during the tensest moments.

Photo by Sarae via Flickr

For libraries, this vote is a very, very big deal. In July the Commission voted on its first E-rate Order that focused on increasing the Wi-Fi capacity for libraries and schools (among a number of other program changes). At that time, FCC Chairman Tom Wheeler made a commitment to taking up the outstanding issues, making it clear that the modernization process would be multi-phased. The outstanding issues that matter most for libraries are the lack of high-capacity broadband to the door of the library—because it’s actually just not available or if it is, the monthly cost is much too much. A second issue left open in July was the long-term funding needs of the program.

ALA fought hard to have these issues addressed and on Thursday, the Chairman is living up to his commitment by bringing a second order before the Commission that squarely takes on both, making strategic rule changes and adding (sitting down still?) $1.5 billion to the fund, permanently. We are very pleased.

And how do you celebrate 18 months of work that involved all of our allies (in states spread across the country)? Step one is to join the meeting on Thursday virtually. While all E-rate meetings are interesting, this one will be especially so. The Chairman has invited librarians, teachers, and students to meet with him before the public open meeting so he can hear directly from the beneficiaries of the program on the difference having a library (or school) connected to high-capacity broadband makes.

On behalf of libraries, the Chairman will be joined by Andrea Berstler, executive director, Wicomico Public Library; Rose Dawson, director of Libraries, Alexandria Library; Nicholas Kerelchuck, manager, Digital Commons, Martin Luther King Jr Memorial Library, DC; and Richard Reyes-Gavilan, executive director, DC Public Library. In addition to meeting with the Chairman, Richard will also present during the Commission meeting. We were pleased to be asked to provide a library perspective and are thrilled to have representatives from a variety of libraries who give the color to why what the Commission is doing will have such positive impact on libraries and the communities they serve.

Join in the fun

A little lighthearted fun at an E-rate meeting? Of course. Play E-rate Bingo online:

Bingo Card A
Bingo Card B
Bingo Card C
Bingo Card D

When one of the Commissioners or the Chairman says one of the words on your card, mark it out. Use the twitter hashtag #libraryerate to let everyone know when you get Bingo! Since the meeting starts at 10:30 Eastern time, we can’t encourage adult beverages so use chocolate when you hear one of your words. Of course you should also tweet throughout the meeting. Tell everyone what more broadband will mean for your library.

After the meeting, we will still have to wait to see the actual order until the Commission releases it publicly. We are planning a number of outreach activities to help navigate the changes to the E-rate program and to help libraries take advantage of them. The first will be a webinar in collaboration with the Public Library Association. This will be January 8 at 2:00 Eastern. Also, look for a summary of the order once we’ve had a chance to read it!
We’re looking forward to Thursday and the work ahead. So while we’re taking at least a day off to reflect and high-five a little, stay tuned. More is on the way.

The post E-rate order at the FCC? B-i-n-g-o! appeared first on District Dispatch.

Mita Williams: From DIY to Working Together: Using a Hackerspace to Build Community : keynote from Scholars Portal Day 2014

planet code4lib - Tue, 2014-12-09 18:55
On December 3rd, I gave a keynote for Scholars Portal Day.  The slide deck was made using BIG and is available online.  Thank you to Scholars Portal for inviting me to be with one of my favourite communities.

You can’t tell how many apples are in a seed.

In May of 2010, I, Art Rhyno, Nicole Noel and late and sorely missed Jean Foster hosted an unconference at the central branch of the Windsor Public Library. 

Unconferences are seemingly no longer in vogue, so just in case you don’t know, an unconference is a conference where the topics of discussion are determined by those in the room who gather and disperse in conversation as their interests dictate. 

The unconference was called WEChangeCamp and it was one several ChangeCamp unconferences that occurred across the country at that time.

At this particular unconference, 40 people from the community came together to answer this question: “How can we re-imagine Windsor-Essex as a stronger and more vibrant community?”

And on that day the topic of a Windsor Hackerspace was suggested by a young man who I later learned was working on his doctorate in electrical engineering.  What I remember of that conversation four years ago was Aaron explaining the problem at hand: he and his friends needed regular infusions of money to rent a place to build a hackerspace so they needed a group of people who would pay monthly membership fees. But they couldn’t get paying members until they could attract them with a space.

Shortly thereafter, Aaron - like so many other young people in Windsor- left the city for work elsewhere. It’s a bit of an epidemic here. We have the second highest unemployment rate in Canada and it’s been said that youth unemployment rate in Windsor is at a staggering 20%.

In Aaron’s case, he moved to Palo Alto, California to do robotics work in an automotive R&D lab.

In the meantime back in Windsor, in May 2012, I helped host code4lib North at the University of Windsor.  We had the pleasure to host librarians from many OCUL libraries over those two days as well as staff from the Windsor Public Library. Also in the audience was Doug Satori. Doug had helped in the development of the WPL’s CanGuru mobile library application. He came to code4lib north because he was was curious about the first generation Raspberry pi that John Fink of McMaster had brought with him.  You have to remember that in 2012 that the Raspberry Pi - the $40 computer card - was still never very new in the world.

A year later, in May 2013, Windsor got its first Hackerspace when Hackforge was officially opened. The Windsor Public Library graciously lent Hackforge the empty space in the front of their Central Branch that was previously a Woodcarver’s Museum.

When Hackforge launches, Doug Satori is president and I’m on the board of directors.

In our 20 months of our existence, I’m proud to say that Hackforge has accomplished quite a lot for itself and for our community.

We’ve co-hosted three hackathons along with the local technology accelerator WETech Alliance.

The first hackathon was called HackWE - and it lasted a weekend, was hosted at the University of Windsor and was based on the City of Windsor’s Open Data Catalogue.

HackWE 2.0 was a 24-hour hackathon based on residential energy data collected by Smart Meters and was part of a larger Ontario Apps for Energy Challenge.

And the third HackWE 3.0 - which happened just this past October -  had events stretched over a week and based on open scientific data in celebration of Science and Technology week.

We’ve hosted our own independent hackathons as well. Last year Hackforge held a two week Summer Games event for people who wanted to try their hand at video game design. Everyone who completed a game won a trophy.  My own video game won the prize for being the Most endearing.

But in general, our members are more engaged in the regular activities of Hackforge.

They include our bi-weekly Tech Talks that our members give to each other and the wider public, on such topics as Amazon Web Services, slide rules, writing Interactive fiction with JavaScript, and using technology in BioArt.

We have monthly Maptime events in the space. Maptime is an open learning environment for all levels related to digital map making but there is a definite an emphasis on support for the beginner.

This photo is from our first Windsor Maptime event which was dedicated to OpenStreetMap. There are Maptime chapters all around the world, and the next Maptime Toronto meeting is December 11th, if you are curious and if you near or in the GTA.

The Hackforge Software Guild meets weekly to work on personal projects as well as practicing pair programming on coding challenges called katas.  For example, one of the first kata challenges was to write a program that would correctly write out the lyrics of 99 bottles of beer on the wall and one of more recent is how to code bowling scores.

We also have an Open Data Interest group and we are going to launch our own Open Data portal for Windsor’s non-profit community in 2015.  We’re able to do this because this year we have received Trillium funding to hire a part-time coordinator and to small pay stipends to people to help with this work.

Our first dataset is likely going to be a community asset map that was compiled by the Ford City Renewal group.  Ford City is one of several neighbourhoods in Windsor in which more than 30% of the population is have income levels that at poverty level. Average incomes of those from the the City of Windsor as a whole isn’t actually that much less than average for all of Canada - its just that we’re just the most economically polarized urban area in the country.  That’s one of the reasons why, in January Hackforge is going to be working with Ford City Renewal to host a build your computer event for young people in the neighborhood.

As well, our 3 year Trillium grant also funds another part-time coordinator who matches individuals seeking technology experience with non-profits such as the Windsor Homeless Coalition who need technology work and support. 

Hackforge has also collaborated with the Windsor Public Library to put on co-hosted events such as the Robot Sumo contest.

 And we’ve worked with the City of Windsor to produce persistence of vision bicycle wheels for the their WAVES light and sound art festival.  I know it’s difficult to see but in the photo on the screen is a bicycle wheel with a narrow set of lights that are strapped to three spokes on the wheel. When the wheel spins, the lights animate and give the impression that there’s an image in the wheel - it only works with the human eye - because of our persistence of vision - and it’s something that really come across in a photo very well.

[here's a video!]

Also, the City of Windsor commissioned us to build a Librarybox for their event which I thought was really cool!

And like most other Hackerspaces, we have 3D printers. We have robotic kits. We have soldering irons, and we have lots and lots of spare electronic and computer parts. But unlike most other hackerspaces who charge their members $30 to $50 a month to join and make use their space, our hackerspace is currently free to members who pay for their membership with volunteer work.

This brings us to today in the last days of 2014.

2014 is also the year that Aaron came back to us from California. He’s now my fellow board member at Hackforge.  And, incidentally, so is Art Rhyno, who  - if you don’t know - is a fellow librarian from the University of Windsor.

I was asked by Scholars Portal if I could share some of my experiences with Hackforge in light of today’s theme of building community.  And that is what my talk will be about today: how to use a hackerspace to build community. And I will do so by expanding on five themes. 

But as you know know - we are only 2 years old, and so - this talk is really about just the beginning steps we’ve been taking and those steps that we are still trying to take.  We admittedly have a long way to go.

Helping out with Hackforge has been a very rich and rewarding experience and I’ve learned much from it. And it’s also been hard work and sometimes it has been very time consuming.

All those decisions we made as we started our hackerspace were the first ones we’ve ever had to make for our new organization. This process was exhilarating but it also was occasionally exhausting.  Which brings us to our first theme:

Institutions reduce the choices available to their members

The reason why starting up an organization is so exhausting can be found in Ronald Coase’s work. Coase is famous for introducing the concept of transaction costs to explain the nature and limits of firms and that earned him the Nobel Prize in Economics in 1991. Now I haven’t read his Nobel prize winning work, myself. I was first introduced to Coase when I read a book last year called The org: the underlying logic of the office by Ray Fisman and Tim Sullivan.

I also read Coase being referenced in a blog post by media critic Clay Shirky that was about about the differences between new media and  old media. It’s Shirky’s words on the screen right now:

These frozen choices are what gives institutions their vitality — they are in fact what make them institutions. Freed of the twin dangers of navel-gazing and random walks, an institution can concentrate its efforts on some persistent, medium-sized, and tractable problem, working at a scale and longevity unavailable to its individual participants.
Further on in his post Shirky explains what he means by this through an example of what happens at a daily newspaper:

The editors meet every afternoon to discuss the front page. They have to decide whether to put the Mayor’s gaffe there or in Metro, whether to run the picture of the accused murderer or the kids running in the fountain, whether to put the Biker Grandma story above or below the fold. Here are some choices they don’t have to make at that meeting: Whether to have headlines. Whether to be a tabloid or a broadsheet. Whether to replace the entire front page with a single ad. Whether to drop the whole news-coverage thing and start selling ice cream. Every such meeting, in other words, involves a thousand choices, but not a billion, because most of the big choices have already been made.
When you are starting a new organization or any new venture, really, every small decision can sometime seem to bog you down.  There is navel-gazing and random walks.

We got bogged down at the beginning of Hackforge. We actually received the keys to the space in the Windsor Public Library in October of 2012.  Why the delay? We had decided that we would launch the opening of our space with a homemade keypass locking system for the doors because we thought it wouldn’t take much time at all. 

And if we were considering how long it would take one talented person to build such a system by themselves, then maybe we would been right. But instead, we were very wrong. And looking back at it, now it seems obvious why this was the case:

We had a set of people who have never worked together before, who don’t necessarily even speak the same programming languages, working without an authority structure, in a scarcely born organization with no promise that we will succeed or survive, nor sure promise of reward.

Now it’s very important for me say that this so I'm absolutely clear - I am not complaining about our volunteers!!!

Hackforge would not have succeeded if it weren’t for those very first volunteers who made Hackforge happen in those early days when we were starting with nothing.

And the same holds to this day. When we say that Hackforge is made of volunteers, what we are really saying is that Hackforge = volunteers. 

Our volunteers are especially remarkable because -- like all  volunteers - they give up their own time that’s left over after their pre-existing commitments to work, school, family and friends. In volunteer work, every interaction is a gift. But, that being said, not every promise in a volunteer organization is one that is fulfilled. Sometimes you learn the hard way that first thing on Tuesday means 3pm.

But the delay wasn’t just from the building of the system. Once it was built, we then we had to make sure that the keypass system was okay with the library and that it was okay with the fire marshall. And we had to figure out how who was going to make the key cards, how they were going to be distributed and how we would use to decide who would get a keycard to the space and who would not.  Ultimately, it took us 8 months to figure this all of this out.

I wanted to explicitly mention this observation because I’ve noticed that within our own institution of libraries that sometimes when a new group or committee is started up, there is the occasional individual who interprets the slow goings and long initial discussions of the first meetings as, at best, extreme inefficiency, and at worst, a sign of imminent failure.

When in fact, we should recognize that slow starts are normal.

Culture is to a organization as community is to a city

New organizations and new ventures happen slowly and furthermore, they should happen slowly because each decision made is one that further that defines the “how” of “what an organization is”.  Are we, as an organization, formal or informal?  Who takes the minutes at meetings?  Do we need to give a notice of motion? Do we do our own books or do we hire an accountant? Do we provide food at our events?  Do we sell swag or do we give it away? How should we fundraise?  How do we deal with bad actors?  Every decision further defines the work that we do.

It’s very important to take the time to take these steps slowly in order to make sure that the way you do things match up with the why you do things.  As I think we can appreciate in libraryland, once institutions reduce choices of their members it is very difficult - although not impossible to open them up again for rethinking and refactoring.

One of reasons why Hackforge has been very successful in its brief existence - is that it was formed with clearly articulated reasons and clear guiding principles that continue to help us shape the form of our work. And I know this, because the vision of what Hackforge should be was told to be me when I was invited to serve of the board when Hackforge began and, I can attest to the fact, that it is the same the as the one we have now.

Now, there are many different types of hacker and makerspaces: some are dedicated to artists, others to entrepreneurs, while others are dedicated to the hobbyist.   Hackforge - in less than 140 characters has been described as this: Hackforge supports capacity building in the community and supporting a culture of mentorship and inclusivity.

More specifically, we exist to help with youth retention in Windsor. We aim to be a place where individuals who work or want to work in technology can find support from each other.

I know it might sound strange to you that we believe that our local IT industry needs support, especially when we read about the excesses of Silicon Valley on a regular basis.

But in Windsor, there are not many options for those with a technology background to find work and so, despite of the impression we give to those pursuing a career in STEM, tech jobs in Windsor can be poorly paid and the working conditions can be very problematic.

Many of the provisions in the labour law - the ones that entitle employees to set working hours, to breaks between and within shifts, to overtime and even time to eat - have exemptions for those who work in IT.  I’ve been told that the only way to get a raise while working in IT in this town is to find a better paying job.

The IT industry sometimes treats people as if they were machines themselves.

Hackforge was built as a response to this environment. It was build in hopes that it could help  grow something better.  At Hackforge we know our strength does not come from the machines that we have in our space, but our amazing members and the time and work that they give to others.   

I mean, we love 3D printers because they are a honeypot that brings curious folks into our space, but the secret is we are not really about 3D printers. 

And yet if you look at all of what our media coverage we receive, you would think we’re just another makerspace that loves 3D printers and robots.

This is why it is SO important to be visible with your values, which is our second theme.

Show your work

One of the challenges that we have at Hackforge is that we don’t have very many women in our ranks.  Women make up half of our board of directors but our larger membership is not representative of the Windsor community and it’s likely not representative in the other aspects of identity, for that matter, either.

We know that if we wanted to change this situation, it would require sustained work on our part. And so when we had our official launch of Hackforge last year, we, as part of the event, hosted a Women in Technology Panel that featured four women who work in IT, including the very successful Girl Develop IT from Detroit, all of whom both shared their experiences and offer strategies to make the field of technology a more inclusive environment and better place for everyone.

In the audience for that panel discussion was a representative of WEST. WEST is a local non-profit group who works and stands for Women’s Enterprise Skills Training. Starting next year, with the support of another Ontario Trillium grant, Hackforge and WEST are going to be launching a project that will offer free computer skills training workshops for women as well as trying to create a community of support, and continue to advocate for women in the IT field.

So I can’t stress this enough. You have to do your work in public if you want your future collaborators to find you.

I have also another Women in Technology story to start our third theme.

So remember I told you about unconferences? Well, the Hackforge members who run the Software Guild do something similar.  Sometimes instead of coding, the folks do something like this.  They write down all thing the things they want to talk about, vote for the topics and then talk the most voted topics within strict time limits. But they don’t call it an unconference:

They call it LEAN COFFEE.

I love it. It’s so adorable.

Anyway, at one of these Lean Coffee sessions, our staff coordinator suggested the topic Women in Technology.  And the response she received was this: We know there’s a problem because Hackforge doesn’t have enough women. But we are not sure how to fix this. 

To me, I found this statement very encouraging. 

Its sad, but in these these times, when people can admit that there’s a problem without any deflection or allocation of blame is actually very refreshing.

I mean, within librarianship - we have some organizations who consistently organize speaking events made up of mostly men. Whenever I raise this matter I usually told that if the speaking topic is not about gender, then it’s not about gender. In other words, they tell me that there is no problem.

But sometimes there is a problem.

Look at this photo:  from this you would never guess that it was taken in a city that is over 80% African American.  This photo from the first meeting of Maptime Detroit that I attended last month.  One of the first things that was said during the evening’s introduction was a simple statement by the organizer.  “I want to acknowledge who isn’t this in room”  And what followed was a plan to hold the next Maptime meetings, not in the mid-town Tech Incubator, but within the various neighbourhoods in the city and alongside partner organizations already working with Detroiters where they live.

So before we can be more inclusive, we need to recognize when we are not.

We can start by acknowledging who isn’t in the room. It isn’t hard to do.

Quinn Norton wrote a lovely essay about this called Count. Speaking of counting, we are now at theme four.

A mailing list is not a community

What you might find surprising is that - for Hackforge being a gathering of people who generally love love love the Internet, is that we really don’t even have a strong online space for folks to hang out in, with the exception of our IRC channel.  We used to have forum software, but is was so overwhelmed with spam on a daily basis it was almost immediately rendered unusable. 

Also, Hackforge doesn’t even have a listserv mailing list. 

And I would go as far to say that one of the reasons why Hackforge has been as successful as we have been is in part, because that we *don’t* have a mailing list. 

There’s a website that’s called Running a Hackerspace that is a collection of animated gifs that metaphorically capture the essence of Running a hackerspace. I think it’s particularly telling that there are many recurrent topics that arise this Tumblr: like the complaints that folks don’t clean up after themselves. 

(And this is when I confess that when I drop by Hackforge, I am also sometimes made sad).

But the most prevalent theme in the blog is mailing list rage.

You would think this would have been a solved problem by now: how do you support project work that is done asynchronously and dispersed over geography. Many open source communities are finding that the traditional tools of mailing lists, forum software, and IRC channels are not doing enough in helping their communities do good work together. More often than not, these technologies seem to be better than boosting the noise rather than the signal.

Distributed companies like Wordpress are moving from IRC to software platforms such as Slack. As I’ve mentioned before,  I’m involved with a largely self-organized group called Maptime and we also make use of Slack, which is essentially user friendly IRC, chat, and messaging along with images, file sharing, archiving and social media capture. 

At Hackforge, we’ve recently decided to use the Jira issue tracker to manage the hacking work that we need to do in the space and we will be switching to Nation Builder software to manage our members and member communications. When activists, non-profits, and political parties are using software like Nation Builder to manage the contact info, the interests, and the fundraising of tens of thousands of people, it makes me wonder when libraries are going to start using similar software to manage the relationships it has with its community.

And at a time when my neighbours who rent the skating rink for collective use, use volunteer management software to figure out who’s turn it is to bring the hot chocolate, I would like to suggest that libraries perhaps could start using similar software to - at least - manage our internal work and communications as well.  Good tools make great communication possible within organizations and our communities. They are are worth the investment.

Invest in but do not outsource community management

Before I end my presentation with this last theme, I do want to offer a caveat to everything I’ve said.  If you asked all of the people who have been involved in Hackforge - those who have come by our events, spent time in the space, or even volunteered some mentoring at an event - if you asked them if they felt they were part of a community, I think most people asked would probably say, no.  I think we have a wonderful group of people who have contributed to Hackforge and  I think we have a group of people who have even found friends at Hackforge, but I think we still can’t call the whole of what we do "a community" - at least not yet. 

Hackforge is approaching its 2nd birthday and this talk has been a wonderful excuse to reflect on what we do well and what we still need to work on.

What works for us are regular events, contests and Hackathons. We are well aware of the limitations of hackathons and how they produce imperfect work but, for us, it seems to be that  that pre-defined limits and deadlines produce more work and generate more interest and excitement than unstructured free time seems to.

Unlike many hackerspaces, we don’t tend to have many group projects. The door project - as you have learned - was one of few group projects, and that one took longer than expected. In our early months, we also had a LED sign project that was never completed and actually resulted in some people leaving Hackforge in frustration.

We are a volunteer organization and as such, by the process of evolution, we are a place for the patient and the forgiving. Sometimes we have gotten our first impressions wrong.

One of the largest challenges I think we have as an organization is to be more accessible to beginners.  In fact, that the feedback that we’ve been getting.

Aaron recently had a tech talk about tech talks and the message he received was that Hackforge should provide more sessions for beginners.  And this is a particular challenge that we haven’t really addressed yet. We’re luckily that Hackforge has people who are both generous with their time and not afraid of public speaking and give tech talks. But many of our speakers don’t preface their talks with an introduction that a newbie could understand. They are so excited to have fellow experts in the crowd and they jump right into the code or electrical specs or what have you.

Likewise, it’s amazing and wonderful that we have regular supportive events like our member’s coding katas in which those who work with software can practice and share their coding practice with others. But at the moment, we don’t really have anything for those who want to learn how to code.  And you might not be shocked to hear this, but Hackforge’s machines like our 3D printers - lack even the most basic documentation on how to use the machines.

Without expanding the work of communicating, documenting, explaining, and teaching, Hackforge won’t be able to attract new members. 

Hackforge started as a top down organization.  Our job as board has been to the build the systems that will allow more of the day to day work of the Hackforge to move from the board to our community and program managers.  We were able to hire our managers in the middle of this year and already, they have made wonderful contributions to Hackforge.  Our next challenge will be how to move more of the operational work of the managers to the members themselves.

In other words, the challenge for Hackforge is to ensure that the work that needs to be done - all of that communicating, documenting, explaining, teaching - needs to be embraced by all of its members as a community of practice.  And through this practice, it’s hoped we can  build a community.

So, those are my five themes for building community with a hackerspace:

Institutions reduce the choices available to their members (so choose carefully)
Show your work (so future collaborators can find you).
Acknowledge who isn’t in the room (Count is only the start).
A mailing list is not a community (Invest in tools that do better).
Invest but do not outsource community management.

The work of figuring how to get a bunch of people to come together and face a shared challenge isn’t just the way the build a community.  This is also how political movements begin.  It’s also how a game begins. I would like to thanks to Scholars Portal for giving me the opportunity to begin Scholars Portal Day with you all. 

District Dispatch: Free financial literacy webinar for librarians

planet code4lib - Tue, 2014-12-09 18:20

Today, the Department of Labor will host a webinar that will outline and discuss suggested activities states should undertake to comply with the Workforce Innovation and Opportunity Act (WIOA) beginning immediately through July 1, 2015. This session has limited space so please register quickly.

US Department of Labor

During the webinar “WIOA Technical Assistance Webinar- Jump-Starting Your Implementation,” speakers will highlight areas where states have existing authority to take action to comply with WIOA, as well as provide technical assistance, based on statutory requirements, on additional areas in which states are encouraged to move forward in the transition from Workforce Investment Act to WIOA.

  • Adele Gagliardi, Administrator, Office of Policy Development and Research, U.S. Department of Labor, Employment and Training Administration
  • Christine Quinn, Special Assistant, U.S. Department of Labor, Employment and Training Administration
  • Lori H. Collins, Director, Division of Workforce & Employment Services,
  • Kentucky Career Center
  • Mike Riley, ETPL lead, Division of Workforce & Employment Services, Kentucky Career Center
  •  Lisa Salazar, Director, City of Los Angeles OneSource, Youth Opportunity System
  • Thomas Colombo, Assistant Director, Division of Employment & Rehabilitation Services, Interim Deputy, State of Arizona
  • Alice Sweeney, Director, Department of Career Services, State of Massachusetts
  • Scott C. Fennell, Chief Operation and Financial Officer, Career Source Florida
  • Moderator: Maggie Ewell, Policy and Reporting Team Lead Office of Grants Management, U.S. Department of Labor, Employment and Training Administration

Webinar Details
Date: December 9, 2014
Time: 2:00pm ET (1:00pm/Central, 12:00pm/Mountain, 11:00am/Pacific)
Join the webinar

The ALA Washington Office does not know if the webinar will be archived. Contact the Department of Labor with questions about the webinar.

The post Free financial literacy webinar for librarians appeared first on District Dispatch.

District Dispatch: Alan’s NYC Adventure

planet code4lib - Tue, 2014-12-09 18:03

I spent much of last week in New York City as part of the American Library Association (ALA) advocacy effort regarding ebooks. These meetings with publishing executives are described in my post on the American Libraries magazine’s E-content blog. However, I also engaged in some other activities during this trip.

I had the pleasure of participating in Jim Neal’s retirement celebration, held at Casa Italiana, Columbia University, which saw hundreds in attendance to pay tribute to him. As many of you know, Jim is a long-time, strategic, and highly-respected contributor to the library community at the national level. Among his many contributions, he has served on ALA’s Executive Board for three separate tenures, including one presently, and is a former treasurer of the Association.

Jim Neal speaking at his retirement party.

Lee Bollinger, president of Columbia University, kicked off the formal program to recognize Jim’s contributions as vice president of information services and university librarian. Several other Columbia University officials participated in the praise, including our close collaborator Bob Wolven, an associate university librarian and former co-chair of the ALA Digital Content Working Group. Bob commented on Jim’s extraordinary energy and initiative, noting that if Jim received some lemons, he would not make lemonade—instead he would develop a plan for a multi-division business and demand more lemons.

In 2015, Jim will become university librarian emeritus, an honor bestowed only two times previously in the university’s history. Jim will remain active in the field and so ALA and the national library community will continue to benefit from Jim’s strategic guidance in copyright policy and other areas for some time to come. As he is a member of the Policy Revolution! Initiative’s Library Advisory Committee, I am relieved to learn that we will continue to benefit from his counsel in our efforts to reshape national public policy for libraries. While on campus, I also had separate meetings with Jim and Bob to discuss broad issues in digital content and information policy.

I Love My Librarian event.

The award ceremony for the I Love My Librarian award took place last week and so I was able to attend it, held at the offices of The New York Times. Of course, I expected to hear about the exemplary library work of the awardees, but the intense level of emotion exhibited at the event was a bit unexpected.

For these librarians, their work extends far beyond a job, becoming a calling—and their patrons see it that way as well. This award is extremely competitive, with over 1000 applications received that ultimately yielded 10 awardees.

Finally, I made it to Connecticut for an evening. My first stop was the Darien Public Library to meet with Amanda Goodman, a staff librarian and member of the Office for Information Technology Policy (OITP) advisory committee.

Darien Public Library December 2014

I got a looksee at the library’s four 3D printers and its fine children’s library. I then met up with Dr. Roger Levien, author of our policy brief Confronting the Future. Roger is working on a new book on the future of public libraries and we discussed varied aspects of his developing analysis in the context of related work in the field.

Trips are great, but then they end and you end up back in the office trying to catch up. I don’t ever seem to catch up—I’m just trying to keep the backlog under embarrassing levels!

The post Alan’s NYC Adventure appeared first on District Dispatch.

DPLA: December 15, 2014: Open Board Call

planet code4lib - Tue, 2014-12-09 17:32

The DPLA Board of Directors will hold an open call on Monday, December 15, at 3:00 PM Eastern. Agenda and dial-in information is included below.


Public Session

  • What’s coming in early 2015
  • Update from Executive Director
  • Questions and comments from the public

Executive Session

  • Voice approval of strategic plan
  • Review of draft tax return
  • Update on nominating subcommittee
  • Update on current grant activities
Public session dial-in Dial-in: 800-501-8979 Access Code: 2739336

Open Knowledge Foundation: The Global Open Data Index 2014 is now live!

planet code4lib - Tue, 2014-12-09 16:45

The Global Open Data Index 2014 team is thrilled to announce that the Global Open Data Index 2014 is now live!

We would not have arrived here without the incredible support from our network and the wider open knowledge community in making sure that so many countries/places are represented in the Index and that the agenda for open data moves forward. We’re already seeing this tool being used for advocacy around the world, and hope that the full and published version will allow you to do the same!

How you can help us spread the news

You can embed a map for your country on your blog or website by following these instructions.

Press materials are available in 6 languages so far (English, German, Spanish, Portuguese, Japanese and French), with more expected. If you want to share where you are please share a link to our press page. If you see any coverage of the Global Open Data Index, please submit them to us via this form so we can track coverage.

We are really grateful for everyone’s help in this great community effort!

Here are some of the results of the Global Open Data Index 2014

The Global Open Data Index ranks countries based on the availability and accessibility of information in ten key areas, including government spending, election results, transport timetables, and pollution levels.

The UK tops the 2014 Index retaining its pole position with an overall score of 96%, closely followed by Denmark and then France at number 3 up from 12th last year. Finland comes in 4th while Australia and New Zealand share the 5th place. Impressive results were seen from India at #10 (up from #27) and Latin American countries like Colombia and Uruguay who came in joint 12th .

Sierra Leone, Mali, Haiti and Guinea rank lowest of the countries assessed, but there are many countries where the governments are less open but that were not assessed because of lack of openness or a sufficiently engaged civil society.

Overall, whilst there is meaningful improvement in the number of open datasets (from 87 to 105), the percentage of open datasets across all the surveyed countries remained low at only 11%.

Even amongst the leaders on open government data there is still room for improvement: the US and Germany, for example, do not provide a consolidated, open register of corporations. There was also a disappointing degree of openness around the details of government spending with most countries either failing to provide information at all or limiting the information available – only two countries out of 97 (the UK and Greece) got full marks here. This is noteworthy as in a period of sluggish growth and continuing austerity in many countries, giving citizens and businesses free and open access to this sort of data would seem to be an effective means of saving money and improving government efficiency.

Explore the Global Open Data Index 2014 for yourself!

David Rosenthal: Talk at Fall CNI

planet code4lib - Tue, 2014-12-09 15:00
I gave a talk at the Fall CNI meeting entitled Improving the Odds of Preservation, with the following abstract:
Attempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" vastly exceeds that from all other causes, such as bit rot and format obsolescence. Will persisting with current preservation technologies improve the odds of preservation? If not, what changes are needed to improve them?It covered much of the same material as Costs: Why Do We Care, with some differences in emphasis. Below the fold, the text with links to the sources.
IntroductionI'm David Rosenthal from the LOCKSS (Lots Of Copies Keep Stuff Safe) Program at the Stanford University Libraries. As with all my talks, you don't need to take notes or ask for the slides. The text of the talk, with links to the sources, will go up on my blog shortly.

One of the preservation networks that the LOCKSS Program operates is the CLOCKSS archive, a large dark archive of e-journals and e-books. We operate it under contract to a not-for-profit organization jointly run by publishers and libraries. Earlier this year we completed a more than year-long process that resulted in the CLOCKSS Archive being certified to the Trusted Repository Audit Criteria (TRAC) by CRL. We equalled the previous highest score and gained the first-ever perfect score for Technology. At you will find all the non-confidential material upon which the auditors based their assessment. And on my blog you will find posts announcing the certification, describing the process we went through, discussing the lessons learned, and describing how you can run the demos we put on for the auditors.

Although CRL's certification was to TRAC, the documents include a finding aid structured according to ISO16363, the official ISO standard that is superseding TRAC. If you look at the finding aid or at the ISO16363 documents you will see that many of the criteria are concerned with economic sustainability. Among the confidential materials the auditors requested were "Budgets for last three years and projections for next two showing revenue and expenses".

We actually gave them five-year projections. This is an area where we had a good story to tell. The LOCKSS Program got started with grant funds from the NSF, the Andrew W. Mellon Foundation, and Sun Microsystems. But grant funding isn't a sustainable basis for long-term preservation. In 2005, the Mellon Foundation gave us a 2-year grant which we had to match, and after which we had to be off grant funding. For more than 7 years we have been in the black without grant funding. The LOCKSS software is free open source, the LOCKSS team charges for support and services.

Achieving this economic sustainability has required a consistent focus on minimizing the cost of every aspect of our operations. Because the LOCKSS system's Lots Of Copies trades using more disk space for using less of other resources (especially lawyers), I have been researching in particular the costs of storage for some years. In what follows I want to look at the big picture of digital preservation costs and their implications. It is in three sections:
  • The current situation.
  • Cost trends.
  • What can be done?
The Current SituationHow well are we doing at the task of preservation? Attempts have been made to measure the probability that content is preserved in some areas; e-journals, e-theses and the surface Web:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) compare information extracted from journal publisher's web sites with the Keepers Registry and conclude:We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
  • The Hiberlink project studied the links in 46,000 US theses and determined that about 50% of the linked-to content was preserved in at least one Web archive.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Web is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the URL shortening service. They then:
    • tried to access the URL from the live Web.
    • used Memento to ask the major Web archives whether they had at least one copy of that URL.
    Their results are somewhat difficult to interpret, but for their two more random samples they report: URIs from search engine sampling have about 2/3 chance of being archived [at least once] and URIs just under 1/3.
So, are we preserving half the stuff that should be preserved? Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic.
An Optimistic AssessmentFirst, the assessment isn't risk-adjusted:
  • As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation*, so what have those resources achieved for future readers?
  • As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
Second, the assessment isn't adjusted for difficulty:
  • A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
  • The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
  • Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Harvesting the low-hanging fruit directs resources away from the content at risk of loss.

Third, the assessment is backward-looking:
  • As regards scholarly communication it looks only at the traditional forms, books, theses and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
  • As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.
Fourth, the assessment is likely to suffer measurement bias:
  • The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
  • As regards Web content, Ainsworth et al describe various forms of bias in their paper.
As Cliff Lynch pointed out in his summing-up of the 2014 IDCC conference, the scholarly literature and the surface Web are genres of content for which the denominator of the fraction being preserved (the total amount of genre content) is fairly well known, even if it is difficult to measure the numerator (the amount being preserved). For many other important genres, even the denominator is becoming hard to estimate as the Web enables a variety of distribution channels:
  • Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
  • YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
  • Much music now happens on YouTube (e.g. Pomplamoose)
  • Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Of course, "what we should be preserving" is a judgement call, but clearly even purists who wish to preserve only stuff to which future scholars will undoubtedly require access would be hard pressed to claim that half that stuff is preserved.
Preserving the RestOverall, its clear that we are preserving much less than half of the stuff that we should be preserving. What can we do to preserve the rest of it?
  • We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.
  • We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels.
  • We can more than halve the cost per unit content. Doing so requires a radical re-think of our preservation processes and technology.
Such a radical re-think requires understanding where the costs go in our current preservation methodology, and how they can be funded. As an engineer, I'm used to using rules of thumb. The one I use to summarize most of the research into past costs is that ingest takes half the lifetime cost, preservation takes one third, and access takes one sixth.

On this basis, one would think that the most important thing to do would be to reduce the cost of ingest. It is important, but not as important as you might think. The reason is that ingest is a one-time, up-front cost. As such, it is relatively easy to fund. In principle, research grants, author page charges, submission fees and other techniques can transfer the cost of ingest to the originator of the content, and thereby motivate them to explore the many ways that ingest costs can be reduced. But preservation and dissemination costs continue for the life of the data, for "ever". Funding a stream of unpredictable payments stretching into the indefinite future is hard. Reductions in preservation and dissemination costs will have a much bigger effect on sustainability than equivalent reductions in ingest costs.
Cost TrendsWe've been able to ignore this problem for a long time, for two reasons. From at least 1980 to 2010 storage costs followed Kryder's Law, the disk analog of Moore's Law, dropping 30-40%/yr. This meant that, if you could afford to store the data for a few years, the cost of storing it for the rest of time could be ignored, because of course Kryder's Law would continue forever. The second is that as the data got older, access to it was expected to become less frequent. Thus the cost of access in the long term could be ignored.

But can we continue to ignore these problems?
PreservationKryder's Law held for three decades, an astonishing feat for exponential growth. Something that goes on that long gets built into people's model of the world, but as Randall Munroe points out, in the real world exponential curves cannot continue for ever. They are always the first part of an S-curve.

This graph, from Preeti Gupta of UC Santa Cruz, plots the cost per GB of disk drives against time. In 2010 Kryder's Law abruptly stopped. In 2011 the floods in Thailand destroyed 40% of the world's capacity to build disks, and prices doubled. Earlier this year they finally got back to 2010 levels. Industry projections are for no more than 10-20% per year going forward (the red lines on the graph). This means that disk is now about 7 times as expensive as was expected in 2010 (the green line), and that in 2020 it will be between 100 and 300 times as expensive as 2010 projections.

These are big numbers, but do they matter? After all, preservation is only about one-third of the total. and only about one-third of that is media costs.

Our models of the economics of long-term storage compute the endowment, the amount of money that, deposited with the data and invested at interest, would fund its preservation "for ever". This graph, from my initial rather crude prototype model, is based on hardware cost data from Backblaze and running cost data from the San Diego Supercomputer Center (much higher than Backblaze's) and Google. It plots the endowment needed for three copies of a 117TB dataset to have a 95% probability of not running out of money in 100 years, against the Kryder rate (the annual percentage drop in $/GB). The different curves represent policies of keeping the drives for 1,2,3,4,5 years. Up to 2010, we were in the flat part of the graph, where the endowment is low and doesn't depend much on the exact Kryder rate. This is the environment in which everyone believed that long-term storage was effectively free.

But suppose the Kryder rate were to drop below about 20%/yr. We would be in the steep part of the graph, where the endowment needed is both much higher and also strongly dependent on the exact Kryder rate.

We don't need to suppose. Preeti's graph and industry projections show that now and for the foreseeable future we are in the steep part of the graph. What happened to slow Kryder's Law? There are a lot of factors, we outlined many of them in a paper for UNESCO's Memory of the World conference (PDF). Briefly, both the disk and tape markets have consolidated to a couple of vendors, turning what used to be a low-margin, competitive market into one with much better margins. Each successive technology generation requires a much bigger investment in manufacturing, so requires bigger margins, so drives consolidation. And the technology needs to stay in the market longer to earn back the investment, reducing the rate of technological progress.

Thanks to aggressive marketing, it is commonly believed that "the cloud" solves this problem. Unfortunately, cloud storage is actually made of the same kind of disks as local storage, and is subject to the same slowing of the rate at which it was getting cheaper. In fact, when all costs are taken in to account, cloud storage is not cheaper for long-term preservation than doing it yourself once you get to a reasonable scale. Cloud storage really is cheaper if your demand is spiky, but digital preservation is the canonical base-load application.

You may think that the cloud is a competitive market; in fact it is dominated by Amazon.
Jillian Mirandi, senior analyst at Technology Business Research Group (TBRI), estimated that AWS will generate about $4.7 billion in revenue this year, while comparable estimated IaaS revenue for Microsoft and Google will be $156 million and $66 million, respectively. When Google recently started to get serious about competing, they pointed out that Amazon's margins may have been minimal at introduction, by then they were extortionate:
cloud prices across the industry were falling by about 6 per cent each year, whereas hardware costs were falling by 20 per cent. And Google didn't think that was fair. ... "The price curve of virtual hardware should follow the price curve of real hardware."Notice that the major price drop triggered by Google was a one-time event; it was a signal to Amazon that they couldn't have the market to themselves, and to smaller players that they would no longer be able to compete.

In fact commercial cloud storage is a trap. It is free to put data in to a cloud service such as Amazon's S3, but it costs to get it out. For example, getting your data out of Amazon's Glacier without paying an arm and a leg takes 2 years. If you commit to the cloud as long-term storage, you have two choices. Either keep a copy of everything outside the cloud (in other words, don't commit to the cloud), or stay with your original choice of provider no matter how much they raise the rent.

Unrealistic expectations that we can collect and store the vastly increased amounts of data projected by consultants such as IDC within current budgets place currently preserved content at great risk of economic failure.

Here's a graph that illustrates the looming crisis in long-term storage, its cost. The red line is Kryder's Law, at IHS iSuppli's 20%/yr. The blue line is the IT budget, at's 2%/yr. The green line is the annual cost of storing the data accumulated since year 0 at the 60% growth rate projected by IDC,** all relative to the value in the first year. 10 years from now, storing all the accumulated data would cost over 20 times as much as it does this year. If storage is 5% of your IT budget this year, in 10 years it will be more than 100% of your budget. If you're in the digital preservation business, storage is already way more than 5% of your IT budget.
DisseminationThe storage part of preservation isn't the only on-going cost that will be much higher than people expect, access will be too. In 2010 the Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. With research data this can be a real difficulty; the value of the data may not be evident for a long time. Shang dynasty astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.

In most cases so far the cost of an access to an individual item has been small enough that archives have not charged the reader. Research into past access patterns to archived data showed that access was rare, sparse, and mostly for integrity checking.

But the advent of "Big Data" techniques mean that, going forward, scholars increasingly don't want to access a few individual items in a collection, they want to ask questions of the collection as a whole. For example, the Library of Congress announced that it was collecting the entire Twitter feed, and almost immediately had 400-odd requests for access to the collection. The scholars weren't interested in a few individual tweets, but in mining information from the entire history of tweets. Unfortunately, the most the Library could afford to do with the feed is to write two copies to tape. There's no way they could afford the compute infrastructure to data-mine from it. We can get some idea of how expensive this is by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 2.5 times as expensive; until recently it was 5.5 times.
IngestAlmost everyone agrees that ingest is the big cost element. Where does the money go? The two main cost drivers appear to be the real world, and metadata.

In the real world it is natural that the cost per unit content increases through time, for two reasons. The content that's easy to ingest gets ingested first, so over time the difficulty of ingestion increases. And digital technology evolves rapidly, mostly by adding complexity. For example, the early Web was a collection of linked static documents. Its language was HTML. It was reasonably easy to collect and preserve. The language of today's Web is Javascript, and much of the content you see is dynamic. This is much harder to ingest. In order to find the links much of the collected content now needs to be executed as well as simply being parsed. This is already significantly increasing the cost of Web harvesting, both because executing the content is computationally much more expensive, and because elaborate defenses are required to protect the crawler against the possibility that the content might be malign.

It is worth noting, however, that the very first US web site in 1991 featured dynamic content, a front-end to a database!

The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.

Worse, the W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

Metadata in the real world is widely known to be of poor quality, both format and bibliographic kinds. Efforts to improve the quality are expensive, because they are mostly manual and, inevitably, reducing entropy after it has been generated is a lot more expensive than not generating it in the first place.
What can be done?We are preserving less than half of the content that needs preservation. The cost per unit content of each stage of our current processes is predicted to rise. Our budgets are not predicted to rise enough to cover the increased cost, let alone more than doubling to preserve the other more than half. We need to change our processes to greatly reduce the cost per unit content.
PreservationIt is often assumed that, because it is possible to store and copy data perfectly, only perfect data preservation is acceptable. There are two problems with this expectation.

To illustrate the first problem, lets examine the technical problem of storing data in its most abstract form. Since 2007 I've been using the example of "A Petabyte for a Century". Think about a black box into which you put a Petabyte, and out of which a century later you take a Petabyte. Inside the box there can be as much redundancy as you want, on whatever media you choose, managed by whatever anti-entropy protocols you want. You want to have a 50% chance that every bit in the Petabyte is the same when it comes out as when it went in.

Now consider every bit in that Petabyte as being like a radioactive atom, subject to a random process that flips it with a very low probability per unit time. You have just specified a half-life for the bits. That half-life is about 60 million times the age of the universe. Think for a moment how you would go about benchmarking a system to show that no process with a half-life less than 60 million times the age of the universe was operating in it. It simply isn't feasible. Since at scale you are never going to know that your system is reliable enough, Murphy's law will guarantee that it isn't.

Here's some back-of-the-envelope hand-waving. Amazon's S3 is a state-of-the-art storage system. Its design goal is an annual probability of loss of a data object of 10-11. If the average object is 10K bytes, the bit half-life is about a million years, way too short to meet the requirement but still really hard to measure.

Note that the 10-11 is a design goal, not the measured performance of the system. There's a lot of research into the actual performance of storage systems at scale, and it all shows them under-performing expectations based on the specifications of the media. Why is this? Real storage systems are large, complex systems subject to correlated failures that are very hard to model.

Worse, the threats against which they have to defend their contents are diverse and almost impossible to model. Nine years ago we documented the threat model we use for the LOCKSS system. We observed that most discussion of digital preservation focused on these threats:
  • Media failure
  • Hardware failure
  • Software failure
  • Network failure
  • Obsolescence
  • Natural Disaster
but that the experience of operators of large data storage facilities was that the significant causes of data loss were quite different:
  • Operator error
  • External Attack
  • Insider Attack
  • Economic Failure
  • Organizational Failure
To illustrate the second problem, consider that building systems to defend against all these threats combined is expensive, and can't ever be perfectly effective. So we have to resign ourselves to the fact that stuff will get lost. This has always been true, it should not be a surprise. And it is subject to the law of diminishing returns. Coming back to the economics, how much should we spend reducing the probability of loss?

Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.

However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.

Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users is currently the 150th most visited site, whereas is the 1519th. For UK users is currently the 131st most visited site, whereas is the 2744th.

Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more really is better.

Unrealistic expectations for how well data can be preserved make the best be the enemy of the good. We spend money reducing even further the small probability of even the smallest loss of data that could instead preserve vast amounts of additional data, albeit with a slightly higher risk of loss.

Within the next decade all current popular storage media, disk, tape and flash, will be up against very hard technological barriers. A disruption of the storage market is inevitable. We should work to ensure that the needs of long-term data storage will influence the result. We should pay particular attention to the work underway at Facebook and elsewhere that uses techniques such as erasure coding, geographic diversity, and custom hardware based on mostly spun-down disks and DVDs to achieve major cost savings for cold data at scale.

Every few months there is another press release announcing that some new,  quasi-immortal medium such as fused silica glass or stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.

The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.

The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
  • Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
  • Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we have seen, current media are many orders of magnitude too unreliable for the task ahead.
Even if you could ignore failures, it wouldn't make economic sense. As Brian Wilson, CTO of BackBlaze points out, in their long-term storage environment:
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)DisseminationThe real problem here is that scholars are used to having free access to library collections and research data, but what scholars now want to do with archived data is so expensive that they must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars. A charging mechanism is needed to pay for the infrastructure. Fortunately, because the scholar's access is spiky, the cloud provides both suitable infrastructure and a charging mechanism.

For smaller collections, Amazon provides Free Public Datasets, Amazon stores a copy of the data with no charge, charging scholars accessing the data for the computation rather than charging the owner of the data for storage.

Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library of Congress kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. For this year, it would have averaged about $4100/mo, or about $50K. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 initial requests would each need to pay about $125 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach. Because the Library's preservation copy isn't in the cloud, they aren't locked-in.

In the near term, separating the access and preservation copies in this way is a promising way not so much to reduce the cost of access, but to fund it more realistically by transferring it from the archive to the user. In the longer term, architectural changes to preservation systems that closely integrate limited amounts of computation into the storage fabric have the potential for significant cost reductions to both preservation and dissemination. There are encouraging early signs that the storage industry is moving in that direction.
IngestThere are two parts to the ingest process, the content and the metadata.

The evolution of the Web that poses problems for preservation also poses problems for search engines such as Google. Where they used to parse the HTML of a page into its Document Object Model (DOM) in order to find the links to follow and the text to index, they now have to construct the CSS object model (CSSOM), including executing the Javascript, and combine the DOM and CSSOM into the render tree to find the words in context. Preservation crawlers such as Heritrix used to construct the DOM to find the links, and then preserve the HTML. Now they also have to construct the CSSOM and execute the Javascript. It might be worth investigating whether preserving a representation of the render tree rather than the HTML, CSS, Javascript, and all the other components of the page as separate files would reduce costs.

It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.

Metadata is commonly regarded as essential for preservation. For example, there are 52 criteria for ISO 16363 Section 4. Of these, 29 (56%) are metadata-related. Creating and validating metadata is expensive:
  • Manually creating metadata is impractical at scale.
  • Extracting metadata from the content scales better, but it is still expensive since:
  • In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
We need less metadata so we can have more data. Two questions need to be asked:
  • When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
  • How important is the metadata to the task of preservation? Generating metadata because it is possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources. If the reason to validate the formats of incoming content using error-prone tools is to reject allegedly non-conforming content, it is counter-productive. The majority of content in formats such as HTML and PDF fails validation but renders legibly.
The LOCKSS and CLOCKSS systems take a very parsimonious approach to format metadata. Nevertheless, the requirements of ISO 16363 pretty much forced us to expend resources implementing and using FITS, whose output does not in fact contribute to our preservation strategy, and whose binaries are so large that we have to maintain two separate versions of the LOCKSS daemon, one with FITS for internal use and one without for actual preservation. Further, the demands we face for bibliographic metadata mean that metadata extraction is a major part of ingest costs for both systems. These demands come from requirements for:
  • Access via bibliographic (as opposed to full-text) search, For example, OpenURL resolution.
  • Meta-preservation services such as the Keepers Registry.
  • Competitive marketing.
Bibliographic search, preservation tracking and bragging about exactly how many articles and books your system preserves are all important, but whether they justify the considerable cost involved is open to question. Because they are cleaning up after the milk has been spilt, digital preservation systems are poorly placed to improve metadata quality.

Resources should be devoted to avoiding spilling milk rather than cleanup. For example, given how much the academic community spends on the services publishers allegedly provide in the way of improving the quality of publications, it is an outrage than even major publishers cannot spell their own names consistently, cannot format DOIs correctly, get authors' names wrong, and so on.

The alternative is to accept that metadata correct enough to rely on is impossible, downgrade its importance to that of a hint, and stop wasting resources on it. One of the reasons full-text search dominates bibliographic search is that it handles the messiness of the real world better.
ConclusionAttempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" will vastly exceed that from all other causes, such as bit rot and format obsolescence. This raises two questions:
  • Will persisting with current preservation technologies improve the odds of preservation? At each stage of the preservation process current projections of cost per unit content are higher than they were a few years ago. Projections for future preservation budgets are at best no higher. So clearly the answer is no.
  • If not, what changes are needed to improve the odds? At each stage of the preservation process we need to at least halve the cost per unit content. I have set out some ideas, others will have different ideas. But the need for major cost reductions needs to be the focus of discussion and development of digital preservation technology and processes.
Unfortunately, any way of making preservation cheaper can be spun as "doing worse preservation". Jeff Rothenberg's Future Perfect 2012 keynote is an excellent example of this spin in action. Even if we make large cost reductions, institutions have to decide to use them, and "no-one ever got fired for choosing IBM".

We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS underwent to obtain this check-off item was well in excess of 10% of its annual budget.

Making the tradeoff of preserving more stuff using "worse preservation" would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.

To sum up, we need to talk about major cost reductions. The basis for this conversation must be more and better cost data. I'm on the advisory board for the EU's 4C project, the Collaboration to Clarify the Costs of Curation. They are addressing the need for more and better cost data by setting up the Curation Cost Exchange. Please go there and submit whatever cost data you can come up with for your own curation operations.

* But notice the current stand-off between Dutch libraries and Elsevier.
** Bill Arms intervened to point out that IDC's 60% growth rate is ridiculous, and thus the graph is ridiculous. He is of course correct, but the point is that unless your archive is growing less than the Kryder rate, your annual storage cost is increasing. The Kryder rate may well be as low as 10%/yr, and very few digital preservation systems are growing at less than 10%/yr.

District Dispatch: What can the new District Dispatch do for you?

planet code4lib - Tue, 2014-12-09 06:31

Today, the American Library Association’s (ALA) Washington Office is pleased to launch the new and reinvigorated District Dispatch! We’ve made it easier for you to find content on the site, search articles, share with friends, subscribe to the blog and learn more about library policy issues. Most importantly, the new and improved site includes features that make it easier for library advocates to find the critical federal policy information they need to take action for libraries.

Here’s what the new District Dispatch can do for you:

We hope our new site will be a great resource for you.

The post What can the new District Dispatch do for you? appeared first on District Dispatch.

Islandora: Announcing the First Annual Islandora Conference

planet code4lib - Mon, 2014-12-08 21:07

It's like an Islandora Camp, only more.

The Islandora community has been working since 2006, via email, listservs, irc, and nearly a dozen Islandora Camps. We have grown, matured, tackled major changes in the structure of the project, and now we are ready to come together for a full conference to talk about where we've been and where we're going.

August 3 - 7, 2015, we invite Islandorians from the world over to join us in the birthplace of Islandora (Charlottetown, PEI) for a week of great food, (hopefully) beautiful weather, and all the Islandora you can handle. This full week event will contain sessions from the Islandora Foundation, Interest groups, community presentations, two full days of hands-on Islandora training, and will end with an islandora Hackfest where we invite you to make your mark in the Islandora code and work together with your fellow Islandorians to complete a project selected by the community.

Full details will be available as they develop on our Conference Website, including registration, calls for proposals, scheduling, and recommendations for accommodations and travel. We look forward to seeing you in PEI next year!

Photo: Nicolas Raymond

District Dispatch: Reminder: Free webinar on Ebola for librarians

planet code4lib - Mon, 2014-12-08 21:04

Reminder: On Friday, December 12, 2014, library leaders from the U.S. National Library of Medicine will host the free webinar “Ebola and Other Infectious Diseases: The Latest Information from the National Library of Medicine.” As a follow-up to the webinar they presented in October, librarians from the U.S. National Library of Medicine will be discussing how to provide effective services in this environment, as well as providing an update on information sources that can be of assistance to librarians.

#521695521 / Speakers
  • Siobhan Champ-Blackwell is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center. Champ-Blackwell selects material to be added to the NLM disaster medicine grey literature data base and is responsible for the Center’s social media efforts. Champ-Blackwell has over 10 years of experience in providing training on NLM products and resources.
  • Elizabeth Norton is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center where she has been working to improve online access to disaster health information for the disaster medicine and public health workforce. Norton has presented on this topic at national and international association meetings and has provided training on disaster health information resources to first responders, educators, and librarians working with the disaster response and public health preparedness communities.

Date: December 12, 2014
Time: 2:00 PM–3:00 PM Eastern
Register for the free event

If you cannot attend this live session, a recorded archive will be available to view at your convenience. To view past webinars also done in collaboration with iPAC, please visit

The post Reminder: Free webinar on Ebola for librarians appeared first on District Dispatch.

Nicole Engard: Bookmarks for December 8, 2014

planet code4lib - Mon, 2014-12-08 20:30

Today I found the following resources and bookmarked them on <a href=

  • Launched in 2013,® is a non-profit dedicated to expanding participation in computer science by making it available in more schools, and increasing participation by women and underrepresented students of color. Our vision is that every student in every school should have the opportunity to learn computer science.
  • Hour of Code Join the largest learning event in history, Dec 8-14, 2014. The Hour of Code is a global movement reaching tens of millions of students in 180+ countries. Anyone, anywhere can organize an Hour of Code event. One-hour tutorials are available in over 30 languages. No experience needed.
  • HipChat Private group chat, video chat, instant messaging for teams
  • Slack Slack is a platform for team communication: everything in one place, instantly searchable, available wherever you go.

Digest powered by RSS Digest

The post Bookmarks for December 8, 2014 appeared first on What I Learned Today....

Related posts:

  1. Conversants :-) A Participatory Conversation
  2. Koha in Library School
  3. Registered

SearchHub: Noob* Notes: Fusion First Look

planet code4lib - Mon, 2014-12-08 19:27
This is a record of my coming up to speed on Fusion, starting from zero.  I’ve just joined the Lucidworks team to write documentation and develop demos.   I’d like to dedicate this my first post to developers who, like me, know enough about search and Lucene and/or Solr to be dangerous employable, but who haven’t used Fusion — yet. Getting Started I like to click first, read the docs later, so the first thing I do is find the Fusion download page. I download Fusion (version 1.1.1). It’s a gzipped tarball with a README.txt file that points to the online documentation. Looks like I have to read the docs sooner rather than later. The installation instructions are straightforward. My Mac is running Java7 JDK (build 1.7.0_71-b14) but I don’t have an existing Solr installation, so I need to start Fusion with the embedded Solr instance. I run the bin/fusion start command, point the Chrome web browser at http://localhost:8764, and login. The Fusion UI shows 5 icons: Admin, Quick Start, Relevancy Workbench, Search, Banana Dashboards. I click through each in turn. The Banana Dashboard is especially impressive. This looks very different from the Solr Admin UI, that’s for sure. The instructions in the Getting Started page start with the Admin app. Following the steps in the First 5 minutes with Fusion, I create a Collection named getStarted, and a web Datasource named lucidworks. The concept Collection is familiar from Solr; it’s a logical index. Datasources are used to pull data into an index. Indexing the Lucidworks web pages starting from the URL retrieves 1180 documents. On a slow cable internet connection, this took 5 minutes. At this point I’ve spent about 3 minute staring at and clicking through the Admin UI, and 5 minutes reading the Lucidworks docs. It’s always prudent to multiply a time estimate by 2 (or 3), so if I can carry out a few searches in under 2 minutes, my first 5 minutes with Solr will have taken 10 minutes my time, plus 5 minutes indexing. I run a series of searches: “Lucidworks” returns 1175 documents, “Lucidworks AND Fusion” returns 1174 documents, “Java AND Python” returns 15 documents, “unicorn” returns 0 documents. That took no time at all. I’ve got a Collection and the search results look sensible. By following the instructions and ignoring everything I don’t understand, my first 5 minutes with Fusion have been a total success. A Real Problem So far I’ve kicked the tires and taken a drive around the parking lot. Time to hit the road and index some fresh content. My go-to test case is the content available from the National Library of Medicine. The NLM maintains databases of drugs, chemicals, diseases, genes, proteins, enzymes, as well as MEDLINE/PubMed, a collection of more than 24 million citations for biomedical literature from MEDLINE, life science journals, and online books. NLM leases MEDLINE/PubMed to U.S. and non-US individuals or organizations, distributed as a set of XML files with top-level element is <MedlineCitationSet>. Each citation set contains one or more lt;MedlineCitation> elements. Every year, NLM releases a new version of MEDLINE, a revised DTD, and a sample data set. Can I index the MEDLINE/PubMed 2015 sample data as easily as I indexed The answer yes I can index the data, but it takes a little more work because a specialized document set requires a specialized index. I demonstrate this by failure. Working through the Fusion Admin UI, I create a new collection called Medsamp2015. As before, I create a web datasource called medsamp2015xml and point it at the MEDLINE/PubMed 2015 sample data file. Fusion proceses this URL into a single document. Since there’s only one document in the index, I use the wildcard search “*” to examine it. The content field contains the text of all elements in the XML file. Definitely not the indexing strategy I had in mind.  The MEDLINE 2015 sample data file has one top-level element <MedlineCitationSet> and 165 <MedlineCitation> elements. What I want to is to index each <MedlineCitation> element as its own document. A Real Solution A Fusion datasource is coupled with Index Pipeline.  Pipelines are powerful, but the documentation is incomplete — that’s why I’ve been hired. At this point, with a little lot of help from the Fusion developers, I was able to create an indexing pipeline for the Medline data. Soon the documentation will catch up to Fusion’s awesome capabilities. In the interim, here’s a report of my what I did: what worked and what didn’t. Pipelines are comprised of a sequence of stages. The conn_solr pipeline is a general-purpose document parsing pipeline composed of the following stages: an Apache Tika Parser index stage; a Field Mapper Index stage; a Multi-value Resolver stage; and Solr Indexer stage. The Tika Parser interface provides a single mechanism for extracting both metadata and data from many different sorts of documents, including HTML, XML, and XHTML. The field mapper index stage maps common document elements to defined fields in the default Solr schema. The Multi-value Resolver stage resolves conflicts that would otherwise arise when a document contains multiple values for a Solr field which is not multi-valued. Finally, the Solr indexer stage sends documents to Solr for indexing. Beause there’s a close connection between a datasource and the processing applied to that data, when possible, the Fusion Admin UI provides a default index pipeline ID. For a web datasource, the default index pipeline is the conn_solr pipeline which provides field mappings for common elements found on HTML pages. In the Getting Started example above, there was a one-to-one correspondance between web pages and document in the Solr index. For a Medline XML file, additional processing is required to transform each citation into a fielded document for Solr.  The indexing pipeline required looks like this:
  • Apache Tika Parser
  • XML Transform
  • Field Mapper
  • Solr Indexer
This pipeline looks superficially similar to the conn_solr index pipeline but both the Tika Parser and Field Mapper stages are configured quite differently and an XML Transform stage is used to map specific elements of the Medline XML to custom fields in the Solr document. A Multi-value Resolver stage isn’t necessary because I’ve set up the mapping so that multi-valued elements are mapped to multi-valued fields. The configuration of the Solr Indexer remains the same. The new Fusion Admin UI Pipelines control panel can be used to define both index and query pipelines. It’s also possible to define pipelines through the Fusion REST API.  As a noob, I’m sticking to the Admin UI.  After clicking through to the Pipelines control panel, Index Pipelines tab, I create a new Index Pipeline named medline_xml, then add each stage in turn. When a new stage is added, the Pipeline panel displays the configuration choices needed for that stage. Apache Tika Parser Configuration To process the MEDLINE XML, I need to configure Tika so that it doesn’t try to extract the text contents but instead passes the XML to the next stage of the indexing pipeline. I’ve captured the config that I need in the following screenshot and circled the setting that I had to change from the current default in red: The control addOriginalContent is set to false and both controls “Return parsed content as XML or HTML” and “Return original XML and HTML instead of Tika XML” are set to true. The latter two controls seem redundant, but they’re not and you’ll need both set to true to work. Trust me. XML Transform Configuration The XML Transform stage does the mapping from nested XML elements into a fielded doc for Solr. After adding an XML Transform stage to my pipeline and naming, I get down to specifying that mapping. The following screenshot shows the key configurations: Because we want to index each MedlineCitation element as its own document, the Root XPath element is set to the full XPath “/MedlineCitationSet/MedlineCitation”.   XPathMappings pick out the elements that map to fields in that document.  For my document fields, I use the Solr dynamic field naming conventions.  Each MedlineCitation is assigned a unique integer identifier called a PMID (PubMed ID).   In this example, flattening the MEDLINE XML into a Solr doc is straightforward.  The XPathMappings used are:
  • “/MedlineCitationSet/MedlineCitation/Article/ArticleTitle” maps to “article-title_txt”, Multi Value false
  • “/MedlineCitationSet/MedlineCitation/Article/Abstract/AbstractText” maps to “article-abstract_txt”, Multi Value true
  • “/MedlineCitationSet/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName” maps to “mesh-heading_txt”, Multi Value true.
There’s a lot more information to be extracted from the XML, but this is enough for now. Field Mapper Configuration It’s complicated, but because I’m using Tika and an XML Transform, I need a Field Mapper stage to remove some of the document fields created by Tika before sending to document to Solr for indexing.  On the advice of my local wizard, I create mappings for “_raw_content_”,  “parsing”, “parsing-time”, “Content-Type”, and “Content-Length” fields and set the mode to “delete”. Solr Indexer Configuration I set the “Buffer documents” for Solr option to true.  This isn’t strictly necessary. It just seems like a good thing to do. Checking Configuration with the Pipeline Result Preview Tool This is a lot to configure, and I couldn’t have done it without the “Pipeline Result Preview” tool, located on the right hand side of the Pipeline UI panel. The Preview tool takes as input a list of documents coded up as a JSON objects and runs them through the indexing pipeline. A document object has two members: id and fields. Here, our input has exactly one field, name “body” whose value is a JSON-encoded string of the raw XML (or a subset thereof). The JSON input can’t split strings across lines, which means that the JSON-encoded XML is pretty much unreadable.  After several tries, I get a well-formed example MedlineCitationSet example consisting of three MedlineCitation elements, properly escaped for JSON all jammed together on one line. The “view results” tab shows the result of running this input through the medline_xml indexing pipeline: I’ve scrolled down to display the input to the Solr Indexer, which consists of three documents, named doc1#0 through doc1#2. Indexing From the Fusion Admin UI, I return to the Collections panel for the collection named medsamp2015. As before, I create a web datasource called medsamp2015xml_v2 and point it at the MEDLINE 2015 sample data file, taking care to specify the medline_xml pipeline in the “Pipeline ID” input box. One input was processed and the index now contains 165 documents. I have managed to index fresh content! Search and Results As a first test, I do a keyword search on the word “gene”. This search returns no documents. I do a keyword search on “*”. This search returns 165 documents, the very first of which contains the word “gene” in both the article title and the article abstract. Again, the problem lies with the pipeline I’m using. The default query pipeline doesn’t search the fields “article-title_txt”, “article-abstract_txt”, or “mesh-heading_txt”. Tuning the search query parameters is done with Query Pipeline control panel. After changing the set of search fields and return fields in the “medsamp2015-default” to include these fields, I run a few more test queries. Now a search on “gene” returns 11 results and returns only the relevant fields. In conclusion, I’ve managed use Fusion Admin UI to search and index my data. I didn’t get the Enterprise up to warp speed. Maybe next week. In the meantime, I’ve learned a lot and I hope that you have too. *dare to learn new things, dare to be a noob

The post Noob* Notes: Fusion First Look appeared first on Lucidworks.

Roy Tennant: One Format to Rule Them All

planet code4lib - Mon, 2014-12-08 18:46

One Format to rule them all, One Format to find them; One Format to bring them all and in the darkness bind them. – with apologies to J.R.R. Tolkien

It is now over 12 years since I wrote “MARC Must Die” in Library Journal. At the time that I wrote it, I think that I imagined a much redesigned metadata format expressed in XML. But it didn’t take long for me to realize the error of my ways. Not that we didn’t need to do something, but I was wrong to think that it required replacing. What it really required, I soon realized, was for us to not rely upon it solely. And that is a point that I feel has become lost in our discussions about our bibliographic future.

Here is how I put it in a follow-up piece in Library Hi Tech titled “A Bibliographic Infrastructure for the Twenty-First Century”:

What I am suggesting [in this article] is different in scope and structure than is implied by my “MARC Must Die” column in Library Journal, although I alluded to it in the follow-up “MARC Exit Strategies” column. What must die is not MARC and AACR2 specifically, despite their clear problems, but our exclusive reliance upon those components as the only requirements for library metadata. If for no other reason than easy migration, we must create an infrastructure that can deal with MARC (although the MARC elements may be encoded in XML rather than MARC codes) with equal facility as it deals with many other metadata standards. We must, in other words, assimilate MARC into a broader, richer, more diverse set of tools, standards, and protocols. The purpose of this article is to advance the discussion of such a possibility.

I went on to explain a number of characteristics that I felt our bibliographic infrastructure should support as well as a fairly specific proposal on implementation. However, despite being awarded for being the best article to appear in Library Hi Tech for that year, that salvo basically landed on deaf ears.

And now we are here.

“Here”, being, of course, that the Library of Congress is developing a new format.

I parse a lot of data. I even fancy myself to be a Data Geek. After all of the data processing I’ve done I’ve come to realize that there are really only three things I care about in terms of metadata: parseabilitygranularity, and consistency. Pretty much everything else can be dealt with. You call your author field “creator”? Fine and dandy. You record dates as MM/DD/YYYY? I can deal with that. So long as your metadata is:

  • Parseable. Separate fields must be delimited in some way. It doesn’t need to be XML, it can be a JSON array or a pipe symbol (“|”) or even, in many cases, a tab (I process a lot of tab-delimited text files that are saved out of Excel, for example). But there must be some way of determining via software what has been kept separate.
  • Granular. If I need first names separate from family names I want them in separate fields. Trying to break apart elements you need to be separate can be difficult, especially if the data is inconsistent. Oh, and by the way, punctuation (even ISBD punctuation) doesn’t count.
  • Consistent. When processing data, inconsistency can cause a lot of problems. Even if a mistake is made, it’s best to make it consistently so the person processing it can treat all records the same. What is difficult is having to accommodate a wide variety of edge cases.

That’s really all I care about, since it is very unlikely that every library will create their own format. No, we are herd animals, so we will gather around a very small number of formats, and perhaps only one. After all, that is all we have ever known.


Photo by David Fulmer, Creative Commons License Attribution 2.0 Generic.

Library of Congress: The Signal: Preserving Carnegie Hall’s Born-Digital Assets: An NDSR Project Update

planet code4lib - Mon, 2014-12-08 14:02

The following is a guest post by Shira Peltzman, National Digital Stewardship Resident at Carnegie Hall in New York City.

The author inside the Isaac Stern Auditorium. Photo by Gino Francesconi.

As the National Digital Stewardship Resident placed at Carnegie Hall, I have been tasked with creating and implementing policies, procedures and best practices for the preservation of our born-digital assets. Carnegie Hall produces a staggering quantity of born-digital content every year: live concert webcasts; videos of professional development and educator workshops; artist interviews; promotional videos for festivals and performances taking place at the Hall; workshops and masterclasses; and all print media produced for the Hall, including infographics, programs, and annual reports, to cite just a few examples. A sampling of this material can be found on Carnegie Hall’s blog, which averages 400 posts per year.

The first phase of my project–which I’m on track to wrap up by mid-December–has been largely focused on developing a detailed understanding of how the organization’s born-digital assets are created, stored and used. To do this, I have spent the past couple months conducting interviews with staff across a wide range of departments at Carnegie Hall that use or produce digital content.

The interview process is fundamentally important to my project because it lays the groundwork for all of my NDSR project deliverables. These include: establishing selection and acquisition criteria for the preservation of born-digital assets; developing and streamlining production workflows; and writing a digital preservation and sustainability policy document. Beyond helping me evaluate the current workflows and digital usage within the organization, I’ve found that conducting these interviews has also helped me settle into my new work environment.  Having the opportunity to talk to so many different people within the organization has allowed me to meet many more of my coworkers than I might otherwise cross paths with in the course of a normal work week, and it’s also helped me to better understand how each department fits into the ‘bigger picture’ at Carnegie Hall.

Each interview takes place in two halves: during the first half of the interview I ask questions that are designed to help me understand precisely how digital assets are created and used within each department. This usually involves asking the people I’m interviewing to walk me through the production process of the digital content they are responsible for creating. I do this so that I can take note of things like the hardware and software they use; whether or not a final version is likely to have many associated versions or drafts; and how likely it is that the audio, video or print media they create will eventually be re-purposed or re-used in the future. This information helps paint a detailed picture of how every department operates. It also allows me to recognize what assets matter most to each department, which in turn will help me establish selection and acquisition criteria further down the road.

During the interview I make it a point to not only understand the current workflows involved in creating and using digital assets, but also to gather information about how these workflows might be improved. This is important because as part of the Digital Archives Project, the Carnegie Hall Archives is in the process of configuring and implementing a new Digital Asset Management System, and the feedback I receive during the interview process will help us streamline the process of ingesting material into the DAMS.

The second half of the interview typically takes place after I’ve had a chance to write up a summary of the initial discussion, and is much more ‘hands-on.’ An important aspect of my project is to create a detailed inventory of Carnegie Hall’s born-digital assets, and so the purpose of this half of the interview is to gather all extant hard drives, thumb drives, optical media, etc. that contain digital assets and review their contents. This involves reviewing the contents of digital storage media from both internal departments and, occasionally, from external contractors of Carnegie Hall as well. My goal is to uncover unknown and overlooked digital assets that should ultimately end up in the DAMS.

A screenshot of Carnegie Hall’s website featuring content created for the UBUNTU: Music and Arts of South Africa festival, which took place from Oct. 8 to Nov. 5 and celebrated twenty years of freedom in South Africa. The videos showcased on this website are exemplary of the born-digital content that Carnegie Hall creates.

Right now the largest task I’m faced with completing before the end of the year is going to be the inventory of born-digital assets. This will be a complex process because not only will I have to account for the assets stored on extant media that I track down throughout the office, but I will also need to create a comprehensive inventory of the assets stored within every departmental file directory. So far, the Digital Record Object Identification (DROID), which is the UK National Archive’s file format profiling tool, has been helpful for this task.

During my downtime between interviews I work on any number of smaller projects that are also part of my NDSR project deliverables. These include creating a document that provides Carnegie Hall with recommendations on how to improve file naming, writing a disaster preparedness plan and revising the Archives’ mission statement so that it includes a remit specifically for digital preservation.

So far one of my biggest takeaways from the project has been the importance of engaging both media creators and Carnegie Hall staff at large in the preservation process. Having their input has been essential because not only do they have a much more intimate knowledge of the different ways that material within the organization is created, used and stored, but they also collectively possess a substantial institutional knowledge that has helped guide my project throughout. Another benefit of this collaboration has been that it has bolstered buy-in for the DAMS, and has helped create a greater level of awareness about preservation among staff.

The first several months of my project feel like they’ve flown by. There are days when I reflect on what I’ve accomplished in just under three months’ time and feel proud of my progress, and then there are other days when I’m humbled by how much there is still left to do. But overall, the project has been one of the greatest learning experiences I could have hoped for–and there’s still six months left to go.

ACRL TechConnect: This is How I Work (Lauren Magnuson)

planet code4lib - Mon, 2014-12-08 14:00

Editor’s Note: This post is part of ACRL TechConnect’s series by our regular and guest authors about The Setup of our work.

Lauren Magnuson, @lpmagnuson

Location: Los Angeles, CA

Current Gig:

Systems & Emerging Technologies Librarian, California State University Northridge (full-time)

Development Coordinator, Private Academic Library Network of Indiana (PALNI) Consortium (part-time, ~10/hrs week)

Current Mobile Device: iPhone 4.  I recently had a chance to upgrade from an old slightly broken iPhone 4, so I got….another iPhone4.  I pretty much only use my phone for email and texting (and rarely, phone calls), so even an old iPhone is kind of overkill for me.

Current Computer:

  • Work:  work-supplied HP Z200 Desktop, Windows 7, dual monitors
  • Home: (for my part-time gig): Macbook Air 11”

Current Tablet: iPad 2, work-issued, never used

One word that best describes how you work: relentlessly

What apps/software/tools can’t you live without?

  • Klok – This is time-tracking software that allows you to ‘clock-in’ when working on a project.  I use it primarily to track time spent working my part-time gig.  My part-time gig is hourly, so I need to track all the time I spend working that job.  Because I love the work I do for that job, I also need to make sure I work enough hours at my full-time job.  Klok allows me to track hours for both and generate Excel timesheets for billing.  I use the free version, but the pro version looks pretty cool as well.
  • Trello – I use this for the same reasons everyone else does – it’s wonderfully simple but does exactly what I need to do.  People often drop by my office to describe a problem to me, and unless I make a Trello card for it, the details of what needs to be done can get lost.  I also publish my CSUN Trello board publically and link it from my email signature.
  • Google Calendar - I stopped using Outlook for my primary job and throw everything into Google Calendar now.  I also dig Google Calendar’s new feature that integrates with Gmail so that hotel reservations and flights are automatically added to your Google Calendar.
  • MAMP/XAMPP – I used to only do development work on my Macbook Air with MAMP and Terminal, which meant I carted it around everywhere – resulting in a lot of wear and tear.  I’ve stopped doing that and invested some time in in setting up a development environment with XAMPP and code libraries on my Windows desktop.  Obviously I then push everything to remote git repositories so that I can pull code from either machine to work on it whether I’m at home or at work.
  • Git (especially Git Shell, which comes with Git for Windows) – I was initially intimidated about learning git – it definitely takes some trial and error to get used to the commands and how fetching/pulling/forking/merging all work together.  But I’m really glad I took the time to get comfortable with it.  I use both GitHub (for code that actually works and is shared publically) and BitBucket (for hacky stuff that doesn’t work yet and needs to be in a private repo).
  • Oxygen XML Editor – I don’t always work with XML/XSLT, but when I have to, Oxygen makes it (almost) enjoyable.
  • YouMail – This is a mobile app that, in the free version, sends you an email every time you have a voicemail or missed call on your phone.  At work, my phone is usually buried in the nether-regions of of my bag, and I usually keep it on silent, so I probably won’t be answering my mobile at work.  YouMail allows me to not worry where my phone is or if I’m missing any calls.  (There is a Pro version that transcribes your voicemail that I do not pay for, but seems like it might be cool if you need that kind of thing).
  • Infinite Storm – It rarely rains in southern California.  Sometimes you just need some weather to get through the day.  This mobile app makes rain and thunder sounds.


  • Post It notes (though I’m trying to break this habit)
  • Basic Logitech headset for webinars / Google hangouts.  I definitely welcome suggestions for a headset that is more comfortable – the one I have weirdly crushes my ears.
  • A white board I use to track information literacy sessions that I teach

What’s your workspace like?

I’m on the fourth floor of the Oviatt Library at CSUN, which is a pretty awesome building.  Fun fact:  the library building was the shooting location for Star Fleet Academy scenes in JJ Abrams’ 2009 Star Trek movie, (but I guess it got destroyed by Romulans because they have a different Academy in Into Darkness):

My office has one of the very few windows available in the building, which I’m ambivalent about.  I truly prefer working in a cave-like environment with only the warm glow of my computer screen illuminating the space, but I also do enjoy the sunshine.

I have nothing on my walls and keep few personal effects in my office – I try to keep things as minimal as possible.  One thing I do have though is my TARDIS fridge, which I keep well-stocked with caffeinated beverages (yes, it does make the whoosh-whoosh sound, and I think it is actually bigger on the inside).

I am a fan of productivity desktop wallpapers – I’m using these right now, which help peripherally see how much time has elapsed when I’m really in the zone.

When I work from home, I mostly work from my living room couch.

What’s your best time saving trick  When I find I don’t know how to do (like when I recently had to wrangle my head around Fedora Commons content models, or learning Ruby on Rails for Hydra), I assign myself some ‘homework’ to read about it later rather than trying to learn the new thing during working hours.  This helps me avoid getting lost in a black hole of Stack Overflow for several hours a day.

What’s your favorite to do list manager Trello

Besides your phone and computer, what gadget can’t you live without?

Mr. Coffee programmable coffee maker

What everyday thing are you better at than everyone else? Troubleshooting

What are you currently reading?  I listen to audiobooks I download from LAPL (Thanks, LAPL!), and I particularly like British mystery series.  To be honest, I kind of tune them out when I listen to them at work, but they keep the part of my brain that likes to be distracted occupied.

In print, I’m currently reading:

What do you listen to while at work?  Mostly EDM now, which is pretty motivating and helps me zone in on whatever I’m working on.  My favorite Spotify station is mostly Deadmau5.

Are you more of an introvert or an extrovert? Introvert

What’s your sleep routine like?  I love sleep.  It is my hobby.  Usually I sleep from around 11 PM to 7 AM; but my ideal would be sleeping between like 9 PM and 9 AM.  Obviously that would be impractical.

Fill in the blank: I’d love to see _________ answer these same questions.  David Walker @ the CSU Chancellor’s Office

What’s the best advice you’ve ever received? 

Applies equally to using the Force and programming.

Eric Hellman: Stop Making Web Surveillance Bugs by Mistake!

planet code4lib - Mon, 2014-12-08 03:59
Since I've been writing about library websites that leak privacy, I figured it would be a good idea to do an audit of to make sure it wasn't leaking privacy in ways I wasn't aware of. I knew that some pages leak some privacy via referer headers to Google, to Twitter, and to Facebook, but we force HTTPS and make sure that user accounts can be pseudonyms. We try not to use any services that push ids for advertising networks. (Facebook "Like" button, I'm looking at you!)
I've worried about using static assets loaded from third party sites. For example, we load jQuery from (it's likely to be cached, and should load faster) and Font Awesome from (ditto). I've verified that these services don't set any cookies and allow caching, which makes it unlikely that they could be used for surveillance of users.

It turned out that my worst privacy leakage was to Creative Commons! I'd been using the button images for the various licenses served from I was surprised to see that id cookies were being sent in the request for these images.
In theory, the folks at Creative Commons could track the usage for any CC-licensed resource that loaded button images from Creative Commons! And it could have been worse. If I had used the HTTP version of the images, anyone in the network between me and Creative Commons would be able to track what I was reading!

Now, to be clear, Creative Commons is NOT tracking anyone. The reason my browser is sending id cookies along with button image requests is that the Creative Commons website uses Google Analytics, and Google Analytics sets a domain-wide id cookie. Google Analytics doesn't see any of this traffic- it doesn't have access to server logs. But without anyone intending it, the combination of Creative Commons, Google Analytics, and websites like mine that want to promote use of Creative Commons have conspired to build a network of web surveillance bugs BY MISTAKE.

When I inquired about this to Creative Commons, I found out they were way ahead of the issue. They've put in redirects to HTTPS version of their button images. This doesn't plug any privacy leakage, but it discourages people from using the privacy spewing HTTP versions. In addition, they'd already started to process of moving static assets like button images to a special-purpose domain. The use of this domain,, will ensure that id cookies aren't sent and nobody could use them for surveillance.

If you care about user privacy and you have a website, here's what you should do:
  1. Avoid loading images and other assets from 3rd party sites. consider self-hosting these.
  2. When you use 3rd party hosted assets, use HTTPS references only!
  3. Avoid loading static assets from domains that use Google Analytics and set id domain cookies.
For Creative Common license buttons, use the buttons from If you use the Creative Commons license chooser, replace "" in the code it makes for you with "". This will help the web respect user privacy. The buttons will also load faster, because the "" requests will get redirected there anyway.

Library of Congress: The Signal: Personal Digital Archiving 2015 in NYC — “Call for Papers” Deadline Approaching

planet code4lib - Fri, 2014-12-05 14:17

New-york-city by irot2 on

The Personal Digital Archiving Conference 2015 will take place in New York City for the first time. The conference will be hosted by our NDIIPP and NDSA partners at New York University’s Moving Image Archiving and Preservation program April 24-26, 2015. Presentation submissions for Personal Digital Archiving are due Monday, December 8th, 2014 by 11:59 pm EST.

This year’s conference will differ slightly from the Personal Digital Archiving Conferences of previous years (see listings below). There will be two full days of presentations focused on a set of themes; a third day will be set aside for workshops covering useful digital tools.

The conference program committee seeks proposals for:
– ten- to twenty-minute presentations
– five-minute lightning talks
– posters (including demos)
– workshops, particularly those emphasizing software tools (to take place on the third day).

The program committee will try to cluster shorter presentations into panels and encourage discussion among the panelists. For the day of workshops, they are seeking hands-­on learning focused on useful digital tools. They anticipate four half-­day workshops, with two in the morning and two in the afternoon.

Personal Digital Archiving 2015 invites proposals on the full range of topics relevant to personal digital archiving, with particular interest in papers and presentations around community groups, activist groups and their use of digital media, as well as personal/familial collections and home­-brewed digital solutions.

Presentations might address challenges, such as:
– Ubiquitous recording devices – such as cell phones — for videos and photos
– Action cameras (such as GoPro)
– Cloud storage
– Social media (Vine, Instagram, Twitter, Facebook, blogs etc.)
– Email
– Open­-source, low-cost digital tools
– Tracking and sharing personal health data
– Community outreach and economic models from an organizational perspective
– Security and issues of access, encryption, reliability and safety
– Archival and library issues associated with collection, appraisal, ingest and description
– Migration of content from obsolete or outdated storage media.

Submissions should include:
– The title of the presentation;
– For 10 to ­20­ minute presentations, a 300 ­word abstract;
– For lightning talks and posters, a 150 to ­300 word abstract;
– For workshop proposals, a 150 to ­300 word curriculum overview, including approximate number of hours needed, what tools will be taught, and computing infrastructure requirements;
– For panel proposals, a 150­ to 300 word overview of the topic and suggestions for additional presenters;
– A brief biographical sketch or CV (no more than 2 pages).

Submit your conference proposals to

For more information on previous PDA conferences, please visit:

Registration, program, housing and other information will be posted in early 2015. For further information, email personaldigitalarchiving [at]

Personal Digital Archiving 2015 is co­-sponsored by NYU’s Moving Image Archiving & Preservation program, the NYU Libraries and the Coalition for Networked Information.

LITA: Building a Small Web Portfolio

planet code4lib - Fri, 2014-12-05 14:00

Since my undergraduate commencement in May, I have been itching to create my own personal portfolio website. I wanted to curate my own space devoted to my curriculum vitae, projects, and thoughts moving through my education and career. This was for my own organization, but also for colleagues to view my work in an environment I envisioned.

I began looking at sites belonging to mentors, students, and other professionals, noticing that each site fit the person and their accomplishments. Then, I began to wonder, which design fits me? Which platform fits me? If I’m terrible at any sort of creative design, how will I design my own website?

I found clarity when a fellow library student shared some advice: it is never right the first time. Get it functional, get that first iteration out of the way, and improve from there.

Choosing a Platform

With web design becoming increasingly accessible, many services have popped up offering to help users create a website. By no means is this list exhaustive, but here are a few that I discovered and considered, ranging from least coding required to most:

  • Wix Wix is a free platform that offers customizable website templates built on HTML5. Once you choose a template, you can click the boxes to add text and click/drag text boxes and images around to change the layout. This platform was useful as a first step in seeing my content laid out on a webpage without having to write code.
  • WordPress and Squarespace These two platforms triple as portfolios, blogs, and content management systems. Both allow customizable templates, hosting space, and a unique domain name. Since they are content management systems, you must learn to use their interface and coding may be required if you want to customize beyond the readily customizable features.
  • Jekyll (using Git) and Bootstrap Jekyll and Bootstrap are frameworks for developing your own HTML- and CSS-based websites. Instead of placing your content into a text box, you actually dive into the HTML files to write your context. These platforms come with templates to get you started, but do require outside coding and system knowledge, such as Git and Ruby for Jekyll. For two great examples visit the pages of Ryan Randall, an ILS graduate assistant and all around culture scholar at IU, and Cecily Walker, a self-titled rabble-rousing librarian residing in Vancouver, BC.
  • Adobe Dreamweaver and Adobe Muse using HTML5 These final two require the most HTML coding. You can find some starter templates, but it is up to you to write and design the majority of the content and website. These platforms offer the highest range of customization, but also the highest learning curve. For a dynamic example of a website built with Adobe Dreamweaver, check out Samantha Sickels‘ media and design portfolio page.

Since I have a programming background, I decided that using Wix felt like neglecting my tech skills. Since I have limited experience with HTML5 and CSS, I wasn’t ready to take on an entire portfolio website from scratch. Therefore, I went with WordPress because I could choose a designed template, but customize as needed.

Using WordPress

I used my Information Architecture skills and decided the exact content I wanted to feature. Since then, I have spent countless hours arguing with WordPress. Consistently asking my computer screen, “What do you want from me?”

Image courtesy of

WordPress turned out to be less intuitive than I imagined (I shrugged off tutorials thinking it couldn’t be that difficult). It took a few tries to understand the interface with pages and menus. I also didn’t realize that different themes come with different customizable features and that they don’t include different page layouts you can choose from a simple drop-down menu. I typically found a perfect combination of clean, minimal, and functional, but with one unsatisfactory aspect. So close.

Image courtesy of

Finally, I chose a theme that worked! With some minor set backs with text fonts, I discovered how to use plug-ins. My social media buttons and accordion-style Projects page were the result of Google, willingness to explore, and my conceptual coding knowledge. Brianna Marshall helped me figure out how to set a menu item as a link. And I breathed a sigh of relief.


If you are creating a personal portfolio or even a quick WordPress site for a library project or service, I have three tips for you:

  1. Choose a random theme. Insert your content. Then, decide on a more permanent theme from there. Sometimes, seeing your own name in Comic Sans will put that theme on the “absolutely not” list.
  2. Read this article shared with me by the same student who gave me the advice above. It is geared toward IU students, with widely applicable ideas.
  3. Persist. You have a vision of your future site’s look and function, keep learning, Googling and exploring until you find out how to bring it to life. I can’t wait to play with more of the coding-heavy platforms in the future!

Find my portfolio website here, and please comment if you have any questions about web hosting and domain names, important aspects of website creation I didn’t touch on.

Then, respond! I would love to hear your thoughts about using WordPress or other platforms mentioned for different functions! What were your struggles and triumphs? Do you prefer a platform I didn’t mention?

OCLC Dev Network: Code and Camaraderie

planet code4lib - Fri, 2014-12-05 02:00

Show-and-tell time for the 11 developers who attended Developer House this week.  Look for a video of the presentations.  These are worth watching!  The teams accomplished a lot of great work.

OCLC Dev Network: Developer House Gets Inside View of Hot Projects at OCLC

planet code4lib - Fri, 2014-12-05 02:00

We spent most of today working on our projects, so thought we'd share more about the presentations we had yesterday, including OCLC's Linked Data strategy and a series of "lightning" talks by OCLC staff responsible for some of our current internal projects.


Subscribe to code4lib aggregator