You are here

Feed aggregator

Ariadne Magazine: Research data management:  A case study

planet code4lib - Mon, 2015-10-12 08:03

Gary Brewerton explains how Loughborough University have tackled the requirements from funding bodies for research data to be made available by partnering with not one, but two cloud service providers.

In April 2014 Loughborough University launched an innovative cloud-based platform [1] to deliver long-term archiving and discovery for its research data. Read more about Research data management:  A case study

Gary Brewerton

Article type: Issue number: Authors: Organisations: Projects: Date published: Mon, 10/12/201574

DuraSpace News: CALL for Proposals: Code4Lib 2016

planet code4lib - Mon, 2015-10-12 00:00

From the The Code4Lib 2016 Conference Program Committee

Code4Lib 2016 is a loosely-structured conference that provides people working at the intersection of libraries/archives/museums/cultural heritage and technology with a chance to share ideas, be inspired, and forge collaborations. For more information about the Code4Lib community, please visit

Tara Robertson: discovering default settings

planet code4lib - Sun, 2015-10-11 02:26
Old Light Switches by Paul Cross

I had a fantastic conversation with Dana Ayotte about some of the work she does as an interaction designer at OCAD’s Inclusive Research Design Centre. One of the projects she worked on was working with people to figure out their settings preferences on a computer and codify or summarize them, so that they are portable. It struck me that this is a little thing that can be really important in terms of access, but also in terms of letting people customize things to suit them. It allows them to decide for themselves what works best. I love the idea of people sharing their preference sets, because sometimes you don’t know that there are other options than the default you’re presented with.

This reminded me of a couple of other conversations and experiences over the past year.

At CSUN one of the best presentations I attended was by Jamie Knight a senior designer at the BBC who is slightly autistic. His talk was about cognitive accessibility. At the beginning of the talk he was talking about how the sensory environment has a big impact on him. He said that if noises were too distracting he might put on his ear defenders. He also mentioned that lighting can be really draining for him. A couple of minutes later one of his coworkers interrupted from the audience and suggested that he could turn off some or all of the horrible fluorescent conference centre lights. Jamie paused and said “I had never considered that. I didn’t even know that was an option to turn off the lights in this room. Yes please.” This was one of my favourite moments at CSUN as it clearly demonstrated empathy.

I sometimes join the folks from the Student Services department for lunch. They are smart, kind and hilarious and I really enjoy hanging out with them. One of them was describing how she had been to the dentist that morning and that she really hates the fluoride. She says hates the taste and that it makes her feel nauseated all day long. When she mentioned this to her dad, he said “I just tell them that I don’t want fluoride.” This was a huge realization for her. She didn’t know that you could say no to the fluoride part of a dental visit. Honestly, I didn’t either. For her saying no to this made going to get her yearly cleaning so much easier. When I went to the dentist last month I also said no. My pulse jumped a bit, as it was a new thing for me to opt out of part of the dental cleaning. The hygienist had no reaction, so while i was a bit nervous about saying no, it wasn’t a big deal at all.

I’ve also been thinking more about consent. Some of the people who are dear to me are survivors of sexual abuse. The stories that they’ve shared about when they learned that they have agency around their own bodies are revolutionary and pivotal moments. One person said “I didn’t know that saying no was an option because when I was younger it didn’t matter what I said.” Realizing that I had some difficulty opting out of a fluoride treatment I have a better understanding and a bit more compassion for folks who are new to setting boundaries around their own bodies.

All of these stories seem related in my mind, but I’m not quite able to explain how. I think all of them relate to agency, boundary setting and learning that something is set to the default setting and discovering new options to choose.

SearchHub: LinkedIn’s Galene Search Architecture Built on Apache Lucene

planet code4lib - Fri, 2015-10-09 17:55

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting LinkedIn engineers Diego Buthay and Sriram Sankar’s session on how they run their search architecture for the massive social network.

LinkedIn’s corpus is a richly structured professional graph comprised of 300M+ people, 3M+ companies, 2M+ groups, and 1.5M+ publishers. Members perform billions of searches, and each of those searches is highly personalized based on the searcher’s identity and relationships with other professional entities in LinkedIn’s economic graph. And all this data is in constant flux as LinkedIn adds more than 2 members every second in over 200 countries (2/3 of whom are outside the United States). As a result, we’ve built a system quite different from those used for other search applications. In this talk, we will discuss some of the unique systems challenges we’ve faced as we deliver highly personalized search over semi-structured data at massive scale.”

Diego (“Mono”) Buthay is a staff engineer at LinkedIn, where he works on the back-end infrastructure for all of LinkedIn’s search products. Before that, he built the search-as-a-service platform at IndexTank, which LinkedIn acquired in 2011. He has BS and MS degrees in computer software engineering from the University of Buenos Aires. Sriram Sankar is a principal staff engineer at LinkedIn, where he leads the development of its next-generation search architecture. Before that, he led Facebook’s search quality efforts for Graph Search, and was a key contributor to Unicorn. He previously worked at Google on search quality and ads infrastructure. He is also the author of JavaCC, a leading parser generator for Java. Sriram has a PhD from Stanford University and a BS from IIT Kanpur.”

Galene – LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram Sankar, LinkedIn from Lucidworks

Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post LinkedIn’s Galene Search Architecture Built on Apache Lucene appeared first on

David Rosenthal: The Cavalry Shows Up in the IoT War Zone

planet code4lib - Fri, 2015-10-09 15:47
Back in May I posted Time For Another IoT Rant. Since then I've added 28 comments about the developments over the last 132 days, or more than one new disaster every 5 days. Those are just the ones I noticed. So its time for another dispatch from the front lines of the IoT war zone on which I can hang reports of the disasters to come.  Below the fold, I cover yesterday's happenings on two sectors of the front line.

Lets start with the obvious fact that good wars have two sides, the guys with the black hats (Boo!) and the guys with the white hats (Yay!). So far, the white hats hats have been pretty much missing in action. But now, riding over the hill in the home router sector of the front lines, comes the white-hat cavalry!

Is the opposite of malware benware? If so, Symantec has found "highly virulent" benware called "Ifwatch" infecting "more than 10,000 Linux-based routers, mostly in China and Brazil":
Ifwatch software is a mysterious piece of “malware” that infects routers through Telnet ports, which are often weakly secured with default security credentials that could be open to malicious attack. Instead, Ifwatch takes that opportunity to set up shop, close the door behind it, and then prompts users to change their Telnet passwords, if they are actually going to use the port.

According to Symantec’s research, it also has code dedicated to removing software that has entered the device with less altruistic intentions. Ifwatch finds out and removes “well-known families of malware targeting embedded devices,”How awesome is it that the titanic struggle between good and evil is taking place inside your home router, so you have a ringside seat?

Meanwhile, in the enterprise router sector, the black hats advanced. Dan Goodin at Ars Technica reports that there is a Backdoor infecting Cisco VPNs steals customers’ network passwords:
Attackers are infecting a widely used virtual private network product sold by Cisco Systems to install backdoors that collect user names and passwords used to log in to corporate networks, security researchers said. ... The attacks appear to be carried out by multiple parties using at least two separate entry points. Once the backdoor is in place, it may operate unnoticed for months as it collects credentials that employees enter as they log in to company networks.That's the news from the war zone yesterday. Stay tuned for more in the comments.

DPLA: Archival Description Working Group Members

planet code4lib - Fri, 2015-10-09 15:00

We are pleased to announce the membership of the Archival Description Working Group:

  • Jodi Allison-Bunnel, OrbisCascade Alliance
  • Mark Custer, Yale University
  • Bradley Daigle, University of Virginia
  • Jackie Dean, University of North Carolina at Chapel Hill
  • Max Eckard, University of Michigan
  • Ben Goldman, The Pennsylvania State University
  • Kris Keisling, University of Minnesota
  • Leigh Grinstead, LYRASIS
  • Adrian Turner, California Digital Library

In addition, we have appointed an Advisory Board that will help the workgroup by reviewing drafts before public release and providing feedback on workplans and tools. Advisory Board members include:

  • Shawn Averkamp, New York Public Library
  • Erin Hawkins, World Digital Library, Library of Congress
  • Sheila McAlister, Digital Library of Georgia
  • Sandra McIntyre, Mountain West Digital Library
  • Anne Van Camp, Smithsonian Institution

We were so excited to find that so many volunteered to help us with the group and regret that we can’t include everyone. We will share the group’s progress through social media, and those who filled out the volunteer form will be asked to help review and comment on the draft of the whitepaper and any other deliverables once the working group and advisory board have developed a first draft.

Library of Congress: The Signal: DPOE Plants Seed for Statewide Digital Preservation Effort in California

planet code4lib - Fri, 2015-10-09 13:07

The following is a guest post by Barrie Howard, IT project manager at the Library of Congress.

The Digital Preservation Outreach and Education (DPOE) program is pleased to announce the successful completion of another train-the-trainer workshop in 2015. The most recent workshop took place in Sacramento, California, from September 22th–25th. This domestic training event follows closely behind two workshops held in Australia in late spring.

The Library of Congress partnered with the State Library of California to host the three-and-a-half day workshop to increase the knowledge and skills of working professionals, who are charged with providing long-term access to digital content. Planning and events management support were provided by the California Preservation Program (CPP), which provides consultation, information resources, and preservation services to archives, historical societies, libraries, and museums across the state.

Trainers and trainees at the DPOE workshop in Sacramento. Photo by Darla Gunning.

This cohort of trainers was highly energized at the completion of the workshop, and left the event buzzing with plans to band together to establish a statewide effort to guarantee long-term, enduring access to California’s collective cultural heritage captured in digital formats. The workshop’s train-the-trainer model inspired the participants to think about how they could work across jurisdictional and organizational boundaries to meet the needs of all California cultural heritage institutions, especially small organizations with very few staff.

CPP Steering Committee Chair Barclay Ogden set the stage by stating, “I’m looking forward to the DPOE workshop to position a cohort of California librarians, archivists, and history museum curators to educate and advocate for statewide digital preservation services. California’s smaller memory institutions need help with digital preservation.

Left to right: Jacob Nadal, DPOE Anchor Instructor; George Coulbourne, DPOE Program Director; Stacey Wiens, DPOE Topical Trainer. Photo by Darla Gunning

DPOE Program Director George Coulbourne led the week-long workshop. Veteran anchor instructors Mary Molinaro (University of Kentucky Libraries) and Jacob Nadal (The Research Collections and Preservation Consortium) and the I joined the instructor team for the first time. We provided presentations throughout the week and facilitated hands-on activities.

The enthusiasm and vision captured at the workshop are a legacy, rather than merely an outcome, that participants carry with them as they join a vibrant network of practitioners in the broader digital preservation community. DPOE continues to nurture the network by providing an email distribution list so practitioners can share information about digital preservation best practices, services, and tools, and to surface stories about their experiences in advancing digital preservation. DPOE also maintains a training calendar as a public service to help working professionals discover professional development opportunities in the practice of digital preservation. The calendar is updated regularly, and includes training events hosted by DPOE trainers, as well as others.

SearchHub: Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump

planet code4lib - Fri, 2015-10-09 04:37

Are you a Gambler? Even if you aren’t, what are you waiting for?

There’s no ante or no buy in needed to “go all in” for a nice pot of prize money in this years Stump The Chump contest at Lucene/Solr Revolution 2015 in Austin Texas. But time is running out! There are only a few days for you to submit your most challenging questions.

Even if you can’t make it to Austin to attend the conference, you can still participate. Check out the session information page for details on how to submit your questions.

To keep up with all the “Chump” related info, you can subscribe to this blog (or just the “Chump” tag).

The post Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump appeared first on

Ed Summers: White Dudes Giving Speeches

planet code4lib - Fri, 2015-10-09 04:00

Thank you for inviting me here today to be with you all at MARAC. I’ll admit that I’m more than a bit nervous to be up here. I normally apologize for being a software developer right about now. But I’m not going to do that today…although I guess I just did. I’m not paying software developers any compliments by using them as a scapegoat for my public presentation skills. And the truth is that I’ve seen plenty of software developers give moving and inspiring talks.

So the reason why I’m a bit more nervous today than usual is because you are archivists. I don’t need to #askanarchivist to know that you think differently about things, in subtle and profound ways. To paraphrase Orwell: You work in the present to shape what we know about the past, in order to help create our future. You are a bunch of time travelers. How do you decide what to hold on to, and what to let go of? How do you care for this material? How do you let people know about it? You do this for organizations, communities and collectively for entire cultures. I want to pause a moment to thank you for your work. You should applaud yourselves. Seriously, you deserve it. Thunderous applause.

My Twitter profile used to say I was a “hacker for libraries”. I changed it a few years ago to “pragmatist, archivist and humanist”. But the reality is that these are aspirational..these are the things I want to be. I have major imposter syndrome about claiming to be an archivist…and that’s why I’m nervous.

Can you believe that I went through a Masters in Library & Information Science program without learning a lick about archival theory? Maybe I just picked the wrong classes, but this was truly a missed opportunity, both for me and the school. After graduating I spent some time working with metadata in libraries, then as a software developer at a startup, then at a publisher, and then in government. It was in this last role helping bits move around at the Library of Congress (yes, some of the bits did still move, kinda, sorta) that I realized how much I had missed about the theory of archives in school.

I found that the literature of archives and archival theory spoke directly to what I was doing as a software developer in the area of digital preservation. With guidance from friends and colleagues I read about archives from people like Hugh Taylor, Helen Samuels, “the Terrys” (Cook and Eastwood), Verne Harris, Ernst Posner, Heather MacNeil, Sue McKemmish, Randall Jimerson, Tom Nesmith and more. I started following some of you on Twitter to read the tea leaves of the profession. I became a member of SAA. I put some of the ideas into practice in my work. I felt like I was getting on a well traveled but not widely known road. I guess it was more like a path among many paths. It definitely wasn’t an information superhighway. I have a lot more still to learn.

So why am I up here talking to you? This isn’t about me right? It’s about we. So I would like to talk about this thing we do, namely create things like this:

Don’t worry I’m not really going to be talking about the creation of finding aids. I think that they are something we all roughly understand. We use them to manage physical and intellectual access to our collections right? Finding aids are also used by researchers to discover what collections we have. Hey, it happens. Instead what I would like to focus on in this talk is the nature of this particular collection. What are the records being described here?

Yes, they are tweets that the Cuban Heritage Collection at the University of Miami collected after the announcement by President Obama on December 17, 2014 that the United States was going to begin normalizing relations with Cuba. You can see information about what format the data is in, when the data was collected, how it was collected, how much data there is, and the rights associated with the data.

Why would you want to do this? What can 25 million tweets tell us about the public reaction to Obama’s announcement? What will they tell us in 10, 25 or 50 years? Natalie Baur (the archivist who organized this collection) is thinking they could tell us a lot, and I think she is right. What I like about Natalie’s work is that she has managed to fold this data collection work in with the traditional work of the archive. I know there were some technical hoops to jump through regarding data collection, but the social engineering required to get people working together as a team so that data collection leads leads to processing and then to product in a timely manner is what I thought was truly impressive. Natalie got in touch with Bergis Jules and I to help with some of the technical pieces since she knew that we had done some similar work in this area before. I thought I would tell you about how that work came to be. But if you take nothing else from my talk today take this example of Natalie’s work.

About a year ago I was at SAA in Washington, DC on a panel that Hillel Arnold set up to talk about Agency, Ethics and Information. Here’s a quote from the panel description:

From the Internet activism of Aaron Swartz to Wikileaks’ release of confidential U.S. diplomatic cables, numerous events in recent years have challenged the scope and implications of privacy, confidentiality, and access for archives and archivists. With them comes the opportunity, and perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems. How are we to engage with these questions as archivists and citizens, and what are their implications for user access?

My contribution to the panel was to talk about the Facebook Emotional Contagion study, and to try to get people thinking about Facebook as an archive. In the question and answer period someone (I wish I could remember his name) asked what archivists were doing to collect what was happening in social media and on the Web regarding the protests in Ferguson. The panel was on August 14th, just 5 days after Mike Brown was killed by police officer Darren Wilson in Ferguson, Missouri. It was starting to turn into a national story, but only after a large amount of protest, discussion and on the ground coverage happening also in Twitter. Someone helpfully pointed out that just a few hours earlier ArchiveIt (the Internet Archive’s subscription service) had announced that it was seeking nominations of web pages to archive related to the events in Ferguson. We can see today that close to 981 pages were collected. 236 of those were submitted using the the form that Internet Archive made available.

But what about the conversation that was happening in Twitter? That’s what you’ve been watching a little bit of for the past few minutes up on the screen here. Right after the panel a group of people made there way to the hotel bar to continue the discussion. At some point I remember talking to Bergis Jules who impressed on me the importance of trying to do what we could to collect the torrent of conversation about Ferguson going on in Twitter. I had done some work collecting data from the Twitter API before and offered to lend a hand. Little did I know what would happen.

When we stopped this initial round of data collection we had collected 13,480,000 tweets that mentioned the word “ferguson” between August 10, 2014 and August 27, 2014.

You can see from this graph of tweets per day, that there were definite cycles in the Twitter traffic. In fact the volume was so high at times, and we had started data collection 6 days late, you can see there are periods where we weren’t able to get the tweets. You might be wondering what this data collection looks like. Before looking closer at the data let me try to demystify it a little bit for you.

Here is a page from the online documentation for Twitter’s API. If you haven’t heard the term API before it stands for Application Programming Interface, and that’s just a fancy name for a website that delivers up data (such as XML or JSON) instead of human readable web pages. If you have a Twitter app on your phone it most likely uses Twitter’s API to access the tweets of people you follow. Twitter isn’t the only place making APIs available: they are everywhere on the Web: Facebook, Google, YouTube, Wikipedia, OCLC, even the Library of Congress has APIs. In some ways if you make your EAD XML available on the Web it is a kind of API. I really hope I didn’t just mansplain what APIs are, that’s not what I was trying to do.

A single API can support multiple “calls” or questions that you can ask. Among the many calls Twitter’s API has a call that allows you to do a search, and get back 100 tweets that match your query plus a token to go and get the next 100. They let you ask this question 180 times every 15 minutes. If you do the math you can see that you can fetch 72,000 tweets an hour, or 1.7 million tweets per day. Unfortunately the API only lets you search the last 9 days of tweets, after which you can pay Twitter for data.

So what Bergis and I did was use a small Python program I had written previously called twarc to try to collect as much of the tweets as we could that had the word “ferguson” in them. twarc is just one tool for collecting data from the Twitter API.

Another tool you can use from the comfort of your Web browser (no command line fu required) is the popular Twitter Archiving Google Sheet TAGS. TAGS lets you collect data from the search API which it puts directly into a spreadsheet for analysis. This is super handy if you don’t want to parse the original JSON data returned by the Twitter API. TAGS is great for individual use.

And another option is the Social Feed Manager (SFM) project. SFM is a project started by George Washington University with support from IMLS and the National Historical Publications and Records Commission. I think SFM is doubly important to bring up today since the theme for the conference is Ingenuity and Innovation in Archives. NHPRC’s support for the SFM project has been instrumental in getting it to where it is today. SFM is an open source Web application that you download and set up at your institution and which users then log into using their Twitter credentials to to setup data collection jobs. GW named it Social Feed Manager because they are in the process of adding other content sources such as Flickr and Tumblr. They are hoping that extending it in this way will provide an architecture that will allow interested people to add other social media sites, and contribute them back to the project. The other nice thing that both SFM and twarc do (but that TAGS does not) is collect the original JSON data from the Twitter API. In a world where original order matters I think this is an important factor to keep in mind.

JSON is an acronym for JavaScript Object Notation. There are lots of other formats for sending data around on the Web, but JSON has emerged as the defacto standard for APIs. This has largely been the result of its versatility and that support for it is cooked into every Web browser that can run JavaScript.

So what’s in the JSON data for a tweet? Twitter is famous for its 140 character message limit. But the text of a tweet only accounts for about 2% of the JSON data that is made available by the Twitter API. Some people might call this metadata, but I’m just going to call it data for now, since this is the original data that Twitter collected and pushed out to any clients that are listening for it.

Also included in the JSON data are things like: the time that the tweet was sent, any hashtags present, geo coordinates for the user (if they have geo-location turned on in their preferences), urls mentioned, places recognized, embedded media such as images or videos, retweet information, reply to information, lots information about the user sending the message, handles for other users mentioned, the current follower count of the sender. And of course you can use the tweet ID to go back to the Twitter API to get all sorts of information such as who has retweeted or liked a tweet.

Here’s what the JSON looks like for a tweet. I’m not going to go into this in detail, but I thought I would at least show it to you. I suspect catalogers or EAD finding aid creators out there might not find this too scary to look at. JSON is much more expressive than the rows and columns of a spreadsheet because you can have lists of things, key/value pairs and hierarchical structures that don’t fit comfortably into a spreadsheet.

Ok, I can imagine some eyes glazing over at these mundane details so lets get back to the Ferguson tweets. Do you remember that form that ArchiveIt put together to let people submit URLs to archive? You may remember that 236 URLs were submitted. Bergis and I were curious so we extracted all the URLs mentioned in the 13 million tweets, unshortened them, and then ranked them by the number of times they were mentioned. You can see a list of the top 100 shared links in that time period. Notice at the time we checked to see if Internet Archive had archived the page.

We then took a look just within the [fiirst day of tweets] that we had, to see if they looked any different. Look at number #2 there, Racial Profiling Data/2013. It’s a government document from the Missouri Attorney General’s Office with statistics from the Ferguson Police Department. Let’s take a moment to digest those stats along with the 1,538 Twitter users who did that day.

Now what’s interesting is that the URL that was tweeted so many times then is already broken. And look, Internet Archive has it, but it was collected for the first time on August 12, 2014. Just as the conversation was erupting on Twitter. Perhaps this URL was submitted by an archivist to the form ArchiveIt put together. Or perhaps someone recognized the importance of archiving it and submitted it directly to the Internet Archive using their Save Now form.

The thing I didn’t mention earlier is that we found 417,972 unique, unshortened URLs. Among them were 21,457 YouTube videos. Here’s the fourth most shared YouTube video, that ended up being seen over half a million times.

As Bergis said in July of this year as he prepared for a class about archiving social media at Archival Education and Research Initiative (AERI):

Every time I hear we shouldn’t build social media archives like #Ferguson, I think abt events in black history for which we have no records.

— Bergis Jules ((???)) July 18, 2015

Bergis was thinking specifically about events like the Watts Riots in Los Angeles where 34 people were killed and 3,438 arrested.

Of course the story does not end there. As I mentioned I work at the Maryland Institute for Technology in the Humanities at the University of Maryland. We aren’t an archive or a library, we’re a digital humanities lab that is closely affiliated with the University library. Neil Fraistat, the Director of MITH, immediately recognized the value of doing this work. He not only supported me in spending time on it with Bergis, but also talked about the work with his colleagues at the University of Maryland.

When there was a Town Hall meeting on December 3, 2014 we were invited to speak along with other faculty, students and the University Police Commissioner. The slides you saw earlier of popular tweets during that period was originally created for the Town Hall. I spoke very briefly about the data we collected and invited students and faculty who were interested in working with the data to please get in touch. The meeting was attended by hundreds of students, and ended up lasting some 4 hours, with most of the time being taken up by students sharing stories from their experience on campus of harassment by police, concern about military grade weapons being deployed in the community, insight into the forms of institutionalized racism that we all still live with today. It was an incredibly moving experience, and our images from the “archive” were playing the whole time as a backdrop.

After the Town Hall meeting Neil and a group of faculty on campus organized the BlackLivesMatter at UMD group, and a set of teach-ins at UMD where the regularly scheduled syllabus was set aside to discuss the events in Ferguson and Black Lives Matter more generally. Porter Olsen (who taught the BitCurator workshop yesterday) helped organize a series of sessions we call digital humanities incubators to build a community of practice around digital methods in the humanities. These sessions focused on tools for data collection, data analysis, and rights and ethical issues. We had Laura Wrubel visit from George Washington University to talk about Social Feed Manager. Trevor Munoz, Katie Shilton and Ricky Punzalan spoke about the rights issues associated with working with social media data. Josh Westgaard from the library spoke about working with JSON data. And Cody Buntain, Nick Diakopoulos and Ben Scheiderman helped us use tools like Python Notebooks and NodeXL for data analysis.

And of course, we didn’t know it at the time, but Ferguson was just the beginning. Or rather it was the beginning of a growing awareness of police injustice towards African Americans and people of color in the United States that began to be known as the BlackLivesMatter movement. BlackLivesMatter was actually started by Alicia Garza, Patrisse Cullors, and Opal Tometi after the acquittal of George Zimmerman in the Florida shooting death of Trayvon Martin two years earlier. But the protests on the ground in Ferguson, elsewhere in the US, and in social media brought international attention to the issue. Names like Aiyana Jones, Rekia Boyd, Jordan Davis, Renisha McBride, Dontre Hamilton, Eric Garner, John Crawford, led up to Michael Brown, and were followed by Tamir Rice, Antonio Martin, Walter Scott, Freddie Gray, Sandra Bland and Samuel Dubose.

Bergis and I did our best to collect what we could from these sad, terrifying and enraging events. The protests in Baltimore were of particular interest to us at the University of Maryland since it was right in our backyard. Our data collection efforts got the attention of Rashawn Ray who is an Assistant Professor of Sociology at the University of Maryland. He and his PhD student Melissa Brown were interested in studying how the discussion of Ferguson changed in four Twitter datasets we had collected: the killing of Michael Brown, the non-indictment of Darren Wilson, the Justice Department Report, and then the one year anniversary. They have been exploring what the hashtags, images and text tell us about the shaping of narratives, sub-narratives and counter-narratives around the Black experience in the United States.

And we haven’t even accessioned any of the data. It’s sitting in MITH’s Amazon cloud storage. This really isn’t anybody’s fault but my own. I haven’t made it a priority to figure out how to get it into the University’s Fedora repository. In theory it should be doable. This is why I’m such a fan of Natalie’s work at the University of Miami that I mentioned at the beginning. Not only did she get the data into the archive, but she described it with a finding aid that is now on the Web, waiting to be discovered by a researcher like Rashawn.

So what value do you think social media has as a tool for guiding appraisal in Web archives? Would it be useful if you could easily participate in conversation going on in your community and collect the Web documents that were important to them? Let me read you the first paragraph of a grant proposal Bergis wrote recently:

The dramatic rise in the public’s use of social media services to document events of historical significance presents archivists and others who build primary source research collections with a unique opportunity to transform appraisal, collecting, preservation and discovery of this new type of research data. The personal nature of documenting participation in historical events also presents researchers with new opportunities to engage with the data generated by individual users of services such as Twitter, which has emerged as one of the most important tools used in social activism to build support, share information and remain engaged. Twitter users document activities or events through the text, images, videos and audio embedded in or linked from their tweets. This means vast amounts of digital content is being shared and re-shared using Twitter as a platform for other social media applications like YouTube, Instagram, Flickr and the Web at large. While such digital content adds a new layer of documentary evidence that is immensely valuable to those interested in researching and understanding contemporary events, it also presents significant data management, rights management, access and visualization challenges.

As with all good ideas, we’re not alone in seeing the usefulness of social media in archival work. Ed Fox and his team just down the road at Virginia Tech have been working solidly on this problem for a few years and recently received an NSF grant to further develop their Integrated Digital Event Archiving and Library IDEAL. Here’s a paragraph from their grant proposal:

The Integrated Digital Event Archive and Library (IDEAL) system addresses the need for combining the best of digital library and archive technologies in support of stakeholders who are remembering and/or studying important events. It extends the work at Virginia Tech on the Crisis, Tragedy, and Recovery network (see to handle government and community events, in addition to a range of significant natural or manmade disasters. It addresses needs of those interested in emergency preparedness/response, digital government, and the social sciences. It proves the effectiveness of the 5S (Societies, Scenarios, Spaces, Structures, Streams) approach to intelligent information systems by crawling and archiving events of broad interest. It leverages and extends the capabilities of the Internet Archive to develop spontaneous event collections that can be permanently archived as well as searched and accessed, and of the LucidWorks Big Data software that supports scalable indexing, analyzing, and accessing of very large collections.

Maybe you should have another Ed up here speaking! Or an archivist like Bergis. Here’s another project called iCrawl from the University of Hannover who are doing something very similar to Virginia Tech.

Seriously though, this has been fun. Before I leave you here are a few places you could go to get involved in and learn about this work.

  1. If you are an SAA member please join the conversation at the SAA Web Archiving discussion list. One of the cool things that happend on this discussion list last year was drafting a letter to Facebook that was sent by President Kathleen Roe.
  2. If you’re not an SAA member there’s a new discussion list called Web Archives. It’s just getting started, so it’s a perfect time to join.
  3. Bergis, Christie Peterson, Bert Lyons, Allison Jai O’Dell, Ryan Baumann and I have been writing short pieces about this kind of work on Medium in the On Archivy publication. If you have ideass, thought experiments, actual work, or commentary please write it on Medium and send us request to include it.

And as Hillel Arnold pointed out recently:

Your semi-regular reminder that direct, local action is what makes change happen, not white dudes giving speeches. #saa15 #s610

— Hillel Arnold ((???)) August 22, 2015

Let’s get to work.

Ed Summers: Seminar Week 6

planet code4lib - Fri, 2015-10-09 04:00

This week we dove into some readings about information retrieval. The literature on the topic is pretty vast, so luckily we had Doug Oard on hand to walk us through it. The readings on deck were Liu (2009), Chapelle, Joachims, Radlinski, & Yue (2012) and Sanderson & Croft (2012). The first two of these were had some pretty technical, mathematical components that were kind of intimidating and over my head. But the basic gist of both of them was understandable, especially after the context that Oard provided.

Oard’s presentation in class was Socratic: he posed questions for us to answer, which he helped answer as well, which led on to other questions. We started with what information, retrieval and research was. I must admit to being a bit frustrated about returning to this definitional game of information. It feels so slippery, but we basically agreed that it was social contruct and moved on to lower level questions such as: what is data, what is a database, what are the feature sets of information retrieval. The feature sets discussion was interesting because we basically have worked with three different features sets: descriptions of things (think of catalog records), content (e.g. contents of books) and user behavior.

We then embarked on a pretty detailed discussion of user behavior, and how the technique of interleaved data sets lets computer systems adaptively tweak the many parameters that tune information retrieval algorithms based on user behavior. Companies like Google and Facebook have the user attention to be able to deploy these adaptive techniques to evolve their systems. I thought it was interesting to reflect on how academic researchers are then almost required to work with these large corporations in order to deploy their research ideas. I also thought it might be interesting how having a large set of users who expect to use your product in a particular way might become a straight jacket of sorts, and perhaps over time lead to a calcification of ideas and techniques. This wasn’t a fully formed thought, but it seemed that this purely statistical and algorithmic approach to design lacked some creative energy that is fundamentally human – even though their technique had human behavior at its center as well.

I guess it’s nice to think that the voice of the individual matters, and we’re not just dumbing all our designs down to the lowest common denominator between us. I think this class successfully steered me away from the competitive space of information retrieval even though my interest in appraisal and web archives moves in that direction, with respect to focused crawling. Luckily a lot of the information retrieval research in this area has been done already, but what is perhaps lacking are system designs that incorporate the decisions of the curator/archivist more. If not I guess I can fall back on my other research area of the history of standards on the Web.


Chapelle, O., Joachims, T., Radlinski, F., & Yue, Y. (2012). Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS), 30(1), 6.

Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.

Sanderson, M., & Croft, W. B. (2012). The history of information retrieval research. Proceedings of the IEEE, 100(Special Centennial Issue), 1444–1451.

Ed Summers: Seminar Week 5

planet code4lib - Fri, 2015-10-09 04:00

In this weeks class we took a closer look at design methods and prototyping with readings from Druin (1999), Zimmerman, Forlizzi, & Evenson (2007) and a paper that fellow student Joohee picked out Buchenau & Suri (2000). In addition to a discussion of the readings Brenna McNally from the iSchool visited us to demonstrate the Cooperative Inquiry that was discussed in the Druin paper.

In a nutshell Cooperative Inquiry is a research methodology that Druin specifically designed to enable adults and children to work collaboratively as equals on design problems. The methodology grew out of work at the University of Maryland Kid Design Team. In the paper Druin specifically discusses two projects at UMD: KidPad and PETS and how cooperative inquiry drew upon the traditions of contextual inquiry and participatory design.

Brenna’s demonstration of the technique was both fun and instructive. It was really interesting to be a participant and then asked to reflect on it in a meta way afterwards. Basically we were asked to generate some ideas about things we would like to do with digital cameras: being able to take a photo while driving, being able to take a picture quickly (like when a young child smiles), and taking pictures in your dreams (that was my suggestion). Then Brenna brought some prototyping materials: colored paper, string, pipe cleaners, various sticky things and asked us to prototype some solutions.

I wish I had take some pictures. Jonathan and Diane’s digital dream catcher that you wore like a showercap was memorable. Joohee and I create a little device that could sit on top of your car and take pictures on voice command and beam them through the “dream cloud” to a picture frame device in your house. I found it difficult to make the leap into using the materials at hand to prototype, but Brenna helped gave us examples of the types of prototyping she was looking for. Also, while we worked she was busily writing down different features that she noticed in our designs.

When we were done we each presented our ideas … and applauded each other (of course). Afterwhich Brenna went over some of the design elements she noticed, and highlighted ones she thought were interesting. We discussed some of them and decided on ones that would be worth digging into further. At this point a new cycle of prorotyping would begin. I thought this demonstration really clearly showed the iterative nature of observation, ideation and prototyping that make up the method.


Buchenau, M., & Suri, J. F. (2000). Experience prototyping. In Proceedings of the 3rd conference on designing interactive systems: Processes, practices, methods, and techniques (pp. 424–433). Association for Computing Machinery.

Druin, A. (1999). Cooperative inquiry: Developing new technologies for children with children. In Proceedings of the sIGCHI conference on human factors in computing systems (pp. 592–599). Association for Computing Machinery.

Zimmerman, J., Forlizzi, J., & Evenson, S. (2007). Research through design as a method for interaction design research in hCI. In Proceedings of the sIGCHI conference on human factors in computing systems (pp. 493–502). Association for Computing Machinery.

LITA: Digital Privacy Toolkit for Librarians, a LITA webinar

planet code4lib - Thu, 2015-10-08 20:38

Attend this important new LITA webinar:

Digital Privacy Toolkit for Librarians

Tuesday October 20, 2015
1:30 pm – 3:00 pm Central Time
Register Online, page arranged by session date (login required)

This 90 minute webinar will include a discussion and demonstration of practical tools for online privacy that can be implemented in library PC environments or taught to patrons in classes/one-on-one tech sessions, including browsers for privacy and anonymity, tools for secure deletion of cookies, cache, and internet history, tools to prevent online tracking, and encryption for online communications.

Attendees will:

Alison’s work for the Library Freedom Project and classes for patrons including tips on teaching patron privacy classes can be found at:

Alison Macrina

Is a librarian, privacy rights activist, and the founder and director of the Library Freedom Project, an initiative which aims to make real the promise of intellectual freedom in libraries by teaching librarians and their local communities about surveillance threats, privacy rights and law, and privacy-protecting technology tools to help safeguard digital freedoms. Alison is passionate about connecting surveillance issues to larger global struggles for justice, demystifying privacy and security technologies for ordinary users, and resisting an internet controlled by a handful of intelligence agencies and giant multinational corporations. When she’s not doing any of that, she’s reading.

Register for the Webinar

Full details
Can’t make the date but still ant to join in? Registered participants will have access to the recorded webinar.


  • LITA Member: $45
  • Non-Member: $105
  • Group: $196

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

SearchHub: Implementing Apache Solr at Target

planet code4lib - Thu, 2015-10-08 20:27

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Target engineer Raja Ramachandran’s session on implementing Solr at one of the world’s largest retail companies.

Sending Solr into action on a high volume, high profile website within a large corporation presents several challenges — and not all of them are technical. This will be an open discussion and overview of the journey at Target to date. We’ll cover some of the wins, losses and ties that we’ve had while implementing Solr at Target as a replacement for a legacy enterprise search platform. In some cases the solutions were basic, while others required a little more creativity. We’ll cover both to paint the whole picture.

Raja Ramachandran is an experienced Solr architect with a passion for improving relevancy and acquiring data signals to improve search’s contextual understanding of its user.

Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target from Lucidworks

Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Implementing Apache Solr at Target appeared first on

District Dispatch: Life-term for nation’s librarian running out?

planet code4lib - Thu, 2015-10-08 16:12

Congress is considering a term-limit for the nation’s Librarian. (photo:

Hard on the heels of the recent surprise announcement that the current Librarian of Congress, Dr. James Billington, would accelerate his retirement from year’s end (as announced in June) to September 30, the Senate last night approved legislation to limit the service of all future Librarians. Co-authored by all five members of the Senate’s Joint Committee on the Library, and passed without debate by unanimous consent on the day of its introduction, the “Librarian of Congress Succession Modernization Act of 2015” (S.2162) would establish a ten-year term for the post, renewable by the President upon Senate reconfirmation. Since the position was established in 1800, it has been held by just 13 Librarians of Congress appointed to life terms. Comparable House legislation is expected, but the timing of its introduction and consideration is uncertain.

The Senate’s action last night comes as the President is preparing to nominate Dr. Billington’s successor against the backdrop of two scathing reports by the Government Accountability Office detailing serious and pervasive inefficiencies and deficiencies of both the Library of Congress‘ and the U.S. Copyright Office’s information systems and (particularly in the case of the Library itself) management. Deputy Librarian David Mao is currently serving as Acting Librarian.

While no timetable for the President’s nomination of Dr. Billington’s successor has been announced, action by the White House (if not Senate confirmation) is expected before the end of this calendar year. In a letter to him last June, ALA President Courtney Young strongly urged President Obama to appoint a professional librarian to the post, a position since echoed by 30 other state-based library organizations.  This summer, in an OpEd published in Roll Call, ALA Office for Information Technology Policy Director Alan Inouye also emphasized the need for the next Librarian of Congress to possess a skill set tailored to the challenge of leading the institution into the 21st Century.

The post Life-term for nation’s librarian running out? appeared first on District Dispatch.

District Dispatch: E-rate, broadband @ ARSL in Little Rock

planet code4lib - Thu, 2015-10-08 15:05

Official Tshirt of the 2015 ARSL Conference at Little Rock, Arkansas

Waiting for my connecting flight on the way back to D.C. from the 2015 conference of the Association for Rural & Small Libraries (ARSL) in Little Rock, I had plenty of time to reflect on the whirlwind of experiences that were packed into my day and a half at the conference. While the impetus for attending was the E-rate modernization proceeding we focused on much of the last two years of our telecom work, an equally important outcome was to be immersed in the culture of librarians dedicated to their rural communities. Learning from the librarians at the conference will be critically important as our office investigates potential rural-focused advocacy work. And I would be remiss if I didn’t mention how much fun we had along the way.

I started the conference providing context to our policy presentation during which my colleague, Alan Inouye, gave an overview of the challenging and often murky work we do on behalf of libraries with decision makers at the national level. It may be counterintuitive to those of you who know E-rate to think of it as providing enlightenment on anything. However, as a case study for how a small association does policy—-as compared to advocacy organizations that have separate budget lines for paying people to wait in lines for congressional hearings (go on, Google it)—-E-rate makes a pretty good story. I am not known for brevity and our E-rate work lends itself to many intertwined and complex twists and turn between the countless phone calls, in-person meetings, and official filings with the Federal Communications Commission (FCC); collaborating (or not) with our coalitions; standing firm on behalf of libraries amidst the strong school focus; swaying the press; coordinating our library partners; and keeping ALA members informed. It was challenging to pick it all apart in my allotted 12 minutes (or in an acceptable blog post length). Interest piqued? Read more here.

Day 2 was E-rate and broadband day

I was privileged to attend sessions by my ALA E-rate Task Force colleagues. Amber Gregory, E-rate coordinator for Arkansas at the state library, presented a comprehensive yet digestible version of E-rate in “E-rate: Get your Share” and Emily Almond, IT Director for Georgia Public Library Service, put the why bother with E-rate into a bigger perspective with “Broadband 101.” My takeaways from these sessions were:

  1. E-rate is an important tool to make sure your library has the internet connection that allows your patrons to do what they need to do online.
  2. It’s time to think beyond the basics and E-rate can help you plan for your library’s future broadband needs.
  3. It’s ok to ask questions even if you don’t know exactly how to ask: We’re librarians and we love information and we love to share!
  4. There are people who can help.

Questions from the participants yielded more discussion about the challenges often felt by rural libraries who lag far behind the broadband speeds we know are necessary for many library services. The discussion also gave me more ideas for where we might focus our E-rate and broadband advocacy efforts in the near term which I will take back to the E-rate Task Force and colleagues in D.C.

Making personal connections

The last event for me in Little Rock was perhaps the highlight. Discussed during our policy presentation, having the impact stories of libraries working in their local communities is essential to the work we do with national decision makers. We need to be able to show how libraries support national priorities in education, employment and economic development, and healthcare to name a few. Examples provide the color to the message we try to convey. Alan and I spent the evening listening to (and grilling?) a table full of librarians who shared with us the challenges and strengths of their rural libraries. They also touched on aspirations for additional services they might provide their communities. This was all to our benefit and we came away with many notes and are thankful for time well spent. Spending time with these librarians and at the conference is a good reminder of how important it is to get out of D.C. regularly to gather input and anecdotes that make our work that much richer and more impactful.

Tipsy Librarian–a special concoction popular at the ARSL conference.

Walking through the hotel lobby after dinner, I was reminded that while the topics we talked about during the session are critically important for rural communities and the long-term impact of libraries that serve them, it’s also important to connect with colleagues at conferences like ARSL’s. This was reinforced to me during conversations with librarians who are often dealing with few resources and on their own without significant support. The ARSL tradition of “dine-arounds” and I believe a new cocktail tradition created by the hotel are a fun way to create bonds that last beyond the conference. Another tidbit I tucked away for later use.

The post E-rate, broadband @ ARSL in Little Rock appeared first on District Dispatch.

DPLA: DPLA Receives $250,000 from Anonymous Donor to Expand Technical Capabilities

planet code4lib - Thu, 2015-10-08 14:55

The Digital Public Library of America is thrilled to announce that an anonymous donor has committed to provide substantial support towards DPLA’s mission in the form of a $250,000 grant to strengthen DPLA’s technical capabilities. This grant will allow DPLA to expand its technology team to handle additional content ingestion and to implement important new features based around its platform and website.

Today’s grant represents the third investment in DPLA’s mission by this anonymous donor. In 2013 they contributed to the rapid scaling-up of DPLA’s Hub network, and in 2015 they provided support for DPLAfest 2015 in Indianapolis.

“It’s wonderful to have this incredible, ongoing support from someone who concurs with the Digital Public Library of America about the importance of democratizing access to our shared cultural heritage,” said Dan Cohen, DPLA’s Executive Director. “Increasing our technical capacity in this way will advance that mission immediately and substantially.”

The Digital Public Library of America strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated more than 11 million items from 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

District Dispatch: Federal libraries and the national policy agenda

planet code4lib - Thu, 2015-10-08 14:00

Library of Congress

On Tuesday, I had the pleasure of meeting at the Library of Congress with the FEDLINK Advisory Board. My brief was a presentation and discussion of the National Policy Agenda for Libraries in the context of federal libraries and related institutions.

Federal libraries represent both a particular segment of the library community and an extensive and far-reaching one as well—including service to the general public. Those with the highest visibility and general name recognition include the Library of Congress and National Library of Medicine but in fact there are numerous libraries in the federal sector, including several hundred libraries in the armed forces. The latter includes the Navy General Library Program, 96 years old, with over a million sailor visits in the last fiscal year.

Following is the FEDLINK mission statement:

The Federal Library and Information Network (FEDLINK) is an organization of federal agencies working together to achieve optimum use of the resources and facilities of federal libraries and information centers by promoting common services, coordinating and sharing available resources, and providing continuing professional education for federal library and information staff. FEDLINK serves as a forum for discussion of the policies, programs, procedures and technologies that affect federal libraries and the information services they provide to their agencies, to the Congress, the federal courts and the American people.

FEDLINK celebrates 50 years of service this year. I’m pleased to have ALA’s Jessica McGilvray serving as our liaison to FEDLINK.

The policy and advocacy challenges for federal libraries have substantial commonalities with other library segments. The problem of higher-ups not understanding the true contributions of libraries resonated—and accordingly, suggested the need for all library managers to be marketing and sales people (and the consequent need for such education in master’s programs).

Many thanks to Blane Dessy of the Library of Congress for the invitation. ALA looks forward to continued and closer collaboration on these issues, and I am particularly committed to working cooperatively on the many federal library issues that intersect with ALA’s national policy work.

The post Federal libraries and the national policy agenda appeared first on District Dispatch.

FOSS4Lib Recent Releases: SobekCM Digital Repository Software - 4.9.0

planet code4lib - Thu, 2015-10-08 00:29

Last updated October 7, 2015. Created by Peter Murray on October 7, 2015.
Log in to edit this page.

Package: SobekCM Digital Repository SoftwareRelease Date: Sunday, October 4, 2015

DuraSpace News: Subscribe to the DuraSpace Quickbyte Video Channel

planet code4lib - Thu, 2015-10-08 00:00

Winchester, MA  Information seekers can now find videos and broadcasts tailored to almost any interest or topic online. Reading long emails went the way of the dinosaurs as rich media viewed on our phones became the way many of us access news and information. DuraSpace has joined the fray by establishing the DuraSpace Quickbyte video series on YouTube.

Library of Congress: The Signal: Extra Extra! Chronicling America Posts its 10 Millionth Historic Newspaper Page

planet code4lib - Wed, 2015-10-07 19:09

Talk about newsworthy! Chronicling America, an online searchable database of historic U.S. newspapers, has posted its 10 millionth page today. Way back in 2013, Chronicling America boasted 6 million pages available for access online.

The San Francisco call., October 12, 1902, Image 15

The site makes digitized newspapers (of those published between 1836 and 1922) available through the National Digital Newspaper Program. It also includes a separate searchable directory of US newspaper records, describing more than 150,000 titles published between 1690 to the present and listing libraries that have physical copies in microfilm or original print. The site now features more than 74 terabytes of total data – from more than 1,900 newspapers in 38 states and territories and the District of Columbia.

For the past eight years, the site has grown with content and providing enhanced access.  The NDNP data is in the public domain and available on the web for anyone to use. In addition, the web application supporting the Chronicling America web site is published as open-source software for others to implement and customize for their own digitized newspaper collections.

The technical aspects of the program are based around sustainable practices in digital preservation, including open and standardized file formats and metadata structures,  technical validation, and using the digital collection and inventory management tools developed at the Library.

New-York tribune., November 25, 1906, Image 17

“It’s very exciting to have created such a large collection of newspapers from so many places around the country covering a wide breadth of time,” said Deb Thomas, who manages the program for the Library of Congress. “We can see how individual communities understood the world around them in those decades.”

The goal for Chronicling America, Thomas said, is to have all 50 states plus U.S. territories represented in the archive– something she estimates may take about 10 more years. “The newspapers are the first draft of history,” she said. “That’s it – it has something for everyone in it. It’s not a specialized resource. It’s a record of community history and cultural history. That’s where we put it all.”

Chronicling America ( ) provides free and open access to more than 10 million pages of historic American newspapers selected by memory institutions in 38 states and territories so far. These states participate in the National Digital Newspaper Program, a joint program of the National Endowment for the Humanities and the Library of Congress. Read more about it at and follow us on Twitter @librarycongress #ChronAm #10million!

Read other Library of Congress blog posts recognizing this milestone:


Subscribe to code4lib aggregator