You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 2 hours 58 min ago

DuraSpace News: Announcing the Release of the 2015 National Agenda For Digital Stewardship

Fri, 2014-10-03 00:00

Washington, DC  The 2015 National Agenda for Digital Stewardship has been released!

You can download a copy of the Executive Summary and Full Report here: 

DuraSpace News: Webinar Recording Available

Fri, 2014-10-03 00:00

Winchester, MA  The 8th DuraSpace Hot Topics Community Webinar Series, “Doing It: How Non-ARL Institutions are Managing Digital Collections” began October 2, 2014.  The first webinar in this series curated by Liz Bishoff, “Research Results on Non-ARL Academic Libraries Managing Digital Collections,” provided an overview of the methodology and key questions and findings of the Managing Digital Collections Survey of non-ARL academic libraries.  Participants also had the opportunity to share how

PeerLibrary: Weekly PeerLibrary meeting finalizing our Knight News Challenge...

Thu, 2014-10-02 23:24

Weekly PeerLibrary meeting finalizing our Knight News Challenge submission.

Cynthia Ng: Access 2014: Closing Keynote Productivity and Collaboration in the Age of Digital Distraction

Thu, 2014-10-02 18:35
The closing keynote for Access 2014. He spoke really fast, so apologies if I missed a couple of points. Presented by Jesse Brown Digital Media Expert, Futurist, Broadcast Journalist Background Bitstrips: To make fun cartoons. Co-founder. CBC show Podcast: Canadaland, broader view of media, in a global sense to what’s happening to society and culture. Technology Changing […]

Evergreen ILS: Evergreen 2.7.0 has been released!

Thu, 2014-10-02 17:42

Evergreen 2.7.0 has been released!

Small delay in announcing, but here we go…

Cheers and many thanks to everyone who helped to make Evergreen 2.7.0 a reality, our first official release of the 2.7 series! After six months of hard work with development, bug reports, testing, and documentation efforts, the 2.7.0 files are available on the Evergreen website’s downloads page:

So what’s new in Evergreen 2.7? You can see the full release notes here: To briefly summarize though, there were contributions made for both code and documentation by numerous individuals. A special welcome and acknowledgement to all our first-time contributors, thanks for your contributions to Evergreen!

Some caveats now… currently Evergreen 2.7.0 requires the use of the latest OpenSRF 2.4 series, which is still in its alpha release (beta coming soon). As folks help to test the OpenSRF release, this will no doubt help to make Evergreen 2.7 series better. Also, for localization/i18n efforts, there was some last minute bug finding and we plan to release updated translation files in the next maintenance release 2.7.1 for the 2.7 series.

Evergreen 2.7.0 includes a preview of the new web-based staff client code. The instructions for setting this up are being finalized by the community and should be expected for release during the next maintenance version 2.7.1 later in October.

See here for some direct links to the various files so far:

Once again, a huge thanks to everyone in the community who has participated this cycle to contribute new code, test and sign-off on features, and work on new documentation and other ongoing development efforts.


— Ben

Cynthia Ng: Access 2014: Day 3 Notes

Thu, 2014-10-02 17:09
Final half day of Access 2014. The last stretch. ## RDF and Discovery in the Real World(cat) Karen Coombs, Senior Product Analyst, WorldShare Platform The Web of Data: Things, Not Strings Way for search engine to be the most relevant e.g. May 2012: Google Knowledge Graph provides more knowledge in search results. Traditionally in bibliographic description […]

Jenny Rose Halperin: A new look for our Community Newsletter

Thu, 2014-10-02 16:18

This post was featured on the Mozilla Community Blog


If you’ve been wondering why you haven’t received the best in Mozilla’s community news in some weeks, it’s because we’ve been busy redesigning our newsletter in order to bring you even more great content.

Non-profit marketing is no easy feat. Even with our team of experts here at Mozilla, we don’t always hit the bar when it comes to open rates, click through rates, and other metrics that measure marketing success. For our community newsletter, I watched our metrics steadily decrease over the six month period since we re-launched the newsletter and started publishing on a regular basis.

It was definitely time for a makeover.

Our community newsletter is a study in pathways and retention: How do we help people who have already expressed interest in contributing get involved and stay involved? What are some easy ways for people to join our community? How can communities come together to write inspiring content for the Web?

At Mozilla, we put out three main newsletters: Firefox and You (currently on a brief hiatus), the Firefox Student Ambassadors newsletter, and our Mozilla Communities Newsletter (formerly called about:Mozilla)

It was important to me to have the newsletter feel authentically like the voice of the community, to help people find their Mozillian way, and to point people in the direction of others who share their interests, opening up participation to a wider audience.

A peer assist with Andrea Wood and Kelli Klein at the Mozilla Foundation helped me articulate what we needed and stay on-target with the newsletter’s goal to “provide the best in contribution opportunities at Mozilla.” Andrea demonstrated to me how the current newsletter was structured for consumption, not action, and directed me toward new features that would engage people with the newsletter’s content and eventually help them join us.

I also took a class with Aspiration Tech on how to write emails that captivate as well as read a lot about non-profit email marketing. While some of it seemed obvious, my research also gave me an overview of the field, which allowed me to redesign the newsletter according to best practices.

Here’s what I learned:

1. According to M & R, who publishes the best (and most hilarious) study of non-profit email campaigns, our metrics were right on track with industry averages. Non-profit marketing emails have a mean open rate of 13% with a 2.5% deviance in either direction. This means that at between 25% and 15% open rate we were actually doing better than other non-profit emails. What worried me was that our open rate rapidly and steadily decreased, signalling a disengagement with the content.

I came up with similar findings for our click through rates– on par with the industry, but steadily decreasing. (From almost 5% on our first newsletter to less than 1.5% on our last, eek!)

2. While I thought that our 70,000 subscribers put us safely in the “large email list” category, I learned that we are actually a small/medium newsletter according to industry averages! In terms of how we gain subscribers, I’m hoping that an increased social media presence as well as experiments with viral marketing (IE “forward this to a friend!”) will bring in new voices and new people to engage with our community.

3. “The Five Second Rule” is perhaps the best rule I learned about email marketing. Have you captured the reader in three seconds? Can you open an email and know what it’s trying to ask you in five seconds? If not, you should redesign.

4. Stories that asked people to take action were always the most clicked on stories in our last iteration. This is unsurprising, but “learn more” and “read more” don’t seem to move our readers. “Sign this petition” and “Sign up” were always well-received.

5. There is no statistically “best time” to send an email newsletter. The best time to send an email newsletter is “when it’s ready.” While every two weeks is a good goal for the newsletter, sending it slightly less frequently will not take away from its impact.

6. As M & R writes, “For everything, (churn churn churn) there is a season (churn, churn, churn)…” our churn rate on the newsletter was pretty high (we lost and gained subscribers at a high rate.) I’m hoping that our new regular features about teaching and learning as well as privacy will highlight what’s great about our community and how to take action.

And now to the redesign!

The first thing you’ll notice is that our newsletter is now called “Mozilla Communities.” We voted on the new name a few weeks ago after the Grow Mozilla call. Thanks to everyone who gave feedback.

An overview of the newsletter’s new look.

While the overall feel remains the same and is in line with other Mozilla-branded newsletters, the new look incorporates a few “evergreen” opportunities and actions you can take before the fold as well as features a contributor in their own words. (For the draft of the new design, that contributor is me!) The easy actions on the left hand side will rotate out as needed and increase in commitment level as you read down the page. Also, take a look at the awesome logo from Christie Koehler!


The next section presents rotating features on our privacy and educational initiatives. Privacy and education span a variety of functional areas, so this section could be populated by a variety of community endeavors. At the bottom of these sections, there’s a Facebook post and Tweet that you can post to easily take action, promote our communities, and get social to protect the Internet.


The next section features a story that engages the reader to take action! (In this case it invites readers into our awesome new gear store…) This story about Mozilla communities will rotate out according to the content that you submit. It will also be action-oriented, easy, and fun.

This last story is optional and will be rotated in and out according to testing during the first few issues. (Early feedback feared that there were too many stories.) In the draft design, we’re announcing a new contribution area. This will be a place for new community contribution areas, pathways, and opportunities to connect. The new photo section, “Mozillian Moments,” replaces our “Photo of the Week” section from the last iteration.


Finally, the footer reminds the reader that this newsletter is community-created and community-supported. It also invites readers to join us on social media. In the upcoming issues, the newsletter will also link to the new “Guides” forum that will help contributors find mentorship opportunities and connect with their fellow Mozillians.


What we need from you:

1. We need writers, coders, social media gurus, copy editors, and designers who are interested in consistently testing and improving the newsletter. The opportunity newsletter is a new contribution area on the October 15th relaunch of the Get Involved page (under the “Writing –> Journalism” drop down choice) and I’m hoping that will engage new contributors as well.

2. A newsletter can’t run without content, and we experimented with lots of ways to collect that content in the last few months. Do you have content for the newsletter? Do you want to be a featured contributor? Reach out to mozilla-communities at mozilla dot com.

3. Feedback requested! I put together an Etherpad that asks specific questions about improving the design. Please put your feedback here or leave it in the comments.

The newsletter is a place for us to showcase our work and connect with each other. We can only continue improving, incorporating best practices, and connecting more deeply and authentically through our platforms. Thank you to everyone who helped in the Mozilla Communities redesign and to all of you who support Mozilla communities every day.

Ed Summers: why @congressedits?

Thu, 2014-10-02 16:07

Note: as with all the content on this blog, this post reflects my own thoughts about a personal project, and not the opinions or activities of my employer.

Two days ago a retweet from my friend Ian Davis scrolled past in my Twitter stream:

This Twitter bot will show whenever someone edits Wikipedia from within the British Parliament. It was set up by @tomscott using @ifttt.

— Parliament WikiEdits (@parliamentedits)

July 8, 2014

The simplicity of combining Wikipedia and Twitter in this way immediately struck me as a potentially useful transparency tool. So using my experience on a previous side project I quickly put together a short program that listens to all major language Wikipedias for anonymous edits from Congressional IP address ranges (thanks Josh) … and tweets them.

In less than 48 hours the @congressedits Twitter account had more than 3,000 followers. My friend Nick set up gccaedits for Canada using the same software … and @wikiAssemblee (France) and @RiksdagWikiEdit (Sweden) were quick to follow.

Watching the followers rise, and the flood of tweets from them brought home something that I believed intellectually, but hadn’t felt quite so viscerally before. There is an incredible yearning in this country and around the world for using technology to provide more transparency about our democracies.

Sure, there were tweets and media stories that belittled the few edits that have been found so far. But by and large people on Twitter have been encouraging, supportive and above all interested in what their elected representatives are doing. Despite historically low approval ratings for Congress, people still care deeply about our democracies, our principles and dreams of a government of the people, by the people and for the people.

We desperately want to be part of a more informed citizenry, that engages with our local communities, sees the world as our stage, and the World Wide Web as our medium.

Consider this thought experiment. Imagine if our elected representatives and their staffers logged in to Wikipedia, identified much like Dominic McDevitt-Parks (a federal employee at the National Archives) and used their knowledge of the issues and local history to help make Wikipedia better? Perhaps in the process they enter into conversation in an article’s talk page, with a constituent, or political opponent and learn something from them, or perhaps compromise? The version history becomes a history of the debate and discussion around a topic. Certainly there are issues of conflict of interest to consider, but we always edit topics we are interested and knowledgeable about, don’t we?

I think there is often fear that increased transparency can lead to increased criticism of our elected officials. It’s not surprising given the way our political party system and media operate: always looking for scandal, and the salacious story that will push public opinion a point in one direction, to someone’s advantage. This fear encourages us to clamp down, to decrease or obfuscate the transparency we have. We all kinda lose, irrespective of our political leanings, because we are ultimately less informed.

I wrote this post to make it clear that my hope for @congressedits wasn’t to expose inanity, or belittle our elected officials. The truth is, @congressedits has only announced a handful of edits, and some of them are pretty banal. But can’t a staffer or politician make a grammatical change, or update an article about a movie? Is it really news that they are human, just like the rest of us?

I created @congressedits because I hoped it could engender more, better ideas and tools like it. More thought experiments. More care for our communities and peoples. More understanding, and willingness to talk to each other. More humor. More human.

@Congressedits is why we invented the Internet

— zarkinfrood (@zarkinfrood)

July 11, 2014

I’m pretty sure zarkinfrood meant @congressedits figuratively, not literally. As if perhaps @congressedits was emblematic, in its very small way, of something a lot bigger and more important. Let’s not forget that when we see the inevitable mockery and bickering in the media. Don’t forget the big picture. We need transparency in our government more than ever, so we can have healthy debates about the issues that matter. We need to protect and enrich our Internet, and our Web … and to do that we need to positively engage in debate, not tear each other down.

Educate and inform the whole mass of the people. Enable them to see that it is their interest to preserve peace and order, and they will preserve them. And it requires no very high degree of education to convince them of this. They are the only sure reliance for the preservation of our liberty. — Thomas Jefferson

Who knew TJ was a Wikipedian…

Ed Summers: Social Machines and the Archive

Thu, 2014-10-02 16:06

Yesterday MIT announced that Twitter made a 5 million dollar investment to help them create a Laboratory for Social Machines (LSM) as part of the MIT Media Lab proper:

MIT launches Laboratory for Social Machines with major Twitter investment @MITLSM @dkroy

— MIT Media Lab (@medialab) October 1, 2014

It seems like an important move for MIT to formally recognize that social media is a new medium that deserves its own research focus, and investment in infrastructure. The language on the homepage gives a nice flavor for the type of work they plan to be doing. I was particularly struck by their frank assessment of how our governance systems are failing us, and social media’s potential role in understanding and helping solve the problems we face:

In a time of growing political polarization and institutional distrust, social networks have the potential to remake the public sphere as a realm where institutions and individuals can come together to understand, debate and act on societal problems. To date, large-scale, decentralized digital networks have been better at disrupting old hierarchies than constructing new, sustainable systems to replace them. Existing tools and practices for understanding and harnessing this emerging media ecosystem are being outstripped by its rapid evolution and complexity.

Their notion of “social machines” as “networked human-machine collaboratives” reminds me a lot of my somewhat stumbling work on @congressedits and archiving Ferguson Twitter data. As Nick Diakopoulos has pointed out we really need a theoretical framework for thinking about what sorts of interactions these automated social media agents can participate in, formulating their objectives, and for measuring their effects. Full disclosure: I work with Nick at the University of Maryland, but he wrote that post mentioning me before we met here, which was kind of awesome to discover after the fact.

Some of the news stories about the Twitter/MIT announcement have included this quote from Deb Roy from MIT who will lead the LSM:

The Laboratory for Social Machines will experiment in areas of public communication and social organization where humans and machines collaborate on problems that can’t be solved manually or through automation alone.

What a lovely encapsulation of the situation we find ourselves in today, where the problems we face are localized and yet global. Where algorithms and automation are indispensable for analysis and data gathering, but people and collaborative processes are all the more important. The ethical dimensions to algorithms and our understanding of them is also of growing importance, as the stories we read are mediated more and more by automated agents. It is super that Twitter has decided to help build this space at MIT where people can answer these questions, and have the infrastructure to support asking them.

When I read the quote I was immediately reminded of the problem that some of us were discussing at the last Society of American Archivists meeting in DC: how do we document the protests going on in Ferguson?

Much of the primary source material was being distributed through Twitter. Internet Archive were looking for nominations of URLs to use in their web crawl. But weren’t all the people tweeting about Ferguson including URLs for stories, audio and video that were of value? If people are talking about something can we infer its value in an archive? Or rather, is it a valuable place to start inferring from?

I ended up archiving 13 million of the tweets that mention “ferguson” for the 2 week period after the killing of Michael Brown. I then went through the URLs in these tweets, and unshortened them and came up with a list of 417,972 unshortened URLs. You can see the top 50 of them here, and the top 50 for August 10th (the day after Michael Brown was killed) here.

I did a lot of this work in prototyping mode, writing quick one off scripts to do this and that. One nice unintended side effect was unshrtn which is a microservice for unshortening URLs, which John Kunze gave me the idea for years ago. It gets a bit harder when you are unshortening millions of URLs.

But what would a tool look like that let us analyze events in social media, and helped us (archivists) collect information that needs to be preserved for future use? These tools are no doubt being created by those in positions of power, but we need them for the archive as well. We also desperately need to explore what it means to explore these archives: how do we provide access to them, and share them? It feels like there could be a project here along the lines of what George Washington University University are doing with their Social Feed Manager. Full disclosure again: I’ve done some contracting work with the fine folks at GW on a new interface to their library catalog.

The 5 million dollars aside, an important contribution that Twitter is making here (that’s probably worth a whole lot more) is firehose access to the Tweets that are happening now, as well as the historic data. I suspect Deb Roy’s role at MIT as a professor and as Chief Media Scientist at Twitter helped make that happen. Since MIT has such strong history of supporting open research, it will be interesting to see how the LSM chooses to share data that supports its research.

Library of Congress: The Signal: Residency Program Success Stories, Part One

Thu, 2014-10-02 13:34

The following is a guest post by Julio Díaz Laabes, HACU intern and Program Management Assistant at the Library of Congress.

Coming off the heels of a successful beginning for the Boston and New York set of cohorts, the National Digital Stewardship Residency Program is becoming a model for digital stewardship residencies on a national scale. This residency program, funded by the Institute of Museum and Library Services,offers recent master’s and doctoral program graduates in specialized fields- library science, information science, museum studies, archival studies and others- the opportunity to gain professional experience in the field of digital preservation.

Clockwise from top left: Lee Nilsson, Maureen McCormick Harlow, Erica Titkemeyer and Heidi Elaine Dowding.

The inaugural year of the NDSR program was completed in May of 2014. During this year, ten residents were placed in various organizations in the Washington, DC area. Since completing the program, all ten residents are now working in positions related to the field of digital preservation! Here are some accounts of how the program has impacted each of the resident’s lives and where they are now in their careers.

Lee Nilsson is employed in a contract position as a junior analyst at the Department of State, Bureau of International Information and programs. Specifically, he is working in the analytics office on foreign audience research. On how the residency helped him, Lee said, “The residency got me to D.C and introduced me to some great people. Without NDSR I would not have made it this far.” Furthermore, Lee commented that the most interesting aspect of his job is “the opportunity to work with some very talented people on some truly global campaigns.”

Following the residency, Maureen McCormick Harlow accepted a permanent position as the new Digital Librarian at PBS (Public Broadcast Service). She works in the Media Library and her tasks include  consulting on the development of the metadata schema for an enterprise-wide digital asset management system, fulfilling archival requests for legacy materials and working with copyright holders to facilitate the next phase of a digitization project (which builds on the NDSR project of Lauren Work). Maureen stated that “NDSR helped her to foster and cultivate a network of digital preservationists and practitioners in the DC area over the nine months that I participated in it.” An interesting aspect of her job is working with the history of PBS and learning about PBS programming to see how it has changed over the years.

On an international scale, Heidi Elaine Dowding is currently in a three-year PhD Research Fellow position at the Royal Dutch Academy of Arts and Sciences Huygens ING Institute. This position is funded through the European Commission. “My research involves the long-term publication and dissemination of digital scholarly editions, so aspects of digital preservation will be key,” said Heidi. On the best part of her position, Heidi said, “I am lucky enough to be fully funded, which allows me to focus on my studies. This gives me that opportunity to research things that I am interested in every day.”

Erica Titkemeyer is currently employed at the University of North Carolina at Chapel Hill as the Project Director and AV Conservator for the Southern Folklife Collection. This position was created as part of a grant-funded initiative to research and analyze workflows for the mass reformatting and preservation of legacy audiovisual materials. “NDSR allotted me the opportunity to participate in research and projects related to the implementation of digital preservation standards. It provided me access to a number of networking events and meetings related to digital stewardship.” In her position, she hopes to help see improved access to the collections, while also having the opportunity to learn more about the rich cultural content they contain.

Given these success stories, the National Digital Stewardship Residency has proven to be an invaluable program, providing opportunity for real world practical experience in the field of digital preservation. Also, the diversity of host institutions and location areas across major U.S. cities gives residents the opportunity to build up an extensive network of colleges, practitioners and potential employers in diverse fields. Stay tuned for part two of this blog post which will showcase the remaining residents of the 2013-2014 Washington D.C cohort.

Peter Murray: Thursday Threads: Mobile Device Encryption, Getty Images for Free

Thu, 2014-10-02 10:42
Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

Just a brief pair of threads this week. First is a look at what is happening with mobile device encryption as consumer electronics companies deal with data privacy in the post-Snowden era. There is also the predictable backlash from law enforcement organizations, and perhaps I just telegraphed how I feel on the matter. The second thread looks at how Getty Images is trying to get into distributing its content for free to get it in front of eyeballs that will end up paying for some of it.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my Pinboard bookmarks (or subscribe to its feed in your feed reader). Items posted to are also sent out as tweets; you can follow me on Twitter. Comments and tips, as always, are welcome.

Apple and Android Device Data Encryption

In an open letter posted on Apple’s website last night, CEO Tim Cook said that the company’s redesigned its mobile operating system to make it impossible for Apple to unlock a user’s iPhone data. Starting with iOS8, only the user who locked their phone can unlock it.

This is huge. What it means is that even if a foreign government or a US police officer with a warrant tries to legally compel Apple to snoop on someone, they won’t. Because they can’t. It’s a digital Ulysses pact.

- Apple Will No Longer Let The Cops Into Your Phone, By PJ Vogt, TL;DR blog, 18-Sep-2014

The next generation of Google’s Android operating system, due for release next month, will encrypt data by default for the first time, the company said Thursday, raising yet another barrier to police gaining access to the troves of personal data typically kept on smartphones.

- Newest Androids will join iPhones in offering default encryption, blocking police, by Craig Timberg, The Washington Post, 18-Sep-2014

Predictably, the US government and police officials are in the midst of a misleading PR offensive to try to scare Americans into believing encrypted cellphones are somehow a bad thing, rather than a huge victory for everyone’s privacy and security in a post-Snowden era. Leading the charge is FBI director James Comey, who spoke to reporters late last week about the supposed “dangers” of giving iPhone and Android users more control over their phones. But as usual, it’s sometimes difficult to find the truth inside government statements unless you parse their language extremely carefully. So let’s look at Comey’s statements, line-by-line.

- Your iPhone is now encrypted. The FBI says it&aposll help kidnappers. Who do you believe? by Trevor Timm, Comment is free on, 30-Sep-2014

I think it is fair to say that Apple snuck this one in on us. To the best of my knowledge, the new encrypted-by-default wasn’t something talked about in the iOS8 previews. And it looks like poor Google had to play catch-up by announcing on the same day that they were planning to do the same thing with the next version of the Android operating system. (If Apple and Google conspired to make this announcement at the same time, I haven’t heard that either.)

As you can probably tell by the quote I pulled from the third article, I think this is a good thing. I believe the pendulum has swung too far in the direction of government control over communications, and Apple/Google are right to put new user protections in place. This places the process of accessing personal information firmly back in the hands of the judiciary through court orders to compel people and companies to turn over information after probable cause has been shown. There is nothing in this change that prevents Apple/Google from turning over information stored on cloud servers to law enforcement organizations. It does end the practice of law enforcement officers randomly seizing devices and reading data off them.

As an aside, there is an on-going discussion about the use of so-called “stingray” equipment that impersonates mobile phone towers to capture mobile network data. The once-predominant 2G protocol that the stingray devices rely on was woefully insecure, and the newer 3G and 4G mobile carrier protocols are much more secure. In fact, stingray devices are known to jam 3G/4G signals to force mobile devices to use the insecure 2G protocol. Mobile carriers are planning to turn off 2G protocols in the coming years, though, which will make the current generation of stingray equipment obsolete.

Getty Offers Royalty-Free Photos

The story of the photography business over the past 20 years has been marked by two shifts: The number of photographs in circulation climbs toward infinity, and the price that each one fetches falls toward zero. As a result, Getty Images, which is in the business of selling licensing rights, is increasingly willing to distribute images in exchange for nothing more than information about the public’s photo-viewing habits.

Now Getty has just introduced a mobile app, Stream, targeted at nonprofessionals to run on Apple’s new operating system. The app lets people browse through Getty’s images, with special focus on curated collections. It’s sort of like a version of Instagram (FB) featuring only professional photographers—and without an upload option.

- Getty&aposs New App Is Part of Its Plan to Turn a Profit From Free Photos, by Joshua Brustein, Businessweek, 19-Sep-2014

Commercial photography is another content industry — like mass-market and trade presses, journal publishers, newspapers, and many others — that is facing fundamental shifts in its business models. In this case, Getty is going the no-cost, embed-in-a-web-page route to getting their content to more eyeballs. They announced the Getty Images Embed program a year ago, and have now followed it up with this iOS app for browsing the collection of royalty-free images.

Link to this post!

State Library of Denmark: What is high cardinality anyway?

Thu, 2014-10-02 09:45

An attempt to explain sparse faceting and when to use it in not-too-technical terms. Sparse faceting in Solr is all about speeding up faceting on high-cardinality fields for small result sets. That’s a clear statement, right? Of course not. What is high, what is small and what does cardinality mean? Dmitry Kan has spend a lot of time testing sparse faceting with his high-cardinality field, without getting the promised performance increase. Besides unearthing a couple of bugs with sparse faceting, his work made it clear that there is a need for better documentation. Independent testing for the win!

What is faceting?

When we say faceting in this context, it means performing a search and getting a list of terms for a given field. The terms are ordered by their count, which is the number of times they are referenced by the documents in the search result. A classic example is a list of authors:

The search for "fairy tale" gave 15 results «hits». Author «field» - H.C. Andersen «term» (12 «count») - Brothers Grimm «term» (5 «count») - Lewis Carroll «term» (3 «count»)

Note how the counts sums up to more than the number of documents: A document can have more than one reference to terms in the facet field. It can also have 0 references, all depending on the concrete index. In this case, there are either more terms than are shown or some of the documents have more than one author . There are other forms of faceting, but they will not be discussed here.

Under the hood

At the abstract level, faceting in Solr is quite simple:

  1. A list of counters is initialized. It has one counter for each unique term in the facet field in the full corpus.
  2. All documents in the result set are iterated. For each document, a list of its references to terms is fetched.
    1. The references are iterated and for each one, the counter corresponding to its term is increased by 1.
  3. The counters are iterated and the Top-X terms are located.
  4. The actual Strings for the located terms are resolved from the index.

Sparse faceting improves on standard Solr in two ways:

  • Standard Solr allocates a new list of counters in step 1 for each call, while sparse re-uses old lists.
  • Standard Solr iterates all the counters in step 3, while sparse only iterates the ones that were updated in step 2.
Distributed search is different

Distributed faceting in Solr adds a few steps:

  • All shards are issued the same request by a coordinating Solr. They perform step 1-4 above and returns the results to the coordinator.
  • The coordinator merges the shard-responses into one structure and extracts the Top-X terms from that.
  • For each Top-X term, its exact count is requested from the shards that did not deliver it as part of step a.

Standard Solr handles each exact-count separately by performing a mini-search for the term in the field. Sparse reuses the filled counters from step 2 (or repeats step 1-2 if the counter has been flushed from the cache) and simply locates the counters corresponding to the terms. Depending on the number of terms, sparse is much faster (think 5-10x) than standard Solr for this task. See Ten times faster for details.

What is cardinality?

Down to earth, cardinality just means how many there are of something. But what thing? The possibilities for faceting are many: Documents, fields, references and terms. To make matters worse, references and terms can be counted for the full corpus as well as just the search result.

  • Performance of standard Solr faceting is linear to the number of unique terms in the corpus in step 1 & 3 and linear to the number of references in the search result in step 2.
  • Performance of sparse faceting is (nearly) independent of the number of unique terms in the corpus and linear to the number of references in the search result in step 2 & 3.

Both standard Solr and sparse treats each field individually, so they both scale linear for that. The documents returned as part of base search are represented in a sparse structure itself (independent of sparse faceting) and scales with result set size. While it does take time to iterate over these documents, this is normally dwarfed by the other processing steps. Ignoring the devils in the details: Standard Solr facet performance scales with the full corpus size as well as the result size, while sparse faceting scales just with the result size.

Examples please!
  • For faceting on URL in the Danish Web Archive, cardinality is very high for documents (5bn), references (5bn) and terms (5bn) in the corpus. The overhead of performing a standard Solr faceting call is huge (hundreds of milliseconds), due to the high number of terms in the corpus. As the typical search results are quite a lot smaller than the full corpus, sparse faceting is very fast.
  • For faceting on host in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. However, the number of  terms (1.3m) is more modest. The overhead of performing a standard Solr faceting call is quite small (a few milliseconds), due to the modest number of terms; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is still faster in most cases, but only by a few milliseconds. Not much if the total response time is hundreds of milliseconds.
  • For faceting on content_type_norm in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. It is extremely small for the number of unique terms, which is 10. The overhead of performing a standard Solr faceting call is practically zero; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is never faster than Solr for this and as a consequence falls back to standard counting, making it perform at the same speed.
  • For faceting on author at the library index at Statsbiblioteket, cardinality is high for documents (15m), references (40m) and terms (9m) in the corpus. The overhead of performing a standard Solr faceting call is noticeable (tens of milliseconds), due to the 9m terms in the corpus. The typical search results is well below 8% of the full corpus, and sparse faceting is markedly faster than standard Solr. See Small Scale Sparse Faceting for details.

DPLA: DPLA Brings National Attention to the Blue Earth County Historical Society

Thu, 2014-10-02 05:00

William and Jane Jones farm, Blue Earth County, Minnesota, ca.1888. Courtesy of the Blue Earth County Historical Society via the Minnesota Digital Library.

The Blue Earth County Historical Society (BECHS), founded in 1901, is located in Mankato, Minnesota. The content submitted by BECHS to the Minnesota Digital Library (MDL) and the Digital Public Library of America (DPLA) is unique to this county of Minnesota. Our collection chronicles the people, places, and events that shaped Blue Earth County from our agricultural roots to the rise of our cities.  The images tell a variety of stories and showcase all walks of life across decades of time.  All of the images have been donated to BECHS by people who wanted to preserve the past for future generations.

BECHS was honored when the DPLA selected one of our images to represent MDL when it came on as a partner in April 2013. The image was of Dr. G. A. Dahl posing for a photograph at a local photography studio. The interesting aspect of this image was that there was a photographer in the image as well as Dr. Dahl. It seems Dahl was fascinated with photography as he had interior photographs taken of his home and office, in a time when photographs of living spaces were rare. These images are also a part of our collection.

Hubbard House from Broad Street with four women, Mankato, Minnesota, ca.1900. Courtesy of the Blue Earth County Historical Society via the Minnesota Digital Library.

Our involvement in MDL helps people across Minnesota locate our images and have access to our collection. BECHS’ involvement with the DPLA has amplified that reach. People from across the country, and the world, are able to locate our images, which gives the user a glimpse into our collection and history. Based on the analytics from our MDL webpage, the DPLA is the highest referral to our webpage for visitors.  Because the DPLA is a national resource, people can search this one site to find images from different locations and be directed back to the image’s home location. It is an excellent resource, especially for genealogists.

The Blue Earth County Historical Society is located in Mankato, Minnesota. BECHS was founded in 1901 in preparation for the semi-centennial of Mankato and Blue Earth County. In 1938, BECHS purchased the Hubbard family home and opened the first public history museum. BECHS operated from this location for 50 years before moving into our current location. The Hubbard House was placed on the National Register of Historic Places in 1978 and is still operated by BECHS as a living history museum. As we enter into our 113th year, BECHS has upcoming expansion plans in our current facility and will continue to collect, preserve and present the history of Blue Earth County for present and future generations.

Featured image credit: Detail from Portrait of Dr. G. A. Dahl, Mankato, Minnesota, ca.1900. Keene, George E. Courtesy of the Blue Earth County Historical Society via the Minnesota Digital Library.

 All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

William Denton: Michael Collins on The Great Eastern

Thu, 2014-10-02 04:22

I’ve written about The Great Eastern a couple of times: once in April, when I had just started to listen to all the episodes for the fifth or sixth time, and then briefly in August with a quote about libraries. I finished listening to it all last month. It’s still a masterpiece, one of the finest radio comedies and one of the richest and deepest works of radio fiction ever.

Michael Collins has been writing long pieces about it on his web site and they are mandatory reading if you know the show at all:

Mack Furlong, who played host Paul Moth, won a John Drainie award recently. He was impeccable on the show.

Hydra Project: Hydra Connect #2 – reports

Thu, 2014-10-02 00:24

170 or so people are gathered together in Cleveland, Ohio, for Hydra Connect #2 – the second annual Hydra get-together. If you weren’t able to come (and even if you were) you’ll find increasing numbers of presentations and meeting notes hanging off the program page at


HangingTogether: A Year of Living Dangerously For Archives (and you)

Wed, 2014-10-01 22:51

[Female acrobats on trapezes at circus | Library of Congress ]

[This post is in honor of American Archives Month, which starts today!]

This year at the Society of American Archivists annual meeting, incoming SAA president Kathleen Roe kicked off her initiative “A Year of Living Dangerously for Archives.” You can read about the initiative on the SAA Website, but I can also boil this down for you. Those of us who work in cultural heritage institutions get it — archives are important. We spend a lot of time telling one another how about our wonderful collections, and about the good work we do. However, despite our passion and conviction, we don’t spend nearly enough time making the case outside the building how important archives are.

I like this formulation: Archives change lives. Tell people about it.

I’m eager to hear all the stories that come out of this Year of Living Dangerously (YOLDA, as I’m dubbing it, which goes nicely with YOLO, don’t you think?). I urge you to participate in YOLDA by sharing your stories on the SAA website but also by pointing us to your work in the comments. Let’s use this year to inspire one another. I think it’s more dangerous to not take action than to find ways to advocate for ourselves, but if it makes you happy to think of yourself as an action hero, than go for it!

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (270)

Karen Coyle: This is what sexism looks like

Wed, 2014-10-01 22:18
[Note to readers: sick and tired of it all, I am going to report these "incidents" publicly because I just can't hack it anymore.]

I was in a meeting yesterday about RDF and application profiles, in which I made some comments, and was told by the co-chair: "we don't have time for that now", and the meeting went on.

Today, a man who was not in the meeting but who listened to the audio sent an email that said:
"I agree with Karen, if I correctly understood her point, that this is "dangerous territory".  On the call, that discussion was postponed for a later date, but I look forward to having that discussion as soon as possible because I think it is fundamental."And he went on to talk about the issue, how important it is, and at one point referred to it as "The requirement is that a constraint language not replace (or "hijack") the original semantics of properties used in the data."

The co-chair (I am the other co-chair, although reconsidering, as you may imagine) replied:
"The requirement of not hijacking existing formal specification languages for expressing constraints that rely on different semantics has not been raised yet." "Has not been raised?!" The email quoting me stated that I had raised it the very day before. But an important issue is "not raised" until a man brings it up. This in spite of the fact that the email quoting me made it clear that my statement during the meeting had indeed raised this issue.

Later, this co-chair posted a link to a W3C document in an email to me (on list) and stated:
"I'm going on holidays so won't have time to explain you, but I could, in theory (I've been trained to understand that formal stuff, a while ago)"That is so f*cking condescending. This happened after I quoted from W3C documents to support my argument, and I believe I had a good point.

So, in case you haven't experienced it, or haven't recognized it happening around you, this is what sexism looks like. It looks like dismissing what women say, but taking the same argument seriously if a man says it, and it looks like purposely demeaning a woman by suggesting that she can't understand things without the help of a man.

I can't tell you how many times I have been subjected to this kind of behavior, and I'm sure that some of you know how weary I am of not being treated as an equal no matter how equal I really am.

Quiet no more, friends. Quiet no more.

Cynthia Ng: Access 2014 Day 2: Afternoon Notes

Wed, 2014-10-01 21:49
We continue with the afternoon of day 2 of Access 2014. On the program is linked data and some lightning talks. Linked Data is People: Using Linked Data to Reshape the Library Staff Directory Jason A. Clark, Head, Library Informatics & Computing, Montana State University Scott W.H. Young, Digital Initiatives Librarian, Montana State University Linkded […]

CrossRef: CrossRef Newsletter - October 2014 Edition

Wed, 2014-10-01 21:33

The latest edition of the CrossRef Newsletter has been posted.

The October 2014 edition contains news and updates on CrossRef Text and Data Mining, CrossMark, FundRef and more. The Tech Corner has updates on new technical developments such as the Notification Callback Service. We will be attending various meetings this Fall including exhibiting at the Frankfurt Book Fair next week. The CrossRef Annual meeting in London is coming up in November and there's a story on that. As well as important updates on Board of Directors Election, billing and more.

Open Knowledge Foundation: Connect and Help Build the Global Open Data Index

Wed, 2014-10-01 21:11

Earlier this week we announced that October is the Global Open Data Index. Already people have added details about open data in Argentina, Colombia, and Chile! You can see all the collaborative work here in our change tracker. Each of you can make a difference to hold governments accountable for open data commitments plus create an easy way for civic technologies to analyze the state of open data around the world, hopefully with some shiny new data viz. Our goal at Open Knowledge is to help you shape the story of Open Data. We are hosting a number of community activities this month to help you learn and connect with each other. Most of all, it is our hope that you can help spread the word in your local language.

Choose your own adventure for the Global Open Data Index

We’ve added a number of ways that you can get involved to the OKFN Wiki. But, here are some more ways to learn and share:

Community Sessions – Let’s Learn Together

Join the Open Knowledge Team and Open Data Index Mentors for a session all about the Global Open Data Index. It is our goal to show open data around the world. We need your help to add data from your region and reach new people to add details about their country.

We will share some best practices on finding and adding open dataset content to the Open Data Index. And, we’ll answer questions about the use of the Index. There are timeslots to help people connect globally.

These will be recorded. But, we encourage you to join us on G+ /youtube and bring your ideas/questions. Stay tuned as we may add more online sessions.

Community Office Hours

Searching for datasets and using the Global Open Data Index tool is all the better with a little help from mentors and fellow community members. If you are a mentor, it would be great if you could join us on a Community Session or host some local office hours. Simply add your name and schedule here.

Mailing Lists and Twitter

The Open Data Index mailing list is the main communication channel for folks who have questions or want to get in touch: For twitter, keep an eye on updates via #openindex2014

Translation Help

What better way to help others get involved than to share in your own language. We could use your help. We have some folks translating content into Spanish. Other priority languages are Yours!, Arabic, Portuguese, French and Swahili. Here are some ways to help translate:

Learn on your own

We know that you have limited time to contribute. We’ve created some FAQs and tips to help you add datasets on your own time. I personally like to think of it as a data expedition to check the quality of open data in many countries. Happy hunting and gathering! Last year I had fun reviewing data from around the world. But, what matters is that you have local context to review the language and data for your country. Here’s a quick screenshot of how to contribute:

Thanks again for making Open Data Matter in your part of the world!

(Photo by Marieke Guy, cc by license (cropped))