You are here

Feed aggregator

Eric Hellman: The Perfect Bookstore Loses to Amazon

planet code4lib - Fri, 2014-10-03 13:59
My book industry friends are always going on and on about "the book discovery problem". Last month, a bunch of us, convened by Chris Kubica, sat in a room in Manhattan's Meatpacking district and plotted out how to make the perfect online bookstore. "The discovery problem" occupied a big part of the discussion. Last year, Perseus Books gathered a smattering of New York's nerdiest at "the first Publishing Hackathon". The theme of the event, the "killer problem": "book discovery". Not to be outdone, HarperCollins sponsored a "BookSmash Challenge" to find "new ways of reading and discovering books".  

Here's the typical framing of "the book discovery problem". "When I go to a bookstore, I frequently leave with all sort of books I never meant to get. I see a book on the table, pick it up and start reading, and I end up buying it. But that sort of serendipitous discovery doesn't happen at Amazon. How do we recreate that experience?" Or "There are so many books published, how do we match readers to the books they'd like best?"

This "problem" has always seemed a bit bogus to me. First of all, when I'm on the internet, I'm constantly running across interesting sounding books. There are usually links pointing me at Amazon, and occasionally I'll buy the book.

As a consumer, I don't find I have a problem with book discovery. I'm not compulsive about finding new books; I'm compulsive about finishing the book I've read half of. When I finish a book, it's usually two in the morning and I really want to get to sleep. I have big stacks both real and virtual of books on my to-read list.

Finally, the "discovery problem" is a tractable one from a tech point of view. Throw a lot of data and some machine learning at the problem and a good-enough solution should emerge. (I should note here that  book "discovery" on the website I run,, is terrible at the moment, but pretty soon it will be much better.)

So why on earth does Amazon, which throws huge amounts of money into capital investment, do such a middling job of book discovery?

Recently the obvious answer hit me in the face, as such answers are wont to do. The answer is that mediocre discovery creates a powerful business advantage for Amazon!

Consider the two most important discovery tools used on the Amazon website:
  1. People who bought X also bought y.
  2. Top seller lists.
Both of these methods have the property that the way to make these work for your book is for your book to sell a lot on Amazon. That means that any author or publisher that wants to sell a lot of books on Amazon will try to steer as many fans as possible to Amazon. More sales means more recommendations, which means more sales, and so on. Amazon is such a dominant bookseller that a bookstore could have the dreamiest features and pay the publisher a larger share of the retail selling price and still have the publisher try to push people to Amazon.

What happens in this sort of positive feedback system is pretty obvious to an electrical engineer like me, but Wikipedia's example of a cattle stampede makes a better picture.

The number of cattle running is proportional to the overall level of panic, which is proportional to...the number of cattle running! "Stampede loop" by Trevithj. CC BY-SAResult: Stampede! Yeah, OK, these are sheep. But you get the point. "Herdwick Stampede" by Andy Docker. CC BY.Imagine what would happen if Amazon shifted from sales-based recommendations to some sort of magic that matched a customer with the perfect book. Then instead of focusing effort on steering readers to Amazon, authors and publishers would concentrate on creating the perfect book. The rich would stop getting richer, and instead, reward would find the deserving.

Ain't never gonna happen. Stampedes sell more books.

Islandora: Islandora Camp CO Has a Logo!

planet code4lib - Fri, 2014-10-03 13:47

Islandora Camp is going to Denver, CO in just 10 days. When we get there, we will be very happy to welcome our Camp attendees with their official t-shirt, designed by UPEI's Donald Moses:

Don's iCampCO logo will join the ranks of our historical Camp Logos. Keep an eye out for our next Camp and yours could be next!

Open Knowledge Foundation: Streamlining the Local Groups network structure

planet code4lib - Fri, 2014-10-03 11:21

We are now a little over a year into the Local Groups scheme that was launched in early 2013. Since then we have been receiving hundreds of applications from great community members wanting to start Local Groups in their countries and become Ambassadors and community leaders. From this great body of amazing talent, Local Groups in over 50 countries have been established and frankly we’ve been overwhelmed with the interest that this program has received!

Over the course of this time we have learned a lot. Not only have we seen that open knowledge first and foremost develops locally and how global peer support is a great driver for making a change in local environments. We’re humbled and proud to be able to help facilitate the great work that is being done in all these countries.

We have also learned, however, of things in the application process and the general network structure that can be approved. After collecting feedback from the community earlier in the year, we learned that the structure of the network and the different labels (Local Group, Ambassador, Initiative and Chapter) were hard to comprehend, and also that the waiting time that applicants wanting to become Ambassadors and starting Local Groups were met with was a little bit frustrating. People applying are eager to get started, and of course having to wait weeks or even longer (because of the number of applications that came in) was obviously a little bit frustrating.

Presenting a more streamlined structure and way of getting involved

We have now thoroughly discussed the feedback with our great Local Groups community and as a result we are excited to present a more streamlined structure and a much easier way of getting involved. The updated structure is written up entirely on the Open Knowledge wiki, and includes the following major headlines:

1. Ambassador and Initiative level merge into “Local Groups”

As mentioned, applying to become an Ambassador and applying to set up an Initiative were the two kinds of entry-level ways to engage; “Ambassador” implying that the applicant was – to begin with – just one person, and “Initiative” being the way for an existing group to join the network. These were then jointly labelled “Local Groups”, which was – admittedly – a lot of labels to describe pretty much the same thing: People wanting to start a Local Group and collaborate. Therefore we are removing the Initiative label all together, and from now everyone will simply apply through one channel to start a Local Group. If you are just one person doing that (even though more people will join later) you are granted the opportunity to take the title of Ambassador. If you are a group applying collectively to start a Local Group, then everyone in that group can choose to take the title of Local Group Lead, which is a more shared way to lead a new group (as compared to an Ambassador). Applying still happens through a webform, which has been revamped to reflect these changes.

2. Local Group applications will be processed twice per year instead of on a rolling basis

All the hundreds of applications that have come in over the last year have been peer-reviewed by a volunteer committee of existing community members (and they have been doing a stellar job!). One of the other major things we’ve learned is the work pressure that the sheer number of applications put on this hard-working group simply wasn’t long term sustainable. That is why that we as of now will replace the rolling basis processing and review of applications in favor of two annual sprints in October and April. This may appear as if waiting time for applicants becomes even longer, but that is not the case! In fact, we are implementing a measure that ensures no waiting at all! Keep reading.

3. Introducing a new easy “get-started-right-away” entry level: “Local Organiser”

This is the new thing we are most excited to introduce! Seeing how setting up a formal Local Group takes time (regardless of how many applications come in), it was clear that we needed a way for people to get involved in the network right away, without having to wait for weeks and weeks on formalities and practicalities. This has lead to the new concept of “Local Organiser”:

Anyone can pick up this title immediately and start to organise Open Knowledge activities locally in their own name, but by calling themselves Local Organiser. This can include organising meetups, contributing on discussion lists, advocating the use of open knowledge, building community and gather more people to join – or any other relevant activity aligned with the values of Open Knowledge.

Local Organisers needs to register by setting up a profile page on the Open Knowledge wiki as well as filling this short form. Shortly thereafter the Local Organiser will then be greeted officially into the community with an email from the Open Knowledge Local Group Team containing a link to the Local Organiser Code of Conduct that the person automatically agrees to adhere to when he/she picks up the title.

Local Organisers use existing, public tools such as, Tumblr, Twitter etc. – but can also request Open Knowledge to set up a public discussion list for their country (if needed – otherwise they can also use other existing public discussion lists). Additionally, they can use the Open Knowledge wiki as a place to put information and organize as needed. Local Organisers are enrouraged to publicly document their activities on their Open Knowledge wiki profile in order to become eligible to apply to start an official Open Knowledge Local Group later down the road.

A rapidly growing global network

What about Chapters you might wonder? Their status remain unchanged and continue to be the expert level entity that Local Groups can apply to become when reaching a certain level of prowess.

All in all it’s fantastic to see how Open Knowledge folks are organising locally in all corners of the world. We look forward to continue supporting you all!

If you have any questions, ideas or comments, feel free to get in touch!

Mita Williams: The Knight Foundation News Challenge Entries That I Have Applauded

planet code4lib - Fri, 2014-10-03 02:55
The Knight News Challenge has been issued and it's about libraries:
How might we leverage libraries as a platform to build more knowledgeable communities? 

I'm reviewing these entries because I think some of them might prove useful in a paper I'm currently writing. There are some reoccurring themes to the entries that I think are quite telling.

Of the 680 entries, there's some wonderful ideas that need to be shared. Here are some of the proposals that I've applauded:

    For the purposes of my paper, I'm interested in the intersections of Open Data and Libraries. Here are the entries that touch on these two topics:

    And I would be remiss if I didn't tell you that I am also collaborating on this entry:

    OVER UNDER AROUND THROUGH: a national library-library game to build civic engagement skills: OVER UNDER AROUND THROUGH is kinda like a dance-off challenge: libraries challenge each other – but instead of “show us your moves” the challenge is “show us how you would take on” actual community challenges such as economic disparity and racial tensions

    In many ways, this Knight News Challenge is just such a dance-off.

    CrossRef: The Public Knowledge Project and CrossRef Collaborate to Improve Services for Publishers using Open Journal Systems

    planet code4lib - Fri, 2014-10-03 02:46

    2 October 2014, Lynnfield, MA, USA and Vancouver, BC, Canada---CrossRef and the Public Knowledge Project (PKP) are collaborating to help publishers and journals using the Open Journal Systems (OJS) platform take better advantage of CrossRef services.

    The collaboration involves an institutional arrangement between CrossRef and PKP, and new software features. Features include an improved CrossRef plugin for OJS that will automate Digital Object Identifier (DOI) deposits, as well as plans to create new tools for extracting references from submissions. To facilitate CrossRef membership, PKP has also become a CrossRef Sponsoring Entity, which will allow OJS-based publishers to join CrossRef through PKP.

    The latest release of OJS version 2.4.5 includes a new CrossRef plugin with improved support for CrossRef deposits, the process by which CrossRef member publishers can assign DOIs (persistent, actionable identifiers) to their content. A CrossRef deposit includes the bibliographic metadata about an article or other scholarly document, the current URL of the item, and the DOI. Publishers need only update the URL at CrossRef if the content's web address changes. The cited DOI will automatically direct readers to the current URL.

    OJS 2.4.5 includes several general improvements that benefit CrossRef members directly and indirectly. First, OJS now allows for automatic deposits to the CrossRef service - manually uploading data via CrossRef's web interface is no longer necessary. Second, users of the plugin will be able to deposit Open Researcher and Contributor Identifiers (ORCIDs), which OJS can now accept during the author registration and article submission processes.

    Additionally, this release also allows OJS publishers to more easily participate in the LOCKSS archiving service of their choice (including the forthcoming PKP PLN Service).

    Finally, this new release will serve as the foundation for further integration of other CrossRef features and services, such as the deposit of FundRef funding data, and the CrossMark publication record service.

    "The release of OJS 2.4.5 signals a new strategic direction for PKP in the provision of enhanced publishing services, such as the new CrossRef plugin," said Brian Owen, Managing Director of PKP. "Our collaboration with CrossRef has enabled us to move up the development of features that our publishers have been asking for. The partnership doesn't end here, either. We're looking forward to supporting publishers more directly now that we are a Sponsoring Entity and to jointly develop tools that will make it easier for publishers to comply with CrossRef's outbound reference linking requirements."

    CrossRef Executive Director Ed Pentz noted, "The profile of CrossRef's member publishers has changed significantly over the years. We are growing by hundreds of members each year. Many of these publishers are small institution-based journals from around the world. And many are hosted by the open source OJS software. It has been challenging for some of these organizations to meet our membership obligations like outbound reference linking and arranging for long-term archiving. And many have not been able to participate in newer services, because they require the ability to deposit additional metadata. We want all of our publishers to have a level playing field, regardless of their size. Our cooperation with PKP will help make that happen."

    Journals and publishers that use OJS and that already have established a direct relationship with CrossRef, or those that have an interest in becoming members through PKP, may take advantage of the enhanced features in the new CrossRef plugin by upgrading to OJS 2.4.5. And starting now, eligible journals can apply for a PKP-sponsored CrossRef membership for free DOI support. See PKP's CrossRef page for more information.

    About PKP
    The Public Knowledge Project was established in 1998 at the University of British Columbia. Since that time PKP has expanded and evolved into an international and virtual operation with two institutional anchors at Stanford University ( and Simon Fraser University Library ( ). OJS is open source software made freely available to journals worldwide for the purpose of making open access publishing a viable option for more journals, as open access can increase a journal's readership as well as its contribution to the public good on a global scale. More information about PKP and its software and services is available at

    About CrossRef

    CrossRef ( serves as a digital hub for the scholarly communications community. A global not-for profit membership organization of scholarly publishers, CrossRef's innovations shape the future of scholarly communications by fostering collaboration among multiple stakeholders. CrossRef provides a wide spectrum of services for identifying, locating, linking to, and assessing the reliability and provenance of scholarly content.

    James MacGregor, PKP

    Carol Anne Meyer, CrossRef
    Phone +1 781-295-0072 x23

    DuraSpace News: Announcing the Release of the 2015 National Agenda For Digital Stewardship

    planet code4lib - Fri, 2014-10-03 00:00

    Washington, DC  The 2015 National Agenda for Digital Stewardship has been released!

    You can download a copy of the Executive Summary and Full Report here: 

    DuraSpace News: Webinar Recording Available

    planet code4lib - Fri, 2014-10-03 00:00

    Winchester, MA  The 8th DuraSpace Hot Topics Community Webinar Series, “Doing It: How Non-ARL Institutions are Managing Digital Collections” began October 2, 2014.  The first webinar in this series curated by Liz Bishoff, “Research Results on Non-ARL Academic Libraries Managing Digital Collections,” provided an overview of the methodology and key questions and findings of the Managing Digital Collections Survey of non-ARL academic libraries.  Participants also had the opportunity to share how

    PeerLibrary: Weekly PeerLibrary meeting finalizing our Knight News Challenge...

    planet code4lib - Thu, 2014-10-02 23:24

    Weekly PeerLibrary meeting finalizing our Knight News Challenge submission.

    Cynthia Ng: Access 2014: Closing Keynote Productivity and Collaboration in the Age of Digital Distraction

    planet code4lib - Thu, 2014-10-02 18:35
    The closing keynote for Access 2014. He spoke really fast, so apologies if I missed a couple of points. Presented by Jesse Brown Digital Media Expert, Futurist, Broadcast Journalist Background Bitstrips: To make fun cartoons. Co-founder. CBC show Podcast: Canadaland, broader view of media, in a global sense to what’s happening to society and culture. Technology Changing […]

    Evergreen ILS: Evergreen 2.7.0 has been released!

    planet code4lib - Thu, 2014-10-02 17:42

    Evergreen 2.7.0 has been released!

    Small delay in announcing, but here we go…

    Cheers and many thanks to everyone who helped to make Evergreen 2.7.0 a reality, our first official release of the 2.7 series! After six months of hard work with development, bug reports, testing, and documentation efforts, the 2.7.0 files are available on the Evergreen website’s downloads page:

    So what’s new in Evergreen 2.7? You can see the full release notes here: To briefly summarize though, there were contributions made for both code and documentation by numerous individuals. A special welcome and acknowledgement to all our first-time contributors, thanks for your contributions to Evergreen!

    Some caveats now… currently Evergreen 2.7.0 requires the use of the latest OpenSRF 2.4 series, which is still in its alpha release (beta coming soon). As folks help to test the OpenSRF release, this will no doubt help to make Evergreen 2.7 series better. Also, for localization/i18n efforts, there was some last minute bug finding and we plan to release updated translation files in the next maintenance release 2.7.1 for the 2.7 series.

    Evergreen 2.7.0 includes a preview of the new web-based staff client code. The instructions for setting this up are being finalized by the community and should be expected for release during the next maintenance version 2.7.1 later in October.

    See here for some direct links to the various files so far:

    Once again, a huge thanks to everyone in the community who has participated this cycle to contribute new code, test and sign-off on features, and work on new documentation and other ongoing development efforts.


    — Ben

    Cynthia Ng: Access 2014: Day 3 Notes

    planet code4lib - Thu, 2014-10-02 17:09
    Final half day of Access 2014. The last stretch. ## RDF and Discovery in the Real World(cat) Karen Coombs, Senior Product Analyst, WorldShare Platform The Web of Data: Things, Not Strings Way for search engine to be the most relevant e.g. May 2012: Google Knowledge Graph provides more knowledge in search results. Traditionally in bibliographic description […]

    Jenny Rose Halperin: A new look for our Community Newsletter

    planet code4lib - Thu, 2014-10-02 16:18

    This post was featured on the Mozilla Community Blog


    If you’ve been wondering why you haven’t received the best in Mozilla’s community news in some weeks, it’s because we’ve been busy redesigning our newsletter in order to bring you even more great content.

    Non-profit marketing is no easy feat. Even with our team of experts here at Mozilla, we don’t always hit the bar when it comes to open rates, click through rates, and other metrics that measure marketing success. For our community newsletter, I watched our metrics steadily decrease over the six month period since we re-launched the newsletter and started publishing on a regular basis.

    It was definitely time for a makeover.

    Our community newsletter is a study in pathways and retention: How do we help people who have already expressed interest in contributing get involved and stay involved? What are some easy ways for people to join our community? How can communities come together to write inspiring content for the Web?

    At Mozilla, we put out three main newsletters: Firefox and You (currently on a brief hiatus), the Firefox Student Ambassadors newsletter, and our Mozilla Communities Newsletter (formerly called about:Mozilla)

    It was important to me to have the newsletter feel authentically like the voice of the community, to help people find their Mozillian way, and to point people in the direction of others who share their interests, opening up participation to a wider audience.

    A peer assist with Andrea Wood and Kelli Klein at the Mozilla Foundation helped me articulate what we needed and stay on-target with the newsletter’s goal to “provide the best in contribution opportunities at Mozilla.” Andrea demonstrated to me how the current newsletter was structured for consumption, not action, and directed me toward new features that would engage people with the newsletter’s content and eventually help them join us.

    I also took a class with Aspiration Tech on how to write emails that captivate as well as read a lot about non-profit email marketing. While some of it seemed obvious, my research also gave me an overview of the field, which allowed me to redesign the newsletter according to best practices.

    Here’s what I learned:

    1. According to M & R, who publishes the best (and most hilarious) study of non-profit email campaigns, our metrics were right on track with industry averages. Non-profit marketing emails have a mean open rate of 13% with a 2.5% deviance in either direction. This means that at between 25% and 15% open rate we were actually doing better than other non-profit emails. What worried me was that our open rate rapidly and steadily decreased, signalling a disengagement with the content.

    I came up with similar findings for our click through rates– on par with the industry, but steadily decreasing. (From almost 5% on our first newsletter to less than 1.5% on our last, eek!)

    2. While I thought that our 70,000 subscribers put us safely in the “large email list” category, I learned that we are actually a small/medium newsletter according to industry averages! In terms of how we gain subscribers, I’m hoping that an increased social media presence as well as experiments with viral marketing (IE “forward this to a friend!”) will bring in new voices and new people to engage with our community.

    3. “The Five Second Rule” is perhaps the best rule I learned about email marketing. Have you captured the reader in three seconds? Can you open an email and know what it’s trying to ask you in five seconds? If not, you should redesign.

    4. Stories that asked people to take action were always the most clicked on stories in our last iteration. This is unsurprising, but “learn more” and “read more” don’t seem to move our readers. “Sign this petition” and “Sign up” were always well-received.

    5. There is no statistically “best time” to send an email newsletter. The best time to send an email newsletter is “when it’s ready.” While every two weeks is a good goal for the newsletter, sending it slightly less frequently will not take away from its impact.

    6. As M & R writes, “For everything, (churn churn churn) there is a season (churn, churn, churn)…” our churn rate on the newsletter was pretty high (we lost and gained subscribers at a high rate.) I’m hoping that our new regular features about teaching and learning as well as privacy will highlight what’s great about our community and how to take action.

    And now to the redesign!

    The first thing you’ll notice is that our newsletter is now called “Mozilla Communities.” We voted on the new name a few weeks ago after the Grow Mozilla call. Thanks to everyone who gave feedback.

    An overview of the newsletter’s new look.

    While the overall feel remains the same and is in line with other Mozilla-branded newsletters, the new look incorporates a few “evergreen” opportunities and actions you can take before the fold as well as features a contributor in their own words. (For the draft of the new design, that contributor is me!) The easy actions on the left hand side will rotate out as needed and increase in commitment level as you read down the page. Also, take a look at the awesome logo from Christie Koehler!


    The next section presents rotating features on our privacy and educational initiatives. Privacy and education span a variety of functional areas, so this section could be populated by a variety of community endeavors. At the bottom of these sections, there’s a Facebook post and Tweet that you can post to easily take action, promote our communities, and get social to protect the Internet.


    The next section features a story that engages the reader to take action! (In this case it invites readers into our awesome new gear store…) This story about Mozilla communities will rotate out according to the content that you submit. It will also be action-oriented, easy, and fun.

    This last story is optional and will be rotated in and out according to testing during the first few issues. (Early feedback feared that there were too many stories.) In the draft design, we’re announcing a new contribution area. This will be a place for new community contribution areas, pathways, and opportunities to connect. The new photo section, “Mozillian Moments,” replaces our “Photo of the Week” section from the last iteration.


    Finally, the footer reminds the reader that this newsletter is community-created and community-supported. It also invites readers to join us on social media. In the upcoming issues, the newsletter will also link to the new “Guides” forum that will help contributors find mentorship opportunities and connect with their fellow Mozillians.


    What we need from you:

    1. We need writers, coders, social media gurus, copy editors, and designers who are interested in consistently testing and improving the newsletter. The opportunity newsletter is a new contribution area on the October 15th relaunch of the Get Involved page (under the “Writing –> Journalism” drop down choice) and I’m hoping that will engage new contributors as well.

    2. A newsletter can’t run without content, and we experimented with lots of ways to collect that content in the last few months. Do you have content for the newsletter? Do you want to be a featured contributor? Reach out to mozilla-communities at mozilla dot com.

    3. Feedback requested! I put together an Etherpad that asks specific questions about improving the design. Please put your feedback here or leave it in the comments.

    The newsletter is a place for us to showcase our work and connect with each other. We can only continue improving, incorporating best practices, and connecting more deeply and authentically through our platforms. Thank you to everyone who helped in the Mozilla Communities redesign and to all of you who support Mozilla communities every day.

    Ed Summers: why @congressedits?

    planet code4lib - Thu, 2014-10-02 16:07

    Note: as with all the content on this blog, this post reflects my own thoughts about a personal project, and not the opinions or activities of my employer.

    Two days ago a retweet from my friend Ian Davis scrolled past in my Twitter stream:

    This Twitter bot will show whenever someone edits Wikipedia from within the British Parliament. It was set up by @tomscott using @ifttt.

    — Parliament WikiEdits (@parliamentedits)

    July 8, 2014

    The simplicity of combining Wikipedia and Twitter in this way immediately struck me as a potentially useful transparency tool. So using my experience on a previous side project I quickly put together a short program that listens to all major language Wikipedias for anonymous edits from Congressional IP address ranges (thanks Josh) … and tweets them.

    In less than 48 hours the @congressedits Twitter account had more than 3,000 followers. My friend Nick set up gccaedits for Canada using the same software … and @wikiAssemblee (France) and @RiksdagWikiEdit (Sweden) were quick to follow.

    Watching the followers rise, and the flood of tweets from them brought home something that I believed intellectually, but hadn’t felt quite so viscerally before. There is an incredible yearning in this country and around the world for using technology to provide more transparency about our democracies.

    Sure, there were tweets and media stories that belittled the few edits that have been found so far. But by and large people on Twitter have been encouraging, supportive and above all interested in what their elected representatives are doing. Despite historically low approval ratings for Congress, people still care deeply about our democracies, our principles and dreams of a government of the people, by the people and for the people.

    We desperately want to be part of a more informed citizenry, that engages with our local communities, sees the world as our stage, and the World Wide Web as our medium.

    Consider this thought experiment. Imagine if our elected representatives and their staffers logged in to Wikipedia, identified much like Dominic McDevitt-Parks (a federal employee at the National Archives) and used their knowledge of the issues and local history to help make Wikipedia better? Perhaps in the process they enter into conversation in an article’s talk page, with a constituent, or political opponent and learn something from them, or perhaps compromise? The version history becomes a history of the debate and discussion around a topic. Certainly there are issues of conflict of interest to consider, but we always edit topics we are interested and knowledgeable about, don’t we?

    I think there is often fear that increased transparency can lead to increased criticism of our elected officials. It’s not surprising given the way our political party system and media operate: always looking for scandal, and the salacious story that will push public opinion a point in one direction, to someone’s advantage. This fear encourages us to clamp down, to decrease or obfuscate the transparency we have. We all kinda lose, irrespective of our political leanings, because we are ultimately less informed.

    I wrote this post to make it clear that my hope for @congressedits wasn’t to expose inanity, or belittle our elected officials. The truth is, @congressedits has only announced a handful of edits, and some of them are pretty banal. But can’t a staffer or politician make a grammatical change, or update an article about a movie? Is it really news that they are human, just like the rest of us?

    I created @congressedits because I hoped it could engender more, better ideas and tools like it. More thought experiments. More care for our communities and peoples. More understanding, and willingness to talk to each other. More humor. More human.

    @Congressedits is why we invented the Internet

    — zarkinfrood (@zarkinfrood)

    July 11, 2014

    I’m pretty sure zarkinfrood meant @congressedits figuratively, not literally. As if perhaps @congressedits was emblematic, in its very small way, of something a lot bigger and more important. Let’s not forget that when we see the inevitable mockery and bickering in the media. Don’t forget the big picture. We need transparency in our government more than ever, so we can have healthy debates about the issues that matter. We need to protect and enrich our Internet, and our Web … and to do that we need to positively engage in debate, not tear each other down.

    Educate and inform the whole mass of the people. Enable them to see that it is their interest to preserve peace and order, and they will preserve them. And it requires no very high degree of education to convince them of this. They are the only sure reliance for the preservation of our liberty. — Thomas Jefferson

    Who knew TJ was a Wikipedian…

    Ed Summers: Social Machines and the Archive

    planet code4lib - Thu, 2014-10-02 16:06

    Yesterday MIT announced that Twitter made a 5 million dollar investment to help them create a Laboratory for Social Machines (LSM) as part of the MIT Media Lab proper:

    MIT launches Laboratory for Social Machines with major Twitter investment @MITLSM @dkroy

    — MIT Media Lab (@medialab) October 1, 2014

    It seems like an important move for MIT to formally recognize that social media is a new medium that deserves its own research focus, and investment in infrastructure. The language on the homepage gives a nice flavor for the type of work they plan to be doing. I was particularly struck by their frank assessment of how our governance systems are failing us, and social media’s potential role in understanding and helping solve the problems we face:

    In a time of growing political polarization and institutional distrust, social networks have the potential to remake the public sphere as a realm where institutions and individuals can come together to understand, debate and act on societal problems. To date, large-scale, decentralized digital networks have been better at disrupting old hierarchies than constructing new, sustainable systems to replace them. Existing tools and practices for understanding and harnessing this emerging media ecosystem are being outstripped by its rapid evolution and complexity.

    Their notion of “social machines” as “networked human-machine collaboratives” reminds me a lot of my somewhat stumbling work on @congressedits and archiving Ferguson Twitter data. As Nick Diakopoulos has pointed out we really need a theoretical framework for thinking about what sorts of interactions these automated social media agents can participate in, formulating their objectives, and for measuring their effects. Full disclosure: I work with Nick at the University of Maryland, but he wrote that post mentioning me before we met here, which was kind of awesome to discover after the fact.

    Some of the news stories about the Twitter/MIT announcement have included this quote from Deb Roy from MIT who will lead the LSM:

    The Laboratory for Social Machines will experiment in areas of public communication and social organization where humans and machines collaborate on problems that can’t be solved manually or through automation alone.

    What a lovely encapsulation of the situation we find ourselves in today, where the problems we face are localized and yet global. Where algorithms and automation are indispensable for analysis and data gathering, but people and collaborative processes are all the more important. The ethical dimensions to algorithms and our understanding of them is also of growing importance, as the stories we read are mediated more and more by automated agents. It is super that Twitter has decided to help build this space at MIT where people can answer these questions, and have the infrastructure to support asking them.

    When I read the quote I was immediately reminded of the problem that some of us were discussing at the last Society of American Archivists meeting in DC: how do we document the protests going on in Ferguson?

    Much of the primary source material was being distributed through Twitter. Internet Archive were looking for nominations of URLs to use in their web crawl. But weren’t all the people tweeting about Ferguson including URLs for stories, audio and video that were of value? If people are talking about something can we infer its value in an archive? Or rather, is it a valuable place to start inferring from?

    I ended up archiving 13 million of the tweets that mention “ferguson” for the 2 week period after the killing of Michael Brown. I then went through the URLs in these tweets, and unshortened them and came up with a list of 417,972 unshortened URLs. You can see the top 50 of them here, and the top 50 for August 10th (the day after Michael Brown was killed) here.

    I did a lot of this work in prototyping mode, writing quick one off scripts to do this and that. One nice unintended side effect was unshrtn which is a microservice for unshortening URLs, which John Kunze gave me the idea for years ago. It gets a bit harder when you are unshortening millions of URLs.

    But what would a tool look like that let us analyze events in social media, and helped us (archivists) collect information that needs to be preserved for future use? These tools are no doubt being created by those in positions of power, but we need them for the archive as well. We also desperately need to explore what it means to explore these archives: how do we provide access to them, and share them? It feels like there could be a project here along the lines of what George Washington University University are doing with their Social Feed Manager. Full disclosure again: I’ve done some contracting work with the fine folks at GW on a new interface to their library catalog.

    The 5 million dollars aside, an important contribution that Twitter is making here (that’s probably worth a whole lot more) is firehose access to the Tweets that are happening now, as well as the historic data. I suspect Deb Roy’s role at MIT as a professor and as Chief Media Scientist at Twitter helped make that happen. Since MIT has such strong history of supporting open research, it will be interesting to see how the LSM chooses to share data that supports its research.

    Library of Congress: The Signal: Residency Program Success Stories, Part One

    planet code4lib - Thu, 2014-10-02 13:34

    The following is a guest post by Julio Díaz Laabes, HACU intern and Program Management Assistant at the Library of Congress.

    Coming off the heels of a successful beginning for the Boston and New York set of cohorts, the National Digital Stewardship Residency Program is becoming a model for digital stewardship residencies on a national scale. This residency program, funded by the Institute of Museum and Library Services,offers recent master’s and doctoral program graduates in specialized fields- library science, information science, museum studies, archival studies and others- the opportunity to gain professional experience in the field of digital preservation.

    Clockwise from top left: Lee Nilsson, Maureen McCormick Harlow, Erica Titkemeyer and Heidi Elaine Dowding.

    The inaugural year of the NDSR program was completed in May of 2014. During this year, ten residents were placed in various organizations in the Washington, DC area. Since completing the program, all ten residents are now working in positions related to the field of digital preservation! Here are some accounts of how the program has impacted each of the resident’s lives and where they are now in their careers.

    Lee Nilsson is employed in a contract position as a junior analyst at the Department of State, Bureau of International Information and programs. Specifically, he is working in the analytics office on foreign audience research. On how the residency helped him, Lee said, “The residency got me to D.C and introduced me to some great people. Without NDSR I would not have made it this far.” Furthermore, Lee commented that the most interesting aspect of his job is “the opportunity to work with some very talented people on some truly global campaigns.”

    Following the residency, Maureen McCormick Harlow accepted a permanent position as the new Digital Librarian at PBS (Public Broadcast Service). She works in the Media Library and her tasks include  consulting on the development of the metadata schema for an enterprise-wide digital asset management system, fulfilling archival requests for legacy materials and working with copyright holders to facilitate the next phase of a digitization project (which builds on the NDSR project of Lauren Work). Maureen stated that “NDSR helped her to foster and cultivate a network of digital preservationists and practitioners in the DC area over the nine months that I participated in it.” An interesting aspect of her job is working with the history of PBS and learning about PBS programming to see how it has changed over the years.

    On an international scale, Heidi Elaine Dowding is currently in a three-year PhD Research Fellow position at the Royal Dutch Academy of Arts and Sciences Huygens ING Institute. This position is funded through the European Commission. “My research involves the long-term publication and dissemination of digital scholarly editions, so aspects of digital preservation will be key,” said Heidi. On the best part of her position, Heidi said, “I am lucky enough to be fully funded, which allows me to focus on my studies. This gives me that opportunity to research things that I am interested in every day.”

    Erica Titkemeyer is currently employed at the University of North Carolina at Chapel Hill as the Project Director and AV Conservator for the Southern Folklife Collection. This position was created as part of a grant-funded initiative to research and analyze workflows for the mass reformatting and preservation of legacy audiovisual materials. “NDSR allotted me the opportunity to participate in research and projects related to the implementation of digital preservation standards. It provided me access to a number of networking events and meetings related to digital stewardship.” In her position, she hopes to help see improved access to the collections, while also having the opportunity to learn more about the rich cultural content they contain.

    Given these success stories, the National Digital Stewardship Residency has proven to be an invaluable program, providing opportunity for real world practical experience in the field of digital preservation. Also, the diversity of host institutions and location areas across major U.S. cities gives residents the opportunity to build up an extensive network of colleges, practitioners and potential employers in diverse fields. Stay tuned for part two of this blog post which will showcase the remaining residents of the 2013-2014 Washington D.C cohort.

    Peter Murray: Thursday Threads: Mobile Device Encryption, Getty Images for Free

    planet code4lib - Thu, 2014-10-02 10:42
    Receive DLTJ Thursday Threads:

    by E-mail

    by RSS

    Delivered by FeedBurner

    Just a brief pair of threads this week. First is a look at what is happening with mobile device encryption as consumer electronics companies deal with data privacy in the post-Snowden era. There is also the predictable backlash from law enforcement organizations, and perhaps I just telegraphed how I feel on the matter. The second thread looks at how Getty Images is trying to get into distributing its content for free to get it in front of eyeballs that will end up paying for some of it.

    Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my Pinboard bookmarks (or subscribe to its feed in your feed reader). Items posted to are also sent out as tweets; you can follow me on Twitter. Comments and tips, as always, are welcome.

    Apple and Android Device Data Encryption

    In an open letter posted on Apple’s website last night, CEO Tim Cook said that the company’s redesigned its mobile operating system to make it impossible for Apple to unlock a user’s iPhone data. Starting with iOS8, only the user who locked their phone can unlock it.

    This is huge. What it means is that even if a foreign government or a US police officer with a warrant tries to legally compel Apple to snoop on someone, they won’t. Because they can’t. It’s a digital Ulysses pact.

    - Apple Will No Longer Let The Cops Into Your Phone, By PJ Vogt, TL;DR blog, 18-Sep-2014

    The next generation of Google’s Android operating system, due for release next month, will encrypt data by default for the first time, the company said Thursday, raising yet another barrier to police gaining access to the troves of personal data typically kept on smartphones.

    - Newest Androids will join iPhones in offering default encryption, blocking police, by Craig Timberg, The Washington Post, 18-Sep-2014

    Predictably, the US government and police officials are in the midst of a misleading PR offensive to try to scare Americans into believing encrypted cellphones are somehow a bad thing, rather than a huge victory for everyone’s privacy and security in a post-Snowden era. Leading the charge is FBI director James Comey, who spoke to reporters late last week about the supposed “dangers” of giving iPhone and Android users more control over their phones. But as usual, it’s sometimes difficult to find the truth inside government statements unless you parse their language extremely carefully. So let’s look at Comey’s statements, line-by-line.

    - Your iPhone is now encrypted. The FBI says it&aposll help kidnappers. Who do you believe? by Trevor Timm, Comment is free on, 30-Sep-2014

    I think it is fair to say that Apple snuck this one in on us. To the best of my knowledge, the new encrypted-by-default wasn’t something talked about in the iOS8 previews. And it looks like poor Google had to play catch-up by announcing on the same day that they were planning to do the same thing with the next version of the Android operating system. (If Apple and Google conspired to make this announcement at the same time, I haven’t heard that either.)

    As you can probably tell by the quote I pulled from the third article, I think this is a good thing. I believe the pendulum has swung too far in the direction of government control over communications, and Apple/Google are right to put new user protections in place. This places the process of accessing personal information firmly back in the hands of the judiciary through court orders to compel people and companies to turn over information after probable cause has been shown. There is nothing in this change that prevents Apple/Google from turning over information stored on cloud servers to law enforcement organizations. It does end the practice of law enforcement officers randomly seizing devices and reading data off them.

    As an aside, there is an on-going discussion about the use of so-called “stingray” equipment that impersonates mobile phone towers to capture mobile network data. The once-predominant 2G protocol that the stingray devices rely on was woefully insecure, and the newer 3G and 4G mobile carrier protocols are much more secure. In fact, stingray devices are known to jam 3G/4G signals to force mobile devices to use the insecure 2G protocol. Mobile carriers are planning to turn off 2G protocols in the coming years, though, which will make the current generation of stingray equipment obsolete.

    Getty Offers Royalty-Free Photos

    The story of the photography business over the past 20 years has been marked by two shifts: The number of photographs in circulation climbs toward infinity, and the price that each one fetches falls toward zero. As a result, Getty Images, which is in the business of selling licensing rights, is increasingly willing to distribute images in exchange for nothing more than information about the public’s photo-viewing habits.

    Now Getty has just introduced a mobile app, Stream, targeted at nonprofessionals to run on Apple’s new operating system. The app lets people browse through Getty’s images, with special focus on curated collections. It’s sort of like a version of Instagram (FB) featuring only professional photographers—and without an upload option.

    - Getty&aposs New App Is Part of Its Plan to Turn a Profit From Free Photos, by Joshua Brustein, Businessweek, 19-Sep-2014

    Commercial photography is another content industry — like mass-market and trade presses, journal publishers, newspapers, and many others — that is facing fundamental shifts in its business models. In this case, Getty is going the no-cost, embed-in-a-web-page route to getting their content to more eyeballs. They announced the Getty Images Embed program a year ago, and have now followed it up with this iOS app for browsing the collection of royalty-free images.

    Link to this post!

    State Library of Denmark: What is high cardinality anyway?

    planet code4lib - Thu, 2014-10-02 09:45

    An attempt to explain sparse faceting and when to use it in not-too-technical terms. Sparse faceting in Solr is all about speeding up faceting on high-cardinality fields for small result sets. That’s a clear statement, right? Of course not. What is high, what is small and what does cardinality mean? Dmitry Kan has spend a lot of time testing sparse faceting with his high-cardinality field, without getting the promised performance increase. Besides unearthing a couple of bugs with sparse faceting, his work made it clear that there is a need for better documentation. Independent testing for the win!

    What is faceting?

    When we say faceting in this context, it means performing a search and getting a list of terms for a given field. The terms are ordered by their count, which is the number of times they are referenced by the documents in the search result. A classic example is a list of authors:

    The search for "fairy tale" gave 15 results «hits». Author «field» - H.C. Andersen «term» (12 «count») - Brothers Grimm «term» (5 «count») - Lewis Carroll «term» (3 «count»)

    Note how the counts sums up to more than the number of documents: A document can have more than one reference to terms in the facet field. It can also have 0 references, all depending on the concrete index. In this case, there are either more terms than are shown or some of the documents have more than one author . There are other forms of faceting, but they will not be discussed here.

    Under the hood

    At the abstract level, faceting in Solr is quite simple:

    1. A list of counters is initialized. It has one counter for each unique term in the facet field in the full corpus.
    2. All documents in the result set are iterated. For each document, a list of its references to terms is fetched.
      1. The references are iterated and for each one, the counter corresponding to its term is increased by 1.
    3. The counters are iterated and the Top-X terms are located.
    4. The actual Strings for the located terms are resolved from the index.

    Sparse faceting improves on standard Solr in two ways:

    • Standard Solr allocates a new list of counters in step 1 for each call, while sparse re-uses old lists.
    • Standard Solr iterates all the counters in step 3, while sparse only iterates the ones that were updated in step 2.
    Distributed search is different

    Distributed faceting in Solr adds a few steps:

    • All shards are issued the same request by a coordinating Solr. They perform step 1-4 above and returns the results to the coordinator.
    • The coordinator merges the shard-responses into one structure and extracts the Top-X terms from that.
    • For each Top-X term, its exact count is requested from the shards that did not deliver it as part of step a.

    Standard Solr handles each exact-count separately by performing a mini-search for the term in the field. Sparse reuses the filled counters from step 2 (or repeats step 1-2 if the counter has been flushed from the cache) and simply locates the counters corresponding to the terms. Depending on the number of terms, sparse is much faster (think 5-10x) than standard Solr for this task. See Ten times faster for details.

    What is cardinality?

    Down to earth, cardinality just means how many there are of something. But what thing? The possibilities for faceting are many: Documents, fields, references and terms. To make matters worse, references and terms can be counted for the full corpus as well as just the search result.

    • Performance of standard Solr faceting is linear to the number of unique terms in the corpus in step 1 & 3 and linear to the number of references in the search result in step 2.
    • Performance of sparse faceting is (nearly) independent of the number of unique terms in the corpus and linear to the number of references in the search result in step 2 & 3.

    Both standard Solr and sparse treats each field individually, so they both scale linear for that. The documents returned as part of base search are represented in a sparse structure itself (independent of sparse faceting) and scales with result set size. While it does take time to iterate over these documents, this is normally dwarfed by the other processing steps. Ignoring the devils in the details: Standard Solr facet performance scales with the full corpus size as well as the result size, while sparse faceting scales just with the result size.

    Examples please!
    • For faceting on URL in the Danish Web Archive, cardinality is very high for documents (5bn), references (5bn) and terms (5bn) in the corpus. The overhead of performing a standard Solr faceting call is huge (hundreds of milliseconds), due to the high number of terms in the corpus. As the typical search results are quite a lot smaller than the full corpus, sparse faceting is very fast.
    • For faceting on host in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. However, the number of  terms (1.3m) is more modest. The overhead of performing a standard Solr faceting call is quite small (a few milliseconds), due to the modest number of terms; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is still faster in most cases, but only by a few milliseconds. Not much if the total response time is hundreds of milliseconds.
    • For faceting on content_type_norm in the Danish Web Archive, cardinality is very high for documents (5bn) and references (5bn) in the corpus. It is extremely small for the number of unique terms, which is 10. The overhead of performing a standard Solr faceting call is practically zero; the time used in step 2, which is linear to the references, is often much higher than the overhead. Sparse faceting is never faster than Solr for this and as a consequence falls back to standard counting, making it perform at the same speed.
    • For faceting on author at the library index at Statsbiblioteket, cardinality is high for documents (15m), references (40m) and terms (9m) in the corpus. The overhead of performing a standard Solr faceting call is noticeable (tens of milliseconds), due to the 9m terms in the corpus. The typical search results is well below 8% of the full corpus, and sparse faceting is markedly faster than standard Solr. See Small Scale Sparse Faceting for details.

    DPLA: DPLA Brings National Attention to the Blue Earth County Historical Society

    planet code4lib - Thu, 2014-10-02 05:00

    William and Jane Jones farm, Blue Earth County, Minnesota, ca.1888. Courtesy of the Blue Earth County Historical Society via the Minnesota Digital Library.

    The Blue Earth County Historical Society (BECHS), founded in 1901, is located in Mankato, Minnesota. The content submitted by BECHS to the Minnesota Digital Library (MDL) and the Digital Public Library of America (DPLA) is unique to this county of Minnesota. Our collection chronicles the people, places, and events that shaped Blue Earth County from our agricultural roots to the rise of our cities.  The images tell a variety of stories and showcase all walks of life across decades of time.  All of the images have been donated to BECHS by people who wanted to preserve the past for future generations.

    BECHS was honored when the DPLA selected one of our images to represent MDL when it came on as a partner in April 2013. The image was of Dr. G. A. Dahl posing for a photograph at a local photography studio. The interesting aspect of this image was that there was a photographer in the image as well as Dr. Dahl. It seems Dahl was fascinated with photography as he had interior photographs taken of his home and office, in a time when photographs of living spaces were rare. These images are also a part of our collection.

    Hubbard House from Broad Street with four women, Mankato, Minnesota, ca.1900. Courtesy of the Blue Earth County Historical Society via the Minnesota Digital Library.

    Our involvement in MDL helps people across Minnesota locate our images and have access to our collection. BECHS’ involvement with the DPLA has amplified that reach. People from across the country, and the world, are able to locate our images, which gives the user a glimpse into our collection and history. Based on the analytics from our MDL webpage, the DPLA is the highest referral to our webpage for visitors.  Because the DPLA is a national resource, people can search this one site to find images from different locations and be directed back to the image’s home location. It is an excellent resource, especially for genealogists.

    The Blue Earth County Historical Society is located in Mankato, Minnesota. BECHS was founded in 1901 in preparation for the semi-centennial of Mankato and Blue Earth County. In 1938, BECHS purchased the Hubbard family home and opened the first public history museum. BECHS operated from this location for 50 years before moving into our current location. The Hubbard House was placed on the National Register of Historic Places in 1978 and is still operated by BECHS as a living history museum. As we enter into our 113th year, BECHS has upcoming expansion plans in our current facility and will continue to collect, preserve and present the history of Blue Earth County for present and future generations.

    Featured image credit: Detail from Portrait of Dr. G. A. Dahl, Mankato, Minnesota, ca.1900. Keene, George E. Courtesy of the Blue Earth County Historical Society via the Minnesota Digital Library.

     All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

    William Denton: Michael Collins on The Great Eastern

    planet code4lib - Thu, 2014-10-02 04:22

    I’ve written about The Great Eastern a couple of times: once in April, when I had just started to listen to all the episodes for the fifth or sixth time, and then briefly in August with a quote about libraries. I finished listening to it all last month. It’s still a masterpiece, one of the finest radio comedies and one of the richest and deepest works of radio fiction ever.

    Michael Collins has been writing long pieces about it on his web site and they are mandatory reading if you know the show at all:

    Mack Furlong, who played host Paul Moth, won a John Drainie award recently. He was impeccable on the show.

    Hydra Project: Hydra Connect #2 – reports

    planet code4lib - Thu, 2014-10-02 00:24

    170 or so people are gathered together in Cleveland, Ohio, for Hydra Connect #2 – the second annual Hydra get-together. If you weren’t able to come (and even if you were) you’ll find increasing numbers of presentations and meeting notes hanging off the program page at



    Subscribe to code4lib aggregator