You are here

Feed aggregator

District Dispatch: Upping people’s digital IQ

planet code4lib - Fri, 2016-06-03 17:46

OITP’s Larra Clark participates in roundtable discussion, “What’s Your Digital IQ?”

It was my pleasure last week to join the Council of Better Business Bureaus (BBB), Nielsen and the Multicultural Media, Telecom and Internet Council (MMTC) in a roundtable discussion on the importance of digital empowerment. “What’s Your Digital IQ?” was opened by Congressman Gus Bilirakis (R-FL) and former Federal Trade Commissioner Julie Brill talking about the importance of the $1 trillion digital economy and the need for tools to help people be smart online and protect themselves against hackers and scam.

Brill referenced recent analysis from the National Telecommunications and Information Administration (NTIA) citing that a lack of trust in Internet privacy and security may deter online activities. Forty-five percent of online households reported that privacy and security concerns stopped them from conducting financial transactions, buying goods or services, posting on social networks or expressing opinions on controversial or political issues via the Internet. (This last finding reminded me of past research by the Pew Research Center on Social Media and the Spiral of Silence.)

“Consumers need help,” Brill said. “Digital literacy and consumer education are necessary” to address privacy concerns and keep online economic activities humming.

The BBB has begun to take up this charge with its Digital IQ initiative, and Genie Barton, vice president and director of the BBB’s Online Interest-Based Advertising Accountability Program and Mobile Marketing Initiatives, discussed its commitment to building a trusted marketplace. Nicol Turner-Lee, vice president and chief research & policy officer for MMTC, affirmed the need for increasing people’s digital savvy—particularly among communities of color.

Interestingly, when I took the Digital IQ “challenge,” it reminded me a lot of the digital and information literacy that happens in libraries. One question asked if all the useful results of a web search are found on the first two pages. Others asked about http vs. https, giving personal information for loyalty programs, and offered food for thought regarding online advertising.

Libraries have long been champions of the right to privacy and teachers/guides for improving the digital skills of our community members. Librarians in all types of libraries help youth and new Internet users better understand and protect their digital “footprint” and be smarter online. According to the Digital Inclusion Survey, 57 percent of public libraries report they offer training on safe online practices. And PLA’s online hub for digital literacy support and training provides modules on Internet privacy and online scams. (Thanks also to librarians who emailed me before the panel to share some of your new or favorite resources, like the San Jose Public Library Virtual Privacy Lab and 10 Tips for Protecting Your Digital Privacy.)

Pew researcher John Horrigan specifically calls out libraries as part of the solution for increasing digital readiness. “Libraries, who are already the primary curator on programs to encourage digital readiness in many communities, should embrace and expand that role.”

I think the BBB and libraries could do great things together in this space. Are any of you out there already working with your local BBB? Let me know at

The post Upping people’s digital IQ appeared first on District Dispatch.

LITA: Lets look at gender in Library IT

planet code4lib - Fri, 2016-06-03 17:42

So. Let’s talk about library technology organizations and gender.

I attended LitaForum 2015 last year, and like many good attendees, I tweeted thoughts as I went. Far more popular in the Twitterverse than anything original I sent out was a simple summary of a slide in a presentation by Angi Faiks, “Girls in Tech: A gateway to diversifying the library workforce.”

The tweet in question was:

That this struck a chord is shocking, presumably, to no one.

The slide that prompted my tweet references a 2009 article by Melissa Lamont that (a) you should read, and (b) briefly presents (among other interesting data) numbers from the 2014-2015 ARL Annual Salary Survey (paywalled).

What is the problem symptom?

Given the popularity of the tweet, I thought I’d dig a little deeper and see what I could find out about Library IT and gender, with the expectation that it would be pretty disappointing.

Spoiler alert: it is.

Before you start thinking, “But…I work in a library, where it’s all mutual respect and a near-perfect meritocracy as far as the eye can see,” well, think again. The overall message I received during conversations on the edges of the conference was that women — especially young women — are often ignored, and their talents squandered, in the higher-tech side of the library world. And when you move away from anecdotes and start looking at the data, well, the numbers are striking and no less upsetting.

  • At the beginning of 2016, Bobbi Newman published a great examination of the LITA Top Tech Trends panelists by sex. Roughly 2/3 of seats between 2003 and 2016 were men. 3/4 of repeat panelists were men.

  • The Lamont article mentioned before — and please, go read it — does some great original research enumerating what is likely a leading indicator: percentage of women authoring papers in library technology journals vs. more generic library journals (with the latter used as a control). First authorship in the higher tech journals goes to women about 34% of the time (JASIS&T is a low outlier with only 28%), while 65% of articles in the control journals have female first authors, mirroring pretty closely the percentage of women librarians in ARL libraries overall.

What are the data?

The numbers in my tweet suffer a bit from an apples-and-oranges comparison, with the ALA gender/race information coming from (wait for it…) the ALA, while the Library IT Heads numbers come from the 2014-2015 ARL statistics (Table 18).

Much (most?) of the IT work in libraries is, of course, done by “off-label” librarians — those hired to do a specific non-IT job, who are then pressed into service to do some programming or sysadmin or whatnot. However, we don’t have numbers for those, so I’m going to up the focus on the US ARL statistics for self-identified library IT departments, partially because I work in an ARL library, and partially because large academic libraries often have an internal, labeled IT department which makes counting easy.

Obviously, I’ve made a decision to give up generality in order to be able to make stronger assertions (e.g., LITA membership breakdown, were it available, might be more appropriate). I’d be very interested in looking at other data (or other slices of these data) if people have any available.

Categorizing Library IT positions

The ARL stats have a number of position categories, four of which obviously relate to Library IT and on which I’m going to focus here.

The leadership position I’ll treat as it’s own thing.

  • Department Head, Library Technology

The other three non-head IT positions I’ll treat as a group, giving this collection the whimsical name Library IT, non-head.

  • Library IT, Library Systems
  • Library IT, Web Developer
  • Library IT, Programmer

There are obviously other jobs that might or might not fit into library IT, depending on how a particular institution is structured. For example, at Michigan we have people who do markup for TEI documents and digitization specialists, neither set of which would obviously fall into one of the above categories. All those folks are part of Library IT on the organization chart at Michigan (and might not be at other places).

Let’s start with the non-head librarians and then look at department heads.

Library IT, non-head positions

61% of all US ARL Librarians are women, but only 29% of US ARL Librarians working in Library IT are women.

Overall, women outnumber men in ARL libraries by a substantial margin. The ARL report notes that, “the largest percentage of men employed in ARL libraries was 38.2% in 1980–81; since then men have consistently represented about 35% of the professional staff in ARL libraries,” (p. 15). That number is closer to 40% when looking at ARL institutions just within the US, as stated above.

So, we’ll call it 40% male librarians overall. How about in library IT?

In Library IT, men outnumber women by 526 to 212, giving us the 29% quoted above. That means there are about two and a half times as many men as women in library IT.

IT in general has been a male-dominated profession for a few decades now. A fairly recent article reports 2013 numbers that show women holding about 26% of jobs in computing, with many Big Name Tech Companies (Google, Facebook, Twitter, etc.) doing significantly worse.

We also don’t know about non-librarians working in library IT (I would be considered one). Given the overall IT statistics, it’s hard to believe that including non-librarians would move the needle toward having more women employees.

So on the one hand, we’re probably doing a very-slightly-less-awful job of bringing in women than the IT world in general. On the other, well, it’s only very slightly less awful, and this in a profession that is majority-female.

Library IT Heads

63% of Department Heads for department other than IT in US ARL libraries are women. About 30% of Library IT Heads are women.

Given the numbers we’re about to look at, it’s worthwhile to note that the majority doesn’t always hold the power, a message driven home by this tweet from Amy Buckland:

The library writ large, then, is female-majority, but not necessary female-dominated. Library IT, of course, is neither female-dominated or female-majority.

First, a broader look. Leadership positions in the wider, non-library IT world in general go overwhelmingly to men. Women hold positions at the CIO level in only about 17% of the Fortune 500. So, the baseline is terrible.

The ARL Stats for 2014-15 (table 30) show 91 US libraries that have a head of Library IT, 27 (30%) of whom are women. That’s about the same as the rank & file IT workers, but far different than the nearly two-thirds of other department heads that are women.

Many people presume this is indicative of what has been called the pipeline problem, the idea that it’s hard to hire women leaders because there aren’t many women coming up, and the lack of women in leadership roles make it harder to recruit women at the lower levels. This is a truth, but certainly not a complete truth.

Sex and salary in Library IT

The good news, such as it is, is that there is (basically) salary parity between men and women at both the IT rank & file and IT head positions.

The bad news is that this is one place where Library IT does better than the library on average. Across the whole library, men make an average of 5% more than women, an inequality that is true at every level of experience (ACRL table 38).

What does it mean?

The numbers give us what, of course, not why. For that explanation, many people initially grab onto the pipeline problem.

“Oh, woe is us white guys trying to do the right thing,” we lament. “We want to hire women and minorities, but none ever apply. There’s a pipeline problem.”

So let’s revisit our friend the pipeline problem. The problem is not just that the pipeline is small. The pipeline leaks.

Rachel Thomas’s article, “If you think women in tech is just a pipeline problem, you haven’t been paying attention” notes right up front that:

According to the Harvard Business Review, 41% of women working in tech eventually end up leaving the field (compared to just 17% of men)

Women leave IT at a much higher rate than other positions. IT can be, in ways large and small, antagonistic to women. Odds are your organization is. For those of you who think otherwise, I challenge you to find a shortish young woman where you work and ask her if she ever feels ignored or undervalued precisely because she’s a shortish young woman. Ask her if she always gets attributed for her ideas. Ask her if her initiatives are given the same consideration as her male colleagues.

What can we do?

Admit that there’s a problem. And then talk about it.

That’s easy to say and crazy-hard to do. The more privilege one has — and I’m a white, well-educated, middle-aged male, so I’ve got privilege up the wazoo — the easier it is to dismiss bias as small, irrelevant, or “elsewhere.”

As to how to get a conversation started, my colleague Meghan Musolff enrolled me to help her with an ingenious plan:

  • Send out an invitation to talk about a diversity-related reading

  • Show up

Our first monthly meeting of what we’re calling the “Tech Diversity Reading Group” ended up drawing about 2/3 of the department (including the boss, who bought pizza) and revolved around the Rachel Thomas pipeline article from above. And yes, the conversation was dominated by men, and yes, there were some nods to, “But this isn’t Silicon Valley so it doesn’t apply to us” or “That doesn’t happen around here, does it?” and, yes, many of the women didn’t feel comfortable speaking out.

We got a lot of feedback, in both directions, but none of it was of the “this isn’t a problem” variety. It wasn’t perfect (or maybe even “good”), but we were there, giving it a shot.

And you can, too.

District Dispatch: Narrowing the Justice Gap

planet code4lib - Fri, 2016-06-03 16:51

Photo via Flickr.

While criminal justice issues have been increasingly in the public eye, lack of access to civil legal information and resources is a less well-known challenge that results in people appearing in court without lawyers in critical life matters such as eviction, foreclosures, child custody and child support proceedings, and debt collection cases. According to the Legal Services Corporation (LSC), more than 64 million Americans are eligible for civil legal aid, yet civil legal aid providers routinely turn away 80% of those who need help because of a lack of resources.

LSC is considering how it might increase access and awareness of civil legal information and resources through public libraries. A planning grant is supporting research and input from a diverse advisory committee to inform the development of a training curriculum for public librarians. I was pleased to join Public Library Association President-Elect (and Cleveland Public Library Director) Felton Thomas at the advisory committee meeting with others from state law libraries, university law libraries, legal aid providers already partnering with public libraries, and OCLC to learn more about the justice gap and how libraries may play a role in helping people find the legal information they need to narrow the justice gap.

Fortunately, a significant body of work already exists, including a series of webinars for librarians developed by LSC and; a Public Library Toolkit developed by the American Association of Law Libraries; the Law4AZ initiative developed and delivered by the State Library of Arizona; and collaboration among the Hawaii State Judiciary, the Hawaii State Public Library System, and the Legal Aid Society of Hawaii to expand court forms available online and support librarian training and public seminars.

I’d be glad to hear from readers about their own experiences in this area and/or what you’d like to see in any future training or resources that may be developed. Shoot me a line at

The post Narrowing the Justice Gap appeared first on District Dispatch.

Open Knowledge Foundation: Addressing Challenges in Opening Land Data – Resources Are Now Live

planet code4lib - Fri, 2016-06-03 14:31

Earlier this year, Open Knowledge International announced a joint-initiative with Cadasta Foundation to explore open data in property rights with the ultimate goal of defining the land ownership dataset for the Global Open Data Index. Now, we are excited to share some initial, ground-breaking resources that showcase the complexity of working at the intersection of open data advocacy and the property rights space.

Land ownership information, including the Land Registry and Cadastre, are traditionally closed datasets within a pay-for-access system. In these situations, the instinct within the open data community is to default to open. While we believe more openness is vital to our aims of using data to secure property tenure, who can use this open data and for what purpose must also be taken into account. Further, property rights administration systems are highly complex and vary greatly from context to context. The implications of open data in a country where the frequent clash of government, community and private sector interests fosters mutual mistrust are very different from those in countries with established land administration system where most of the population’s property rights are formally documented. Our acknowledgement of these nuances are reflected within our research thus far and has been the foundation of our process to define open data in land rights.



These guides also exemplify the results of a partnership between the open data community and actors with sector-specific expertise. We foresee these resources and the lessons learned providing a framework for cross-sector data explorations as well as specific guidance for the international open data community involved with the Global Open Data Index.

Our initial resources include a comprehensive Overview of Property Rights Data and a Risk Assessment. These two guides are intended to explain what land ownership data is, where it can be found, as well as outline the process that OKI and Cadasta conducted to determine what of this data should be open. All current and forthcoming resources, as well as additional background on this project can be found on Cadasta’s Open Data page

We are actively seeking feedback to inform our research going forward and ensure that this work become core resources within the open data and land rights communities alike. Please reply with your comments and questions on the Discussion Forum or by reaching out to our researcher, Lindsay Ferris, directly at We look forward to hearing from you.

FOSS4Lib Recent Releases: veraPDF - 0.16.2

planet code4lib - Fri, 2016-06-03 13:52

Last updated June 3, 2016. Created by Peter Murray on June 3, 2016.
Log in to edit this page.

Package: veraPDFRelease Date: Friday, June 3, 2016

OCLC Dev Network: Server-Side Linked Data Consumption with Ruby

planet code4lib - Fri, 2016-06-03 13:30

Learn about how to use Ruby to consume linked data from a specific graph URL.

Hydra Project: OR2017 will be in Brisbane

planet code4lib - Fri, 2016-06-03 11:48

Likely of interest to many Hydranauts:

The Open Repositories (OR) Steering Committee in conjunction with the University of Queensland (UQ), Queensland University of Technology (QUT) and Griffith University are delighted to inform you that Brisbane will host the annual Open Repositories 2017 Conference.

The University of Queensland (UQ), Queensland University of Technology (QUT) and Griffith University welcomed today’s announcement that Brisbane will host the International Open Repositories Conference 26-30 June 2017 at the Hilton Brisbane.

The annual Open Repositories Conference brings together users and developers of open digital repository platforms from higher education, government, galleries, libraries, archives and museums. The Conference provides an interactive forum for delegates from around the world to come together and explore the global challenges and opportunities facing libraries and the broader scholarly information landscape.

Eric Lease Morgan: Achieving perfection

planet code4lib - Fri, 2016-06-03 09:48

Through the use of the Levenshtein algorithm, I am achieving perfection when it comes to searching VIAF. Well, almost.

I am making significant progress with VIAF Finder [0], but now I have exploited the use of the Levenshtein algorithm. In fact, I believe I am now able to programmatically choose VIAF identifiers for more than 50 or 60 percent of the authority records.

The Levenshtein algorithm measures the “distance” between two strings. [1] This distance is really the number of keystrokes necessary to change one string into another. For example, the distance between “eric” and “erik” is 1. Similarly the distance between “Stefano B” and “Stefano B.” is still 1. Along with a colleague (Stefano Bargioni), I took a long, hard look at the source code of an OpenRefine reconciliation service which uses VIAF as the backend database. [2] The code included the calculation of a ratio to denote the relative distance of two strings. This ratio is the quotient of the longest string minus the Levenshtein distance divided by the length of the longest string. From the first example, the distance is 1 and the length of the string “eric” is 4, thus the ratio is (4 – 1) / 4, which equals 0.75. In other words, 75% of the characters are correct. In the second example, “Stefano B.” is 10 characters long, and thus the ratio is (10 – 1) / 10, which equals 0.9. In other words, the second example is more correct than the first example.

Using the value of MARC 1xx$a of an authority file, I can then query VIAF. The SRU interface returns 0 or more hits. I can then compare my search string with the search results to create a ranked list of choices. Based on this ranking, I am able to more intelligently choose VIAF identifiers. For example, from my debugging output, if I get 0 hits, then I do nothing:

query: Lucariello, Donato hits: 0

If I get too many hits, then I still do nothing:

query: Lucas Lucas, Ramón hits: 18 warning: search results out of bounds; consider increasing MAX

If I get 1 hit, then I automatically save the result, which seems to be correct/accurate most of the time, even though the Levenshtein distance may be large:

query: Lucaites, John Louis hits: 1 score: 0.250 John Lucaites (57801579) action: perfection achieved (updated name and id)

If I get many hits, and one of them exactly matches my query, then I “achieved perfection” and I save the identifier:

query: Lucas, John Randolph hits: 3 score: 1.000 Lucas, John Randolph (248129560) score: 0.650 Lucas, John R. 1929- (98019197) score: 0.500 Lucas, J. R. 1929- (2610145857009722920913) action: perfection achieved (updated name and id)

If I get many hits, and many of them are exact matches, then I simply use the first one (even though it might not be the “best” one):

query: Lucifer Calaritanus hits: 5 score: 1.000 Lucifer Calaritanus (189238587) score: 1.000 Lucifer Calaritanus (187743694) score: 0.633 Luciferus Calaritanus -ca. 370 (1570145857019022921123) score: 0.514 Lucifer Calaritanus gest. 370 n. Chr. (798145857991023021603) score: 0.417 Lucifer, Bp. of Cagliari, d. ca. 370 (64799542) action: perfection achieved (updated name and id)

If I get many hits, and none of them are perfect, but the ratio is above a configured threshold (0.949), then that is good enough for me (even if the selected record is not the “best” one):

query: Palanque, Jean-Remy hits: 5 score: 0.950 Palanque, Jean-Rémy (106963448) score: 0.692 Palanque, Jean-Rémy, 1898- (46765569) score: 0.667 Palanque, Jean Rémy, 1898- (165029580) score: 0.514 Palanque, J. R. (Jean-Rémy), n. 1898 (316408095) score: 0.190 Marrou-Davenson, Henri-Irénée, 1904-1977 (2473942) action: perfection achieved (updated name and id)

By exploiting the Levenshtein algorithm, and by learning from the good work of others, I have been able to programmatically select VIAF identifiers for more than half of my authority records. When one has as many as 120,000 records to process, this is a good thing. Moreover, this use of the Levenshtein algorithm seems to produce more complete results when compared to the VIAF AutoSuggest API. AutoSuggest identified approximately 20 percent of my VIAF identifiers, while my Levenshtein algorithm/logic identifies more than 40 or 50 percent. AutoSuggest is much faster though. Much.

Fun with the intelligent use of computers, and think of the possibilities.

[0] VIAF Finder –

[1] Levenshtein –

[2] reconciliation service –

ZBW German National Library of Economics: Content recommendation by means of EEXCESS

planet code4lib - Fri, 2016-06-03 07:59

Authors: Timo Borst, Nils Witt

Since their beginnings, libraries and related cultural institutions were confident in the fact that users had to visit them in order to search, find and access their content. With the emergence and massive use of the World Wide Web and associated tools and technologies, this situation has drastically changed: if those institutions still want their content to be found and used, they must adapt themselves to those environments in which users expect digital content to be available. Against this background, the general approach of the EEXCESS project is to ‘inject’ digital content (both metadata and object files) into users' daily environments like browsers, authoring environments like content management systems or Google Docs, or e-learning environments. Content is not just provided, but recommended by means of an organizational and technical framework of distributed partner recommenders and user profiles. Once a content partner has connected to this framework by establishing an Application Program Interface (API) for constantly responding to the EEXCESS queries, the results will be listed and merged with the results of the other partners. Depending on the software component installed either on a user’s local machine or on an application server, the list of recommendations is displayed in different ways: from a classical, text-oriented list, to a visualization of metadata records.

The Recommender

The EEXCESS architecture comprises  three major components: a privacy-preserving proxy, multiple client-side tools for the Chrome Browser, Wordpress, Google Docs and more, and the central server-side component, responsible for generating recommendations, called recommender. Covering all of these components in detail is beyond the scope of this blog post. Instead, we want to focus on one component: the federated recommender, as it is the heart of the EEXCESS infrastructure.

The recommender’s task is to generate a list of objects like text documents, images and videos (hereafter called documents, for brevity’s sake) in response to a given query. The list is supposed to contain only documents relevant to the user. Moreover, the list should be ordered (by descending relevance). To generate such a list, the recommender can pick documents from the content providers that participate in the EEXCESS infrastructure. Technically speaking but somewhat oversimplified: the recommender receives a query and forwards it to all content provider systems (like Econbiz, Europeana, Mendeley and others). After receiving results from each content provider, the recommender decides in which order documents will be recommended to the user  and return it to the user who submitted the query.

This raises some questions. How can we find relevant documents? The result lists from the content providers are already sorted by relevance; how can we merge them? Can we deal with ambiguity and duplicates? Can we respond within reasonable time? Can we handle the technical disparities of the different content provider systems? How can we integrate the different document types? In the following, we will describe how we tackled some of these questions, by giving a more detailed explanation on how the recommender compiles the recommendation lists.

Recommendation process

If the user wishes to obtain personalized recommendations, she can create a local profile (i.e. stored only on the user’s device). They can specify their education, age, field of interest and location. But to be clear here: this is optional. If the profile is used, the Privacy Proxy[4] takes care of anonymizing the personal information. The overall process of creating personalized recommendations is depicted in figure and will be described in the following.

After the user has sent a query as well as her user profile, a process called Source Selection is triggered. Based on the user’s preferences, the Source Selection decides which partner systems will be queried. The reason for this is that most content providers cover only a specific discipline (see figure). For instance, queries from a user that is only interested in biology and chemistry will never receive Econbiz recommendations, whereas a query from a user merely interested in politics and money will get Econbiz recommendations (up to the present, this may change when other content provider participate). Thereby, Source Selection lowers the network traffic and the latency of the overall process and increases the precision of the results at the expense of missing results and reduced diversity. Optionally, the user can also select the sources manually.

The subsequent Query Processing step alters the query:

  • Short queries are expanded using Wikipedia knowledge
  • Long queries are split into smaller queries, which are then handled separately (See [1] for more details).

 The queries from the Query Processing step are then used to query the content providers selected during the Source Selection step. With the results from the content providers, two post processing steps are carried out to generate the personalized recommendations:

  • Result Processing: The purpose of the Result Processing is to detect duplicates. A technique called fuzzy hashing is used for this purpose. The words that make up a result list’s entry are sorted, counted and truncated by the MD5 hash algorithm [2], which allows convenient comparison.
  • Result Ranking: After the duplicates have been removed, the results are re-ranked. To do so, a slightly modified version of the round robin method is used. Where vanilla round robin would just concatenate slices of the result lists (i.e. first two documents from list A + first two document from list B + …), Weighted Round Robinmodifies this behavior by taking the overlap of the query and the result’s meta-data into account. This is, before merging the lists, each individual list is modified. Documents, whose meta data exhibit a high accordance to the query, are being promoted.

Partner Wizard

As the quality of the recommended documents increases with the number and diversity of the content providers that participate, a component called Partner Wizard was implemented. Its goal is to simplify the integration of new content providers to a level that non-experts can manage this process without any support from the EEXCESS consortium. This is achieved by a semi-automatic process triggered from a web frontend that is provided by the EEXCESS consortium. Given a search API, it is relatively easy to obtain search results, but the main point is to obtain results that are meaningful and relevant to the user. Since every search service behaves differently, there is no point in treating all services equally. Some sort of customization is needed. That’s where the Partner Wizard comes into play. It allows an employee from the new content provider to specify the search API. Afterwards, the wizard submits pre-assembled pairs of search queries to the new service. Each pair is similar but not identical, like for examp

  • Query 1: <TERM1> OR <TERM2>
  • Query 2: <TERM1> AND <TERM2>.

The thereby generated result lists are presented to the user, which has to decide which list contains the more relevant results and suits the query better (see figure). Finally, based on the previous steps, a configuration file is generated that configures the federated recommender. Whereupon the recommender mimics the behavior, that was previously exhibited. The wizard can be completed within a few minutes and it only requires a publically available search API.

The project started with five initial content providers. Now, due to the contribution of the partner wizard, there are more than ten content providers and negotiations with further candidates are ongoing. Since there are almost no technical issues anymore, legal issues dominate the consultations. As all programs developed within the EEXCESS project are published under open source conditions, the Partner Wizard can be found at [3].


The EEXCESS project is about injecting distributed content from different cultural and scientific domains into everyday user environments, so this content becomes more visible and better accessible. To achieve this goal and to establish a network of distributed content providers, apart from the various organizational, conceptual and legal aspects some specification and engineering of software is to be done – not only one-time, but also with respect to maintaining the technical components. One of the main goals of the project is to establish a community of networked information systems, with a lightweight approach towards joining this network by easily setting up a local partner recommender. Future work will focus on this growing network and the increasing requirements of integrating heterogeneous content via central processing of recommendations.

EEXCESS Recommender Recommender system   Metadata   Economics  

ZBW German National Library of Economics: In a nutshell: EconBiz Beta Services

planet code4lib - Fri, 2016-06-03 07:56

Author: Arne Martin Klemenz

EconBiz – the search portal for Business Studies and Economics – was launched in 2002 as the Virtual Library for Economics and Business Studies. The project was initially funded by the German Research Foundation (DFG) and is developed by the German National Library of Economics (ZBW) with the support of the EconBiz Advisory Board and cooperation partners. The search portal aims to support research in and teaching of Business Studies and Economics with a central entry point for all kinds of subject-specific information and direct access to full texts [1].

As an addition to the main EconBiz service we provide several beta services as part of the EconBiz Beta sandbox. These service developments cover the outcome of research projects based on large-scale projects like EU Projects as well as small-scale projects e.g. in cooperation with students from Kiel University. Therefore, this beta service sandbox aims to provide a platform for testing new features before they might be integrated to the main service (proof of concept development) on the one hand, and it aims to provide a showcase for relevant project output from related projects on the other hand.

Details about some exemplary selected beta services are provided in the following.

Current Beta Services Online Call Organizer

Based on the winning idea of an EconBiz ideas competition, the Online Call Organizer (OCO) got developed in cooperation with students from Kiel University.

The OCO is a service based on the EconBiz Calendar of Events which contains events from all over the world like conferences and workshops that are relevant for economics, business studies and social sciences [2]. At the moment, the calendar of events contains more than 10,000 events in total, including more than 500 future events. The main idea of the OCO is to handle the huge amount of event related information in a better way with the objective to “never miss a deadline” like the registration or submission deadline of a relevant event in a user’s personal area of interest. The main EconBiz Calendar of Events service provides a keyword based facetted search functionality and detailed information for each event. In addition, the OCO provides a filter mechanism based on the user’s research profile.  This is combined with a notification service based on email and twitter alerts. The OCO is published as open beta and can be accessed here:

Technologically, the OCO is implemented based on PHP and JavaScript following the client-server model. The backend server processes, aggregates and transforms information about events retrieved from the EconBiz API which provides access to the EconBiz dataset following a RESTful API design. Besides that, the OCO server handles signup and authentication requests as well as any other user account related actions like changes regarding a user’s research profile or the notification settings. The server functionality is encapsulated by providing an internal API as outlined in Figure 1.

Figure 1: Online Call Organizer - Client Server Architecture Overview

Silex – a micro web framework based on Symfony – provides the basis for the OCO Server API. The communication between OCO Client and the OCO Service API is based on common HTTP methods GET, PUT/POST and DELETE utilizing the JSON format. This allows a comfortable abstraction of the detailed application logic. Likewise, the server implements an additional abstraction layer based on the Doctrine ORM framework – an object-relational mapper (ORM) for PHP – that provides the capability to abstract the database layer.

The OCO server module handles its library dependencies with Composer – a tool for dependency management in PHP. Further functionality is based on the utilization of the following libraries and frameworks:

The core functionality – the alert service itself – is based on a daily cronjob which checks searches for events matching a user’s research profile by retrieving up to date data from the EconBiz API. Users can specify if they want to be notified by email and/or Twitter alert about events matching their research profile. Alerts are sent ‘X’ days (depending on the account settings) before the submission deadline ends, registration closes or the event actually takes place.

The corresponding multilingual (English/German) web client is based on PHP and JavaScript. It is kept quite simple, as it should mostly provide the possibility to edit basic account settings (email address, password, Twitter profile for alerts) and research profiles in a form-based manner. In addition to this, it provides a calendar overview for upcoming events based on the FullCalendar jQuery plugin. The communication with the OCO backend server is based on asynchronous requests (AJAX) in JSON format send to the OCO server API. In addition to the main jQuery and the FullCalendar library, the PHP based client’s skeleton utilizes the following common JavaScript libraries to implement its features:

Some parts of the OCO implementation have already been reused in the EconBiz portal. The full OCO service may be integrated with EconBiz as part of a scheduled reimplementation of the EconBiz Calendar of Events.

Other Beta Services

As part of the EconBiz Beta sandbox we provide several other beta services. On the one hand we would like to ease reuse of information provided in EconBiz. The EconBiz API provides full access to the EconBiz dataset. But for those, who are not comfortable with implementing their own services based on the API, we decided to provide some basic widgets that might be easily integrated to any website. The widgets come with a configuration and widget generation dialogue to make the integration as comfortable as possible. Currently, three different kinds of widgets are available: a search and event search widget as well as a bookmark widget. These widgets were initially meant to be a comfortable way e.g. for EconBiz Partners (see EconBiz Partner Network) to embed data from EconBiz into their websites, but these widgets are now also used in other ZBW services e.g. to embed lists with expert-selected literature based on the bookmark widget.

On the other hand, Visual Search Interfaces and Visualization Widgets as prototypes from the EEXCESS project are also presented in the EconBiz Beta sandbox. The EEXCESS vision is to unfold the treasure of cultural, educational and scientific long-tail content for the benefit of all users and to facilitate access to this high quality information [16]. One aspect of the project, which is reflected by the EEXCESS prototypes in the EconBiz Beta sandbox, is the visualization of search processes.

EEXCESS Visual Search Interfaces and Visualization Widgets:


With EconBiz Beta we provide a sandbox for services developed in the context of EconBiz. This blog post gave an overview of some selected EconBiz Beta services. Several more services are available here:

If you would like to disseminate a service you created based on the EconBiz API, we would like to publish your development in the EconBiz Beta sandbox. Please get in touch via and provide some information about the application.

EconBiz Sandbox Web widget   Visualization (computer graphics)   Organizer  

Max Planck Digital Library: Citation Trails in Primo Central Index (PCI)

planet code4lib - Thu, 2016-06-02 17:39

The May 2016 release brought an interesting functionality to the Primo Central Index (PCI): The new "Citation Trail" capability enables PCI users to discover relevant materials by providing cited and citing publications for selected article records.

At this time the only data source for the citation trail feature is CrossRef, thus the number of citing articles will be below the "Cited by" counts in other sources like Scopus and Web of Science.

Further information:

District Dispatch: Federal experts to walk libraries through government red tape

planet code4lib - Thu, 2016-06-02 13:12

Ever felt frustrated by the prospect of another unfunded mandate from the federal, local or state government? Get a better understanding of the red tape. Empower yourself, your library and your community by learning to navigate major e-government resources and websites by attending “E-Government Services At your Library: Conquering An Unfunded Mandate,” a conference session that takes place at the 2016 American Library Association (ALA) Annual Conference in Orlando, Fla. During the session, participants will learn how to navigate federal funding regulations.

Learn about taxes, housing, aid to dependent families, social security, healthcare, services to veterans, legal issues facing librarians in e-government and more. The session takes place on Thursday, June 23, 2016, from 1:00-4:00 p.m. in the Orange County Convention Center, room W103A.

Session speakers include Jayme Bosio, government research services librarian at the Palm Beach Library System in West Palm Beach, Fla.; Ryan Dooley, director of the Miami Passport Agency at the U.S. Department of State; and Chris Janklow, community engagement coordinator of the Palm Beach Library System in West Palm Beach, Fla.

Want to attend other policy sessions at the 2016 ALA Annual Conference? View all ALA Washington Office sessions

The post Federal experts to walk libraries through government red tape appeared first on District Dispatch.

LibUX: Progress Continues Toward HTML5 DRM

planet code4lib - Thu, 2016-06-02 07:40

It is Thursday, June 2nd. You’re listening to W3 Radio (MP3), your news weekly about the world wide web in under ten minutes.

RSS | Google Play | iTunes


So, hey there! Thanks for giving my new podcast — W3 Radio — a spin. You can help it find its footing by leaving a nice review, telling your friends, and subscribing. Let’s be friends on twitter.

W3 Radio is now available in Google Play and in iTunes. Of course, you can always subscribe to the direct feed

The post Progress Continues Toward HTML5 DRM appeared first on LibUX.

Karen Coyle: This is what sexism looks like, # 3

planet code4lib - Thu, 2016-06-02 05:32
I spend a lot of time in technical meetings. This is no one's fault but my own since these activities are purely voluntary. At the end of many meetings, though, I vow to never attend one again. This story is about one.

There was no ill-preparedness or bad faith on the part of either the organizers or the participants at this meeting. There is, however, reality, and no amount of good will changes that.

This took place at a working meeting that was not a library meeting but at which some librarians were present. At lunch one day, three librarians, myself and two others, all female, were sitting together. I can say that we are all well-known and well seasoned in library systems and standards. You would recognize our names. As lunch was winding down, the person across from us opened a conversation with this (all below paraphrased):

P: Libraries should get involved with the Open Access movement; they are in a position to have an effect.

us: Libraries *are* heavily involved in the OA movement, and have been for at least a decade.

P: (Going on.) If you'd join together you could fight for OA against the big publishers.

us: Libraries *have* joined together and are fighting for OA. (Beginning to get annoyed at this point.)

P: What you need to do is... [various iterations here]

us: (Visibly annoyed now) We have done that. In some cases, we have started an effort that is going forward. We have organizations dedicated to that, we hold whole conferences on these topics. You are preaching to the choir here - these aren't new ideas for us, we know all of this. You don't need to tell us.

P: (Going on, no response to what we have said.) You should set a deadline, like 2017, after which you should drop all journals that are not OA.

us: [various statements about a) setting up university-wide rules for depositing articles; b) the difference in how publishing matters in different disciplines: c) the role of tenure, etc.]

P: (Insisting) If libraries would support OA, publishers like Elsevier could not survive.

us: [oof!]

me: You are sitting here with three professionals with a combined experience in this field of well over 50 years, but you won't listen to us or believe what we say. Why not?

P: (Ignoring the question.) I'm convinced that if libraries would join in, we could win this one. You should...

At this point, I lost it. I literally head-desked and groaned out "Please stop with the mansplaining!" That was a mistake, but it wasn't wrong. This was a classic case of mansplaining. P hopped up and stalked out of the room. Twenty minutes later I am told that I have violated the "civility code" of the conference. I have become the perpetrator of abuse because I "accused him" of being sexist.

I don't know what else we could have done to stop what was going on. In spite of a good ten minutes of us replying that libraries are "on it" not one of our statements was acknowledged. Not one of P's statements was in response to what we said. At no point did P acknowledge that we know more about what libraries are doing than he does, and perhaps he could learn by listening to us or asking us questions. And we actually told him, in so many words, he wasn't listening, and that we are knowledgeable. He still didn't get it.

This, too, is a classic: Catch-22. A person who is clueless will not get the "hints" but you cannot clue them or you are in the wrong.

Thanks to the men's rights movement, standing up against sexism has become abuse of men, who are then the victims of what is almost always characterized as "false accusations". Not only did this person tell me, in the "chat" we had at his request, "I know I am not sexist" he also said, "You know that false accusations destroy men's lives." It never occurred to him that deciding true or false wasn't de facto his decision. He didn't react when I said that all three of us had experienced the encounter in the same way. The various explanations P gave were ones most women have heard before: "If I didn't listen, that's just how I am with everybody." "Did I say I wasn't listening because you are women? so how could it be sexist?" And "I have listened to you in our meetings, so how can you say I am sexist?" (Again, his experience, his decision.) During all of this I was spoken to, but no interest was shown in my experience, and I said almost nothing. I didn't even try to explain it. I was drubbed.

The only positive thing that I can say about this is that in spite of heavy pressure over 20 minutes, one on one, I did not agree to deny my experience. He wanted me to tell him that he hadn't been sexist. I just could't do that. I said that we would have to agree to disagree, but apologized for my outburst.

When I look around meeting rooms, I often think that I shouldn't be there. I often vow that the next time I walk into a meeting room and it isn't at least 50% female, I'm walking out. Unfortunately, that meeting room does not exist in the projects that I find myself in.

Not all of the experience at the meeting was bad. Much of it was quite good. But the good doesn't remove the damage of the bad. I think about the fact that in Pakistan today men are arguing that it is their right to physically abuse the women in their home and I am utterly speechless. I don't face anything like that. But the wounds from these experiences take a long time to heal. Days afterward, I'm still anxious and depressed. I know that the next time I walk into a meeting room I will feel fear; fear of further damage. I really do seriously think about hanging it all up, never going to another meeting where I try to advocate for libraries.

I'm now off to join friends and hopefully put this behind me. I wish I could know that it would never happen again. But I get that gut punch just thinking about my next meeting.

Meredith Farkas: Generous hearts and social media shaming

planet code4lib - Wed, 2016-06-01 19:28

When I was young and bold and strong,
Oh, right was right, and wrong was wrong!
My plume on high, my flag unfurled,
I rode away to right the world.
“Come out, you dogs, and fight!” said I,
And wept there was but once to die.

But I am old; and good and bad
Are woven in a crazy plaid.

-From “The Veteran,” by Dorothy Parker


As someone who has been active on social media for the entirety of my professional life and who wrote a book on social media for libraries, things have to be pretty bad for me to be considering taking a hiatus from social media. But I feel like the vitriol, nastiness, and lack of compassion is getting worse and worse and I want no part of it.

My Facebook feed right now is full of (armchair primatologist and parenting expert) friends expressing outrage over the mother of the child who climbed into the gorilla enclosure in Cincinnati and the staffers who decided to kill the gorilla. The glee with which people I think of as compassionate are going after the parents and looking into their lives and background is disturbing. I feel like they must have access to much more information than I do, because I don’t know that the mother was negligent, and having experienced my nephew who was “a runner” when he was little, I know how kids can get away from you in the blink of an eye.

When bad things happen, society always seems to look for someone to blame. Someone to look down on. Someone to judge. Why do we do it? Because it makes us feel better about ourselves. I would never _____. Those people are less [careful, caring, moral, human, etc.] than me. Therefore, nothing bad like that could ever happen to [me/people I love]. And, instead of looking at how this thing can be prevented from happening again, we just want punishment. We want to see someone burned at the stake. That short-term vitriol never seems to lead to long-term improvements that would keep the same thing from happening again.

Were I just seeing it in the larger society, I could maybe just write it off, but I also see it happening in my profession, a profession full of brilliant critical thinkers who sometimes engage in mob mentality online. I have witnessed so many social media take-downs of people in our profession — some small, some quite large and public. More often than not, we are not privy to all the facts, but still blame and shame in ways that cause real damage to people.

I also saw this vitriol come out recently against American Libraries when some content had been changed in an article (without the author’s permission) that was favorable to a partner vendor. Admittedly, this was a bad situation that was not handled well by ALA Publishing, but it spurred on the usual “blame and shame” Twitter cycle that, as usual, did not lead to any meaningful change. What happened as a result? Did their policies or practices change? Is there a committee looking at this? Does anyone know?

What makes me craziest about the “blame and shame” Twitter cycle is that it never seem to lead to meaningful change. The people who are expressing outrage only seem to care for a short time, and not enough to ensure that positive and constructive change comes from all of it. The people who’ve been social media shamed either do everything in their power to disappear from the world or otherwise write off the people on social media as lunatics who can’t be taken seriously. Either way, again, no meaningful change or learning comes from it.

I used to see the world more in black and white and get riled up over things that now seem inconsequential. I used to be more judgmental. I am ashamed of some of my blog posts from long ago. As you see people in your life struggle, and as you struggle, you realize that nothing is quite that simple. An action you may have judged in the past rarely happens in isolation, and there may be some very good reasons why what happened did that you are unaware of.

Now, instead of rushing to judge, I try to understand. I try to put myself in their shoes. I remind myself that I’m not perfect; that I’m not immune from making mistakes or from bad things happening. In his recent post on ACRLog, Ian McCullough talks about having a “generous heart” when it comes to other people’s failings. I really like that. At one point, my friend, Josh Neff, called it “charitable reading,” but it applies to our physical lives as much as our digital. When we jump on the Twitter rage train against someone, we are forgetting that they are fully-formed human beings with complicated lives and emotions and desires who are not all bad or all good. And when we do that to people in our profession (which sometimes seems very small), it feels particularly egregious and short-sighted.

I’ve made mistakes. I’ve made bad decisions. I’ve been the “bad guy.” I’ve done things in my life that I never thought I’d do, and a big part of that is because I’d never anticipated being in the situations I was in. Life is unpredictable. People who go through terrible things are usually blindsided by them. And you don’t really know how you will respond until you’re in the situation. There isn’t a roadmap. To me, the key is learning and growing from the experience. And I think it’s hard to learn or reflect when you are in fight-or-flight mode because you’re being excoriated by people around you who probably don’t know the whole story.

But when you fall down, you see who your real friends are. You see who judges you, who stands back and holds you at arm’s length, and who is there for you. You find out who sees you as the sum of your parts instead of just one thing you did or said. I’ve been through several difficult chapters in my life that have made it clear to me who I can count on, and it is a powerful lesson. I only hope I can be there for those people in the same way when they need me.

Embracing nuance is hard sometimes. It’s so much easier to say “vendors are evil” than to admit that the people working at those companies are human beings, some of whom actually want to do (or are doing) good things for libraries. It’s so much easier to destroy an editor at ALA who made a poor decision than to work with ALA to make sure that never happens again. It’s so much easier to jump on the rage train when someone on Twitter is getting flogged for a comment they’d thought innocuous than to try and be kind and constructive.

Letting go of all that piss and vinegar and moral superiority feels good, at least for me. It’s freeing to recognize our own humanity and the humanity of those around us. We’re all flawed human beings trying to make our way through the world with (mostly) good intentions. We don’t all value the same things. We all sometimes feel schadenfreude; it’s an inescapable part of our reality TV-loving society. But assuming the best in people and helping them when they fall down feels a lot better than tearing them down… well, at least in the long term.

Image source

LITA: Jobs in Information Technology: June 1, 2016

planet code4lib - Wed, 2016-06-01 18:35

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Queensborough Community College (CUNY), Assistant Professor (Librarian) – Head of Reference Library, Bayside, NY

California State University, Dominguez Hills, Liaison-Systems Librarian, Carson, CA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

DPLA: Open, Free, and Secure to All: DPLA Launches Full Support for HTTPS

planet code4lib - Wed, 2016-06-01 17:12

DPLA is pleased to announce that the entirety of our website, including our portal, exhibitions, Primary Source Sets, and our API, are now accessible using HTTPS by default. DPLA takes user privacy seriously, and the infrastructural changes that we have made to support HTTPS allows us to extend this dedication further and become signatories of the Library Digital Privacy Pledge of 2015-2016, developed by our colleagues at the Library Freedom Project. The changes we’ve made include the following:

  • Providing HTTPS versions of all web services that our organization directly controls (including everything under the domain), for both human and machine consumption,
  • Automatic redirection for all HTTP requests to HTTPS, and
  • A caching thumbnail proxy for items provided by the DPLA API and frontend, which serves the images over HTTPS instead of providing them insecurely.

After soft-launching HTTPS support at DPLAFest 2016, DPLA staff has done thorough testing, and we are fairly confident that all pages and resources should load over HTTPS with no issues. If you do encounter any problems, such as mixed content warnings or web resources not loading properly, please contact us with the subject “Report a problem with the website” and describe the problem, including links to the pages on which you see the problem.

These changes are just the start, however. To ensure better privacy, DPLA encourages both its Hubs and their partners to provide access to all of their resources and web services over HTTPS, and to join DPLA in becoming a signatory of the Library Digital Privacy Pledge. By working together, we can achieve a national network of cultural heritage resources that are open, free, and secure to all. Please join me in thanking the Mark Breedlove, Scott Williams, and the rest of the DPLA Technology Team for making this possible.

Featured image: Patent for Padlock by Augustus Richards, Sr., 1889, UNT Libraries Government Documents Department, Portal to Texas History,

LITA: Managing iPads – The Configurator

planet code4lib - Wed, 2016-06-01 15:00

We’ve talked in the past about having iPads in the library and how to buy multiple copies of an app at the same time. This is long delayed post about the tool we use at my library to manage the devices.

At Waukesha Public Library we use Apple’s Configurator 2. This is a great solution if you have up to forty or fifty devices to manage. Beyond that it gets unwieldy (although I’ll talk about a way you could use the Configurator for more devices). We have two dozen or so iPads we manage this way so it works perfectly for us.

You can see in the photo above all our iPads connected to the Configurator. It even gives you an idea of what the desktop of the iPad is; you can even upload your own image to be loaded on each device if you want to brand them for your library. When you connect the iPads you get a great status overview. You get the OS version, what specific device it is, its capacity, and whether it has software updates among other things.

Across the top are several choices of how to interact with the iPads: you can prepare, update, back up, or tag. Prepare is used when the iPad is first configured for use in the Configurator. This gives you the option of supervising the devices so that you can control how they get updated and what networks they have access to. If you’re going to circulate iPads you don’t want to supervise them because it will set limits on how the public can use them. If you’re using iPads only in the library—as we are—then you should supervise them so that you can guarantee that they work in your network. We mostly use update which gives supervised iPads the option of doing an OS update, an app update, or both an OS and app update (depending on what the devices need).

OS updates go very quickly. Usually it takes about a half hour to update 22 iPads. You might need to interact with each device after an OS update—to set a passcode (you can reset a device’s passcode through the Configurator which is great when someone changes it or forgets what they set it as), enable location services, etc.—so just budget that into the time you need to get the devices ready for us.

App updates have been a different beast for us. We were on a monthly update which is perhaps not often enough. We found that if we updated all the apps on a single device or if we updated a single app on all devices that the process went quickly. If we tried to updates all apps on all devices it tended to get hung up and time out. We’re doing updates more frequently now so we’re not running into that problem any longer.

The best thing you can do with the Configurator is create profiles. There are a lot of settings to which the Configurator gives you access. This includes blocking in-app purchases, setting the WiFi network, enabling content filters, setting up AirPlay or AirPrint, and more. Basically anything you can control under an iPad’s settings outside of downloaded apps you can set using the Configurator and put into a profile.

This way if there are forty iPads for children’s programming, twenty iPads for teens, and thirty iPads the public checks out, each one could have its own profile and its own settings. In this way you can manage a lot more than forty or fifty devices. You would manage each profile as an individual group.

If you want to be able to push out updates to devices wirelessly, you can consider Apple’s Mobile Device Management. You can host your MDM services locally—which requires a server—or host them in the cloud. For us it made sense to use the Configurator and update devices by connecting to them since they are kept in a single cart. Our local school district, as I’ve mentioned before, provides an iPad to all students K-12 so they use JAMF’s Casper Suite (a customized solution) to manage their approximately 15,000 devices.


DPLA: Primary Source Sets for Education Now Accessible in PBS LearningMedia

planet code4lib - Wed, 2016-06-01 14:50

We are thrilled to announce that all of our 100 Primary Source Sets for education are now accessible in PBS LearningMedia, a leading destination for high-quality, trusted digital education resources that inspire students and transform learning.

This announcement comes as the result of the partnership between DPLA and PBS LearningMedia announced at last year’s DPLAfest and the work of our Education team and Advisory Committee to create a collection of one hundred primary source sets over the last ten months. We are excited to have the opportunity to bring together PBS LearningMedia’s unparalleled media resources and connections to the world of education and lifelong learning with DPLA’s vast and growing storehouse of openly available education materials, cultural heritage collections, and community of librarians, archivists, and curators.

Together with PBS LearningMedia, we hope that by providing access to the DPLA Primary Source Sets to PBS LearningMedia’s broad audience of educators, we can make our rich new resources more accessible and discoverable for all, while introducing educators across the country to DPLA and the cultural heritage collections contributed by our growing network of hubs and contributing institutions. 

Within PBS LearningMedia, educators will be able to access, save, and combine DPLA’s education resources with more than 100,000 videos, images, interactives, lesson plans and articles drawn from critically acclaimed PBS programs such as Frontline and American Experience and from expert content contributors like NASA.  Teachers also have the option of navigating within the DPLA resources; from our collection page,  educators can explore by core subject areas, such as US history, literature, arts, and science and technology, as well as themes like migration and labor history and groups including African Americans and women.

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – When

planet code4lib - Wed, 2016-06-01 14:12

In 2008 a group of institution comprised of the Internet Archive, Library of Congress, California Digital Library, University of North Texas, and Government Publishing Office worked together to collect the web presence of the federal government in a project that has come to be known as the End of Term Presidential Harvest 2008.

Working together this group established the scope of the project, developed a tool to collect nominations of URLs important to the community for harvesting, carried out a harvest of the federal web presence before the election, after the election, and after the inauguration of President Obama. This collection was harvested by the Internet Archive, Library of Congress, California Digital Library, and the UNT Libraries.  At the end of the EOT project the data harvested was shared between the partners with several institutions acquiring a copy of the complete EOT dataset for their local collections.

Moving forward four years the same group got together to organize the harvesting of the federal domain in 2012.  While originally scoped as a way of capturing the transition of the executive branch,  this EOT project also served as a way to systematically capture a large portion of the federal web on a four year calendar.  In addition to the 2008 partners,  Harvard joined in the project for 2012.

Again the team worked to identify in-scope content to collect, this time however the content included URLs from the social web including Twitter and Facebook for agencies, offices and individuals in the federal government.  Because there was not a change in office because of the 2012 election, there was just a set of crawls that occurred during the fall of 2012 and the winter of 2013.  Again this content was shared between the project partners interested in acquiring the archives for their own collections.

The End of Term group is a loosely organized group that comes together ever four years to conduct the harvesting of the federal web presence. As we ramp up for the end of the Obama administration the group has started to plan the EOT 2016 project with a goal to start crawling in September of 2016.  This time there will be a new president so the crawling will probably take the format of the 2008 crawls with a pre-election, post-election and post-inauguration set of crawls.

So far there hasn’t been much in the way of analysis to compare the EOT2008 and EOT2012 web archives.  There are a number of questions that have come up over the years that remain unanswered about the two collections.  This series of posts will hopefully take a stab at answering some of those questions and maybe provide better insight into the makeup of these two collections.  Finally there are hopefully a few things that can be learned from the different approaches used during the creation of these archives that might be helpful as we begin the EOT 2016 crawling.

Working with the EOT Data

The dataset that I am working with for these posts consists of the CDX files created for the EOT2008 and EOT2012 archive.  Each of the CDX files acts as an index to the raw archived content and contains a number of fields that can be useful for analysis.  All of the archive content is referenced in the CDX file.

If you haven’t looked at a CDX file in the past here is an example of a CDX file.

gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:martinelli,%20giovanni&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005312 text/html 200 LFN2AKE4D46XEZNOP3OLXG2WAPLEKZKO - - - 533010532 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:schumann-heink,%20ernestine&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005219 text/html 200 EL5OT5NAXGGV6VADBLNP2CBZSZ5MH6OT - - - 531160983 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:scotti,%20antonio&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005255 text/html 200 SEFDA5UNFREPA35QNNLI7DPNU3P4WDCO - - - 804325022 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:viafora,%20gina&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005309 text/html 200 EV6N3TMKIVWAHEHF54M2EMWVM5DP7REJ - - - 532966964 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:homer,%20louise&fq[1]=take_composer_name:campana,%20f.%20&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125070122 text/html 200 FW2IGVNKIQGBUQILQGZFLXNEHL634OI6 - - - 661008391

The CDX format is a space delimited file with the following fields

  • SURT formatted URI
  • Capture Time
  • Original URI
  • MIME Type
  • Response Code
  • Content Hash (SHA1)
  • Redirect URL
  • Meta tags (not populated)
  • Compressed length (sometimes populated)
  • Offset in WARC file
  • WARC File Name

The tools I’m working with to analyze the EOT datasets will consist of Python scripts that either extract specific data from the CDX files where it can be further sorted and counted, or they will be scripts that work on these sorted and counted versions of files.

I’m trying to post code and derived datasets in a Github repository called eot-cdx-analysis if you are interested in taking a look.  There is also a link to the original CDX datasets there as well.

How much

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  Unfortunately the CDX files that we are working with don’t have consistent size information that we can use for analysis but the rough sizes for each of the archives is EOT2008 at 16TB and EOT2012 at just over 41.6TB.


The first dimension I wanted to look at was when was the content harvested for each of the EOT rounds.  In both cases we all remember starting the harvesting “sometime in September” and then ending the crawls “sometime in March” of the following year.  How close were we to our memory?

For this I extracted the Capture Time field from the CDX file, converted that into a date yyyy–mm-dd was a decent bucket to group into and then sorted and counted each instance of a date.

EOT2008 Harvest Dates

This first chart shows the harvest dates contained in the EOT2008 CDX files.  Things got kicked off in September 2008 and apparently concluded all the way in OCT 2009.  There is another blip of activity in May of 2009.  This is probably something to go back and look at to help remember what exactly these two sets of crawling were that happened after March 2009 when we all seem to remember crawling stopping.

EOT2012 Harvest Dates

The EOT2012 crawling started off in mid-September and this time finished up in the first part of March 2013.  There is a more consistent shape to the crawling for this EOT with a pretty consistent set of crawling happening between mid-October and the end of January.

EOT2008 and EOT2012 Harvest Dates Compared

When you overlay the two charts you can see how the two compare.  Obviously the EOT2008 data continues quite a bit further than the EOT2012 but where they overlap you can see that there were different patterns to the collecting.


This is the first of a few posts related to web archiving and specifically to comparing the EOT2008 and EOT2012 archives.  We are approaching the time to start the EOT2016 crawls and it would be helpful to have more information about what we crawled in the two previous cycles.

In addition to just needing to do this work there will be a presentation on some of these findings as well as other types of analysis at the 2016 Web Archiving and Digital Libraries (WADL) workshop that is happening at the end of JCDL2016 this year in Newark, NJ.

If there are questions you have about the EOT2008 or EOT2012 archives please get in contact with me and we can see if we can answer them.

If you have questions or comments about this post,  please let me know via Twitter.


Subscribe to code4lib aggregator