You are here

Feed aggregator

ZBW German National Library of Economics: Content recommendation by means of EEXCESS

planet code4lib - Fri, 2016-06-03 07:59

Authors: Timo Borst, Nils Witt

Since their beginnings, libraries and related cultural institutions were confident in the fact that users had to visit them in order to search, find and access their content. With the emergence and massive use of the World Wide Web and associated tools and technologies, this situation has drastically changed: if those institutions still want their content to be found and used, they must adapt themselves to those environments in which users expect digital content to be available. Against this background, the general approach of the EEXCESS project is to ‘inject’ digital content (both metadata and object files) into users' daily environments like browsers, authoring environments like content management systems or Google Docs, or e-learning environments. Content is not just provided, but recommended by means of an organizational and technical framework of distributed partner recommenders and user profiles. Once a content partner has connected to this framework by establishing an Application Program Interface (API) for constantly responding to the EEXCESS queries, the results will be listed and merged with the results of the other partners. Depending on the software component installed either on a user’s local machine or on an application server, the list of recommendations is displayed in different ways: from a classical, text-oriented list, to a visualization of metadata records.

The Recommender

The EEXCESS architecture comprises  three major components: a privacy-preserving proxy, multiple client-side tools for the Chrome Browser, Wordpress, Google Docs and more, and the central server-side component, responsible for generating recommendations, called recommender. Covering all of these components in detail is beyond the scope of this blog post. Instead, we want to focus on one component: the federated recommender, as it is the heart of the EEXCESS infrastructure.

The recommender’s task is to generate a list of objects like text documents, images and videos (hereafter called documents, for brevity’s sake) in response to a given query. The list is supposed to contain only documents relevant to the user. Moreover, the list should be ordered (by descending relevance). To generate such a list, the recommender can pick documents from the content providers that participate in the EEXCESS infrastructure. Technically speaking but somewhat oversimplified: the recommender receives a query and forwards it to all content provider systems (like Econbiz, Europeana, Mendeley and others). After receiving results from each content provider, the recommender decides in which order documents will be recommended to the user  and return it to the user who submitted the query.

This raises some questions. How can we find relevant documents? The result lists from the content providers are already sorted by relevance; how can we merge them? Can we deal with ambiguity and duplicates? Can we respond within reasonable time? Can we handle the technical disparities of the different content provider systems? How can we integrate the different document types? In the following, we will describe how we tackled some of these questions, by giving a more detailed explanation on how the recommender compiles the recommendation lists.

Recommendation process

If the user wishes to obtain personalized recommendations, she can create a local profile (i.e. stored only on the user’s device). They can specify their education, age, field of interest and location. But to be clear here: this is optional. If the profile is used, the Privacy Proxy[4] takes care of anonymizing the personal information. The overall process of creating personalized recommendations is depicted in figure and will be described in the following.

After the user has sent a query as well as her user profile, a process called Source Selection is triggered. Based on the user’s preferences, the Source Selection decides which partner systems will be queried. The reason for this is that most content providers cover only a specific discipline (see figure). For instance, queries from a user that is only interested in biology and chemistry will never receive Econbiz recommendations, whereas a query from a user merely interested in politics and money will get Econbiz recommendations (up to the present, this may change when other content provider participate). Thereby, Source Selection lowers the network traffic and the latency of the overall process and increases the precision of the results at the expense of missing results and reduced diversity. Optionally, the user can also select the sources manually.

The subsequent Query Processing step alters the query:

  • Short queries are expanded using Wikipedia knowledge
  • Long queries are split into smaller queries, which are then handled separately (See [1] for more details).

 The queries from the Query Processing step are then used to query the content providers selected during the Source Selection step. With the results from the content providers, two post processing steps are carried out to generate the personalized recommendations:

  • Result Processing: The purpose of the Result Processing is to detect duplicates. A technique called fuzzy hashing is used for this purpose. The words that make up a result list’s entry are sorted, counted and truncated by the MD5 hash algorithm [2], which allows convenient comparison.
  • Result Ranking: After the duplicates have been removed, the results are re-ranked. To do so, a slightly modified version of the round robin method is used. Where vanilla round robin would just concatenate slices of the result lists (i.e. first two documents from list A + first two document from list B + …), Weighted Round Robinmodifies this behavior by taking the overlap of the query and the result’s meta-data into account. This is, before merging the lists, each individual list is modified. Documents, whose meta data exhibit a high accordance to the query, are being promoted.

Partner Wizard

As the quality of the recommended documents increases with the number and diversity of the content providers that participate, a component called Partner Wizard was implemented. Its goal is to simplify the integration of new content providers to a level that non-experts can manage this process without any support from the EEXCESS consortium. This is achieved by a semi-automatic process triggered from a web frontend that is provided by the EEXCESS consortium. Given a search API, it is relatively easy to obtain search results, but the main point is to obtain results that are meaningful and relevant to the user. Since every search service behaves differently, there is no point in treating all services equally. Some sort of customization is needed. That’s where the Partner Wizard comes into play. It allows an employee from the new content provider to specify the search API. Afterwards, the wizard submits pre-assembled pairs of search queries to the new service. Each pair is similar but not identical, like for examp

  • Query 1: <TERM1> OR <TERM2>
  • Query 2: <TERM1> AND <TERM2>.

The thereby generated result lists are presented to the user, which has to decide which list contains the more relevant results and suits the query better (see figure). Finally, based on the previous steps, a configuration file is generated that configures the federated recommender. Whereupon the recommender mimics the behavior, that was previously exhibited. The wizard can be completed within a few minutes and it only requires a publically available search API.

The project started with five initial content providers. Now, due to the contribution of the partner wizard, there are more than ten content providers and negotiations with further candidates are ongoing. Since there are almost no technical issues anymore, legal issues dominate the consultations. As all programs developed within the EEXCESS project are published under open source conditions, the Partner Wizard can be found at [3].


The EEXCESS project is about injecting distributed content from different cultural and scientific domains into everyday user environments, so this content becomes more visible and better accessible. To achieve this goal and to establish a network of distributed content providers, apart from the various organizational, conceptual and legal aspects some specification and engineering of software is to be done – not only one-time, but also with respect to maintaining the technical components. One of the main goals of the project is to establish a community of networked information systems, with a lightweight approach towards joining this network by easily setting up a local partner recommender. Future work will focus on this growing network and the increasing requirements of integrating heterogeneous content via central processing of recommendations.

EEXCESS Recommender Recommender system   Metadata   Economics  

ZBW German National Library of Economics: In a nutshell: EconBiz Beta Services

planet code4lib - Fri, 2016-06-03 07:56

Author: Arne Martin Klemenz

EconBiz – the search portal for Business Studies and Economics – was launched in 2002 as the Virtual Library for Economics and Business Studies. The project was initially funded by the German Research Foundation (DFG) and is developed by the German National Library of Economics (ZBW) with the support of the EconBiz Advisory Board and cooperation partners. The search portal aims to support research in and teaching of Business Studies and Economics with a central entry point for all kinds of subject-specific information and direct access to full texts [1].

As an addition to the main EconBiz service we provide several beta services as part of the EconBiz Beta sandbox. These service developments cover the outcome of research projects based on large-scale projects like EU Projects as well as small-scale projects e.g. in cooperation with students from Kiel University. Therefore, this beta service sandbox aims to provide a platform for testing new features before they might be integrated to the main service (proof of concept development) on the one hand, and it aims to provide a showcase for relevant project output from related projects on the other hand.

Details about some exemplary selected beta services are provided in the following.

Current Beta Services Online Call Organizer

Based on the winning idea of an EconBiz ideas competition, the Online Call Organizer (OCO) got developed in cooperation with students from Kiel University.

The OCO is a service based on the EconBiz Calendar of Events which contains events from all over the world like conferences and workshops that are relevant for economics, business studies and social sciences [2]. At the moment, the calendar of events contains more than 10,000 events in total, including more than 500 future events. The main idea of the OCO is to handle the huge amount of event related information in a better way with the objective to “never miss a deadline” like the registration or submission deadline of a relevant event in a user’s personal area of interest. The main EconBiz Calendar of Events service provides a keyword based facetted search functionality and detailed information for each event. In addition, the OCO provides a filter mechanism based on the user’s research profile.  This is combined with a notification service based on email and twitter alerts. The OCO is published as open beta and can be accessed here:

Technologically, the OCO is implemented based on PHP and JavaScript following the client-server model. The backend server processes, aggregates and transforms information about events retrieved from the EconBiz API which provides access to the EconBiz dataset following a RESTful API design. Besides that, the OCO server handles signup and authentication requests as well as any other user account related actions like changes regarding a user’s research profile or the notification settings. The server functionality is encapsulated by providing an internal API as outlined in Figure 1.

Figure 1: Online Call Organizer - Client Server Architecture Overview

Silex – a micro web framework based on Symfony – provides the basis for the OCO Server API. The communication between OCO Client and the OCO Service API is based on common HTTP methods GET, PUT/POST and DELETE utilizing the JSON format. This allows a comfortable abstraction of the detailed application logic. Likewise, the server implements an additional abstraction layer based on the Doctrine ORM framework – an object-relational mapper (ORM) for PHP – that provides the capability to abstract the database layer.

The OCO server module handles its library dependencies with Composer – a tool for dependency management in PHP. Further functionality is based on the utilization of the following libraries and frameworks:

The core functionality – the alert service itself – is based on a daily cronjob which checks searches for events matching a user’s research profile by retrieving up to date data from the EconBiz API. Users can specify if they want to be notified by email and/or Twitter alert about events matching their research profile. Alerts are sent ‘X’ days (depending on the account settings) before the submission deadline ends, registration closes or the event actually takes place.

The corresponding multilingual (English/German) web client is based on PHP and JavaScript. It is kept quite simple, as it should mostly provide the possibility to edit basic account settings (email address, password, Twitter profile for alerts) and research profiles in a form-based manner. In addition to this, it provides a calendar overview for upcoming events based on the FullCalendar jQuery plugin. The communication with the OCO backend server is based on asynchronous requests (AJAX) in JSON format send to the OCO server API. In addition to the main jQuery and the FullCalendar library, the PHP based client’s skeleton utilizes the following common JavaScript libraries to implement its features:

Some parts of the OCO implementation have already been reused in the EconBiz portal. The full OCO service may be integrated with EconBiz as part of a scheduled reimplementation of the EconBiz Calendar of Events.

Other Beta Services

As part of the EconBiz Beta sandbox we provide several other beta services. On the one hand we would like to ease reuse of information provided in EconBiz. The EconBiz API provides full access to the EconBiz dataset. But for those, who are not comfortable with implementing their own services based on the API, we decided to provide some basic widgets that might be easily integrated to any website. The widgets come with a configuration and widget generation dialogue to make the integration as comfortable as possible. Currently, three different kinds of widgets are available: a search and event search widget as well as a bookmark widget. These widgets were initially meant to be a comfortable way e.g. for EconBiz Partners (see EconBiz Partner Network) to embed data from EconBiz into their websites, but these widgets are now also used in other ZBW services e.g. to embed lists with expert-selected literature based on the bookmark widget.

On the other hand, Visual Search Interfaces and Visualization Widgets as prototypes from the EEXCESS project are also presented in the EconBiz Beta sandbox. The EEXCESS vision is to unfold the treasure of cultural, educational and scientific long-tail content for the benefit of all users and to facilitate access to this high quality information [16]. One aspect of the project, which is reflected by the EEXCESS prototypes in the EconBiz Beta sandbox, is the visualization of search processes.

EEXCESS Visual Search Interfaces and Visualization Widgets:


With EconBiz Beta we provide a sandbox for services developed in the context of EconBiz. This blog post gave an overview of some selected EconBiz Beta services. Several more services are available here:

If you would like to disseminate a service you created based on the EconBiz API, we would like to publish your development in the EconBiz Beta sandbox. Please get in touch via and provide some information about the application.

EconBiz Sandbox Web widget   Visualization (computer graphics)   Organizer  

Max Planck Digital Library: Citation Trails in Primo Central Index (PCI)

planet code4lib - Thu, 2016-06-02 17:39

The May 2016 release brought an interesting functionality to the Primo Central Index (PCI): The new "Citation Trail" capability enables PCI users to discover relevant materials by providing cited and citing publications for selected article records.

At this time the only data source for the citation trail feature is CrossRef, thus the number of citing articles will be below the "Cited by" counts in other sources like Scopus and Web of Science.

Further information:

District Dispatch: Federal experts to walk libraries through government red tape

planet code4lib - Thu, 2016-06-02 13:12

Ever felt frustrated by the prospect of another unfunded mandate from the federal, local or state government? Get a better understanding of the red tape. Empower yourself, your library and your community by learning to navigate major e-government resources and websites by attending “E-Government Services At your Library: Conquering An Unfunded Mandate,” a conference session that takes place at the 2016 American Library Association (ALA) Annual Conference in Orlando, Fla. During the session, participants will learn how to navigate federal funding regulations.

Learn about taxes, housing, aid to dependent families, social security, healthcare, services to veterans, legal issues facing librarians in e-government and more. The session takes place on Thursday, June 23, 2016, from 1:00-4:00 p.m. in the Orange County Convention Center, room W103A.

Session speakers include Jayme Bosio, government research services librarian at the Palm Beach Library System in West Palm Beach, Fla.; Ryan Dooley, director of the Miami Passport Agency at the U.S. Department of State; and Chris Janklow, community engagement coordinator of the Palm Beach Library System in West Palm Beach, Fla.

Want to attend other policy sessions at the 2016 ALA Annual Conference? View all ALA Washington Office sessions

The post Federal experts to walk libraries through government red tape appeared first on District Dispatch.

LibUX: Progress Continues Toward HTML5 DRM

planet code4lib - Thu, 2016-06-02 07:40

It is Thursday, June 2nd. You’re listening to W3 Radio (MP3), your news weekly about the world wide web in under ten minutes.

RSS | Google Play | iTunes


So, hey there! Thanks for giving my new podcast — W3 Radio — a spin. You can help it find its footing by leaving a nice review, telling your friends, and subscribing. Let’s be friends on twitter.

W3 Radio is now available in Google Play and in iTunes. Of course, you can always subscribe to the direct feed

The post Progress Continues Toward HTML5 DRM appeared first on LibUX.

Karen Coyle: This is what sexism looks like, # 3

planet code4lib - Thu, 2016-06-02 05:32
I spend a lot of time in technical meetings. This is no one's fault but my own since these activities are purely voluntary. At the end of many meetings, though, I vow to never attend one again. This story is about one.

There was no ill-preparedness or bad faith on the part of either the organizers or the participants at this meeting. There is, however, reality, and no amount of good will changes that.

This took place at a working meeting that was not a library meeting but at which some librarians were present. At lunch one day, three librarians, myself and two others, all female, were sitting together. I can say that we are all well-known and well seasoned in library systems and standards. You would recognize our names. As lunch was winding down, the person across from us opened a conversation with this (all below paraphrased):

P: Libraries should get involved with the Open Access movement; they are in a position to have an effect.

us: Libraries *are* heavily involved in the OA movement, and have been for at least a decade.

P: (Going on.) If you'd join together you could fight for OA against the big publishers.

us: Libraries *have* joined together and are fighting for OA. (Beginning to get annoyed at this point.)

P: What you need to do is... [various iterations here]

us: (Visibly annoyed now) We have done that. In some cases, we have started an effort that is going forward. We have organizations dedicated to that, we hold whole conferences on these topics. You are preaching to the choir here - these aren't new ideas for us, we know all of this. You don't need to tell us.

P: (Going on, no response to what we have said.) You should set a deadline, like 2017, after which you should drop all journals that are not OA.

us: [various statements about a) setting up university-wide rules for depositing articles; b) the difference in how publishing matters in different disciplines: c) the role of tenure, etc.]

P: (Insisting) If libraries would support OA, publishers like Elsevier could not survive.

us: [oof!]

me: You are sitting here with three professionals with a combined experience in this field of well over 50 years, but you won't listen to us or believe what we say. Why not?

P: (Ignoring the question.) I'm convinced that if libraries would join in, we could win this one. You should...

At this point, I lost it. I literally head-desked and groaned out "Please stop with the mansplaining!" That was a mistake, but it wasn't wrong. This was a classic case of mansplaining. P hopped up and stalked out of the room. Twenty minutes later I am told that I have violated the "civility code" of the conference. I have become the perpetrator of abuse because I "accused him" of being sexist.

I don't know what else we could have done to stop what was going on. In spite of a good ten minutes of us replying that libraries are "on it" not one of our statements was acknowledged. Not one of P's statements was in response to what we said. At no point did P acknowledge that we know more about what libraries are doing than he does, and perhaps he could learn by listening to us or asking us questions. And we actually told him, in so many words, he wasn't listening, and that we are knowledgeable. He still didn't get it.

This, too, is a classic: Catch-22. A person who is clueless will not get the "hints" but you cannot clue them or you are in the wrong.

Thanks to the men's rights movement, standing up against sexism has become abuse of men, who are then the victims of what is almost always characterized as "false accusations". Not only did this person tell me, in the "chat" we had at his request, "I know I am not sexist" he also said, "You know that false accusations destroy men's lives." It never occurred to him that deciding true or false wasn't de facto his decision. He didn't react when I said that all three of us had experienced the encounter in the same way. The various explanations P gave were ones most women have heard before: "If I didn't listen, that's just how I am with everybody." "Did I say I wasn't listening because you are women? so how could it be sexist?" And "I have listened to you in our meetings, so how can you say I am sexist?" (Again, his experience, his decision.) During all of this I was spoken to, but no interest was shown in my experience, and I said almost nothing. I didn't even try to explain it. I was drubbed.

The only positive thing that I can say about this is that in spite of heavy pressure over 20 minutes, one on one, I did not agree to deny my experience. He wanted me to tell him that he hadn't been sexist. I just could't do that. I said that we would have to agree to disagree, but apologized for my outburst.

When I look around meeting rooms, I often think that I shouldn't be there. I often vow that the next time I walk into a meeting room and it isn't at least 50% female, I'm walking out. Unfortunately, that meeting room does not exist in the projects that I find myself in.

Not all of the experience at the meeting was bad. Much of it was quite good. But the good doesn't remove the damage of the bad. I think about the fact that in Pakistan today men are arguing that it is their right to physically abuse the women in their home and I am utterly speechless. I don't face anything like that. But the wounds from these experiences take a long time to heal. Days afterward, I'm still anxious and depressed. I know that the next time I walk into a meeting room I will feel fear; fear of further damage. I really do seriously think about hanging it all up, never going to another meeting where I try to advocate for libraries.

I'm now off to join friends and hopefully put this behind me. I wish I could know that it would never happen again. But I get that gut punch just thinking about my next meeting.

Meredith Farkas: Generous hearts and social media shaming

planet code4lib - Wed, 2016-06-01 19:28

When I was young and bold and strong,
Oh, right was right, and wrong was wrong!
My plume on high, my flag unfurled,
I rode away to right the world.
“Come out, you dogs, and fight!” said I,
And wept there was but once to die.

But I am old; and good and bad
Are woven in a crazy plaid.

-From “The Veteran,” by Dorothy Parker


As someone who has been active on social media for the entirety of my professional life and who wrote a book on social media for libraries, things have to be pretty bad for me to be considering taking a hiatus from social media. But I feel like the vitriol, nastiness, and lack of compassion is getting worse and worse and I want no part of it.

My Facebook feed right now is full of (armchair primatologist and parenting expert) friends expressing outrage over the mother of the child who climbed into the gorilla enclosure in Cincinnati and the staffers who decided to kill the gorilla. The glee with which people I think of as compassionate are going after the parents and looking into their lives and background is disturbing. I feel like they must have access to much more information than I do, because I don’t know that the mother was negligent, and having experienced my nephew who was “a runner” when he was little, I know how kids can get away from you in the blink of an eye.

When bad things happen, society always seems to look for someone to blame. Someone to look down on. Someone to judge. Why do we do it? Because it makes us feel better about ourselves. I would never _____. Those people are less [careful, caring, moral, human, etc.] than me. Therefore, nothing bad like that could ever happen to [me/people I love]. And, instead of looking at how this thing can be prevented from happening again, we just want punishment. We want to see someone burned at the stake. That short-term vitriol never seems to lead to long-term improvements that would keep the same thing from happening again.

Were I just seeing it in the larger society, I could maybe just write it off, but I also see it happening in my profession, a profession full of brilliant critical thinkers who sometimes engage in mob mentality online. I have witnessed so many social media take-downs of people in our profession — some small, some quite large and public. More often than not, we are not privy to all the facts, but still blame and shame in ways that cause real damage to people.

I also saw this vitriol come out recently against American Libraries when some content had been changed in an article (without the author’s permission) that was favorable to a partner vendor. Admittedly, this was a bad situation that was not handled well by ALA Publishing, but it spurred on the usual “blame and shame” Twitter cycle that, as usual, did not lead to any meaningful change. What happened as a result? Did their policies or practices change? Is there a committee looking at this? Does anyone know?

What makes me craziest about the “blame and shame” Twitter cycle is that it never seem to lead to meaningful change. The people who are expressing outrage only seem to care for a short time, and not enough to ensure that positive and constructive change comes from all of it. The people who’ve been social media shamed either do everything in their power to disappear from the world or otherwise write off the people on social media as lunatics who can’t be taken seriously. Either way, again, no meaningful change or learning comes from it.

I used to see the world more in black and white and get riled up over things that now seem inconsequential. I used to be more judgmental. I am ashamed of some of my blog posts from long ago. As you see people in your life struggle, and as you struggle, you realize that nothing is quite that simple. An action you may have judged in the past rarely happens in isolation, and there may be some very good reasons why what happened did that you are unaware of.

Now, instead of rushing to judge, I try to understand. I try to put myself in their shoes. I remind myself that I’m not perfect; that I’m not immune from making mistakes or from bad things happening. In his recent post on ACRLog, Ian McCullough talks about having a “generous heart” when it comes to other people’s failings. I really like that. At one point, my friend, Josh Neff, called it “charitable reading,” but it applies to our physical lives as much as our digital. When we jump on the Twitter rage train against someone, we are forgetting that they are fully-formed human beings with complicated lives and emotions and desires who are not all bad or all good. And when we do that to people in our profession (which sometimes seems very small), it feels particularly egregious and short-sighted.

I’ve made mistakes. I’ve made bad decisions. I’ve been the “bad guy.” I’ve done things in my life that I never thought I’d do, and a big part of that is because I’d never anticipated being in the situations I was in. Life is unpredictable. People who go through terrible things are usually blindsided by them. And you don’t really know how you will respond until you’re in the situation. There isn’t a roadmap. To me, the key is learning and growing from the experience. And I think it’s hard to learn or reflect when you are in fight-or-flight mode because you’re being excoriated by people around you who probably don’t know the whole story.

But when you fall down, you see who your real friends are. You see who judges you, who stands back and holds you at arm’s length, and who is there for you. You find out who sees you as the sum of your parts instead of just one thing you did or said. I’ve been through several difficult chapters in my life that have made it clear to me who I can count on, and it is a powerful lesson. I only hope I can be there for those people in the same way when they need me.

Embracing nuance is hard sometimes. It’s so much easier to say “vendors are evil” than to admit that the people working at those companies are human beings, some of whom actually want to do (or are doing) good things for libraries. It’s so much easier to destroy an editor at ALA who made a poor decision than to work with ALA to make sure that never happens again. It’s so much easier to jump on the rage train when someone on Twitter is getting flogged for a comment they’d thought innocuous than to try and be kind and constructive.

Letting go of all that piss and vinegar and moral superiority feels good, at least for me. It’s freeing to recognize our own humanity and the humanity of those around us. We’re all flawed human beings trying to make our way through the world with (mostly) good intentions. We don’t all value the same things. We all sometimes feel schadenfreude; it’s an inescapable part of our reality TV-loving society. But assuming the best in people and helping them when they fall down feels a lot better than tearing them down… well, at least in the long term.

Image source

LITA: Jobs in Information Technology: June 1, 2016

planet code4lib - Wed, 2016-06-01 18:35

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Queensborough Community College (CUNY), Assistant Professor (Librarian) – Head of Reference Library, Bayside, NY

California State University, Dominguez Hills, Liaison-Systems Librarian, Carson, CA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

DPLA: Open, Free, and Secure to All: DPLA Launches Full Support for HTTPS

planet code4lib - Wed, 2016-06-01 17:12

DPLA is pleased to announce that the entirety of our website, including our portal, exhibitions, Primary Source Sets, and our API, are now accessible using HTTPS by default. DPLA takes user privacy seriously, and the infrastructural changes that we have made to support HTTPS allows us to extend this dedication further and become signatories of the Library Digital Privacy Pledge of 2015-2016, developed by our colleagues at the Library Freedom Project. The changes we’ve made include the following:

  • Providing HTTPS versions of all web services that our organization directly controls (including everything under the domain), for both human and machine consumption,
  • Automatic redirection for all HTTP requests to HTTPS, and
  • A caching thumbnail proxy for items provided by the DPLA API and frontend, which serves the images over HTTPS instead of providing them insecurely.

After soft-launching HTTPS support at DPLAFest 2016, DPLA staff has done thorough testing, and we are fairly confident that all pages and resources should load over HTTPS with no issues. If you do encounter any problems, such as mixed content warnings or web resources not loading properly, please contact us with the subject “Report a problem with the website” and describe the problem, including links to the pages on which you see the problem.

These changes are just the start, however. To ensure better privacy, DPLA encourages both its Hubs and their partners to provide access to all of their resources and web services over HTTPS, and to join DPLA in becoming a signatory of the Library Digital Privacy Pledge. By working together, we can achieve a national network of cultural heritage resources that are open, free, and secure to all. Please join me in thanking the Mark Breedlove, Scott Williams, and the rest of the DPLA Technology Team for making this possible.

Featured image: Patent for Padlock by Augustus Richards, Sr., 1889, UNT Libraries Government Documents Department, Portal to Texas History,

LITA: Managing iPads – The Configurator

planet code4lib - Wed, 2016-06-01 15:00

We’ve talked in the past about having iPads in the library and how to buy multiple copies of an app at the same time. This is long delayed post about the tool we use at my library to manage the devices.

At Waukesha Public Library we use Apple’s Configurator 2. This is a great solution if you have up to forty or fifty devices to manage. Beyond that it gets unwieldy (although I’ll talk about a way you could use the Configurator for more devices). We have two dozen or so iPads we manage this way so it works perfectly for us.

You can see in the photo above all our iPads connected to the Configurator. It even gives you an idea of what the desktop of the iPad is; you can even upload your own image to be loaded on each device if you want to brand them for your library. When you connect the iPads you get a great status overview. You get the OS version, what specific device it is, its capacity, and whether it has software updates among other things.

Across the top are several choices of how to interact with the iPads: you can prepare, update, back up, or tag. Prepare is used when the iPad is first configured for use in the Configurator. This gives you the option of supervising the devices so that you can control how they get updated and what networks they have access to. If you’re going to circulate iPads you don’t want to supervise them because it will set limits on how the public can use them. If you’re using iPads only in the library—as we are—then you should supervise them so that you can guarantee that they work in your network. We mostly use update which gives supervised iPads the option of doing an OS update, an app update, or both an OS and app update (depending on what the devices need).

OS updates go very quickly. Usually it takes about a half hour to update 22 iPads. You might need to interact with each device after an OS update—to set a passcode (you can reset a device’s passcode through the Configurator which is great when someone changes it or forgets what they set it as), enable location services, etc.—so just budget that into the time you need to get the devices ready for us.

App updates have been a different beast for us. We were on a monthly update which is perhaps not often enough. We found that if we updated all the apps on a single device or if we updated a single app on all devices that the process went quickly. If we tried to updates all apps on all devices it tended to get hung up and time out. We’re doing updates more frequently now so we’re not running into that problem any longer.

The best thing you can do with the Configurator is create profiles. There are a lot of settings to which the Configurator gives you access. This includes blocking in-app purchases, setting the WiFi network, enabling content filters, setting up AirPlay or AirPrint, and more. Basically anything you can control under an iPad’s settings outside of downloaded apps you can set using the Configurator and put into a profile.

This way if there are forty iPads for children’s programming, twenty iPads for teens, and thirty iPads the public checks out, each one could have its own profile and its own settings. In this way you can manage a lot more than forty or fifty devices. You would manage each profile as an individual group.

If you want to be able to push out updates to devices wirelessly, you can consider Apple’s Mobile Device Management. You can host your MDM services locally—which requires a server—or host them in the cloud. For us it made sense to use the Configurator and update devices by connecting to them since they are kept in a single cart. Our local school district, as I’ve mentioned before, provides an iPad to all students K-12 so they use JAMF’s Casper Suite (a customized solution) to manage their approximately 15,000 devices.


DPLA: Primary Source Sets for Education Now Accessible in PBS LearningMedia

planet code4lib - Wed, 2016-06-01 14:50

We are thrilled to announce that all of our 100 Primary Source Sets for education are now accessible in PBS LearningMedia, a leading destination for high-quality, trusted digital education resources that inspire students and transform learning.

This announcement comes as the result of the partnership between DPLA and PBS LearningMedia announced at last year’s DPLAfest and the work of our Education team and Advisory Committee to create a collection of one hundred primary source sets over the last ten months. We are excited to have the opportunity to bring together PBS LearningMedia’s unparalleled media resources and connections to the world of education and lifelong learning with DPLA’s vast and growing storehouse of openly available education materials, cultural heritage collections, and community of librarians, archivists, and curators.

Together with PBS LearningMedia, we hope that by providing access to the DPLA Primary Source Sets to PBS LearningMedia’s broad audience of educators, we can make our rich new resources more accessible and discoverable for all, while introducing educators across the country to DPLA and the cultural heritage collections contributed by our growing network of hubs and contributing institutions. 

Within PBS LearningMedia, educators will be able to access, save, and combine DPLA’s education resources with more than 100,000 videos, images, interactives, lesson plans and articles drawn from critically acclaimed PBS programs such as Frontline and American Experience and from expert content contributors like NASA.  Teachers also have the option of navigating within the DPLA resources; from our collection page,  educators can explore by core subject areas, such as US history, literature, arts, and science and technology, as well as themes like migration and labor history and groups including African Americans and women.

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – When

planet code4lib - Wed, 2016-06-01 14:12

In 2008 a group of institution comprised of the Internet Archive, Library of Congress, California Digital Library, University of North Texas, and Government Publishing Office worked together to collect the web presence of the federal government in a project that has come to be known as the End of Term Presidential Harvest 2008.

Working together this group established the scope of the project, developed a tool to collect nominations of URLs important to the community for harvesting, carried out a harvest of the federal web presence before the election, after the election, and after the inauguration of President Obama. This collection was harvested by the Internet Archive, Library of Congress, California Digital Library, and the UNT Libraries.  At the end of the EOT project the data harvested was shared between the partners with several institutions acquiring a copy of the complete EOT dataset for their local collections.

Moving forward four years the same group got together to organize the harvesting of the federal domain in 2012.  While originally scoped as a way of capturing the transition of the executive branch,  this EOT project also served as a way to systematically capture a large portion of the federal web on a four year calendar.  In addition to the 2008 partners,  Harvard joined in the project for 2012.

Again the team worked to identify in-scope content to collect, this time however the content included URLs from the social web including Twitter and Facebook for agencies, offices and individuals in the federal government.  Because there was not a change in office because of the 2012 election, there was just a set of crawls that occurred during the fall of 2012 and the winter of 2013.  Again this content was shared between the project partners interested in acquiring the archives for their own collections.

The End of Term group is a loosely organized group that comes together ever four years to conduct the harvesting of the federal web presence. As we ramp up for the end of the Obama administration the group has started to plan the EOT 2016 project with a goal to start crawling in September of 2016.  This time there will be a new president so the crawling will probably take the format of the 2008 crawls with a pre-election, post-election and post-inauguration set of crawls.

So far there hasn’t been much in the way of analysis to compare the EOT2008 and EOT2012 web archives.  There are a number of questions that have come up over the years that remain unanswered about the two collections.  This series of posts will hopefully take a stab at answering some of those questions and maybe provide better insight into the makeup of these two collections.  Finally there are hopefully a few things that can be learned from the different approaches used during the creation of these archives that might be helpful as we begin the EOT 2016 crawling.

Working with the EOT Data

The dataset that I am working with for these posts consists of the CDX files created for the EOT2008 and EOT2012 archive.  Each of the CDX files acts as an index to the raw archived content and contains a number of fields that can be useful for analysis.  All of the archive content is referenced in the CDX file.

If you haven’t looked at a CDX file in the past here is an example of a CDX file.

gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:martinelli,%20giovanni&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005312 text/html 200 LFN2AKE4D46XEZNOP3OLXG2WAPLEKZKO - - - 533010532 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:schumann-heink,%20ernestine&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005219 text/html 200 EL5OT5NAXGGV6VADBLNP2CBZSZ5MH6OT - - - 531160983 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:scotti,%20antonio&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005255 text/html 200 SEFDA5UNFREPA35QNNLI7DPNU3P4WDCO - - - 804325022 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:farrar,%20geraldine&fq[1]=take_vocal_id:viafora,%20gina&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125005309 text/html 200 EV6N3TMKIVWAHEHF54M2EMWVM5DP7REJ - - - 532966964 gov,loc)/jukebox/search/results?count=20&fq[0]=take_vocal_id:homer,%20louise&fq[1]=take_composer_name:campana,%20f.%20&page=1&q=geraldine%20farrar&referrer=/jukebox/ 20121125070122 text/html 200 FW2IGVNKIQGBUQILQGZFLXNEHL634OI6 - - - 661008391

The CDX format is a space delimited file with the following fields

  • SURT formatted URI
  • Capture Time
  • Original URI
  • MIME Type
  • Response Code
  • Content Hash (SHA1)
  • Redirect URL
  • Meta tags (not populated)
  • Compressed length (sometimes populated)
  • Offset in WARC file
  • WARC File Name

The tools I’m working with to analyze the EOT datasets will consist of Python scripts that either extract specific data from the CDX files where it can be further sorted and counted, or they will be scripts that work on these sorted and counted versions of files.

I’m trying to post code and derived datasets in a Github repository called eot-cdx-analysis if you are interested in taking a look.  There is also a link to the original CDX datasets there as well.

How much

The EOT2008 dataset consists of 160,212,141 URIs and the EOT2012 dataset comes in at 194,066,940 URIs.  Unfortunately the CDX files that we are working with don’t have consistent size information that we can use for analysis but the rough sizes for each of the archives is EOT2008 at 16TB and EOT2012 at just over 41.6TB.


The first dimension I wanted to look at was when was the content harvested for each of the EOT rounds.  In both cases we all remember starting the harvesting “sometime in September” and then ending the crawls “sometime in March” of the following year.  How close were we to our memory?

For this I extracted the Capture Time field from the CDX file, converted that into a date yyyy–mm-dd was a decent bucket to group into and then sorted and counted each instance of a date.

EOT2008 Harvest Dates

This first chart shows the harvest dates contained in the EOT2008 CDX files.  Things got kicked off in September 2008 and apparently concluded all the way in OCT 2009.  There is another blip of activity in May of 2009.  This is probably something to go back and look at to help remember what exactly these two sets of crawling were that happened after March 2009 when we all seem to remember crawling stopping.

EOT2012 Harvest Dates

The EOT2012 crawling started off in mid-September and this time finished up in the first part of March 2013.  There is a more consistent shape to the crawling for this EOT with a pretty consistent set of crawling happening between mid-October and the end of January.

EOT2008 and EOT2012 Harvest Dates Compared

When you overlay the two charts you can see how the two compare.  Obviously the EOT2008 data continues quite a bit further than the EOT2012 but where they overlap you can see that there were different patterns to the collecting.


This is the first of a few posts related to web archiving and specifically to comparing the EOT2008 and EOT2012 archives.  We are approaching the time to start the EOT2016 crawls and it would be helpful to have more information about what we crawled in the two previous cycles.

In addition to just needing to do this work there will be a presentation on some of these findings as well as other types of analysis at the 2016 Web Archiving and Digital Libraries (WADL) workshop that is happening at the end of JCDL2016 this year in Newark, NJ.

If there are questions you have about the EOT2008 or EOT2012 archives please get in contact with me and we can see if we can answer them.

If you have questions or comments about this post,  please let me know via Twitter.

DuraSpace News: NEW Video Update from DSpace User Interface Initiative

planet code4lib - Wed, 2016-06-01 00:00

From Tim Donohue, DSpace Tech Lead

Austin, TX  Here's the latest video update on the Angular 2 UI prototype, demoing what we've accomplished in the last two weeks:

LITA: Transmission #5

planet code4lib - Tue, 2016-05-31 20:11

In this action-packed fifth installment, Begin Transmission is joined by the inimitable Leanne Mobley. She’s a LITA blogger, Scholarly Technologies Librarian at Indiana University, and a makerspace proponent.

Stay tuned for another Transmission, Monday, June 13th!

NYPL Labs: National Photo Month at the Digital Imaging Unit

planet code4lib - Tue, 2016-05-31 19:22

The Digital Imaging Unit at The New York Public Library is an extraordinary place filled with talented artists and photographers who are dedicated to providing the public with images from the library’s special collections. I’m particularly fond of the notion that the work we do helps to release information from the page and put it at the fingertips of a new kind of internet-connected public library patron. This flow of information and its impact also has a reverse component, for we often and unexpectedly find ourselves transformed in the process.

Interacting with the special collections materials in the way we do, carefully and expertly handling the rarest and most fragile artifacts of our shared cultural heritage, and putting these objects in front of the highest-resolution cameras available reveals details and moments that inevitably stop us in our tracks. We see a person in a window looking back at the camera, an erasure, inky fingerprints on the back of a manuscript, the otherworldly skill and precision required to accomplish a particular drawing or print, or we pause in front of the overwhelming beauty of an object. We find ourselves seeing the objects, photography, the world, and ultimately ourselves differently after these encounters. As professional photographers, nothing brings us more pleasure than to be faced with the prints of photographic luminaries and to be able to attend to their translation into the networked landscape.

Here are a few highlights from our most beloved encounters with the library’s photo collections that we’ve seen along the way. —Eric Shows, Digitization Services Manager

Sharecroppers' children on Sunday, near Little Rock, Arkansas. Image ID: 5251626 Arkansas sharecropper. Image ID: 5338462 Waiting for relief agent, Scott's Run, Monongalia County, West Virginia. Image ID: 5326678

I was very fortunate to handle most of the Library’s collection of Ben Shahn’s Depression-era FSA photographs. Shahn was primarily a painter and illustrator, which I think made him uninhibited behind the camera, but also very observant. He photographed his subjects in such a thoughtful way that they do not come across as victims from a bygone era, but as real and relatable people. —Martin Parsekian, Collections Photographer

George Avakian and Anahid Ajemian during the Ajemian sisters' first European tour. Image ID: 5665559 George Avakian recording Sidney Bechet. Image ID: 5649287 Earliest photo of George Avakian, with his parents in Tiflis. Image ID: 5649237

I selected these images because I find it fascinating that we can appreciate and witness through them the life, work, and legacy of American music producer/writer, George Avakian. I think is great that not only was he recognized for playing a major role in the development of jazz, but also for impacting the lives of many great artists through his work as a music producer. —Jenny Jordan, Collections Photographer

Joan Mitchell. Image ID: 5154611 Helen Frankenthaler. Image ID: 5154371 Willem DeKooning. Image ID: 5154297

I selected photographs of Willem DeKooning, Helen Frankenthaler, and Joan Mitchell by Walter Silver. These images capture the people behind iconic New York School paintings. The casual studio shots help me to imagine living and working with the abstract expressionists in the 1950s New York. —Rebecca Baldwin, Collections Photographer

From forest to mill. Image ID: 110149 View of Log-raft. Columbia River. Image ID: G91F306_021F Sawing timber. Image ID: 110132

I am always amazed by how we do what we do, for better, for worse, all the while. —Steven Crossot, Assistant Manager, Digitization Services

Housetop life, Hopi. Image ID: 433127 “Mobile anti-aircraft searchlights…” Image ID: 5111981 Storm. Image ID: 5147219

These three images represent the peculiarity of the library’s photographic collection. The romance and exoticism of Edward Curtis’ images of Native Americans from the early 20th century, official U.S. government press photos of military from WWI, and an annotated work print from the Walter Silver collection, with unintentionally ironic subtext. —Adam Golfer, Collections Photographer

Blossom Restaurant, 103 Bowery, Manhattan. Image ID: 482799, New York Public Library Testing meats at the Department of Agriculture. Beltsville, Maryland. Image ID: 3999921 Scott's Run mining camps near Morgantown, West Virginia. Domestic interior. Shack at Osage. Image ID: 5233691

These are a few of the photographs that have stuck with me over the years, by Berenice Abbott, Carl Mydans, and Walker Evans. Whether capturing the graphic signage and intensity of expression on people’s faces, the oddity of a testing scene or the subtle beauty and pride portrayed through a domestic scene, they all resonate with me in different ways. —Pete Riesett, Head Photographer

Chicken Market, 55 Hester Street, Manhattan. Image ID: 482844 Bread Store, 259 Bleecker Street, Manhattan. Image ID: 482591 Pingpank Barber Shop, 413 Bleecker Street, Manhattan. Image ID: 482595

I love exploring the NYPL's photography collections because of the historical and pictorial relevance of the works they hold, including Berenice Abbott's Changing New York—a series of stunning, iconic black and white photographs of the "old" city. Abbott was an extraordinarily skilled architectural photographer, but I especially enjoy her methodical documentation of storefronts as an integral part of the city, featuring visually glorious layers of texture and content. —Allie Smith, Collections Photographer

Tiger Man: Animal Graffiti, 14 St. Image ID: 5038710 Pretty Long Haired Woman Looking Up: Pretty Man in White Polo Looking Down. Image ID: 5038738 Woman Through Graffiti Window is Caught Unaware By Camera: Crowded Group in Car, Woman in Fur Coat and Wool Hat Looks at Camera. Image ID: 5038756

These are just a few images in a wonderful series by photographer Alen MacWeeney that were taken in 1977 in the NYC subway. At first glance I love these photographs because of how cool and stylized they look depicting the 1970s graffiti covered New York. Then you peer in closer and you see how MacWeeney added his own twist to the images by pairing two separate images to create diptychs which at first sight might appear to be one image. That creates an interesting narrative between the cast of characters. Added plus with these images is that no one is typing away on their phones and it doesn’t appear as crowded and jam packed with people as it is today. But there are things in the photos that never change on the subway and are timeless. —Marietta Davis, Collections Photographer

Jonathan Rochkind: GREAT presentation on open source development

planet code4lib - Tue, 2016-05-31 18:03

I highly recommend Schneem’s presentation on “Saving Sprockets”, which he has also turned into a written narrative. Not so much for what it says about Sprockets, but for what it says about open source development.

I won’t say I agree with 100% of it, but probably 85%+, and some of the stuff I agree with is really important and useful, and Schneem’s analyzes what’s going on very well and figures out how to say it very well.

Some of my favorite points:

“To them, I ask: what are the problems? Do you know what they are? Because we can’t fix what we can’t define, and if we want to attempt a re-write, then a re-write would assume that we know better. We still have the same need to do things with assets, so we don’t really know better.”

A long term maintainer is really important, coders aren’t just inter-changeable widgets:

“While I’m working on Sprockets, there’s so many times that I say “this is absolutely batshit insane. This makes no sense. I’m going to rip this all out. I’m going to completely redo all of this.” And then, six hours later, I say “wow, that was genius,” and I didn’t have the right context for looking at the code. Maintainers are really historians, and these maintainers, they help bring context. We try to focus on good commit messages and good pull requests. Changelog entries. Please keep a changelog, btw. But none of that compares to having someone who’s actually there. A story is worth 1000 commit messages. For example, you can’t exactly ask a commit message a question, like, “hey, did you consider trying to uh…” and the commit message is like, “uh, I’m a commit message.” It doesn’t store the context about the conversations around that”

“These are all different people with very different needs who need different documentation. Don’t make them hunt down the documentation that they need. When I started working on Sprockets, somebody would ask, “is this expected?” and I would say honestly, “I don’t know, you tell me. Was it happening before?” And through doing that research, I put together some guides, and eventually we could definitively say what was expected behavior. The only way that I could make those guides make sense is if I split them out, and so, we have a guide for “building an asset processing framework”, if you’re building the next Rails asset pipeline, or “end user asset generation”, if you are a Rails user, or “extending Sprockets” if you want to make one of those plugins. It’s all right there, it’s kind of right at your fingertips, and you only need to look at the documentation that fits your use case, when you need it.

We made it easier for developers to find what they need. Also, it was a super useful exercise for me as well. One thing I love about these guides is that they live in the source and not in a wiki, because documentation is really only valid for one point in time.”

I also really like the concept that figuring out how to support or fix someone else’s code (which is really all ‘legacy’ means), is an excercize in a sort of code archeology.  I’ve been doing that a lot lately.  Also how to use someone else’s code that isn’t documented sufficiently.  It’s sort of fun sometimes, but better to have better docs.

Filed under: General

District Dispatch: OITP welcomes summer intern

planet code4lib - Tue, 2016-05-31 15:34

Brian Clark

On June 6, Brian M. Clark will begin an internship with ALA’s Office for Information Technology Policy (OITP) for the summer. Brian just completed his junior year at Elon University in North Carolina where he is majoring in media analytics and minoring in business administration. At Elon, Brian has completed coursework in Web and mobile communications, Creating multimedia content, Applied media/data analytics, Media law and ethics, Statistics, Economics, Finance, Marketing, Management, and Accounting.

Not surprisingly, Brian’s projects this summer will focus on social media and the web generally and how ALA can better leverage communications technologies to achieve more effective policy advocacy. In addition, Brian will participate in selected D.C. activities and ALA meetings to develop some appreciation of public policy advocacy and lobbying. Activities include attending hearings of the Congress and/or federal regulatory agencies and attending events of think tanks and advocacy groups. Such participation might be in conjunction with ALA’s Google Policy Fellow Nick Gross, who will also be in residence this summer.

Please join me in welcoming Brian to ALA, the Washington Office, the realm of information policy, and libraryland.

The post OITP welcomes summer intern appeared first on District Dispatch.

District Dispatch: Bring federal job training funding home to your library

planet code4lib - Tue, 2016-05-31 14:28

Jobseekers at Redding Library job fair.

Do you know how to secure funding for job training services and programs in your library? Learn how to secure workforce support funding for your library at this year’s 2016 American Library Association (ALA) Annual Conference in Orlando, Fla. at the Washington Update session “Concrete Tips to Take Advantage of Workforce Funding.” During the session, a panel of library and workforce leaders will discuss best practices for supporting jobseekers. The session takes place on Saturday, June 25, 2016, from 8:30-10:00 a.m. in the Orange County Convention Center, room W303.

Session participants will learn about effective job training from two different panel discussions and discuss activities, classes and programs they can offer in their own libraries. Conference session attendees will also discuss new workforce support opportunities as the federal government rolls out the new Workforce Innovation and Opportunity Act (WIOA). The U.S. Department of Labor expects to release Workforce Innovation and Opportunity Act regulations on June 30, 2016.

The program will include a number of dynamic speakers, including Mimi Coenen, chief operating officer of CareerSource Central Florida; Tonya Garcia, director of the Long Branch Public Library in New Jersey; Stephen Parker, legislative director of the National Governors Association; Alta Porterfield, Delaware Statewide Coordinator of the Delaware Libraries Inspiration Space; and Renae Rountree, director of the Washington County Public Library in Florida.

Want to attend other policy sessions at the 2016 ALA Annual Conference? View all ALA Washington Office sessions

The post Bring federal job training funding home to your library appeared first on District Dispatch.

Islandora: Guest Blog: So you want to get started working on CLAW?

planet code4lib - Tue, 2016-05-31 13:54

We have a very exciting entry for this week's Islandora update: a guest blog from Islandora CLAW sprinter Ben Rosner (Barnard College). Ben has been following along with the project for a while now and started his first sprint last month. He has been an awesome addition to the team and he was kind enough to share his experiences as a 'newbie' to CLAW, and explain how you can join in too:

So you want to get started working on CLAW?

Lessons learned from a first-time contributor, hopefully to ease your transition into the land of CLAW.

Foremost, before any listing of resources, pages, and things to understand: please know that IRC is the place to be. Even if the channel is quiet, someone is ALWAYS lurking and will help you through any issue. Without the #islandora channel on I'm not sure this beginner would have had the fortitude to stick with what can sometimes be challenging, but very rewarding work. Also, the weekly CLAW calls are great if you have the time, even if just to listen in. I didn't understand a flippin' thing the first time I joined, but I stuck with it as time allowed and it has paid off. Lastly the Google Group, both islandora and islandora-dev - just so you can get a 'feel' for what's going on in the community. 

The stuff to bookmark now, and remember for later!

If you're going to work on or with the microservices you need this guide: READ THIS GUIDE, LEARN THIS GUIDE, LIVE THIS GUIDE. curl so many times it hurts, then some more. While you're doing all this 'curling' watch and explore http://localhost:8080/fcrepo/rest/, query blazegraph with a simple SELECT * WHERE { ?s ?p ?o }. What's happening? Why? HOW!? Is that a camel? No way, get out there's a camel? Yup. We've got that.

Here are some dummy objects I've created to quickly populate my repo that might be handy when you're curling: Note the instructions in my may be outdated as we continue work on the Islandora microservices.

If you've never used a Vagrant before look inside of the install folder in the main CLAW repo and follow the README. Note to Windows users, run the command prompt/git shell/whatever and VirtualBox opened as an ADMIN before typing vagrant up. You've been warned!

Have an idea of PSR coding standards - when you're not in Drupal land you'll be using PSR2 while working on CLAW. Just like in college when you had to write in APA (or any of the lesser styles, muhahaha), PSR2 is a style guide. Here's the guide and here's something to fix your code for you (also see below about picking your editor of choice).

Theres more?! Diego Pino (@DiegoPino) has been amazing and hosted a series of live learning tutorials that are recorded and available on YouTube. There are links in main CLAW repository's, and here are the slides that were presented (or at least some of em').

Finding a comfortable editing space.

Pick your editor of choice - personally I'm a fan of Sublime Text 3 with a few plugins, mainly phpfmt/SublimeLinter (for PSR2 compliance and auto-formatting), bracket highlighter (distinguishes which block of code you're working on/in), and markdown preview (sometimes you want it to look nice before you push that commit). There are a few linters out there - some of them MUCH better than phpfmt (like SublimeLinter), and they work great, but rtm before you get started using them.

If you prefer a full-fledged IDE, PhpStorm is my recommend - educational users should be able to swing a free license, though do check with them as I'm not a lawyer... It does your typical IDE stuff (formatting, code completion, intellisense, etc.).

If you love yourself some VIM look through @nruests repos for his configs and I'm told you'll be all set!

Finally, if you're new to GitHub, don't panic, but learning it is a practical exercise regardless of how many guides you read. Here is one such guide, but go ahead and Google 'github guide' and you'll be inundated with everyone's "best guide to learn github." Depending on your learning style all of it is rubbish. Github is a PRACTICAL application. Just mess around with it and when you don't know something, speak up in IRC! Also, don't let your IDE do your git stuff, the commands are more powerful than an IDE exposes.

Okay enough ranting. Here's my "CLAW story."

A nice way to start contributing is through sprints, which are held during the first (or last, I forget) two weeks of each month. There are a number of opened issues/tickets that you can peruse. There is a sprint kick-off call where folks discuss what they want to tackle, and if you're not sure, you can be aimed towards those items that will interest you. I've learned this: EVERY LITTLE BIT HELPS. No task is to small. (See: the last CLAW Tutorial/Lesson for more about Sprints and getting involved and read the

So I started in Sprint 06 which ended April, 2016. Here is my "journal entry" from then:

Thus far my #CLAW experience has been using and helping update the Vagrant (for Windows comparability sigh) and working on the PHP Microservices (it's iterative process, still workin' my first PR).

I'm working primarily on #150 and having some success. Lot's of testing using error.log and watching what's happening in Apache/FCRepo/Blazegraph. I think commit 64df023 is gonna crack this nut.

The Vagrant refuses to symlink on windows machines and so our composer architecture which allows live editing is failing to work. Since I work in a mixed coding environment (which includes a Windows machine) this is fairly important. It turns out we may need to run VMWare with elevated privs for proper linking to occur. shrug

And my work continued, even outside of any true sprint. Small things like creating a .gitconfig to handle line endings properly, resolving other issues with the Vagrant and small documentation tasks. Again little things that I could do as I could.

Then Sprint 007 (I added the extra 0, because who doesn't love a Bond reference?) came along, and I missed the kickoff call due to other obligations. But I asked for a ticket ~ and got one that was like "woot" this is amazing. However! Due to my own overzelousness and over obligation I did not get to work on my issue as much as I would have like :(. And guess what?! No-one judged me, everyone understood, and was it was just great to be part of the community and realize we really are in it together. I plan to continue my work with CLAW, and to take on as much as I possibily can! It's a great project, with great people (Nick, Diego, Jared - the main committers - are so supportive and collaborative), and something I am happy to be part of.

Ben out!

Wait Ben! Help me get started!

Have you curled yet? Get on IRC! Chat with us! Ping nruest a lot, he loves helping folks get started. Check the issues on the github, noone will stop you from trying to tackle one! Just fork and hack away :).

Eric Lease Morgan: Catholic Pamphlets and the Catholic Portal: An evolution in librarianship

planet code4lib - Tue, 2016-05-31 12:34

This blog posting outlines, describes, and demonstrates how a set of Catholic pamphlets were digitized, indexed, and made accessible through the Catholic Portal. In the end it advocates an evolution in librarianship.

A few years ago, a fledgling Catholic pamphlets digitization process was embarked upon. [1] In summary, a number of different library departments were brought together, a workflow was discussed, timelines were constructed, and in the end approximately one third of the collection was digitized. The MARC records pointing to the physical manifestations of the pamphlets were enhanced with URLs pointing to their digital surrogates and made accessible through the library catalog. [2] These records were also denoted as being destined for the Catholic Portal by adding a value of CRRA to a local note. Consequently, each of the Catholic Pamphlet records also made their way to the Portal. [3]

Because the pamphlets have been digitized, and because the digitized versions of the pamphlets can be transformed into plain text files using optical character recognition, it is possible to provide enhanced services against this collection, namely, text mining services. Text mining is a digital humanities application rooted in the counting and tabulation of words. By counting and tabulating the words (and phrases) in one or more texts, it is possible to “read” the texts and gain a quick & dirty understanding of their content. Probably the oldest form of text mining is the concordance, and each of the digitized pamphlets in the Portal is associated with a concordance interface.

For example, the reader can search the Portal for something like “is the pope always right”, and the result ought to return a pointer to a pamphlet named Is the Pope always right? of papal infallibility. [4] Upon closer examination, the reader can download a PDF version of the pamphlet as well as use a concordance against it. [5, 6] Through the use of the concordance the reader can see that the words church, bill, charlie, father, and catholic are the most frequently used, and by searching the concordance for the phrase “pope is”, the reader gets a single sentence fragment in the result, “…ctrine does not declare that the Pope is the subject of divine inspiration by wh…” And upon further investigation, the reader can see this phrase is used about 80% of the way through the pamphlet.

The process of digitizing library materials is very much like the workflows of medieval scriptoriums, and the process is well understood. Description and access to digital versions of original materials is well-accommodated by the exploitation of MARC records. The next step for the profession to move beyond find & get and towards use & understand. Many people can find many things, with relative ease. The next step for librarianship is to provide services against the things readers find so they can more easily learn & comprehend. Save the time of the reader. The integration of the University of Notre Dame’s Hesburgh Libraries’s Catholic Pamphlets Collection into the Catholic Portal is one possible example of how this evolutionary process can be implemented.


[1] digitization process –

[2] library catalog –

[3] Catholic Portal –

[4] “Of Papal Infallibility” –

[5] PDF version –

[6] concordance interface –


Subscribe to code4lib aggregator