You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 6 hours 16 min ago

FOSS4Lib Recent Releases: Pazpar2 - 1.12.2

Mon, 2015-08-31 12:43

Last updated August 31, 2015. Created by Peter Murray on August 31, 2015.
Log in to edit this page.

Package: Pazpar2Release Date: Monday, August 31, 2015

SearchHub: Mining Events for Recommendations

Mon, 2015-08-31 10:33
Summary: TheEventMiner” feature in Lucidworks Fusion can be used to mine event logs to power recommendations. We describe how the system uses graph navigation to generate diverse and high-quality recommendations. User Events The log files that most web services generate are a rich source of data for learning about user behavior and modifying system behavior based on this. For example, most search engines will automatically log details on user queries and the resulting clicked documents (URLs). We can define a (user, query, click, time) record which records a unique “event” that occurred at a specific time in the system. Other examples of event data include e-commerce transactions (e.g. “add to cart”, “purchase”), call data records, financial transactions etc. By analyzing a large volume of these events we can “surface” implicit structures in the data (e.g. relationships between users, queries and documents), and use this information to make recommendations, improve search result quality and power analytics for business owners. In this article we describe the steps we take to support this functionality. 1. Grouping Events into Sessions Event logs can be considered as a form of “time series” data, where the logged events are in temporal order. We can then make use of the observation that events close together in time will be more closely related than events further apart. To do this we need to group the event data into sessions. A session is a time window for all events generated by a given source (like a unique user ID). If two or more queries (e.g. “climate change” and “sea level rise”) frequently occur together in a search session then we may decide that those two queries are related. The same would apply for documents that are frequently clicked on together. A “session reconstruction” operation identifies users’ sessions by processing raw event logs and grouping them based on user IDs, using the time-intervals between each and every event. If two events triggered by the same user occur too far apart in time, they will be treated as coming from two different sessions. For this to be possible we need some kind of unique ID in the raw event data that allows us to tell that two or more events are related because they were initiated by the same user within a given time period. However, from a privacy point of view, we do not need an ID which identifies an actual real person with all their associated personal information. All we need is an (opaque) unique ID which allows us to track an “actor” in the system. 2. Generating a Co-Occurrence Matrix from the Session Data We are interested in entities that frequently co-occur, as we might then infer some kind of interdependence between those entities. For example, a click event can be described using a click(user, query, document) tuple, and we associate each of those entities with each other and with other similar events within a session. A key point here is that we generate the co-occurrence relations not just between the same field types e.g. (query, query) pairs, but also “cross-field” relations e.g. (query, document), (document, user) pairs etc. This will give us an N x N co-occurrence matrix, where N = all unique instances of the field types that we want to calculate co-occurrence relations for. Figure 1 below shows a co-occurrence matrix that encodes how many times different characters co-occur (appear together in the text) in the novel “Les Miserables”. Each colored cell represents two characters that appeared in the same chapter; darker cells indicate characters that co-occurred more frequently. The diagonal line going from the top left to the bottom right shows that each character co-occurs with itself. You can also see that the character named “Valjean”, the protagonist of the novel, appears with nearly every other character in the book.

Figure 1. “Les Miserables” Co-occurrence Matrix by Mike Bostock.

In Fusion we generate a similar type of matrix, where each of the items is one of the types specified when configuring the system. The value in each cell will then be the frequency of co-occurrence for any two given items e.g. a (query, document) pair, a (query, query) pair, a (user, query) pair etc.

For example, if the query “Les Mis” and a click on the web page for the musical appear together in the same user session then they will be treated as having co-occurred. The frequency of co-occurrence is then the number of times this has happened in the raw event logs being processed.

3. Generating a Graph from the Matrix The co-occurrence matrix from the previous step can also be treated as an “adjacency matrix”, which encodes whether two vertices (nodes) in a graph are “adjacent” to each other i.e. have a link or “co-occur”. This matrix can then be used to generate a graph, as shown in Figure 2:

Figure 2. Generating a Graph from a Matrix.

Here the values in the matrix are the frequency of co-occurrence for those two vertices. We can see that in the graph representation these are stored as “weights” on the edge (link) between the nodes e.g. nodes V2 and V3 co-occurred 5 times together.

We encode the graph structure in a collection in Solr using a simple JSON record for each node. Each record contains fields that list the IDs of other nodes that point “in” at this record, or which this node points “out” to.

Fusion provides an abstraction layer which hides the details of constructing queries to Solr to navigate the graph. Because we know the IDs of the records we are interested in we can generate a single boolean query where the individual IDs we are looking for are separated by OR operators e.g. (id:3677 OR id:9762 OR id:1459). This means we only make a single request to Solr to get the details we need.

In addition, the fact that we are only interested in the neighborhood graph around a start point means the system does not have to store the entire graph (which is potentially very large) in memory.

4. Powering Recommendations from the Graph

At query/recommendation time we can use the graph to make suggestions on which other items in that graph are most related to the input item, using the following approach:

  1. Navigate the co-occurrence graph out from the seed item to harvest additional entities (documents, users, queries).
  2. Merge the list of entities harvested from different nodes in the graph so that the more lists an entity appears in the more weight it receives and the higher it rises in the final output list.
  3. Weights are based on the reciprocal rank of the overall rank of the entity. The overall rank is calculated as the sum of the rank of the result the entity came from and the rank of the entity within its own list.

The following image shows the graph surrounding the document “Midnight Club: Los Angeles” from a sample data set:

Figure 3. An Example Neighborhood Graph.

Here the relative size of the nodes shows how frequently they occurred in the raw event data, and the size of the arrows is a visual indicator of the weight or frequency of co-occurrence between two elements.

For example, we can see that the query “midnight club” (blue node on bottom RHS) most frequently resulted in a click on the “Midnight Club: Los Angeles Complete Edition Platinum Hits” product (as opposed to the original version above it). This is the type of information that would be useful to a business analyst trying to understand user behavior on a site.

Diversity in Recommendations For a given item, we may only have a small number of items that co-occur with it (based on the co-occurrence matrix). By adding in the data from navigating the graph (which comes from the matrix), we increase the diversity of suggestions. Items that appear in multiple source lists then rise to the top. We believe this helps improve the quality of the recommendations & reduce bias. For example, in Figure 4 we show some sample recommendations for the query “Call of Duty”, where the recommendations are coming from a “popularity-based” recommender i.e. it gives a large weight to items with the most clicks. We can see that the suggestions are all from the “Call of Duty” video game franchise:

Figure 4. Recommendations from a “popularity-based” recommender system.

In contrast, in Figure 5 we show the recommendations from EventMiner for the same query:

Figure 5. Recommendations from navigating the graph.

Here we can see that the suggestions are now more diverse, with the first two being games from the same genre (“First Person Shooter” games) as the original query.

In the case of an e-commerce site, diversity in recommendations can be an important factor in suggesting items to a user that are related to their original query, but which they may not be aware of. This in turn can help increase the overall CTR (Click-Through Rate) and conversion rate on the site, which would have a direct positive impact on revenue and customer retention.

Evaluating Recommendation Quality To evaluate the quality of the recommendations produced by this approach we used CrowdFlower to get user judgements on the relevance of the suggestions produced by EventMiner. Figure 6 shows an example of how a sample recommendation was presented to a human judge:

Figure 6. Example relevance judgment screen (CrowdFlower).

Here the original user query (“resident evil”) is shown, along with an example recommendation (another video game called “Dead Island”). We can see that the judge is asked to select one of four options, which is used to give the item a numeric relevance score:

  1. Off Topic
  2. Acceptable
  3. Good
  4. Excellent
  In this example the user might judge the relevance for this suggestion as “good”, as the game being recommended is in the same genre (“survival horror”) as the original query. Note that the product title contains no terms in common with the query i.e. the recommendations are based purely on the graph navigation and do not rely on an overlap between the query and the document being suggested. In Table 1 we summarize the results of this evaluation: Items Judgements Users Avg. Relevance (1 – 4) 1000 2319 30 3.27  

Here we can see that the average relevance score across all judgements was 3.27 i.e. “good” to “excellent”.

Conclusion If you want an “out-of-the-box” recommender system that generates high-quality recommendations from your data please consider downloading and trying out Lucidworks Fusion.

The post Mining Events for Recommendations appeared first on Lucidworks.

Hydra Project: Michigan becomes the latest Hydra Partner

Mon, 2015-08-31 08:49

We are delighted to announce that the University of Michigan has become the latest formal Hydra Partner.  Maurice York, their Associate University Librarian for Library Information Technology, writes:

“The strength, vibrancy and richness of the Hydra community is compelling to us.  We are motivated by partnership and collaboration with this community, more than simply use of the technology and tools. The interest in and commitment to the community is organization-wide; last fall we sent over twenty participants to Hydra Connect from across five technology and service divisions; our showing this year will be equally strong, our enthusiasm tempered only by the registration limits.”

Welcome Michigan!  We look forward to a long collaboration with you.

Eric Hellman: Update on the Library Privacy Pledge

Mon, 2015-08-31 02:39
The Library Privacy Pledge of 2015, which I wrote about previously, has been finalized. We got a lot of good feedback, and the big changes have focused on the schedule.

Now, any library , organization or company that signs the pledge will have 6 months to implement HTTPS from the effective date of their signature. This should give everyone plenty of margin to do a good job on the implementation.

We pushed back our launch date to the first week of November. That's when we'll announce the list of "charter signatories". If you want your library, company or organization to be included in the charter signatory list, please send an e-mail to

The Let's Encrypt project will be launching soon. They are just one certificate authority that can help with HTTPS implementation.

I think this is an very important step for the library information community to take, together. Let's make it happen.

Here's the finalized pledge:

The Library Freedom Project is inviting the library community - libraries, vendors that serve libraries, and membership organizations - to sign the "Library Digital Privacy Pledge of 2015". For this first pledge, we're focusing on the use of HTTPS to deliver library services and the information resources offered by libraries. It’s just a first step: HTTPS is a privacy prerequisite, not a privacy solution. Building a culture of library digital privacy will not end with this 2015 pledge, but committing to this first modest step together will begin a process that won't turn back.  We aim to gather momentum and raise awareness with this pledge; and will develop similar pledges in the future as appropriate to advance digital privacy practices for library patrons.
We focus on HTTPS as a first step because of its timeliness. The Let's Encrypt initiative of the Electronic Frontier Foundation will soon launch a new certificate infrastructure that will remove much of the cost and technical difficulty involved in the implementation of HTTPS, with general availability scheduled for September. Due to a heightened concern about digital surveillance, many prominent internet companies, such as Google, Twitter, and Facebook, have moved their services exclusively to HTTPS rather than relying on unencrypted HTTP connections. The White House has issued a directive that all government websites must move their services to HTTPS by the end of 2016. We believe that libraries must also make this change, lest they be viewed as technology and privacy laggards, and dishonor their proud history of protecting reader privacy.
The 3rd article of the American Library Association Code of Ethics sets a broad objective:
We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.It's not always clear how to interpret this broad mandate, especially when everything is done on the internet. However, one principle of implementation should be clear and uncontroversial: Library services and resources should be delivered, whenever practical, over channels that are immune to eavesdropping.
The current best practice dictated by this principle is as following: Libraries and vendors that serve libraries and library patrons, should require HTTPS for all services and resources delivered via the web.
The Pledge for Libraries:
1. We will make every effort to ensure that web services and information resources under direct control of our library will use HTTPS within six months. [ dated______ ]
2. Starting in 2016, our library will assure that any new or renewed contracts for web services or information resources will require support for HTTPS by the end of 2016.
The Pledge for Service Providers (Publishers and Vendors):
1. We will make every effort to ensure that all web services that we (the signatories) offer to libraries will enable HTTPS within six months. [ dated______ ]
2. All web services that we (the signatories) offer to libraries will default to HTTPS by the end of 2016.
The Pledge for Membership Organizations:
1. We will make every effort to ensure that all web services that our organization directly control will use HTTPS within six months. [ dated______ ]
2. We encourage our members to support and sign the appropriate version of the pledge.
There's a FAQ available, too. All this will soon be posted on the Library Freedom Project website.

Harvard Library Innovation Lab: Link roundup August 30, 2015

Sun, 2015-08-30 17:41

This is the good stuff.

Rethinking Work

When employees negotiate, they negotiate for improved compensation, since nothing else is on the table.

Putting Elon Musk and Steve Jobs on a Pedestal Misrepresents How Innovation Happens

“Rather than placing tech leaders on a pedestal, we should put their successes”

Lamp Shows | HAIKU SALUT

Synced lamps as part of a band’s performance

Lawn Order | 99% Invisible

Jail time for a brown lawn? A wonderfully weird dive into the moral implications of lawncare

Sky-high glass swimming pool created to connect south London apartment complex

Swim through the air

DuraSpace News: Cineca DSpace Service Provider Update

Sun, 2015-08-30 00:00

From Andrea Bollini, Cineca

It has been a hot and productive summer here in Cineca,  we have carried out several DSpace activities together with the go live of the National ORCID Hub to support the adoption of ORCID in Italy [1][2].

Ed Summers: iSchool

Sat, 2015-08-29 20:05

As you can see, I’ve recently changed things around here at Yeah, it’s looking quite spartan at the moment, although I’m hoping that will change in the coming year. I really wanted to optimize this space for writing in my favorite editor, and making it easy to publish and preserve the content. Wordpress has served me well over the last 10 years and up till now I’ve resisted the urge to switch over to a static site. But yesterday I converted the 394 posts, archived the Wordpress site and database, and am now using Jekyll. I haven’t been using Ruby as much in the past few years, but the tooling around Jekyll feels very solid, especially given GitHub’s investment in it.

Honestly, there was something that pushed me over the edge to do the switch. Next week I’m starting in the University of Maryland iSchool, where I will be pursuing a doctoral degree. I’m specifically hoping to examine some of the ideas I dredged up while preparing for my talk at NDF in New Zealand a couple years ago. I was given almost a year to think about what I wanted to talk about – so it was a great opportunity for me to reflect on my professional career so far, and examine where I wanted to go.

After I got back I happened across a paper by Steven Jackson called Rethinking Repair, which introduced me to what felt like a very new and exciting approach to information technology design and innovation that he calls Broken World Thinking. In hindsight I can see that both of these things conspired to make returning to school at 46 years of age look like a logical thing to do. If all goes as planned I’m going to be doing this part-time while also working at the Maryland Istitute for Technology in the Humanities, so it’s going to take a while. But I’m in a good spot, and am not in any rush … so it’s all good as far as I’m concerned.

I’m planning to use this space for notes about what I’m reading, papers, reflections etc. I thought about putting my citations, notes into Evernote, Zotero, Mendeley etc, and I may still do that. But I’m going to try to keep it relatively simple and use this space as best I can to start. My blog has always had a navel gazy kind of feel to it, so I doubt it’s going to matter much.

To get things started I thought I’d share the personal statement I wrote for admission to the iSchool. I’m already feeling more focus than when I wrote it almost a year ago, so it will be interesting to return to it periodically. The thing that has become clearer to me in the intervening year is that I’m increasingly interested in examining the role that broken world thinking has played in both the design and evolution of the Web.

So here’s the personal statement. Hoepfully it’s not too personal :-)

For close to twenty years I have been working as a software developer in the field of libraries and archives. As I was completing my Masters degree in the mid-1990s, the Web was going through a period of rapid growth and evolution. The computer labs at Rutgers University provided me with what felt like a front row seat to the development of this new medium of the World Wide Web. My classes on hypermedia and information seeking behavior gave me a critical foundation for engaging with the emerging Web. When I graduated I was well positioned to build a career around the development of software applications for making library and archival material available on the Web. Now, after working in the field, I would like to pursue a PhD in the UMD iSchool to better understand the role that the Web plays as an information platform in our society, with a particular focus on how archival theory and practice can inform it. I am specifically interested in archives of born digital Web content, but also in what it means to create a website that gets called an archive. As the use of the Web continues to accelerate and proliferate it is more and more important to have a better understanding of its archival properties.

My interest in how computing (specifically the World Wide Web) can be informed by archival theory developed while working in the Repository Development Center under Babak Hamidzadeh at the Library of Congress. During my eight years at LC I designed and built both internally focused digital curation tools as well as access systems intended for researchers and the public. For example, I designed a Web based quality assurance tool that was used by curators to approve millions of images that were delivered as part of our various digital conversion projects. I also designed the National Digital Newspaper Program’s delivery application, Chronicling America, that provides thousands of researchers access to over 8 million pages of historic American newspapers every day. In addition, I implemented the data management application that transfers and inventories 500 million tweets a day to the Library of Congress. I prototyped the Library of Congress Linked Data Service which makes millions of authority records available using Linked Data technologies.

These projects gave me hands on, practical experience using the Web to manage and deliver Library of Congress data assets. Since I like to use agile methodologies to develop software, this work necessarily brought me into direct contact with the people who needed the tools built, namely archivists. It was through these interactions over the years that I began to recognize that my Masters work at Rutgers University was in fact quite biased towards libraries, and lacked depth when it came to the theory and praxis of archives. I remedied this by spending about two years of personal study focused on reading about archival theory and practice with a focus on appraisal, provenance, ethics, preservation and access. I also began participating member of the Society of American Archivists.

During this period of study I became particularly interested in the More Product Less Process (MPLP) approach to archival work. I found that MPLP had a positive impact on the design of archival processing software since it oriented the work around making content available, rather than on often time consuming preservation activities. The importance of access to digital material is particularly evident since copies are easy to make, but rendering can often prove challenging. In this regard I observed that requirements for digital preservation metadata and file formats can paradoxically hamper preservation efforts. I found that making content available sooner rather than later can serve as an excellent test of whether digital preservation processing has been sufficient. While working with Trevor Owens on the processing of the Carl Sagan collection we developed an experimental system for processing born digital content using lightweight preservation standards such as BagIt in combination with automated topic model driven description tools that could be used by archivists. This work also leveraged the Web and the browser for access by automatically converting formats such as WordPerfect to HTML, so they could be viewable and indexable, while keeping the original file for preservation.

Another strand of archival theory that captured my interest was the work of Terry Cook, Verne Harris, Frank Upward and Sue McKemmish on post-custodial thinking and the archival enterprise. It was specifically my work with the Web archiving team at the Library of Congress that highlighted how important it is for record management practices to be pushed outwards onto the Web. I gained experience in seeing what makes a particular web page or website easier to harvest, and how impractical it is to collect the entire Web. I gained an appreciation for how innovation in the area of Web archiving was driven by real problems such as dynamic content and social media. For example I worked with the Internet Archive to archive Web content related to the killing of Michael Brown in Ferguson, Missouri by creating an archive of 13 million tweets, which I used as an appraisal tool, to help the Internet Archive identify Web content that needed archiving. In general I also saw how traditional, monolithic approaches to system building needed to be replaced with distributed processing architectures and the application of cloud computing technologies to easily and efficiently build up and tear down such systems on demand.

Around this time I also began to see parallels between the work of Matthew Kirschenbaum on the forensic and formal materiality of disk based media and my interests in the Web as a medium. Archivists usually think of the Web content as volatile and unstable, where turning off a web server can result in links breaking, and content disappearing forever. However it is also the case that Web content is easily copied, and the Internet itself was designed to route around damage. I began to notice how technologies such as distributed revision control systems, Web caches, and peer-to-peer distribution technologies like BitTorrent can make Web content extremely resilient. It was this emerging interest in the materiality of the Web that drew me to a position in the Maryland Institute for Technology in the Humanities where Kirschenbaum is the Assistant Director.

There are several iSchool faculty that I would potentially like to work with in developing my research. I am interested in the ethical dimensions to Web archiving and how technical architectures embody social values, which is one of Katie Shilton’s areas of research. Brian Butler’s work studying online community development and open data is also highly relevant to the study of collaborative and cooperative models for Web archiving. Ricky Punzalan’s work on virtual reunification in Web archives is also of interest because of its parallels with post-custodial archival theory, and the role of access in preservation. And Richard Marciano’s work on digital curation, in particular his recent work with the NSF on Brown Dog, would be an opportunity for me to further my experience building tools for digital preservation.

If admitted to the program I would focus my research on how Web archives are constructed and made accessible. This would include a historical analysis of the development of Web archiving technologies and organizations. I plan to look specifically at the evolution and deployment of Web standards and their relationship to notions of impermanence, and change over time. I will systematically examine current technical architectures for harvesting and providing access to Web archives. Based on user behavior studies I would also like to reimagine what some of the tools for building and providing access to Web archives might look like. I expect that I would spend a portion of my time prototyping and using my skills as a software developer to build, test and evaluate these ideas. Of course, I would expect to adapt much of this plan based on the things I learn during my course of study in the iSchool, and the opportunities presented by working with faculty.

Upon completion of the PhD program I plan to continue working on digital humanities and preservation projects at MITH. I think the PhD program could also qualify me to help build the iSchool’s new Digital Curation Lab at UMD, or similar centers at other institutions. My hope is that my academic work will not only theoretically ground my work at MITH, but will also be a source of fruitful collaboration with the iSchool, the Library and larger community at the University of Maryland. I look forward to helping educate a new generation of archivists in the theory and practice of Web archiving.

Cherry Hill Company: Learn About Islandora at the Amigos Online Conference

Fri, 2015-08-28 19:42

On September 17, 2015, I'll be giving the presentation "Bring you Local, Unique Content to the Web Using Islandora" at the Amigos Open Source Software and Tools for the Library and Archive online conference. Amigos is bringing together practitioners from around the library field who have used open source in projects at their library. My talk will be about the Islandora digital asset management system, the fundamental building block of the Cherry Hill LibraryDAMS service.

Every library has content that is unique to itself and its community. Islandora is open source software that enables libraries to store, present, and preserve that unique content to their communities and to the world. Built atop the popular Drupal content management system and the Fedora digital object repository, Islandora powers many digital projects on the...

Read more »

SearchHub: How Shutterstock Searches 35 Million Images by Color Using Apache Solr

Fri, 2015-08-28 18:00
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shutterstock engineer Chris Becker’s session on how they use Apache Solr to search 35 million images by color. This talk covers some of the methods they’ve used for building color search applications at Shutterstock using Solr to search 40 million images. A couple of these applications can be found in Shutterstock Labs – notably Spectrum and Palette. We’ll go over the steps for extracting color data from images and indexing them into Solr, as well as looking at some ways to query color data in your Solr index. We’ll cover some issues such as what does relevance mean when you’re searching for colors rather than text, and how you can achieve various effects by ranking on different visual attributes. At the timeof this presetnation, Chris was the Principal Engineer of Search at Shutterstock– a stock photography marketplace selling over 35 million images– where he’s worked on image search since 2008. In that time he’s worked on all the pieces of Shutterstock’s search technology ecosystem from the core platform, to relevance algorithms, search analytics, image processing, similarity search, internationalization, and user experience. He started using Solr in 2011 and has used it for building various image search and analytics applications. Searching Images by Color: Presented by Chris Becker, Shutterstock from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How Shutterstock Searches 35 Million Images by Color Using Apache Solr appeared first on Lucidworks.

DPLA: DPLA Welcomes Four New Service Hubs to Our Growing Network

Fri, 2015-08-28 16:50

The Digital Public Library of America is pleased to announce the addition of four Service Hubs that will be joining our Hub network. The Hubs represent Illinois, Michigan, Pennsylvania and Wisconsin.  The addition of these Hubs continues our efforts to help build local community and capacity, and further efforts to build an on-ramp to DPLA participation for every cultural heritage institution in the United States and its territories.

These Hubs were selected from the second round of our application process for new DPLA Hubs.  Each Hub has a strong commitment to bring together the cultural heritage content in their state to be a part of DPLA, and to build community and data quality among the participants.

In Illinois, the Service Hub responsibilities will be shared by the Illinois State Library, the Chicago Public Library, the Consortium of Academic and Research Libraries of Illinois (CARLI), and the University of Illinois at Urbana Champaign. More information about the Illinois planning process can be found here. Illinois plans to make available collections documenting coal mining in the state, World War II photographs taken by an Illinois veteran and photographer, and collections documenting rural healthcare in the state.

In Michigan, the Service Hub responsibilities will be shared by the University of Michigan, Michigan State University, Wayne State University, Western Michigan University, the Midwest Collaborative for Library Services and the Library of Michigan.  Collections to be shared with the DPLA cover topics including the history of the Motor City, historically significant American cookbooks, and Civil War diaries from the Midwest.

In Pennsylvania, the Service Hub will be led by Temple University, Penn State University, University of Pennsylvania and Free Library of Philadelphia in partnership with the Philadelphia Consortium of Special Collections Libraries (PACSCL) and the Pennsylvania Academic Library Consortium (PALCI), among other key institutions throughout the state.  More information about the Service Hub planning process in Pennsylvania can be found here.  Collections to be shared with DPLA cover topics including the Civil Rights Movement in Pennsylvania, Early American History, and the Pittsburgh Iron and Steel Industry.

The final Service Hub, representing Wisconsin will be led by Wisconsin Library Services (WiLS) in partnership with the University of Wisconsin-Madison, Milwaukee Public Library, University of Wisconsin-Milwaukee, Wisconsin Department of Public Instruction and Wisconsin Historical Society.  The Wisconsin Service Hub will build off of the Recollection Wisconsin statewide initiative.  Materials to be made available document the American Civil Rights Movement’s Freedom Summer and the diversity of Wisconsin, including collections documenting the lives of Native Americans in the state.

“We are excited to welcome these four new Service Hubs to the DPLA Network,” said Emily Gore, DPLA Director for Content. “These four states have each led robust, collaborative planning efforts and will undoubtedly be strong contributors to the DPLA Hubs Network.  We look forward to making their materials available in the coming months.”

DPLA: The March on Washington: Hear the Call

Thu, 2015-08-27 19:00

Fifty-two years ago this week, more than 200,000 Americans came together in the nation’s capitol to rally in support of the ongoing Civil Rights movement. It was at that march that Martin Luther King Jr.’s iconic “I Have A Dream” speech was delivered. And it was at that march that the course of American history was forever changed, in an event that resonates with protests, marches, and movements for change around the country decades later.

Get a new perspective on the historic March on Washington with this incredible collection from WGBH via Digital Commonwealth. This collection of audio pieces, 15 hours in total, offers uninterrupted coverage of the March on Washington, recorded by WGBH and the Educational Radio Network (a small radio distribution network that later became part of National Public Radio). This type of coverage was unprecedented in 1963, and offers a wholly unique view on one of the nation’s most crucial historic moments.

In this audio series, you can hear Martin Luther King Jr.’s historic speech, along with the words of many other prominent civil rights leaders–John Lewis, Bayard Rustin, Jackie Robinson, Roy Wilkins,  Rosa Parks, and Fred Shuttlesworth. There are interviews with Hollywood elite like Marlon Brando and Arthur Miller, alongside the complex views of the “everyman” Washington resident. There’s also the folk music of the movement, recorded live here, of Joan Baez, Bob Dylan, and Peter, Paul, and Mary. There are the stories of some of the thousands of Americans who came to Washington D.C. that August–teachers, social workers, activists, and even a man who roller-skated to the march all the way from Chicago.

Hear speeches made about the global nonviolence movement, the labor movement, and powerful words from Holocaust survivor Joachim Prinz. Another notable moment in the collection is an announcement of the death of W.E.B DuBois, one of the founders of the NAACP and an early voice for civil rights issues.

These historic speeches are just part of the coverage, however. There are fascinating, if more mundane, announcements, too, about the amount of traffic in Washington and issues with both marchers’ and commuters’ travel (though they reported that “north of K Street appears just as it would on a Sunday in Washington”). Another big, though less notable, issue of the day, according to WGBH reports, was food poisoning from the chicken in boxed lunches served to participants at the march. There is also information about the preparation for the press, which a member of the march’s press committee says included more than 300 “out-of-town correspondents.” This was in addition to the core Washington reporters, radio stations, like WGBH, TV networks, and international stations from Canada, Japan, France, Germany and the United Kingdom. These types of minute details and logistics offer a new window into a complex historic event, bringing together thousands of Americans at the nation’s capitol (though, as WGBH reported, not without its transportation hurdles!).

At the end of the demonstration, you can hear for yourself a powerful pledge, recited from the crowd, to further the mission of the march. It ends poignantly: “I pledge my heart and my mind and my body unequivocally and without regard to personal sacrifice, to the achievement of social peace through social justice.”

Hear the pledge, alongside the rest of the march as it was broadcast live, in this inspiring and insightful collection, courtesy of WGBH via Digital Commonwealth.

Banner image courtesy of the National Archives and Records Administration.

A view of the March on Washington, showing the Reflecting Pool and the Washington Monument. Courtesy of the National Archives and Records Administration.

Jonathan Rochkind: Am I a “librarian”?

Thu, 2015-08-27 18:42

I got an MLIS degree, received a bit over 9 years ago, because I wanted to be a librarian, although I wasn’t sure what kind. I love libraries for their 100+ year tradition of investigation and application of information organization and retrieval (a fascinating domain, increasingly central to our social organization); I love libraries for being one of the few information organizations in our increasingly information-centric society that (often) aren’t trying to make a profit off our users so can align organizational interests with user interests and act with no motive but our user’s benefit; and I love libraries for their mountains of books too (I love books).

Originally I didn’t plan on continuing as a software engineer, I wanted to be ‘a librarian’.  But through becoming familiar with the library environment, including but not limited to job prospects, I eventually realized that IT systems are integral to nearly every task staff and users perform at or with a librarian — and I could have a job using less-than-great tech knowing that I could make it better but having no opportunity to do so — or I could have job making it better.  The rest is history.

I still consider myself a librarian. I think what I do — design, build, and maintain internal and purchased systems by which our patrons interact with the library and our services over the web —  is part of being a librarian in the 21st century.

I’m not sure if all my colleagues consider me a ‘real librarian’ (and my position does not require an MLIS degree).  I’m also never sure, when strangers or aquaintances ask me what I do for work, whether to say ‘librarian’, since they assume a librarian does something different then what I spend my time doing.

But David Lee King in a blog post What’s the Most Visited Part of your Library? (thanks Bill Dueber for the pointer), reminds us, I think from a public library perspective:

Do you adequately staff the busiest parts of your library? For example, if you have a busy reference desk, you probably make sure there are staff to meet demand….

Here’s what I mean. Take a peek at some annual stats from my library:

  • Door count: 797,478 people
  • Meeting room use: 137,882 people
  • Library program attendance: 76,043 attendees
  • Art Gallery visitors: 25,231 visitors
  • Reference questions: 271,315 questions asked

How about website visits? We had 1,113,146 total visits to the website in 2014. The only larger number is is our circulation count (2,300,865 items)….

…So I’ll ask my question again: Do you adequately staff the busiest parts of your library?

I don’t have numbers in front of me from our academic library, but I’m confident that our ‘website’ — by which I mean to include our catalog, ILL system, link resolver, etc, all of the places users get library services over the web, the things me and my colleagues work on — is one of the most, if not the most, used ‘service points’ at our library.

I’m confident that the online services I work on reach more patrons, and are cumulatively used for more patron-hours, than our reference or circulation desks.

I’m confident the same as true at your library, and almost every library.

What would it mean for an organization to take account of this?  “adequate staffing”, as King says, absolutely. Where are staff positions allocated?  But also in general, how are non-staff resources allocated?  How is respect allocated? Who is considered a ‘real librarian’? (And I don’t really think it’s about MLIS degree either, even though I led with that). Are IT professionals (and their departments and managers) considered technicians to maintain ‘infrastructure’ as precisely specified by ‘real librarians’, or are they considered important professional partners collaborating in serving our users?  Who is consulted for important decisions? Is online service downtime taken as seriously (or more) than an unexpected closure to the physical building, and are resources allocated correspondingly? Is User Experience  (UX) research done in an actual serious way into how your online services are meeting user needs — are resources (including but not limited to staff positions) provided for such?

What would it look like for a library to take seriously that it’s online services are, by far, the most used service point in a library?  Does your library look like that?

In the 21st century, libraries are Information Technology organizations. Do those running them realize that? Are they run as if they were? What would it look like for them to be?

It would be nice to start with just some respect.

Although I realize that in many of our libraries respect may not be correlated with MLIS-holders or who’s considered a “real librarian” either.  There may be some perception that ‘real librarians’ are outdated. It’s time to update our notion of what librarians are in the 21st century, and to start running our libraries recognizing how central our IT systems, and the development of such in professional ways, are to our ability to serve users as they deserve.

Filed under: General

SearchHub: Indexing Arabic Content in Apache Solr

Thu, 2015-08-27 18:27
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Ramzi Alqrainy‘s session on using Solr to index and search documents and files in Arabic. Arabic language poses several challenges faced by the Natural Language Processing (NLP), largely due to the fact that Arabic language, unlike European languages, has a very rich and sophisticated morphological system. This talk will cover some of the challenges and how to solve them with Solr and will also present the challenges that were handled by Opensooq as a real case in the Middle East. Ramzi Alqrainy is one of the most recognized experts within Artificial Intelligence and Information Retrieval fields in the Middle East. He is an active researcher and technology blogger, with a focus on information retrieval. Arabic Content with Apache Solr: Presented by Ramzi Alqrainy, OpenSooq from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Indexing Arabic Content in Apache Solr appeared first on Lucidworks.

LITA: August Library Tech Roundup

Thu, 2015-08-27 13:00
image courtesy of Flickr user cdevers (CC BY NC ND)

Each month, the LITA bloggers will share selected library tech links, resources, and ideas that resonated with us. Enjoy – and don’t hesitate to tell us what piqued your interest recently in the comments section!

Brianna M.

Here are some of the things that caught my eye this month, mostly related to digital scholarship.

John K.

Jacob S.

  • I’m thankful for Shawn Averkamp’s Python library for interacting with ContentDM (CDM), including a Python class for editing CDM metadata via their Catcher, making it much less of a pain batch editing CDM metadata records.
  • I recently watched an ALA webinar where Allison Jai O’Dell presented on TemaTres, a platform for publishing linked data controlled vocabularies.

Nimisha B.

There have been a lot of great publications and discussions in the realm of Critlib lately concerning cataloging and library discovery. Here are some, and a few other things of note:

Michael R.

  • Adobe Flash’s days seem numbered as Google Chrome will stop displaying Flash adverts by default, following Firefox’s lead. With any luck, Java will soon follow Flash into the dustbin of history.
  • NPR picked up the story of DIY tractor repairs running afoul of the DMCA. The U.S. Copyright Office is considering a DMCA exemption for vehicle repair; a decision is scheduled for October.
  • Media autoplay violates user control and choice. Video of a fatal, tragic Virginia shooting has been playing automatically in people’s feeds. Ads on autoplay are annoying, but this…!

Cinthya I.

These are a bit all over the map, but interesting nonetheless!

Bill D.

I’m all about using data in libraries, and a few things really caught my eye this month.

David K.

Whitni W.

Marlon H.

  • Ever since I read an ACRL piece about library adventures with Raspberry Pi, I’ve wanted to build my own as a terminal for catalog searches and as an self checkout machine. Adafruit user Ruizbrothers‘ example of how to Build an All-In-One Desktop using the latest version of Raspberry Pi might just what I need to finally get that project rolling.
  • With summer session over (and with it my MSIS, yay!) I am finally getting around to planning my upgrade from Windows 8.1 to 10. Lifehacker’s Alan Henry, provides quite a few good reasons to opt for a Clean Install over the standard upgrade option. With more and more of my programs conveniently located just a quick download away and a wide array of cloud solutions safeguarding my data, I think I found my weekend project.

Share the most interesting library tech resource you found this August in the comments!

William Denton: MIME type

Thu, 2015-08-27 03:03

Screenshot of an email notification I received from Bell, as viewed in Alpine:

Inside: Content-Type: text/plain; charset="ISO-8859-1"

In the future, email billing will be mandatory and email bills will be unreadable.

District Dispatch: Tech industry association releases copyright report

Wed, 2015-08-26 20:37


The Computer and Communications Industry Association (CCIA) released a white paper “Copyright Reform for a Digital Economy” yesterday that includes many ideas also supported by libraries. The American Library Association shares the same philosophy that the purpose of the copyright law is to advance learning and benefit the public. We both believe that U.S. copyright law is a utilitarian system that rewards authors through a statutorily created but limited monopoly in order to serve the public. Any revision of the copyright law needs to reflect that viewpoint and recognize that today, copyright impacts everyone, not just media companies. The white paper does get a little wonky in parts, but check out at least the executive summary (or watch the webinar) to learn why we argue for fair use, licensing transparency, statutory damages reform and a respect for innovation and new creativity.

The post Tech industry association releases copyright report appeared first on District Dispatch.

Cynthia Ng: Making the NNELS Site Responsive

Wed, 2015-08-26 19:09
Honestly, making a site responsive is nothing new, not even for me. Nevertheless, I wanted to document the process (no surprise there). Since as of the date of publishing this post, the responsive version of the theme hasn’t gone live yet, you get a sneak peek. I was a little worried because we have a … Continue reading Making the NNELS Site Responsive

Jonathan Rochkind: Virtual Shelf Browse is a hit?

Wed, 2015-08-26 18:55

With the semester starting back up here, we’re getting lots of positive feedback about the new Virtual Shelf Browse feature.

I don’t have usage statistics or anything at the moment, but it seems to be a hit, allowing people to do something like a physical browse of the shelves, from their device screen.

Positive feedback has come from underclassmen as well as professors. I am still assuming it is disciplinarily specific (some disciplines/departments simply don’t use monographs much), but appreciation and use does seem to cut across academic status/level.

Here’s an example of our Virtual Shelf Browse.

Here’s a blog post from last month where I discuss the feature in more detail.

Filed under: General