You are here

Feed aggregator

SearchHub: Query Autofiltering IV: – A Novel Approach to Natural Language Processing

planet code4lib - Thu, 2015-11-19 14:46

This is my fourth blog post on a technique that I call Query Autofiltering. The basic idea is that we can use meta information stored within the Solr/Lucene index itself (in the form of string or non-tokenized text fields) to generate a knowledge base from which we can parse user queries and map phrases within the query to metadata fields in the index. This enables us to re-write the user’s query to achieve better precision in the response.

Recent versions of Query Autofiltering, which uses the Lucene FieldCache as a knowledge store, are able to do this job rather well but still leave some unresolved ambiguities. This can happen when a given metadata value occurs in more than one field (some examples of this below), so the query autofilter will create a complex boolean query to handle all of the possibile permutations. With multiple fields involved, some of the cross-field combinations don’t exist in the index (the autofilter can’t know that) and an additional filtering step happens serendipitously when the query is run. This often gives us exactly the right result but there is an element of luck involved, which means that there are bound to be situations where our luck runs out.

As I was developing demos for this approach using a music ontology I am working on, I discovered some of these use cases. As usual, once you see a problem and understand the root cause, you can then find other examples of it. I will discuss a biomedical / personal health use case below that I had long thought was difficult or impossible to solve with conventional search methods (not that query autofiltering is “conventional”). But I am getting ahead of myself. The problem crops up when users add verbs, adjectives or prepositions to their query to constrain the results, and these terms do not occur as field values in the index. Rather, they map to fields in the index. The user is telling us that they want to look for a key phrase in a certain metadata context, not all of the contexts in which the phrase can occur. It’s a Natural Language 101 problem! – Subject-Verb-Object stuff. We get the subject and object noun phrases from query autofiltering. We now need a way to capture the other key terms (often verbs) to do a better job of parsing these queries – to give the user the accuracy that they are asking for.

I think that a real world example is needed here to illustrate what I am talking about. In the Music ontology, I have entities like songs, the composers/songwriters/lyricists that wrote them and the artists that performed or recorded them. There is also the concept of a “group” or “band” which consists of group members who can be songwriters, performers or both.

One of my favorite artists (and I am sure that some, but maybe not all of my readers would agree) is Bob Dylan. Dylan wrote and recorded many songs and many of his songs were covered by other artists. One of the interesting verbs in this context is “covered”. A cover in my definition, is a recording by an artist who is not one of the song’s composers. The verb form “to cover” is the act of recording or performing another artist’s composition. Dylan, like other artists, recorded both his own songs and songs of other musicians, but a cover can be a signature too. So for example, Elvis Presley covered many more songs than he wrote, but we still think of “Jailhouse Rock” as an Elvis Presley song even though he didn’t write it (Jerry Leiber and Mike Stoller did).

So if I search for “Bob Dylan Songs” – I mean songs that Dylan either wrote or recorded (i.e. both). However if I search for “Songs Bob Dylan covered”, I mean songs that Bob Dylan recorded but didn’t write and “covers of Bob Dylan songs” would mean recordings by other artists of songs that Dylan wrote– Jimi Hendrix’s amazing cover of “All Along The Watchtower” immediately comes to mind here. (There is another linguistic phenomenon besides verbs going on here that I will talk about in a bit.)

So how do we resolve these things? Well, we know that the phrase “Bob Dylan” can occur many places in our ontology/dataset. It is a value in the “composer” field, the “performer” field and in the title field of our record for Bob Dylan himself. It is also the value of an album entity since his first album was titled “Bob Dylan”. So given the query “Bob Dylan” we should get all of these things – and we do – the ambiguity of the query matches the ambiguities discovered by the autofilter, so we are good. “Bob Dylan Songs” gives us songs that he wrote or recorded – now the query is more specific but still some ambiguities here, but still good because we have value matches for the whole query. However, if we say “Songs Bob Dylan recorded” vs “Songs Bob Dylan wrote” we are asking for different subsets of “song” things. Without help, the autofilter misses this subtlety because there is no matching fields for the terms “recorded” or “wrote” so it treats them as filler words.

To make the query autofilter a bit “smarter” we can give it some rules. The rule states that if a term like “recorded” or “performed” is near an entity (detected by the standard query autofilter parsing step) like “Bob Dylan” that maps to the field “performer_ss” then just use that field by itself and don’t fan it out to the other fields that the phrase also maps to. We configure this like so:

    performed, recorded,sang => performer_ss

and for songs composed or written:

    composed,wrote,written by => composer_ss

Where the list of synonymous verb or adjective phrases is on the left and the field or fields that these should map to on the right. Now these queries work as expected! Nice.

Another example is if we want to be able to answer questions about the bands that an artist was in or the members of a group. For questions like “Who’s in The Who?” or “Who were the members of Procol Harum?” we would map the verb or prepositional phrases “who’s in” and “members of” to the group_members_ss and member_of_group fields in the index.        

    who’s in,was in,were in,member,members => group_members_ss,member_of_group_ss

Now, searching for “who’s in the who” brings back just Messrs. Daltrey, Entwistle, Moon and Towhshend – cool!!!

Going Deeper – handling covers with noun-noun phrase ambiguities

The earlier example that I gave, “songs Bob Dylan covered” vs. “covers of Bob Dylan songs” contains additional complexities that the simple verb-to-field mapping doesn’t solve yet. Looking at this problem from a language perspective (rather than from a software hacking point of view) I was able to find a explanation and from that a solution. A side note here is that the output of my pre-processing of the ontology to detect when a recording was a cover, was the opposite relation when the performer of a song is also one of the composers. Index records of this type get tagged with an “original_performer_s” field and a “version_s:Original” to distinguish them from covers at query time (which are tagged “version_s:Cover”).

Getting back to the language thing, it turns out that in the phrase “Bob Dylan songs covered”, the subject noun phrase is “Bob Dylan songs”! That is the noun entity is the plural form of song, and the noun phrase “Bob Dylan” qualifies that noun to specify songs by him – its what is known in linguistics as a “noun-noun phrase” meaning that one noun “Bob Dylan” serves as an adjective to another one, “song” in this case. Remember – language is tricky! However, in the phrase “Songs Bob Dylan covered”, now “Songs” is the object noun, “Bob Dylan” is the subject noun and “covered” is the verb. To get this one right, I devised an additional rule which I call a pattern rule: if an original_performer entity precedes a composition_type song entity, use that pattern for query autofiltering. This is expressed in the configuration like so:

covered,covers:performer_ss => version_s:Cover | original_performer_s:_ENTITY_,recording_type_ss:Song=>original_performer_s:_ENTITY_

To break this down, the first part does the mapping of ‘covered’ and ‘covers’ to the field performer_ss. The second part sets a static query parameter version_s:Cover and the third part:


Translates to: if an original performer is followed by a recording type of “song”, use original_performer_s as the field name.

We also want this pattern to be applied in a context sensitive manner – it is needed to disambiguate the bi-directional verb “cover” so we only use it in this situation. That is this pattern rule is only triggered if the verb “cover” is encountered in the query. Again, these rules are use-case dependent and we can grow or refine them as needed. Rule-based approaches like this require curation and analysis of query logs but can be a very effective way to handle edge cases like this. Fortunately, the “just plug it in and forget it” part of the query autofiltering setup handles a large number of use cases without any help. That’s a good balance.

With this rule in place, I was able to get queries like “Beatles Songs covered by Joe Cocker” and “Smokey Robinson songs covered by the Beatles” to work as expected. (The answer to the second one is that great R&B classic “You’ve Really Got A Hold On Me”).

Healthcare concerns

Let’s examine another domain to see the generality of these techniques. In healthcare, there is a rich ontology that we can think of relating diseases, symptoms, treatments and root biomedical causes. There are also healthcare providers of various specialties and pharmaceutical manufacturers in the picture among others. In this case, the ontologies are out there (like MeSH) courtesy of the National Institutes of Medicine and other affiliated agencies. So, imagine that we have a consumer healthcare site with pages that discuss these entities and provide ways to navigate between them. The pages would also have metadata that we can both facet and perform query autofiltering on.

Lets take a concrete example. Suppose that you are suffering from abdominal pain (sorry about that). This is an example of a condition or symptom that may be benign (you ate or drank too much last night) or a sign of something more serious. Symptoms can be caused by diseases like appendicitis or gastroenteritis, can be treated with drugs or may even be caused by a drug side effect or adverse reaction. So if you are on this site, you may be asking questions like “what drugs can treat abdominal pain?” and maybe also “what drugs can cause abdominal pain?”. This is a hard problem for traditional search methods and the query autofilter, without the type of assistance I am discussing here would not get it right either. For drugs, the metadata fields for the page would be “indication” for positive relationships (an indication is what the drug has been approved for by the FDA) and “side_effect” or “adverse_reaction” for the dark side of pharmaceuticals (don’t those disclaimers on TV ads just seem to go on and on and on?).

With our new query autofilter trick, we can now configure these verb preposition phrases to map to the right fields:

    treat,for,indicated => indication_ss

    cause,produce => side_effect_ss,adverse_reaction_ss

Now these queries should work correctly: our search application is that much smarter – and our users will be much happier with us – because as we know, users asking questions like this are highly motivated to get good, usable answers and don’t have the time/patience to wade through noise hits (i.e. they may already be in pain).

You may be wondering at this point how many of these rules will we need? One thing to keep in mind and the reason for my using examples from two different domains is to illustrate the domain-specific nature of these problems. For general web search applications like Google, this list of rules might be very large (but then again so is Google). For domain specific applications as occur in enterprise search or eCommerce, the list can be much more manageable and use-case driven. That is, we will probably discover these fixes as we examine our query logs, but now we have another tool in our arsenal to tackle language problems like this.

Using Natural Language Processing techniques to detect and respond to User Intent

The general technique that I am illustrating here is something that I have been calling “Query Introspection”. A more plain-english way to say this is inferring user intent. That is, using techniques like this we can do a better job of figuring out what the user is looking for and then modifying the query to go get it if we can. It’s a natural language processing or NLP problem. There are other approaches that have been successful here, notably using parts of speech analysis on the query (POS) to get at the nouns, verbs and prepositions that I have been talking about. This can be based on machine learning or algorithmic approaches (rules based) and can be a good way of parsing the query into its linguistic component parts. IBM’s famous Watson program needed a pretty good one to parse Jeopardy questions. Machine learning approaches can also be applied directly to Q&A problems. A good discussion of this is in Ingersoll’s great book Taming Text.

The user intent detection step, which classical NLP techniques discussed above and now the query autofilter can do, represents phase one of the process. Translating this into an appropriately accurate query is the second phase. For POS tagged approaches, this usually involves a knowledge base that enables parts of speech phrases to be mapped to query fields. Obviously, the query autofilter does this natively but it can get the information from the “horses mouth” so to speak. The POS / knowledge base approach may be more appropriate when there is less metadata structure in the index itself as the KB can be the output of data mining operations. There were some excellent talks on this at the recent Lucene/Solr Revolution in Austin (see my blog post on this). However, if you already have tagged your data, manually or automagically, give query autofiltering a shot.

Source code is available

The Java source code for this is available on github for both Solr4.x and Solr 5 versions. Technical details about the code and how it works is available there. Download and use this code if you want to incorporate this feature into your search applications now. There is also a Solr JIRA submission (SOLR-7539).

The post Query Autofiltering IV: – A Novel Approach to Natural Language Processing appeared first on

DuraSpace News: INVITATION to Join the New Fedora Community Book Club

planet code4lib - Thu, 2015-11-19 00:00

From Andrew Woods, Fedora Tech Lead

Winchester, MA  A book club or reading group offers readers an opportunity to have focused discussions about what they are reading. Currently more than 5 million adults are thought to participate in a wide variety of reading groups.*  

Chris Prom: Why I voted yes

planet code4lib - Wed, 2015-11-18 22:21

As most Society of American Archivists (SAA) members know, we have just been invited to vote for a dues increase, to be phased in over three years.

It is no exaggeration to say that the proposal is controversial. Not only is the US economy hobbling along, but member salaries seem flat.  Many of us struggle to make ends meet, working in positions that pay but a fraction of the value we provide to society.  And student members worry about the future, understandably so.

So why increase dues now?  And why vote yes?

Here is why I did vote yes, and with enthusiasm!

First, whatever your income level, SAA membership is a incredible value.

Personally, I have benefitted many times over from my membership dues.  Entering the profession in 1999, I found an instant home: a place to discuss substantive issues, a place to shape professional discourse, a place to grow in my knowledge and understanding, and a place to commiserate with my peers.   In Democracy in America, Alexis De Tocqueville argued that “[t]he health of a democratic society may be measured by the quality of functions performed by private citizens.”  And in America, the way that private citizens best express ourselves is through  associations like SAA.  Associations give as a place to talk to each other, to systematize our knowledge, and to advocate for change.

Without SAA, archivists would lack the means to speak with a common (but not necessarily unitary) voice on issues of vital importance to the future of our country and world.

Second, SAA provides us these opportunities on a shoestring budget.  It may sound trite to say that members are the organization, but it is true!   SAA’s talented and hardworking staff play a critical enabling role.  But without adequate income from dues, meeting attendance, publication sales, and education, the members simply cannot exist as a positive force; that is to say, as a national association. Dues are a critical pillar in the foundation that supports SAA, and they provide us the opportunity to extend our effectiveness far beyond what we can achieve alone.  Even more to the point, they heighten the value our work provides in our individual institutions, our communities, and our other professional associations.  Each of these groups has a complementary role, but they can’t duplicate or replace that of SAA.

Third, SAA is a responsive organization.  In short, SAA works!  Members objected (in quite productive and collegial terms) to the intial dues increase proposal. Council carefully considered the objections, reshaped the proposal, and provided a very rational and effective response, one which holds the line on dues at the lower income levels, while introducing modest increases elsewhere and moving us toward a more progressive structure.   This is a process and result that put SAA’s best attributes on vivid display!

And finally, the proposed dues increase represents the minimum amount that is necessary to support core functions of the Society–functions like education, publishing, and technology.  Each of these functions is critical to the long-term health of our profession, not just our professional association, even while they enable related activities that are critically important to the health of SAA — like information exchange, best practice development, and standards creation/maintenance.

At different times in my career, I’ve been a student, a member of the lowest membership tier, the middle tiers, and now the highest one.  But whatever my status, I have received much much more from my SAA membership than whatever the monetary cost I incurred each year.

I’ve learned to be a better archivist.  I’ve been mentored by people who are much wiser than I’ll ever be. I’ve expanded my knowledge through top-notch meetings, books, workshops, and courses.  And most of all, I’ve made and continue to make many friends.

And that is the type of value we can all take the bank!

[updated 8:26 PM CST, Wed Nov. 18, 2015 and 1:20  CST Friday Nov 20]

District Dispatch: Copyright looks different from Jamaican eyes

planet code4lib - Wed, 2015-11-18 22:02

Street in Montigo Bay, Jamaica

This week, the Re:create Coalition of which ALA is a founding member held two events: One highlighting academic research in copyright held at Howard University School of Law; and a half day of panel discussions on “Modernizing Copyright Law for Today’s Reality.”

The program on academic research hosted by the Institute for Intellectual Property and Social Justice at Howard University School of Law was a treat because not one sound bite was uttered. One gets so much of the sound bite thing working and living in DC that is gets tiresome. You can imagine. At the IP social justice event, discussion centered on progressive ways to think about copyright in today’s global, digital information society. One speaker was Larisa Kingston Mann who discussed her research in the music and dance world of Jamaicans. Jamaicans living in poverty are creators that operate in a totally unregulated environment, not guided by western copyright law. This is not a big surprise when you reflect on the colonial history of Jamaica. Why would Jamaicans follow laws established by their oppressors meant to assimilate them in western ways?

Jamaicans produce mass street dances where individuals can become celebrities by dancing or singing with no expectation of compensation. Their version of royalties is being mentioned in a song. They openly use copyrighted music and record their own versions over instrumental tracks. Creating an original work as we conceive of it in the dominant copyright paradigm is meaningless because harkening back to works that have already been created and that link to their culture is the value they embrace. Mann’s presentation was fascinating and pointed out that (once again) official policy is far removed from behavior on the ground.

This reminded me of academic authors who want to share their research and do not expect monetary compensation. Instead it is the opinion of their peers that matters. Like the Jamaican creators, their royalties consist of being mentioned in another person’s work in the form of citations. Official copyright policy tends to think of authors only in the commercial context, again far removed from behavior on the ground.

Of course, many academic authors are paid for a living – a circumstance that commercial authors are quick to note. But they are missing the point. “One size fits all” copyright policy is just like the clothing. One size rarely fits all. Copyright law does not recognize the majority of the creators in the world. It favors a specific kind of author unduly focusing on the incentive aspect of copyright rather than on the true purpose—to advance learning.

The post Copyright looks different from Jamaican eyes appeared first on District Dispatch.

LITA: Jobs in Information Technology: November 18, 2015

planet code4lib - Wed, 2015-11-18 20:45

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

Programmer, University of Colorado Denver- Auraria Library, Denver, CO

Head of Digital Library Services, J. Willard Marriott Library, University of Utah, Salt Lake City, UT

Discovery Services Librarian, Middle Tennessee State University, Murfreesboro, TN

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

SearchHub: Lucidworks Announces $21 Million in Series D Funding

planet code4lib - Wed, 2015-11-18 19:00

It’s a great Wednesday:

Lucidworks, the chosen search solution for leading brands and organizations around the world, today announced $21 million in new financing. Allegis Capital led the round with participation from existing investors Shasta Ventures and Granite Ventures. Lucidworks will use the funds to accelerate its product-focused mission enabling companies to translate massive amounts of data into actionable business intelligence.

“Organizations demand powerful data applications with the ease of use of a mobile app. Our platform provides universal data access to end users both inside and outside the enterprise,” said Will Hayes, CEO, Lucidworks. “With this investment, Lucidworks will expand our efforts to be the premier platform for building search and data-driven applications.”

“Lucidworks has proven itself, not only by providing the software and solutions that businesses need to benefit from Lucene/Solr search, but also by expanding its vision with new products like Fusion that give companies the ability to fully harness search technology suiting their particular customers,” says Spencer Tall, Managing Director, Allegis Capital. “We fully support Lucidworks, not only for what it has achieved to date — disruptive search solutions that offer real, immediate benefits to businesses — but for the promising future of its product technology.”

Full details

The post Lucidworks Announces $21 Million in Series D Funding appeared first on

pinboard: Presentations @ Code4Lib

planet code4lib - Wed, 2015-11-18 16:41
RT @ronallo: Vote for #code4lib talk Building Desktop Applications using Web Technologies w/ #electronjs Will be useful and fun!

Roy Tennant: Only Good Enough to be Dangerous

planet code4lib - Wed, 2015-11-18 15:54

I feel like I didn’t really reach adulthood until I became a commercial river guide. There are a few reasons for this opinion. One is that you probably don’t really become an adult until you are responsible for the health and well-being of someone else. As a guide and more importantly as a trip leader, that responsibility was mine at 21.

Another reason is that you don’t really become an adult until you realize how little you understand. As a teenager who has recently come into one’s own in terms of making decisions sans parental input and generally making your way in the world it can all go to your head. Then add to that the accomplishment of apparent mastery of a sport that is mildly dangerous and you have a cocktail for disaster. Believe me, I know. I’ve seen it in me and I’ve seen it in others. See the picture of me perched on a rock in the middle of a river if you don’t believe me.

I call it “knowing just enough to be dangerous.”

Unfortunately, that also describes my knowledge of programming. Although I’ve been writing software since the mid-80s (seriously, talk to me about BASIC sometime), I’ve never been all that good. I only seem to learn just enough to do what I need and sometimes not even that. Since I’ve never actually been employed as a programmer this level of skill has sufficed, which only enables the problem.

Meanwhile, I’ve had continuous access to computing resources my entire adult life, and recently these resources have been truly astonishing. A 51-node parallel computing cluster, with a terabyte and a half of RAM, for example. And still my skills at bringing machines to their knees through loading huge arrays into RAM, running loops within loops, and other shenanigans continues unabated. You could even call it a talent.

So…yeah. Only good enough to be dangerous, I am. And perhaps you too. Naw, just kidding. You’re just fine.

DPLA: Digital Public Library of America and Pop Up Archive partner to make audiovisual collections across the U.S. searchable

planet code4lib - Wed, 2015-11-18 15:50

Oakland, CA & Boston, MA — Libraries across the United States house tens of millions of audio and video recordings, a rich and vibrant body of cultural history and content for the public, scholars, and researchers — but the recordings are virtually impossible to search. The Digital Public Library of America is partnering with Pop Up Archive to offer discounted services to the DPLA network. DPLA Hubs and their partners will be able to take advantage of this discounted rate to make it possible for anyone to search and pinpoint exact search terms and phrases within audiovisual collections.

DPLA already provides a catalog of over eleven million records from libraries across the U.S., including many audiovisual records. Through new service offerings available exclusively to the DPLA’s 1,600+ partner organizations, Pop Up Archive will automatically transcribe, timestamp, and generate keywords for the audio collections.

“We’re creating so much more digital media with every day that passes. If 300 hours of video are uploaded to YouTube every minute, for libraries to keep up with the pace of audiovisual content creation, they need practices that can radically scale to meet the pace of creation,” said Anne Wootton, CEO of Pop Up Archive.

“Our goal is to connect the widest audience with the greatest amount of openly available materials in our nation’s cultural heritage institutions, and audiovisual material has been both critical to our growing collection and less searchable than other forms,” said Dan Cohen, DPLA’s Executive Director. “We’re delighted that we can work with Pop Up Archive to provide this valuable additional service to our constantly expanding network of libraries, archives, and museums.”

Interested DPLA partners can learn more at

Since it was founded in 2012, Pop Up Archive has partnered with dozens of libraries, archives, and public media organizations to transcribe, tag, and index over 1,000,000 minutes of recorded sound, including over 10,000 audio items preserved at the Internet Archive ( Pop Up Archive was built to create digital access to audiovisual collections through simple tools for searching and sharing sound. Most recently, Pop Up Archive has embarked on projects to combine machine intelligence with crowdsourced improvements for collections at the New York Public Library and The Moth as well as the American Archive of Public Broadcasting, a collaboration between WGBH and the Library of Congress to identify, preserve, and make accessible a digital archive of 40,000 hours of public media dating back to the late 1940s.

The Digital Public Library of America strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated over 11 million items from over 1,600 institutions. The DPLA hubs model is establishing a national network by building off of state/regional digital libraries and myriad large digital libraries in the US, bringing together digitized and born-digital content from across the country into a single access point for end users, and an open platform for developers. The model supports or establishes local collaborations, professional networks, metadata globalization, and long-term sustainability. It ensures that even the smallest institutions have an on-ramp to participation in DPLA.

Pop Up Archive is funded by the John S. and James L. Knight Foundation, the National Endowment for the Humanities, and the Institute for Museum and Library Services. Pop Up Archive’s partners include This American Life, the American Archive of Public Broadcasting, the Studs Terkel Radio Archive from the WFMT Radio Network, and tens of thousands of hours of audio from across the United States collected by numerous public media, storytelling, and oral history organizations. Learn more at

The Digital Public Library of America is generously supported by a number of foundations and government agencies, including the Alfred P. Sloan Foundation, the Andrew W. Mellon Foundation, an anonymous donor, the Arcadia Fund, the Bill & Melinda Gates Foundation, the Institute of Museum and Library Services, the John S. and James L. Knight Foundation, the Whiting Foundation, and the National Endowment for the Humanities. To find out more about DPLA, visit

In the Library, With the Lead Pipe: Renovation as a Catalyst for Change

planet code4lib - Wed, 2015-11-18 14:00

Throwback Thursday for November 18, 2015: Take a look at Renovation as a Catalyst for Change by Erin Dorney and Eric Frierson. We may be coming to the end of 2015, but it’s never too late to think about switching things up at your library!

Photo by Flickr user drp (CC BY-NC-ND 2.0)


This Lead Pipe post is about two libraries attempting to reinvent services, collections, and spaces as the walls of their buildings come crashing down. Rather than embarking on phased construction projects, the library buildings at both St. Edward’s University and Millersville University will be completely shut down for a period of one and two years, respectively. Co-authors Eric Frierson, Library Digital Services Manager at St. Edward’s and Erin Dorney, Outreach Librarian at Milersville discuss renovations as catalysts for change, experimentation and flexibility, and distributed/embedded librarianship. These facets contribute to the identity crisis librarianship has struggled with since the Information Age began – only exacerbated by unique circumstances. The conversation below is one example of the kinds of real questions being proffered to librarians at both institutions:

“I don’t mean to sound disrespectful,” began the biology professor, “but if we can do without a library for a whole year, what does that say about the library?” An awkward silence settled over the science faculty meeting before the librarian was able to pull together a response.

“You’re right. The library as it exists now – the print collections, the reference desk – these may not be required elements of a thriving university library. This renovation project gives us the opportunity to re-examine what a library does on campus, what things we don’t need to do, and what things we could start doing that we haven’t done before.”

This post will not cover the new, technologically-situated, collaborative learning spaces which will exist following the renovations, but rather discuss how renovations can bring organizational change that has the potential to shape the library of the future. It is our belief that the pace of change our libraries have adopted should become the norm at all libraries.

St. Edward’s University

St. Edward’s University is a private, Catholic institution in Austin, Texas. It is home to over 5,000 students and is situated on a hill overlooking the lower Colorado River, boasting gorgeous views of the Austin skyline. Enrollment has nearly doubled in the past ten years, and the campus master plan has made the grounds and buildings cohesive, beautiful, and a delight to explore. The library, however, has been a 30-year anomaly with its white stucco, rounded-edge shell.

(Current St. Edward’s University Library Building)

During the summer of 2011, the university received a gift of $13 million from Austin-area philanthropists Pat and Bill Munday for the creation of a new library and learning commons. The only catch is that construction must be complete within a period of two years. This aggressive timeline demanded the selection of an architect almost immediately, and the library, along with its partners in the new commons, needed to have the plan for the new space completed within a few short months.

Because the project involves renovating existing square-footage and building a new addition, almost all physical resources – including collections – will be need be removed from the building for one year. The print collection of 170,000 books will need to be aggressively weeded and stored off-campus, inaccessible during the project. Only a few hundred high circulation items and the media collection will remain on campus. Seventeen staff members will find a new home in Mang House, a three bedroom residence with kitchen and a laundry room. The 100 computers and public use furniture from the old library will be dispersed throughout existing campus locations.

(Mang House – The temporary location for St. Edward’s University’s library.)

The librarians are not sure what Mang House will be like. For so long, they have identified public services with the desk that sits near the front door of the old building. There is no space for a robust reference desk in the temporary location; instead, staff will have a smallish living room with a fireplace. For an entire year, the library will exist without a reference desk, a print collection, or dedicated computing and study spaces. “If we don’t have those things… who are we, exactly?” asks Frierson and countless others.

Millersville University

Millersville University is a regional comprehensive Pennsylvania state school with a 2010 FTE of approximately 6,970 undergrads and 583 graduate students. As a state institution, campus buildings are only eligible for renovations on a strict schedule. Originally allocated $7 million from the state for basic infrastructure updates, the library and university administrators have successfully increased that amount to $25 million based on additional state allocations, university support, and private donations.

(Millersville University Library – Under construction for two years)

This intense renovation project will take 2 years to complete, gutting the interior of the 11-story building to replace all major systems (including heating, cooling, lighting, fire protection, vapor barriers and elevators). The library had to be emptied of all people, books, microfilms, computers, shelving, and furniture, down to the last piece of signage and window shades in order to allow construction to move at a quicker pace and ensure the safety of staff, visitors, and physical materials. Over 300,000 print items have been placed in storage off-site, where, similarly to St. Edwards, the books will be inaccessible to students and faculty members.

(Millersville University – The temporary library @ Gerhart Hall)

For the next two years, the campus will rely on a temporary library in Gerhart Hall containing approximately 10,000 items and less than a quarter of the study and computing space that the old library provided. There is no traditional reference desk and most of the librarians are distributed across campus, embedded in offices within academic buildings that align with their liaison areas. Similarly to the situation at St. Edwards, this period of massive change calls into question everything an academic library has traditionally been known to provide and represent.

Renovation as a Catalyst

From a librarian’s point of view, temporarily disconnecting from the building provides an opportunity for a clean slate. Many legacy processes are tied to institutional history and specific circumstances. To put it another way, buildings come with baggage. Libraries make exceptions, create lengthy policies, even determine resources and services based on prior experiences. Concern has been voiced by librarians (particularly those new to the profession) over the “way we’ve always done it” mantra that sometimes infiltrates institutions, marking this steadfastness as resistance to change that will leave libraries irrelevant to their constituencies. Ross and Sennyey (2008) describe some library services as holdovers from an era that has disappeared, “making our professional assumptions seem as foreign as a medieval manuscript in chains” (146). Included in these assumptions are services that are tethered to user needs that no longer exist.

The situations at St. Edward’s and Millersville are unique in that the renovations are not incremental. At both institutions, the scale of construction will shut down the entire space – not just one floor at a time. There are no branch or specialized libraries to absorb collections, services, or personnel. Business simply cannot proceed as usual – the status quo has become impossible to maintain. The libraries at St. Edward’s and Millersville have an opportunity to let go of legacies in order to better meet the needs of their respective campus communities.

Warehouse for Books

One assumption under interrogation is the idea of a library as a warehouse for print books. Neither institution is a research library attempting to collect and preserve all of the world’s knowledge. Millersville has a collection development policy stating that theirs is a “teaching collection” which directly supports the university curriculum. With limited physical space and budget, items not used are transitioned out of the collection and replaced by more accessible materials relevant to institutional learning goals. The renovation at Millersville has prompted the library to increase its number of electronic books and databases in order to support campus research needs.

At St. Edwards, the massive renovation project has provided the library with an “excuse” to look holistically at the print collection. One year ago, the library owned 170,000 volumes. Through the first weeding project in the library’s long history, staff managed to reduce that number to 130,000. In the new building, space allocated for stacks can house approximately 90,000 books, meaning staff have some ways to go before boxing up the collection. Because librarians can’t guarantee that the library will hold the same number of print volumes in the future, the space needs to have a flexible infrastructure in order to be used differently.

It is possible that after two years of adjusting to primarily electronic scholarship, faculty and students may shed some of the traditional stereotypes held about libraries as warehouses for books. Although collection assessment and strategic reallocation initiatives at both St. Edward’s and Millersville were primarily designed to help students and faculty survive the lengthy renovation periods, this may in fact become the de-facto standard for content development for the foreseeable future. Preliminary findings of ebrary®’s 2011 Global Student E-book Survey  revealed that while E-book usage and awareness have not increased significantly in 2011 compared to 2008, the vast majority of students would choose electronic over print if it were available and if better tools along with fewer restrictions were offered. Reflecting global trends like this, libraries are moving towards an increase in electronic holdings and are reorganizing space within their buildings to emphasize engagement with content, not simply storage.

Rethinking Reference

In addition to addressing changes in content and collections, the renovations at St. Edwards and Millersville provide opportunities to experiment with (or without) certain longstanding library services. At Millersville, the two years without a building have been internally referred to as “a big experiment” in order to test out new ideas and determine which existing or new services are brought back into the new library.

Traditional reference is one service currently being investigated for transformation. Staff at Millersville decided not to install a reference desk inside of the temporary library in Gerhart Hall. In fact, there are no librarians located within Gerhart Hall, only staff and student employees. For just-in-case research questions, the library has developed a stand up, self-help kiosk where users can walk up to a dedicated computer and instantly chat/IM/email a librarian or pick up the phone and call. To assist, student employees working at the circulation desk are being trained on a referral system where they can lead students to the kiosk or direct them to specific subject librarian.

Staff at Millersville have expanded their suite of virtual research help options for just-in-time questions. Librarians take shifts providing assistance through phone, text, email and chat/IM (11-8 Monday through Thursday, 11-4 Fridays, and 2-8 Sundays). Another facet has been initiating at least three consistent office hours during which subject librarians will be available in their office for research consultations or appointments.

Inspired by Austin’s Coolhaus Ice Cream Truck use of Twitter to notify customers of their current location, St. Edward’s is considering heavier use of social media to inform students and faculty where reference assistance can be found. While still in the planning stages, the general idea is for librarians to check in using Foursquare or Gowalla at various campus locations with a note about how long they will be there.  This check in will automatically propagate to the library’s Facebook and Twitter accounts and show up on the website in a rolling feed of library news and updates. In this scenario, even users who do not connect with the library through social media services will still benefit from the check in.

Librarians who station themselves around campus will be equipped with a netbook or a tablet computer with a keyboard and have the ability to print to any campus printer. The hope is that fully mobile librarians with high-end technology and the ability to help wherever the student may be will begin to shape expectations of students.

Traditional reference desks are often immobile and, in some cases, emphasize the power disparity between the knowledge seeker and the knowledge holder (either purposely or inadvertently). In these situations, it may be difficult for libraries to experiment with new methods of interacting with users, either face-to-face or digitally. It is easy to fall back on what is known, what is safe. The removal of these structures for renovation purposes is described by Dorney as an “almost cathartic experience,” providing a sense of freedom to test user and librarian reaction to innovative avenues of service.

Professional Identity & Relevance

While impact of these two renovations on their respective campus communities is an area ripe for discussion, the projects have also released the internal floodgates. Both institutions are witnessing discussions relating to professional identity and the library’s relevance/value within higher education. Often, anxiety accompanies these conversations, a natural reaction for any passionate professional.

At Millersville, staff is distributed on and off campus. There are librarians in academic buildings, staff in Gerhart Hall, librarians and staff at the off-site storage facility, and student employees everywhere in between. The way library work is accomplished is changing dramatically. Employees are beginning to rely more and more on technology to assist in everyday activities. Where resistance to change may have before existed for initiatives like video conferencing or using a wiki to share documentation, individuals have been forced out of their comfort zones to grow as a high-functioning team of professionals.

In the case of St. Edward’s, questions abound about how group dynamics may change when seventeen staff members are forced to exist within a cozy, three-bedroom house for one year. Without personal offices, librarians there may have a completely different experience in terms of collaboration and it is inevitable that all interactions will reach new levels of intensity, for better or worse.

Though the St. Edward’s library website already provides a great deal of services and resources, it will become even more apparent that it is the primary means of interacting with the library.  David Lee King writes that “the library’s website IS the library,” and the absence of a robust, physical presence will solidify that perception. It is time for as much – if not more – effort to be placed on our digital assets than our physical spaces.

On Failure & Flexibility

It would not be apropos to conclude this article without mentioning the importance of flexibility and freedom to fail. Both authors have found that it is often the best laid plans that have disintegrated while spur-of-the-moment ideas have taken off like wildfire. There is no ultimate road map to ensure success.

At Millersville, for example, the old library was the tallest, most heavily-trafficked building on campus. Assuming that the next largest building for student gathering was the newly-renovated Student Memorial Center, librarians set up a “Research Blast” table in a high-visibility area. The plan was to have multiple librarians available in shifts with computers and informational handouts to help students with their research questions. Staff promoted the one-week event heavily, using Facebook, QR codes, emails, posters, word-of-mouth. Librarians wore bright green tee shirts saying “Ask me about the library” and were proactive, making eye contact and greeting students as they passed.

The librarians barely received one research question the entire week. It turned out to be a great opportunity to answer questions about the library – what’s in the temporary library, where can I go to print papers, what is the new library going to look like, when is it the project going to be done? But librarians certainly weren’t helping students locate or evaluate peer-reviewed articles, analyze sources, or brainstorm search strategies. It was a failure in one aspect and a success in another. The freedom to fail and flexibility to adapt accordingly is paramount to initiating change.

St. Edward’s has the benefit of learning from Millersville’s two-year experiment before knocking down their old building. If students are not using roaming librarians to ask research questions, then where are they asking those kinds of questions?  Studies of student research behavior suggest that faculty members, teaching assistants, the writing center, and course readings and websites are frequently sources students turn to for help (Foster & Gibbons, 2007; Head & Eisenberg, 2009). Though liaison librarians continue to inform faculty and teaching assistants about the services that will be available during construction, reaching students through course websites is another avenue worth exploring.

Currently, all Blackboard course websites at St. Edward’s University have a link labeled “Ask a Librarian,” which links students to the general reference assistance page. However, most students do not understand how librarians can help. To improve this Blackboard presence, librarians have written a short javascript widget that will link students to course or subject-specific pages designed to be an in-context landing page for library resources and services. In other words, if a student clicks on “Library Resources” from a course in the school of business, he or she will be directed to the research guide for business students, not the generic library homepage.

Exploring new options takes staff time and creative thinking; some projects will fail, but the spark of innovation provided by challenging circumstances may result in new and improved practices that last well beyond the transition period into these new buildings.


As economist Paul Romer once said, “A crisis is a terrible thing to waste” (Rosenthal, 2009). In the cases of St. Edward’s and Millersville, the crisis of being without the library as one cohesive place provides librarians with an opportunity to initiate change. Without the baggage of the past, libraries can look holistically at the their portfolio of services, determining which to continue investing time and resources into. Others may have simply run their course, poorly designed from the outset or dated for serving a new generation of scholars.

Measuring the success of these experiments is often difficult. Due to the magnitude of change (moving from one centralized building to many distributed/embedded locations), neither St. Edward’s nor Millersville can simply compare usage statistics to the those of the old library. Because these libraries are focusing on interacting with users in new ways, measures have to be more comprehensive, taking both qualitative and quantitative aspects into account. In some cases, this will be longitudinal data. Both authors are hopeful that what is learned during these experiments outside of the library will be brought back into the new libraries in order to support the university community at a higher level, showcasing our professional growth and relevancy.

For each traditional library aspect that is re-envisioned, time and resources are made available to investigate new and innovative ways to interact with information. While keeping the history and mission of the academic library close to heart, librarians need to initiate honest, open, and difficult conversations and take immediate action towards readying academic librarianship for a new era.

In her fall 2010 convocation address to the university community, Millersville University President Francine McNairy stated: “…Indeed, the Ganser building will close, but the University library will not. You might think that the library is at the intersection of Frederick and George Streets, but it is actually at the intersection of scholarship, innovation, creativity and collaboration. And that’s the road to our future.” It is possible that upon moving back into each of these new libraries, the resources, services and spaces provided to users may look completely different. When individuals inquire about the risk of becoming irrelevant after a year or two without a building, perhaps that is the opportunity for librarians to inform their communities that the library is much more than bricks and mortar, and we are in the midst of fundamental shifts regarding our impact on students and learning.

Embarking on extensive renovations like those discussed here bring unique opportunities to initiate change within libraries, but they are not the only way to prepare for the future. The authors are issuing a call to action: How would you change your library as if you had a year without the historical baggage of a building? Take those plans and run with them – there is no reason why you have to wait for the bulldozers.

Many thanks to Melissa Gold for her feedback on this piece. Thanks also to Lead Pipers Hilary Davis, Leigh Anne Vrabel, Ellie Collier, and Emily Ford for edits, comments, and thought provoking questions.

References and Further Readings

Association of College and Research Libraries (n.d.). Value of Academic Libraries Report. Retrieved November 8, 2011, from

booth, c. (2010). Librarians as __________: Shapeshifting at the periphery. Retrieved November 5, 2011, from

ebrary® (2011). ebrary Surveys Suggest Students’ Research Needs Unmet, Results to be Presented at Charleston. Retrieved November 8, 2011, from

Foster, N. F., & Gibbons, S. (Eds.). (2007). Studying students: The undergraduate research project at the University of Rochester. Chicago: Association of College and Research Libraries. Retrieved November 8, 2011, from

Frierson, E. (2011, July 8). Course-specific library links in Blackboard, Moodle, or any LMS you can name [blog post]. Retrieved November 8, 2011 from

Head, A. J., & Eisenberg, M. B. (2009). Lessons learned: How college students seek information in the digital age. Project Information Literacy First Year Report with Student Survey Findings. University of Washington Information School. Retreived November 8, 2011 from

Millersville University Library (n.d.). Millersville Library Renovation Information. Retrieved November 6, 2011, from

King, D. L. (2005, September 22). Website as destination [blog post]. Retrieved November 8, 2011 from

Rosenthal, J. (2009). On Language – A Terrible Thing to Waste. The New York Times Magazine. Retrieved November 8, 2011, from

Ross, L., & Sennyey, P. (2008). The library is dead, long live the library! The practice of academic librarianship and the digital revolution. Journal of Academic Librarianship, 34(2), 145-152.

St. Edward’s University (2011). St. Edward’s University Receives $13 Million from Pat and Bill Munday. Retrieved November 5, 2011, from

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 United States License. Copyright remains with the author/s.

Library Tech Talk (U of Michigan): Setting up our Library IT Workload Study

planet code4lib - Tue, 2015-11-17 00:00

For the past year, the University of Michigan Library IT (LIT) division has been devoting time & resources to conducting a thorough self-assessment. These efforts have included discussions focused on our application infrastructure, our division communication patterns, and our service workflows. And while these discussions were instrumental in helping identify challenges and potential solutions, we wanted to begin to collect hard data to help better understand the complexity of our work.

District Dispatch: Job Corps vendor outreach conference Nov 18

planet code4lib - Mon, 2015-11-16 21:05

The U.S. Department of Labors Employment and Training Administration invites interested organizations to register for its Nov 18 Job Corps Vendor Conference.

Consider being a contractor or subcontractor in operating a Job Corps center

The U.S. Department of Labor’s (DOL) Employment and Training Administration, Office of Contracts Management has announced that it will be holding a Job Corps vendor outreach conference on November 18, 2015 at the Woodland Job Corps Center.

The Agency would like to elicit new organizations in supporting the Agency’s Job Corps Program as a contractor or subcontractor in the operation of a Job Corps center. The Agency is hoping to attract new organizations, such as libraries, that may have a fresh perspective on supporting Job Corps’ mission to provide vocational and academic training to eligible young people aged 16 to 24.

The Woodland Job Corps Center is located in Laurel, Maryland, conveniently located between the cities of Baltimore and Washington, D.C.  To attend you must REGISTER!

When:    Wednesday, November 18; the conference will run from 10:00a.m. until 4:00 p.m.
Where:   Woodland Job Corps Center, 3300 Fort Meade Road, Laurel, MD 20724


Contact:   Leslie Adams, Contract Specialist (Contractor),
Phone:     (202) 693-2920

The post Job Corps vendor outreach conference Nov 18 appeared first on District Dispatch.

Nicole Engard: Bookmarks for November 16, 2015

planet code4lib - Mon, 2015-11-16 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Smore Smore makes it easy to design beautiful and effective online flyers and newsletters.
  • Ninite Install and Update All Your Programs at Once

Digest powered by RSS Digest

The post Bookmarks for November 16, 2015 appeared first on What I Learned Today....

Related posts:

  1. No more email newsletters for me
  2. RSS Submit
  3. Collaborate in Real Time

Eric Lease Morgan: A historical treasure trove: Social justice tradition runs through Catholic archives

planet code4lib - Mon, 2015-11-16 19:55

by Michael Sean Winters  |  Nov. 10, 2015, National Catholic Reporter

Catholic schools face many challenges. In recent decades, the steady supply of free labor from religious men and women has dried up. Demographic changes resulted in most inner-city Catholic schools serving poor, non-Catholic populations. Stagnant wages put the cost of a Catholic school education out of reach for most middle-class Catholic families. And the rising cost of education at all levels, from kindergarten through college, has affected profoundly the crowning glory of U.S. Catholicism, our vibrant educational system.

For all those problems, there are many interesting developments in Catholic education, one of which was the focus of a conference titled “Catholic Archives in the Digital Age: A Conference for Archivists and Teachers” held Oct. 8-9 at The Catholic University of America in Washington. The event brought together Catholic educators with Catholic archivists to explore ways that archival material, especially digitized material, can be used in classrooms.

We all find reasons to bemoan canon law, but one of its benefits is that it requires a lot of record-keeping, and those records, deposited in Catholic archives, are a treasure trove of information for teaching young people.

The conference began with a panel of archivists highlighting their holdings that could be useful in the classroom. Malachy McCarthy oversees the Claretian archives in Chicago. He noted that religious communities like the Claretians respond to the needs of the times, and the archives reflect those responses. For example, the Claretian archives have material on the “down-and-dirty social history” of the mostly working-class people the Claretians served. Continue reading …


Eric Lease Morgan: DPLA Announces Knight Foundation Grant to Research Potential Integration of Newspaper Content

planet code4lib - Mon, 2015-11-16 19:45

Posted by DPLA on November 9, 2015 in DPLA Updates, News & Blog, Projects and tagged announcements, newspapers.

The Digital Public Library of America has been awarded $150,000 from the John S. and James L. Knight Foundation to research the potential integration of newspaper content into the DPLA platform.

Over the course of the next year, DPLA will investigate the current state of newspaper digitization in the US. Thanks in large part to the National Endowment for the Humanities and the Library of Congress’s joint National Digital Newspaper Program (NDNP) showcased online as Chronicling America, many states in the US have digitized their historic newspapers and made them available online. A number of states, however, have made newspapers available outside of or in addition to this important program, and DPLA plans to investigate what resources it would take to potentially provide seamless discovery of the newspapers of all states and US territories, including the over 10 million pages already currently available in Chronicling America. Continue reading.

District Dispatch: 5 things you shouldn’t forget this holiday season

planet code4lib - Mon, 2015-11-16 18:39

5.  Thaw the turkey!
4.  Accidentally don’t remember to buy glitter.
3.  Stock up on elf repellent
2.  Open the fireplace flue all the way (whether you’re expecting Santa or not).
1.  Bake an “Advocake”!

“What the ^%$&#@?!@ is an Advocake?” we hear you saying. Glad you asked! We happen to have the recipe right here. There’s no better way to maximize holiday satisfaction than by wishing Members of Congress a happy holiday and inviting them for a quick visit to their local library while they are back in town. Here’s how to take advantage of the holiday recess and use it for library advocacy!

The post 5 things you shouldn’t forget this holiday season appeared first on District Dispatch.

Harvard Library Innovation Lab: Link roundup November 16, 2015

planet code4lib - Mon, 2015-11-16 18:03

This is the good stuff.


Make amazing shirts from web image searches. Love this. Fog is a winner. Grass too.

Rebellious Group Splices Fruit-Bearing Branches Onto Urban Trees | Mental Floss

Guerrilla Grafters splice fruit-bearing branches onto urban trees

Idea Sex: How New Yorker Cartoonists Generate 500 Ideas a Week – 99u

“One idea is never enough”

Google Cardboard’s New York Times Experiment Just Hooked a Generation on VR

The new (Cardboard) made of the old (cardboard) bundled with the old (printed newspaper).

Amazon is opening its first physical bookstore today | The Verge

Amazon opens a store

Islandora: Islandora Fundraising

planet code4lib - Mon, 2015-11-16 15:00

The Islandora Foundation is growing up. As a member-supported nonprofit, we have been very fortunate to have the support of more than a dozen wonderful universities, private companies, and like-minded projects - enough support that within our first year of operation, we were solvent. As of 2015, we now have a small buffer in our budget, which is a comfortable place to be.

But comfortable isn't enough. Not when our mission is to steward the Islandora project and ensure that it is the best software that it can be. With the launch of Fedora 4 last December, we started work on a version of Islandora that would work with this new major upgrade to the storage layer of our sites, recognizing that our community is going to want and need to move on to Fedora 4 someday and we had better be ready with a front-end for them when the time comes. Islandora 7.x-2.x was developed to the prototype stage with special funding from some of our supporters, and development continues by way of volunteer sprints. Meanwhile, Islandora 7.x-1.x (which works with Fedora 3) continues to be supported and improved - also by volunteers

It's a lot to coordinate, and we have determined through consultation with our interest groups, committees, and the community in general that in order to do this right, we need to have someone with the right skill set dedicated to coordinating these projects. We need a Tech Lead.

Right now, the Islandora Foundation has a single employee (*waves*). I am the Project & Community Manager, which means I work to support community groups and initiatives, organize Islandora events, handle communications (both public and private) and promotions, and just generally do everything I can to help our many wonderful volunteers to do the work that keeps this project thriving. We've been getting by with that because many of the duties that would belong to a Tech Lead have been fulfilled by members of the community on a volunteer basis, but we are swiftly outgrowing that model. The Fedora 4 project that inspired us to take on a new major version of Islandora has had great success with a two person team of employees (plus many groups for guidance): Product Manager David Wilcox (more or less my counterpart) and Tech Lead Andrew Woods. 

Now to the point: we need money. We have a confirmed membership revenue of $86,000 per year*, which is plenty for one employee plus some travel and general expenses, but not enough to hire this second position that we need to get the project to the next level. About a month ago I contacted many of the institutions in our community to see if they could consider becoming members of the Islandora Foundation, and we had a gratifying number of hopeful responses (thank you to those folks!), but we're still short of where we need to be. 

And so, the Funding Lobster (or Lobstometre). In the interest of transparency, and perhaps as motivation, this little guy is showing you exactly where things stand with our Tech Lead goal. If we get $160,000 in memberships we can do it (but we'll be operating without a net), $180,000 and we're solid, and if we hit $200,000 or above that's just unmitigated awesome (and would get turned into special projects, events, and other things to support the community). He's the Happy Lobster, and not the Sad Lobster, because we do believe we'll get there with your help, and soon.

How can you help? Become a member. While it would be great if we could frame this as a funding drive and take one-time donations, since the goal is to hire a real live human being who will want to know that they can pay their rent and eat beyond their first year of employment, we need to look for renewable commitments. Our membership levels are as follows:

Institutional Membership:

  • Member - $2000
  • Collaborator - $4000
  • Partner - $10,000

Individual Membership:

  • $10 - $250+ (at your discretion)

There are many benefits to membership, including things like representation on governing committees and discounts at events. Check out the member page or drop me an email if you want to know more.

Many thanks,

- Melissa

* some of our members were able to allocate more funding to support 7.x-2.x development than their typical membership dues. It is currently unknown how many will be able to maintain that funding level at renewal, but yearly membership revenue could be as high as $122,000. I went with the number we can be sure of.

Mark E. Phillips: Finding figures and images in Electronic Theses and Dissertations (ETD)

planet code4lib - Mon, 2015-11-16 14:17

One of the things that we are working on at UNT is a redesign of The Portal to Texas History’s interface.  In doing so I’ve been looking around quite a bit at other digital libraries to get ideas of features that we could incorporate into our new user experience.

One feature that I found that looked pretty nifty was the “peek” interface for the Carolina Digital Repository. They make the code for this interface available to others to use if they are interested via the UNC Libraries GitHub in the peek repository.  I think this is an interesting interface but I had the question still of “how did you decide which images to choose”.  I came across the peek-data repository that suggested that the choosing of images was a manual process, and I also found a powerpoint presentation titled “A Peek Inside the Carolina Digital Repository” by Michael Daines that confirmed this is the case.  These slides are a few years old so I don’t know if the process is still manual.

I really like this idea and would love to try and implement something similar for some of our collections but the thought of manually choosing images doesn’t sound like fun at all.  I looked around a bit to see if I could borrow from some prior work that others have done.  I know that the Internet Archive and the British Library have released some large image datasets that appear to be the “interesting” images from books in their collections.

Less and More interesting images

I ran across a blog post by Chris Adams who works on the World Digital Library at the Library of Congress called “Extracting images from scanned book pages” that seemed to be close to what I wanted to do,  but wasn’t exactly it either.

I remembered back to a Code4Lib Lightning Talk a few years back from Eric Larson called “Finding image in book page images” and the companion GitHub repository picturepages that contains the code that he used.   In reviewing the slides and looking at the code I think I found what I was looking for,  at least a starting point.


What Eric proposed for finding interesting images was that you would take an image, convert it to grayscale, increase the contrast dramatically, convert this new images into a single pixel wide image that is 1500 pixels tall and sharpen the image.  That resulting image would be inverted,  have a threshold applied to it to convert everything to black or white pixels and then it would be inverted again.  Finally the resulting values of either black or white pixels are analyzed to see if there are areas of the image that are 200 or more pixels long that are solid black.

convert #{file} -colorspace Gray -contrast -contrast -contrast -contrast -contrast -contrast -contrast -contrast -resize 1X1500! -sharpen 0x5 miff:- | convert - -negate -threshold 0 -negate TXT:#{filename}.txt`

The script above which uses ImageMagick to convert an input image to greyscale, calls contrast eight times, resizes the image and the sharpens the result. It pipes this file into convert again, flips the colors, applies and threshold and flips back the colors. The output is saved as a text file instead of an image, with one line per pixel. The output looks like this.

# ImageMagick pixel enumeration: 1,1500,255,srgb ... 0,228: (255,255,255) #FFFFFF white 0,229: (255,255,255) #FFFFFF white 0,230: (255,255,255) #FFFFFF white 0,231: (255,255,255) #FFFFFF white 0,232: (0,0,0) #000000 black 0,233: (0,0,0) #000000 black 0,234: (0,0,0) #000000 black 0,235: (255,255,255) #FFFFFF white 0,236: (255,255,255) #FFFFFF white 0,237: (0,0,0) #000000 black 0,238: (0,0,0) #000000 black 0,239: (0,0,0) #000000 black 0,240: (0,0,0) #000000 black 0,241: (0,0,0) #000000 black ...

The next step was to loop through each of the lines in the file to see if there was a sequence of 200 black pixels.

I pulled a set of images from an ETD that we have in the UNT Digital Library and tried a Python port of Eric’s code that I hacked together.  For me things worked pretty well, it was able to identify the images that I would have manually pulled as pages that were “interesting” on my own.

But there was a problem that I ran into,  the process was pretty slow.

I pulled a few more sets of page images from ETDs and found that for those images it would take the ImageMagick convert process up to 23 seconds per images to create the text files that I needed to work with.  This made me ask if I could actually implement this same sort of processing workflow with just Python.

I need a Pillow

I have worked with the Python Image Library (PIL) a few times over the years and had a feeling it could do what I was interested in doing.  I ended up using Pillow which is a “friendly fork” of the original PIL library.  My thought was to apply the same processing workflow as was carried out in Eric’s script and see if doing it all in python would be reasonable.

I ended up with an image processing workflow that looks like this:

# Open image file im = # Convert image to grayscale image g_im = ImageOps.grayscale(im) # Create enhanced version of image using aggressive Contrast e_im = ImageEnhance.Contrast(g_im).enhance(100) # resize image into a tiny 1x1500 pixel image # ANTIALIAS, BILINEAR, and BICUBIC work, NEAREST doesn't t_im = e_im.resize((1, 1500), resample=Image.BICUBIC) # Sharpen skinny image file st_im = t_im.filter(ImageFilter.SHARPEN) # Invert the colors it_im = ImageOps.invert(st_im) # If a pixel isn't black (0), make it white (255) fixed_it_im = it_im.point(lambda x: 0 if x < 1 else 255, 'L') # Invert the colors again final = ImageOps.invert(fixed_it_im)

I was then able to iterate through the pixels in the final image with the getdata() method and apply the same logic of identifying images that have sequences of black pixels that were over 200 pixels long.

Here are some examples of thumbnails from three ETDs,  first all images and then just the images identified by the above algorithm as “interesting”.

Example One

Thumbnails for ark:/67531/metadc699990/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.



Example Two

Thumbnails for ark:/67531/metadc699999/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.

Example Three

Thumbnails for ark:/67531/metadc699991/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699991/ with just visually interesting pages.

So in the end I was able to implement the code in Python with Pillow and a fancy little lambda function.  The speed was much improved as well.  For those same images that were taking up to 23 seconds to process with the ImageMagick version of the workflow,  I was able to process them in a tiny bit over a second with this Python version.

The full script I was using for these tests is below. You will need to download and install Pillow in order to get it to work.

I would love to hear other ideas or methods to do this kind of work, if you have thoughts, suggestions, or if I missed something in my thoughts, please let me know via Twitter.


D-Lib: Reminiscing About 15 Years of Interoperability Efforts

planet code4lib - Mon, 2015-11-16 14:14
Opinion by Herbert Van de Sompel, Los Alamos National Laboratory and Michael L. Nelson, Old Dominion University


Subscribe to code4lib aggregator