You are here

Feed aggregator

NYPL Labs: Nomadic Classification: Classmark History and New Browsing Tool

planet code4lib - Wed, 2016-01-27 19:58

In the past few months,  NYPL Labs has embarked upon a series of investigations into how legacy classification systems at the library can be used to generate new data and power additional forms of discovery. What follows is some background on the project and some of the institutional context that has prompted these examinations. One of the tools we’re introducing here is “BILLI:  Bibliographic Identifiers for Library Location Information;” read on for more background, and be sure to try the tool out for yourself.

Then there is a completely other type of distribution or hierarchy which must be called nomadic, a nomad nomos, without property, enclosure or measure. Here, there is no longer a division of that which is distributed but rather a division among those who distribute themselves in an open space — a space which is unlimited, or at least without precise limits… Even when it concerns the serious business of life, it is more like a space of play, or a rule of play… To fill a space, to be distributed within it, is very different from distributing the space. —Gilles Deleuze, Difference & Repetition

Classification, the basic process of categorization, is simple in theory but becomes complex in practice. Examples of classification can be seen all around us, from the practical use of organizing the food found in your local grocery store into aisles, to the very specialized taxonomy system that separates the hundreds of different species of the micro-animal Tardigrada. At their core these various systems of categorization are simply based on good faith judgments. Whoever organized your local grocery thought: “Cookies seem pretty similar to crackers, I will put them together in the same aisle.”

A similar, but more evidence based process developed the system that categorizes hundreds of thousands of biological species. Classification systems are usually logical but are inherently arbitrary. Uniform application of classification is what makes a system useful. Yet, uniformity is difficult to maintain over long periods of time. Institutional focus shifts, the meaning of words drift, and even our culture itself changes when measuring time in decades. Faced with these challenges, in the age of barcodes, databases, and networks, the role of traditional classification systems are not diminished but could benefit by thinking how they could practically leverage this new environment.

**Nerd Alert! If 19th century classification history is not your thing you might want to skip to A Linked Space.** Problem Space

Libraries are founded on the principle of classification, the most common and well known form being the call number. This is the code that appears on the spine of a book keeping track of where it should be stored and usually the subject of its content. The most well known form of call number is, of course, the iconic Dewey Decimal System. But there are many other systems employed by libraries due to the nature of the resources being organized and the strengths and weaknesses of a specific classification system. Just as there is no single tool for every job there is no universal system for classification.

The New York Public Library is a good example of that realization as seen in the adoption of multiple call number systems over its 120-year history. The very first system used at the library was developed in 1899 by NYPL’s first president, John Shaw Billings. He wanted to develop a system that could efficiently organize the materials being stored in the library’s soon-to-open main branch, the Stephen A. Schwarzman building. In fact, Billings also contributed to how the new main building should be physically designed:

John Shaw Billings, sketch of the main building layout. Image ID: 465480

The classification scheme he came up with, known as the Billings system, could be thought of as a reflection the physical layout of the main branch circa 1911. Billings was very practical in the description of his creation, writing:

Upon this classification it may be remarked that it is not a copy of any classification used elsewhere; that it is not especially original; that is is not logical so far as the succession of different departments in relation to the operations of the human mind is concerned; that it is not recommended for any other library, and that no librarian of another library would approve of it.

The system groups materials together by assigning each area of research a letter, A-Z (minus the letter J, more on that later). This letter, the first part of the call number, is known as a classmark.

For example, books cataloged under this system that are Biographies would have the first letter of their classmark, “A”. History is “B”, Literature is “N”, and so on through “Z”, which is Religion. More letters can be added to denote more specific areas within that subject. For example, “ZH” is about Ritual and Liturgy and “ZHW” is more specifically about Ancient and Medieval Ritual and Liturgy.

While this system was used to classify the materials held in the main branch’s stacks, there were also materials held in the reading rooms or special collections around the building. To organize these materials, he reused the same system but added an asterisk in front of the letter to make what he called the star groups — i.e., a classmark starting with a “K” is about Geography, but “*K” is a resource kept in the Rare Books division though not necessarily about Geography. With these star groups,  the Billings system became a conflation of a subject, location, and material-based systems. This overview document gives a good idea of the large range classification that the Billings system covered:

Top level Billings classmarks

In the 1950s, the uptick in the acquisition of materials made the Billings system too inefficient to quickly catalog materials. While parts of the Billing system continued being used, even through to today, a general shift to a new fixed order system was made in 1956 and then refined in 1970.

The idea behind the fixed order systems is to group materials together by size to most efficiently store them. The library decided that discovery of materials could be achieved not by the classmark but by the resource’s subject headings. Subject headings are added to the record while it is being cataloged and provide a vector of discovery if the same subjects are uniformly applied.

In the United States the most common vocabulary of subject headings is the Library of Congress Subject Headings. Enabling resource  discovery through subject headings obviates the need for call numbers to organize materials. The call number can just be an identifier to physically locate the resource. The first fixed order system grouped items only by size:

Old fixed order system classmarks

The call numbers would look something like “B-10 200”, meaning it was the 200th 17cm monograph cataloged. This system was refined in 1970 to included a bit more contextual information about what the materials were about:

Revised fixed order system classmarks

Since the J letter was previously unused in the Billing system it was used here as a prefix to the size system to add more context to the fixed order classification. For example, a “JFB 00-200” means it is still a 17cm monograph but is generally about the Humanities & Social Sciences.

While this fixed order system is used for the majority of the monographs acquired by the library there are special collections in the research divisions that use their own call number systems. Resources like archival collections or prints and photographs have records in the catalog that help locate them in their own division. For example, archival collections often have a call number that starts with “MssCol” while rare books at the Schomburg Center start with “Sc Rare”. This final diverse category of classification at NYPL drives home the obvious problem: the sheer number of classification systems at work reduces the call number to an esoteric identifier—especially for obsolete and legacy systems—useful to only the most veteran staff member. These identifiers have great potential to develop new avenues of discovery if they can be leveraged.

A Linked Space

An ambitious 19th-century librarian faced with this problem might come up with a simple solution: Let’s invent a new classification system that incorporates all these various types of call numbers into one centralized system. With the emergence of linked data in library metadata practice, however, when relationships between resources are gaining increased importance, a 21st-century librarian has an even better idea: Let’s link everything together!

In most existing library metadata systems, the call number is a property of the record; it is a field in the metadata that helps to describe the resource. The linked data alternative is to make the classmark its own entity and give it an identifier (URI), which allows us to describe the classmark and start making statements about it. This is nothing new in the library linked data world; the Library of Congress for example, started doing this for some of their vocabularies (including some LCC classmarks) years ago. But to accomplish this task at NYPL it took a combination of technical work, institutional knowledge sleuthing, and a lot of data cleanup.

The first step is to simply get a handle on what classmarks are in use at the library. By aggregating over 24 million catalog records and identifying the unique classmarks, we are able to create a dataset that contains all possible classmarks at the library. This new bottom-up approach to our call number data enables the next step of organization and description.

Institutional knowledge is hard to retain over 120 years of history. It takes the form of old documents and outdated technical memos. The Billings classification was first documented in a schedule (a monograph book) that lists each classmark and its meaning. Some of these original bound resources are still around the library and have gone through their own data migration journey.

Page from the bound Billings schedule

While the bound books are still in use today, most with copious marginalia, over the years this resource was converted to a typed document and then converted into the digital realm in the form of a MS Word document. This data was stewarded by our colleagues in BookOps who are the authority for cataloging at NYPL. We took this data and converted it into an RDF SKOS vocabulary and reconciled it with the raw classmark data aggregated from the catalog. This new dataset is comprehensive because it aligns what classmarks are supposed to be in use, from the documentation, with what is actually in use, from the data aggregation. Each classmark now has its own URI persistent identifier which we can begin making statements about:

Example triple statements for *H classmark - Libraries This confusing looking jumble of text is a bunch of RDF triples in the Turtle Syntax talking about the *H (Libraries) classmark. It is describing the name of the classmark, the type of classmark it is, what narrower classmarks are related to it, how many resources use it in the catalog and some mappings to other classifications systems among other data. Now that we have all the classmarks in a highly structured semantic data model we can publish all these statements and start doing even more interesting things with them. BILLI Home page

All of these statements about our classmarks are hosted on a system we are calling BILLI:  Bibliographic Identifiers for Library Location Information, an homage to NYPL’s mustachioed first president John Shaw Billings.

The BILLI system allows you to explore the classmarks in use at the library, traverse the hierarchies, and link out to resources in the catalog. But the real power of having our classmark information in this linked data form is the ability to start building relationships by creating new data statements.

A logged in staff member can add notes or change the description of a classmark but, more importantly, is able to start linking to other linked data resources. Right now, staff members are able to connect our classmarks to Wikidata and DBpedia, two linked data sources connected to Wikipedia:

Staff interface for mapping classmarks
The system auto-suggests some possible connections, here for the *H Library classmark, and the staff member can select or search for a more appropriate entity. Once connected, we pull in some basic information to enrich the classmark page: Classmarks page displaying external data

As well as the the mapping relations:

New mapping relationships created

Now that a link has been established we can use the metadata found in those two sources in our own discovery system and even apply some of it to the resources in our catalog that use this classmark. While the system is currently only able to build these connections with Wikidata/DBpedia, we can explore other resources that it would make sense to map to and expand the network to library and non-library based linked data sources.

While these classmark pages are here for us humans to interact with, computers can also “read” these pages automatically as they are machine readable through content negotiation. To a computer requesting *H data, the page would appear like this:*H/nt

A Space of Play

If classification is arbitrary it is true there are a number of other systems that the library’s materials could be organized under. The only limitation is that it would be cost prohibitive to apply a new system to millions of resources. But in this virtual space, existing and even newly-invented systems can be easily overlaid and applied to our materials.

The first classmarks listed on BILLI are grouped under “LCC Range.” This classmark system is based on the existing Library of Congress Classification, a system which has historically not been used at NYPL. LCC is traditionally used at most research libraries, but because NYPL used the Billings and then Fixed Order systems it was never adopted here. Due to new linked data services, however, we were able to retroactively reclassify our entire catalog with LCC classmarks.

Using OCLC’s Classify—a service that returns aggregate data about a resource from all institutions in the OCLC consortium—we are able to find out a resource’s LCC classmark from other libraries that holds the same title. While we were not able to match 100% of our resources to a Classify result we were able to apply a significant number of LCC classmarks to our materials.

Of the materials for which we were unable to obtain a LCC, we can use some simple statistics to draw concordances between Billings and LCC. For example, if we have enough resources that have the same LCC and Billings classmark we can assume that the two are equivalent and apply that LCC to all the materials with that Billings classmark. One caveat is that LCC numbers can be very specific in their classification, much more so than Billings. What we need to do in order to map Billings to LCC is create more generalized, or coarser, LCC classmarks.

To accomplish this we parsed the freely available PDF LCC outlines found online into a new dataset with each classmark representing a large range of LCC numbers. Using this invented, yet still related, LCC range classmark we can map our Billings classmarks to their coarse LCC equivalents. Now, a researcher who is perhaps more familiar with LCC can easily navigate to corresponding Billings resources.

The LCC range classmark example represents an exciting opportunity to organize our materials in new ways within a virtual space. In this networked environment classification shifts from rigid hierarchy to a fluid interconnected mappings—the difference between dividing space and filling it.

LITA: Jobs in Information Technology: January 27, 2016

planet code4lib - Wed, 2016-01-27 19:42

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

SIU Edwardsville, Electronic Resources Librarian, Asst or Assoc Professor, Edwardsville, IL

Olin College of Engineering, Community and Digital Services Librarian, Needham, MA

Great Neck Library, Information Technology Director, Great Neck, NY

Art Institute of Chicago, Senior Application Developer for Collections, Chicago, IL

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

DPLA: Apply to our 4th Class of Community Reps

planet code4lib - Wed, 2016-01-27 16:00

We’re thrilled to announce today our fourth call for applications for the DPLA Community Reps program! The application for this fourth class of Reps will close on Friday, February 19, 2016.

What is the DPLA Community Reps program? In brief, we’re looking for enthusiastic volunteers who are willing to help us bring DPLA to their local communities through outreach activities or support DPLA by working on special projects. A local community could be a school, a library, an academic campus, a professional network in your space, or another group of folks who you think might be interested in DPLA and what it has to offer. Reps give a small commitment of time to community engagement, collaboration with fellow Reps and DPLA staff, and check-ins with DPLA staff. We have three terrific classes of reps from diverse places and professions.

With the fourth class, we are hoping to strengthen and expand our group geographically and professionally. The single most important factor in selection is the applicant’s ability to clearly identify communities they can serve and plan relevant outreach activities or DPLA-related projects for them. We are looking for enthusiastic, motivated people from the US and the world with great ideas above all else!

To answer general inquiries about what type of work reps normally engage in and to provide information about the program in general, we’re offering an open information and Q&A session with key DPLA staff members and current community reps.

Reps Info Session: Tuesday, February 9, 5:00 PM Eastern

If you would like to join this webinar, please register.

For more information about the DPLA Community Reps program, please contact

Apply to the Community Reps program

LITA: Intro to Youth Coding Programs, a LITA webinar

planet code4lib - Wed, 2016-01-27 14:00

Attend this informative and fast paced new LITA webinar:

How Your Public Library Can Inspire the Next Tech Billionaire: an Intro to Youth Coding Programs

Thursday March 3, 2016
noon – 1:00 pm Central Time
Register Online, page arranged by session date
(login required)

Kids, tweens, teens and their parents are increasingly interested in computer programming education, and they are looking to public and school libraries as a host for the informal learning process that is most effective for learning to code. This webinar will share lessons learned through youth coding programs at libraries all over the U.S. We will discuss tools and technologies, strategies for promoting and running the program, and recommendations for additional resources. An excellent webinar for youth and teen services librarians, staff, volunteers and general public with an interest in tween/teen/adult services.


  • Inspire attendees about kids and coding, and convince them that the library is key to the effort.
  • Provide the tools, resources and information necessary for attendees to launch a computer coding program at their library.
  • Cultivate a community of coding program facilitators that can share ideas and experiences in order to improve over time.


Kelly Smith spent hundreds of hours volunteering at the local public library before realizing that kids beyond Mesa, Arizona could benefit from an intro to computer programming. With a fellow volunteer, he founded Prenda – a learning technology company with the vision of millions of kids learning to code at libraries all over the country. By day, he designs products for a California technology company. Kelly has been hooked on computer programming since his days as a graduate student at MIT.

Crystle Martin is a postdoctoral research scholar at the Digital Media and Learning Research Hub at the University of California, Irvine. She explores youth and connected learning in online and library settings and is currently researching implementation of Scratch in underserved community libraries, to explore new pathways to STEM interests for youth. Her 2014 book, titled “Voyage Across a Constellation of Information: Information Literacy in Interest-Driven Learning Communities,” reveals new models for understanding information literacy in the 21st century through a study of information practices among dedicated players of World of Warcraft. Crystle holds a PhD in Curriculum & Instruction specializing in Digital Media, with a minor in Library and Information Studies from the University of Wisconsin–Madison; serves on the Board of Directors for the Young Adult Library Services Association; and holds an MLIS from Wayne State University in Detroit, MI.

Justin Hoenke is a human being who has worked in youth services all over the United States and is currently the Executive Director of the Benson Memorial Library in Titusville, Pennsylvania. Before that, he was Coordinator of Teen Services at the Chattanooga Public Library in Chattanooga, TN where Justin created The 2nd Floor, a 14,000 square foot space for ages 0-18 into a destination that brings together learning, fun, the act of creating and making, and library service. Justin is a member of the 2010 American Library Association Emerging Leaders class and was named a Library Journal Mover and Shaker in 2013. His professional interests include public libraries as community centers, working with kids, tweens, and teens, library management, video games, and creative spaces. Follow Justin on Twitter at @justinlibrarian and read his blog at

Register for the Webinar

Full details
Can’t make the date but still want to join in? Registered participants will have access to the recorded webinar.


  • LITA Member: $45
  • Non-Member: $105
  • Group: $196

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

SearchHub: example/files – a Concrete Useful Domain-Specific Example of bin/post and /browse

planet code4lib - Wed, 2016-01-27 03:31
The Series

This is the third in a three part series demonstrating how it’s possible to build a real application using just a few simple commands.  The three parts to this are:

In the previous /browse article, we walked you through to the point of visualizing your search results from an aesthetically friendlier perspective using the VelocityResponseWriter. Let’s take it one step further.

example/files – your own personal Solr-powered file-search engine

The new example/files offers a Solr-powered search engine tuned specially for rich document files. Within seconds you can download and start Solr, create a collection, post your documents to it, and enjoy the ease of querying your collection. The /browse experience of the example/files configuration has been tailored for indexing and navigating a bunch of “just files”, like Word documents, PDF files, HTML, and many other formats.

Above and beyond the default data driven and generic /browse interface, example/files features the following:

  • Distilled, simple, document type navigation
  • Multi-lingual, localizable interface 
  • Language detection and faceting
  • Phrase/shingle indexing and “tag cloud” faceting
  • E-mail address and URL index-time extraction
  • “instant search” (as you type results)
Getting started with example/files

Start up Solr and create a collection called “files”:

bin/solr start bin/solr create -c files -d example/files

Using the -d flag when creating a Solr collection specifies the configuration from which the collection will be built, including indexing configuration and scripting and UI templates.

Then index a directory full of files:

bin/post -c files ~/Documents

Depending on how large your “Documents” folder is, this could take some time. Sit back and wait for a message similar to the following:

23731 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/files/update… Time spent: 0:11:32.323

And then open /browse on the files collection:

open http://localhost:8983/solr/files/browse The UI is the App

With example/files we wanted to make the interface specific to the domain of file search.  With that in mind, we implemented a file-domain specific ability to facet and filter by high level “types”, such as Presentation, Spreadsheet, and PDF.   Taking a UI/UX-first approach, we also wanted “instant search” and a localizable interface.

The rest of this article explains, from the outside-in, the design and implementation from UI and URL aesthetics down to the powerful Solr features that make it possible.

URLs are UI too!

“…if you think about how you design them” – Cool URIs

Besides the HTML/JavaScript/CSS “app” of example/files, care was taken on the aesthetics and cleanliness of the other user interface, the URL.  The URLs start with /browse, describing the user’s primary activity in this interface – browsing a collection of documents.

Browsing by document type

Results can be filtered by document “type” using the links at the top.


As you click on each type, you can see the “type” parameter changing in the URL request.

For the aesthetics of the URL, we decided filtering by document type should look like this: /browse?type=pdf (or type=html, type=spreadsheet, etc).  The interface also supports two special types: “all” to select all types and “unknown” to select documents with no document type.

At index-time, the type of a document is identified.  An update processor chain (files-update-processor) is defined to run a script for each document.  A series of regular expressions determine the high-level type of the document, based off of the inherent “content_type” (MIME type) field set for each rich document indexed.  The current types are doc, html, image, spreadsheet, pdf, and text.  If a high-level type is recognized,  a doc_type field is set to that value.

No doc_type field is added if the content_type does not have an appropriate higher level mapping, an important aspect to the filtering technique specifics.  The /browse handler definition was enhanced with the following parameters to enable doc_type faceting and filtering using our own “type=…” URL parameter to filter by any of the types, including “all” or “unknown”:

  • facet.field={!ex=type}doc_type
  • facet.query={!ex=type key=all_types}*:*
  • fq={!switch v=$type tag=type case=’*:*’ case.all=’*:*’ case.unknown=’-doc_type:[* TO *]’ default=$type_fq}

There are some details of how these parameters are set worth mentioning here.  Two parameters, facet.field and facet.query, are specified in params.json utilizing the “paramset” feature of Solr.  And the fq parameter is appended in the /browse definition in solrconfig.xml (because paramsets don’t currently allow appending, only setting, parameters). 

The faceting parameters exclude the “type” filter (defined on the appended fq), such that the counts of the types shown aren’t affected by type filtering (narrowing to “image” types still shows “pdf” type counts rather than 0).  There’s a special “all_types” facet query specified, that provides the count for all documents, within the query and other filtering constrained set.  And then there’s the tricky fq parameter, leveraging the “switch” query parser that controls how the type filtering works from the custom “type” parameter.  When no type parameter is provided, or type=all, the type filter is set to “all docs” (via *:*), effectively not filtering by type.  When type=unknown, the special -doc_type:[* TO *] (note the dash/minus sign to negate), matching all documents that do not have a doc_type field.  And finally, when a “type” parameter other than all or unknown is provided, the filter used is defined by the “type_fq” parameter which is defined in params.json as type_fq={!field f=doc_type v=$type}.  That type_fq parameter specifies a field value query (effectively the same as fq=doc_type:pdf, when type=pdf) using the field query parser (which will end up being a basic Lucene TermQuery in this case). 

That’s a lot of Solr mojo just to be able to say type=image from the URL, but it’s all about the URL/user experience so it was worth the effort to implement and hide the complexity.

Localizing the interface

The example/files interface has been localized in multiple languages. Notice the blue global icon in the top right-hand corner of the /browse UI.  Hover over the globe icon and select a language in which to view your collection.

Each text string displayed is defined in standard Java resource bundles (see the files under example/files/browse-resources).  For example, the text (“Find” in English) that appears just before the search input box is specified in each of the language-specific resource files as:

English: find=Find French: find=Recherche German: find=Durchsuchen

The VelocityResponseWriter’s $resource tool picks up on a locale setting.  In the browse.vm (example/files/conf/velocity/browse.vm) template, the “find” string is specified generically like this:

$resource.find: <input name=”q”…/>

From the outside, we wanted the parameter used to select the locale to be clean and hide any implementation details, like /browse?locale=de_DE.  

The underlying parameter needed to control the VelocityResponseWriter $resource tool’s locale is v.locale, so we use another Solr technique (parameter substitution) to map from the outside locale parameter to the internal v.locale parameter.

This parameter substitution is different than “local param substitution” (used with the “type” parameter settings above) which only applies as exact param substitution within the {!… syntax} as dollar signed non-curly bracketed {!… v=$foo} where the parameter foo (&foo=…) is substituted in. The dollar sign curly bracketed syntax can be used as an in-place text substitution, allowing a default value too like ${param:default}.

To get the URLs to support a locale=de_DE parameter, it is simply substituted as-is into the actual v.locale parameter used to set the locale within the Velocity template context for UI localization. In params.json we’ve specified v.locale=${locale}

Language detection and faceting

It can be handy to filter a set of documents by its language.  Handily, Solr sports two(!) different language detection implementations so we wired one of them up into our update processor chain like this:

<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory"> <lst name="defaults"> <str name="langid.fl">content</str> <str name="langid.langField">language</str> </lst> </processor>

With the language field indexed in this manner, the UI simply renders its facets (facet.field=language, in params.json), allowing filtering too.

Phrase/shingle indexing and “tag cloud” faceting

Seeing common phrases can be used to get the gist of a set of documents at a glance.  You’ll notice the top phrases change as a result of the “q” parameter changing (or filtering by document type or language).  The top phrases reflect phrases that appear most frequently in the subset of results returned for a particular query and applied filters. Click on a phrase to display the documents in your results set that contain the phrase. The size of the phrase corresponds to the number of documents containing that phrase.

Phrase extraction of the “content” field text occurs by copying to a text_shingles field which creates phrases using a ShingleFilter.  This feature is still a work in progress and needs improvement in extracting higher quality phrases; the current rough implementation isn’t worth adding a code snippet here to imply folks should copy/paste emulate it, but here’s a pointer to the current configuration – 

E-mail address and URL index-time extraction

One, currently unexposed, feature added for fun is the index-time extraction of e-mail addresses and URLs from document content.  With phrase extraction as described above, the use is to allow for faceting and filtering, but when looking at an individual document we didn’t need the phrases stored and available. In other words, text_shingles did not need to be a stored field, and thus we could leverage the copyField/fieldType technique.  But for extracted e-mail addresses and URLs, it’s useful to have these as stored (multi-valued), not just indexed terms… which means our indexing pipeline needs to provide these independently stored values.  The copyField/fieldType-extraction technique won’t suffice here.  However, we can use a field type definition to help, and take advantage of its facilities within an update script.  Update processors, like the script one used here, allow for full manipulation of an incoming document, including adding additional fields, and thus their value can be “stored”.  Here are the configuration pieces that extract e-mail addresses and URLs from text:

The Solr admin UI analysis tool is useful for seeing how this field type works. The first step, through the UAX29URLEmailTokenizer, tokenizes the text in accordance with the Unicode UAX29 segmentation specification with the special addition to recognize and keep together e-mail addresses and URLs. During analysis, the tokens produced also carry along a “type”. The following screenshot depicts the Solr admin analysis tool results of analyzing an “” string with the text_email_url field type. The tokenizer tags e-mail addresses with a type of, literally, “<EMAIL>” (angle brackets included), and URLs as “<URL>”. There are other types of tokens that URL/email tokenizer emits, but for this purpose we only want to screen out everything but e-mail addresses and URLs. Enter TypeTokenFilter, allowing only a strictly specified set of token type values to pass through. In the screenshot you’ll notice the text “at” was identified as type “<ALPHANUM>”, and did not pass through the type filter. An external text file (email_url_types.txt) contains the types to pass through, and simply contains two lines with the values “<URL>” and “<EMAIL>”.

So now we have a field type that can do the recognition and extraction of e-mail address and URLs. Let’s now use it from within the update chain, conveniently possible in update-script.js. With some scary looking JavaScript/Java/Lucene API voodoo, it’s achieved with the code shown above in update-script.js.  That code is essentially how indexed fields get their terms, we’re just having to do it ourselves to make the values *stored*.

This technique was originally described in the “Analysis in ScriptUpdateProcessor” section of this this presentation:

example/files demonstration video

Thanks go to Esther Quansah who developed much of the example/files configuration and produced the demonstration video during her internship at Lucidworks.

What’s next for example/files?

An umbrella Solr JIRA issue has been created to note these desirable fixes and improvements – – including the following items:

  • Fix e-mail and URL field names (<email>_ss and <url>_ss, with angle brackets in field names), also add display of these fields in /browse results rendering
  • Harden update-script: it currently errors if documents do not have a “content” field
  • Improve quality of extracted phrases
  • Extract, facet, and display acronyms
  • Add sorting controls, possibly all or some of these: last modified date, created date, relevancy, and title
  • Add grouping by doc_type perhaps
  • fix debug mode – currently does not update the parsed query debug output (this is probably a bug in data driven /browse as well)
  • Filter out bogus extracted e-mail addresses

The first two items were fixed and patch submitted during the writing of this post.


Using example/files is a great way of exploring the built-in capabilities of Solr specific to rich text files. 

A lot of Solr configuration and parameter trickery makes /browse?locale=de_DE&type=html a much cleaner way to do this: /select?v.locale=de_DE&fq={!field%20f=doc_type%20v=html}&wt=velocity&v.template=browse&v.layout=layout&q=*:*&facet.query={!ex=type%20key=all_types}*:*&facet=on… (and more default params) 

Mission to “build a real application using just a few simple commands” accomplished!   It’s so succinct and clean that you can even tweet it!$ bin/solr start; bin/solr create -c files -d example/files; bin/post -c files ~/Documents #solr





The post example/files – a Concrete Useful Domain-Specific Example of bin/post and /browse appeared first on

Karen G. Schneider: Holding infinity

planet code4lib - Wed, 2016-01-27 03:21
To see a World in a Grain of Sand And a Heaven in a Wild Flower  Hold Infinity in the palm of your hand  And Eternity in an hour This weekend Sandy and I had a scare which you have heard about if you follow me on Facebook. I won’t repeat all of it here, but our furnace was leaking carbon monoxide, the alarms went off, firefighters came, then left, then came back later to greet us as we sat on our stoop in our robes and pajamas, agreeing the second time that it wasn’t bad monitor batteries as they walked slowly through our home, waving their magic CO meter; they stayed a very long time and aired out rooms and closets and… well. I could see that big crow walking over our graves, its eyes shining, before its wingspan unfurled and it rose into the night, disgruntled to have lost us back to the living. After a chilly (but not unbearable) weekend in an unheated house, our landlord, who is a doll, immediately and graciously replaced the 26-year-old furnace with a spiffy new model that is quiet and efficient and not likely to kill us anytime soon. Meanwhile, we both had colds (every major crisis in my life seems to be accompanied by head colds), and I was trying valiantly to edit my dissertation proposal for issues major and minor that my committee had shared with me. Actually, at first it was a struggle, but then it became a refuge. Had I known I would be grappling with the CO issue later on Saturday, I would not have found so many errands to run that morning, my favorite method of procrastination. But by the next morning, editing my proposal seemed like a really, really great thing to be doing, me with my fully-alive body. I had a huge batch of posole cooking on the range, and the cat snored and Sandy sneezed and when I got tired of working on the dissertation I gave myself a break to work on tenure and promotion letters or to contemplate statewide resource-sharing scenarios (because I am such a fun gal). I really liked my Public Editor idea for American Libraries and would like to see something happen in that vein, but after ALA I see that it is an idea whose idea needs more than me as its champion, at least through this calendar year. There’s mild to moderate interest, but not enough to warrant dropping anything I’m currently involved in to make it happen. It’s not forgotten, it’s just on a list of things I would like to make happen. That said, this ALA in Boston–ok, stand back, my 46th, if you count every annual and midwinter–was marvelous for its personal connections. Oh yes, I learned more about scholarly communications and open access and other Things. But the best ideas I garnered came from talking with colleagues, and the best moments did too. Plus two delightful librarians introduced me to Uber and the Flour Bakery in the same madcap hour. I was a little disappointed they weren’t more embarrassed when I told the driver it was my first Uber ride. I am still remembering that roast lamb sandwich. And late-night conversations with George. And early-evening cocktails with Grace. And a proper pub pint with Lisa. And the usual gang for our usual dinner. And a fabulous GLBTRT social. And breakfast with Brett. And how wonderful it was to stay in a hotel where so many people I know were there. And the hotel clerk who said YOU ARE HALF A BLOCK FROM THE BEST WALGREENS IN THE WORLD and he was right. It’s hard to explain… unless you remember the truly grand Woolworth stores of yesteryear, such as the store at Powell and Market that had a massive candy counter, a fabric and notions section, every possible inexpensive wristwatch one could want for, and a million other fascinating geegaws. Sometimes these days I get anxious that I need to get such-and-such done in the window of calm. It’s true, it’s better to be an ant than a grasshopper. I would not have spent Saturday morning tootling from store to store in search of cilantro and pork shoulder had I known I would have spent Saturday afternoon and evening looking up “four beeps on a CO monitor” and frantically stuffing two days’ worth of clothes into a library tote bag (please don’t ask why I didn’t use the suitcase sitting right there) as we prepared to evacuate our home. But I truly don’t have that much control over my life. I want it, but I don’t have it. Yes, it’s good to plan ahead. We did our estate planning (hello, crow!) and made notebooks to share with one another (hi crow, again!) and try to be mindful that things happen on a dime. But if I truly believed life was that uncertain, I couldn’t function. On some level I have to trust that the sounds I hear tonight–Sandy whisking eggs for an omelette, cars passing by our house on a wet road, the cat padding from room to room, our dear ginger watchman–will be the sounds I hear tomorrow and tomorrow. Even if I know–if nothing else, from the wide shadow of wings passing over me–that will not always be the case. Onward into another spring semester. There aren’t many students in the library just yet. They aren’t frantically stuffing any tote bags, not for their lives, not for their graduations, not for even this semester. They’ll get there. It will be good practice. Bookmark to:

Mashcat: Upcoming webinars in early 2016

planet code4lib - Tue, 2016-01-26 17:19

We’re pleased to announce that several free webinars are scheduled for the first three months of 2016. Mark your calendars!

Date/Time Speaker Title 26 January 2016 (14:00-17:00 UTC / 09:00-12:00 EST) Owen Stephens Installing OpenRefine This webinar will be an opportunity for folks to see how OpenRefine can be installed and to get help doing so, and serves as preparation for the webinar in March.  There will also be folks at hand in the Mashcat Slack channel to assist.

Recording / Slides (pptx)

19 February 2016 (18:00-19:00 UTC / 13:00-14:00 EST) Terry Reese Evolving MarcEdit: Leveraging Semantic Data in MarcEdit. Library metadata is currently evolving — and whether you believe this evolution will lead to a fundamental change in how Libraries manage their data (as envisioned via BibFrame) or more of an incremental change (like RDA); one thing that is clear is the merging of traditional library data and semantic data.  Over the next hour, I’d like to talk about how this process is impacting how MarcEdit is being developed, and look at some of the ways that Libraries can not just begin to embed semantic data into their bibliographic records right now — but also begin to new services around semantic data sources to improve local workflows and processes. 14 March 2016 (16:00-17:30 UTC / 11:00-12:30 EST Owen Stephens (Meta)data tools: Working with OpenRefine OpenRefine is a powerful tool for analyzing, fixing, improving and enhancing data. In this session the basic functionality of OpenRefine will be introduced, demonstrating how it can be used to explore and fix data, with particular reference to the use of OpenRefine in the context of library data and metadata.

The registration link for each webinar will be communicated in advance. Many thanks to Alison Hitchens and the University of Waterloo for offering up their Adobe Connect instance to host the webinars.

David Rosenthal: Emulating Digital Art Works

planet code4lib - Tue, 2016-01-26 16:00
Back in November a team at Cornell led by Oya Rieger and Tim Murray produced a white paper for the National Endowment for the Humanities entitled Preserving and Emulating Digital Art Objects. It was the result of two years of research into how continuing access could be provided to the optical disk holdings of the Rose Goldsen Archive of New Media Art at Cornell. Below the fold, some comments on the white paper.

Early in the project their advisory board strongly encouraged them to focus on emulation as a strategy, advice that they followed. Their work thus parallels to a considerable extent the German National Library's (DNB's) use of Freiburg's Emulation As A Service (EAAS) to provide access to their collection of CD-ROMs. The Cornell team's contribution includes surveys of artists, curators and researchers to identify their concerns about emulation because, as they write:
emulation is not always an ideal access strategy: emulation platforms can introduce rendering problems of their own, and emulation usually means that users will experience technologically out-of-date artworks with up-to-date hardware. This made it all the more important for the team to survey media art researchers, curators, and artists, in order to gain a better sense of the relative importance of the artworks' most important characteristics for different kinds of media archives patrons. The major concern they reported was experiential fidelity:
Emulation was controversial for many, in large part for its propensity to mask the material historical contexts (for example, the hardware environments) in which and for which digital artworks had been created. This part of the artwork's history was seen as an element of its authenticity, which the archiving institution must preserve to the best of its ability, or lose credibility in the eyes of patrons. We determined that cultural authenticity, as distinct from forensic or archival authenticity, derived from a number of factors in the eyes of the museum or archive visitor. Among our survey respondents, a few key factors stood out: acknowledgement of the work's own historical contexts, preservation of the work's most significant properties, and fidelity to the artist's intentions, which is perhaps better understood as respect for the artist's authority to define the work's most significant properties. As my report pointed out (Section 2.4.3), hardware evolution can significantly impair the experiential fidelity of legacy artefacts, and (Section 3.2.2) the current migration from PCs to smartphones as the access device of choice will make the problem much worse. Except in carefully controlled "reading room" conditions the Cornell team significantly underestimate the problem:
Accessing historical software with current hardware can subtly alter aspects of the work's rendering. For example, a mouse with a scroll wheel may permit forms of user interactivity that were not technologically possible when a software-based artwork was created. Changes in display monitor hardware (for example, the industry shift from CRT to LED display) brings about color shifts that are difficult to calibrate or compensate for. The extreme disparity between the speed of current and historical processors can lead to problems with rendering speed, a problem that is unfortunately not trivial to solve. The overestimate a different part of the problem when they write:
emulators, too, are condemned to eventual obsolescence; as new operating systems emerge, the distance between "current" and "historical" operating systems must be recalculated, and new emulators created to bridge this distance anew. We attempted to establish archival practices that would mitigate these instabilities. For example, we collected preservation metadata specific to emulators that included documentation of versions used, rights information about firmware, date and source of download, and all steps taken in compiling them, including information about the compiling environment. We were also careful to keep metadata for artworks emulator-agnostic, in order to avoid future anachronism in our records. If the current environment they use to access a digital artwork is preserved, including the operating system and the emulator the digital artwork currently needs, future systems will be able to emulate the current environment. Their description of the risk of emulator obsolescence assumes we are restricted to a single layer of emulation. We aren't. Multi-layer emulations have a long history, for example in the IBM world, and in the Internet Archive's software collection.

Ilya Kreymer's shows that another concern the Cornell team raise is also overblown:
The objective of a 2013 study by the New York Art Resources Consortium (NYARC)was to identify the organizational, economic, and technological challenges posed by the rapidly increasing number of web-based resources that document art history and the art market. 18 One of the conclusions of the study was that regardless of the progress made, "it often feels that the more we learn about the ever-evolving nature of web publishing, the larger the questions and obstacles loom." Although there are relevant standards and technologies, web archiving solutions remain to be costly, and harvesting technologies as of yet lack maturity to completely capture the more complex cases. The study concluded that there needs to be organized efforts to collect and provide access to art resources published on the web. The ability to view old web sites with contemporary browsers provided by should allay these fears.

Ultimately, as do others in the field, the Cornell team takes a pragmatic view of the potential for experiential fidelity, refusing to make the best be the enemy of the good.
The trick is finding ways to capture the experience - or a modest proxy of it - so that future generations will get a glimpse of how early digital artworks were created, experienced, and interpreted. So much of new media works' cultural meaning derives from users' spontaneous and contextual interactions with the art objects. Espenschied, et al.point out that digital artworks relay digital culture and "history is comprehended as the understanding of how and in which contexts a certain artifact was created and manipulated and how it affected its users and surrounding objects."

Terry Reese: MarcEdit update Posted

planet code4lib - Tue, 2016-01-26 06:04

I’ve posted an update for all versions – changed noted here:

The significant change was a shift in how the linked data processing works.  I’ve shifted from hard code to a rules file.  You can read about that here:

If you need to download the file, you can get it from the automated update tool or from:


Library Tech Talk (U of Michigan): The Next Mirlyn: More Than Just a Fresh Coat of Paint

planet code4lib - Tue, 2016-01-26 00:00

The next version of Mirlyn ( is going to take some time to create, but let's take a peek under the hood and see how the next generation of search will work.

Library Tech Talk (U of Michigan): Designing for the Library Website

planet code4lib - Tue, 2016-01-26 00:00

This post is a brief overview of the process in designing for large web-based systems. This includes understanding what makes up an interface and how to start fresh to create a good foundation that won't be regrettable later.

DuraSpace News: CALL for Proposals: OR2016 Fedora Interest Group

planet code4lib - Tue, 2016-01-26 00:00

From the OR2016 Fedora Interest Group program committee

Join us for the Fedora Interest Group sessions at Open Repositories 2016 [1] in Dublin to meet other Fedora users and developers and share your experiences with one another. If you are new to Fedora you will find many opportunities to learn more about the repository software from your peers in the open source community.

This year’s Fedora Interest Group track will showcase presentations, panels, demonstrations, and project updates. Some of the central themes include:

DuraSpace News: CALL for Proposals: OR2016 DSpace Interest Group

planet code4lib - Tue, 2016-01-26 00:00

From the OR2016 DSpace Interest Group program committee

London, UK  The DSpace community will meet again at the 11th Open Repositories Conference, 13–16 June, in Dublin, Ireland.  The DSpace Interest Group program committee invites your contributions to share, describe and report on use of the DSpace platform, outlining novel experiences or developments in the construction and use of DSpace repositories. Whether you’re a developer, researcher, repository manager, administrator or practitioner, we would like to hear from you.  

LITA: There’s a Reason There’s a Specialized Degree

planet code4lib - Mon, 2016-01-25 20:33

I think it can be easy to look around a library — especially a smooth-running one — and forget that the work that gets done there ranges from the merely difficult to the incredibly complex. This isn’t the sort of stuff just anyone can do, no matter how well-meaning and interested they might be, which is why there are specialized degree programs designed to turn out inventive and effective experts.

I’m talking, of course, about the accountants. And computer programmers. And instructional designers. And usability experts.

And, oh, yeah, the librarians.

A double standard?

There’s a temptation among librarians (and programmers too, of course, and an awful lot of professors) to think that the world consists of two types of work:

  1. Stuff only we can do, and
  2. Everything else

If I were to head off to a library school for a semester and take a single course on cataloging, my colleagues would be understandably worried about dropping me next to the ILS with a stack of new books. A single group project looking broadly at research methodologies doesn’t qualify me for … well, for anything, inside the library or not.

But I often see librarians with only half a semester of programming, or a survey course on usability testing (never mind actual UX), or experience in a group project where they got stuck with the title Project Manager take on (or, often, be thrust into) actual professional roles to do those things.

The unspoken, de facto standard seems to be, “We can teach a librarian to do anything, but we can’t or won’t teach anyone else to do Real Librarian work.”

Subject-matter expertise is not overall expertise

I’m lucky enough to work in a ginormous academic library, where we’re not afraid to hire specialists when warranted. And yet, even here, there persists the curious belief that librarians can and often should do just about everything.

This leads me to what I believe is a Truth That Must Be Spoken:

A committee of four interested and well-meaning librarians is not equivalent to a trained expert with actual education and experience.

There’s a reason most disciplines separate out the “subject-matter expert” (SME) from the other work. Instructional Designers are trained to do analysis, study users and measure outcomes, and work with a SME to incorporate their knowledge into a useful instructional product. The world at large differentiates between web design, content management, and quality assurance. And the first time you work with a real project manager, you’ll come to the stark realization that you’ve never before worked with a real project manager, because the experience is transformative.

Knowing the content and culture makes you a necessary part of a complete intervention. It doesn’t make you the only necessary part.

A question of value

“But Bill,” you’re saying after doing a quick check to see what my name is, “we don’t have the money to hire experts in everything, and besides, we’re dedicated to growing those sorts of expertise within the library profession.”

I’m not against that — who could be against that? But I do worry that it exemplifies an attitude that the value the library really offers is essentially embodied in the sorts of things librarians have been doing for a century or more — things that only librarians can do — and everything else that happens in a library adds notable but ultimately marginal value to the patrons.

That’s not true. The website, the instructional and outreach activities, increasingly complicated management, and (the big one these days) contract negotiation with vendors are all hugely important to the library, and arguably have a much bigger impact on the patrons as a group than, say, face-to-face reference work, or original cataloging. I know our digital environment is used orders of magnitude more than our physical plant, up to and including the actual librarians. Not all users are (or should be) valued equally, but when the zeros start stacking up like that, you should at least take a hard look at where your resources are being spent compared to where your patrons are deriving most of the value.

It’s great if you can get a librarian with the skills needed to excel at these “other” things. But when you put a near-novice in charge of something, you’re implicitly saying two things:

  1. This isn’t all that important to do well or quickly, which you can tell because we put you, a novice, in charge of it, and
  2. The work you were doing before isn’t that important, because we’re willing to pay you to try to learn all this stuff on-the-job instead of whatever you were doing before.

If there’s an eyes-wide-open assessment of the needs of the institution and they decide in favor of internal training, then that’s great. What I’m railing against is starting a project/program/whatever with the implicit attitude that the “library part” is specialized and hard, and that we don’t really care if everything else is done well, agilely, and quickly, because it’s essentially window dressing.

What to do?

Unfortunately, librarianship is, as a discipline, constantly under attack by people looking for a simple way to cut costs. I worry this has the unfortunate side effect of causing librarians as a culture to close ranks. One way this manifests itself is by many institutions requiring an MLS for just about any job in the library. I don’t think that’s in anyone’s interest.

Are you better off hiring another librarian, or a programmer? Should you move someone off their duties to do system administration (almost certainly badly), or should you cut something else and outsource it? Do you have any idea at all if your instructional interventions have lasting impact? If not, maybe it’s time to hire someone to help you find out.

The days when the quality of a library’s services depended almost exclusively on the librarians and the collection are behind us. It takes a complex, heterogenous set of knowledge and expertise to provide the best service you can for as many patrons as you can. And maybe, just maybe, the best way to gather those skills is to hire some non-librarians and take advantage of what they know.

Librarians deserve to be valued for their expertise, education, and experience. So does everyone else.

LITA: Which Test for Which Data, a new LITA web course

planet code4lib - Mon, 2016-01-25 16:25

Here’s the first web course in the LITA spring 2016 offerings:
Which Test for Which Data: Statistics at the Reference Desk

Instructor: Rachel Williams, PhD student in the School of Library and Information Studies at UW-Madison

Offered: February 29 – March 31, 2016
A Moodle based web course with asynchronous weekly content lessons, tutorials, assignments, and group discussion.

Register Online, page arranged by session date (login required)

This web course is designed to help librarians faced with statistical questions at the reference desk. Whether assisting a student reading through papers or guiding them when they brightly ask “Can I run a t-test on this?”, librarians will feel more confident facing statistical questions. This course will be ideal for library professionals who are looking to expand their knowledge of statistical methods in order to provide assistance to students who may use basic statistics in their courses or research. Students taking the course should have a general understanding of mean, median, and mode.


  • Develop knowledge related to statistical concepts, including basic information on what the goals of statistical tests are and which kinds of data scales are associated with each, with a focus on t-tests, correlations, and chi-square tests.
  • Explore different kinds of statistical tests and increase ability to discern between the utility of different types of statistical tests and why one may be more appropriate than another.
  • Increase literacy in evaluating and describing statistical research that uses t-tests, correlations, and chi-square tests.
  • Improve confidence in answering questions about statistical tests in a reference setting, including explaining tests and results and assisting users in determining which statistical tests are appropriate for a dataset. Helping others analyze graphical representations of statistics.

Here’s the Course Page

Rachel Williams is a PhD student in the School of Library and Information Studies at UW-Madison. Rachel has several years of experience in public and academic libraries and is passionate about research design and methods. She has also taught courses at SLIS on database design, metadata, and social media in information agencies. Rachel’s research explores the constraints and collaborations public libraries operate within to facilitate access to health information and services for the homeless.


February 29 – March 31, 2016


  • LITA Member: $135
  • ALA Member: $195
  • Non-member: $260

Technical Requirements:

Moodle login info will be sent to registrants the week prior to the start date. The Moodle-developed course site will include weekly new content lessons and is composed of self-paced modules with facilitated interaction led by the instructor. Students regularly use the forum and chat room functions to facilitate their class participation. The course web site will be open for 1 week prior to the start date for students to have access to Moodle instructions and set their browser correctly. The course site will remain open for 90 days after the end date for students to refer back to course material.

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

Library of Congress: The Signal: Digital Preservation Planning: An NDSR Boston Project Update

planet code4lib - Mon, 2016-01-25 16:08

The following is a guest post by Jeffrey Erickson, National Digital Stewardship Resident at the University Archives and Special Collections at UMass Boston. He participates in the NDSR-Boston cohort.

Jeffrey Erickson, 2015 NDSR Resident at UMass Boston.

I am a recent graduate of Simmons College’s School of Library and Information Science as well as a current participant in this year’s Boston cohort of the National Digital Stewardship Residency (NDSR) program. At Simmons, I focused my studies on archiving and cultural heritage informatics. In the NDSR program, I am excited to continue learning as I work on a digital preservation planning project at the University Archives and Special Collections (UASC) at UMass Boston.


My project involves developing a digital preservation plan for the digital objects of the UASC project called the “Mass. Memories Road Show (MMRS)”. Because the UASC operates with limited IT support, hosted technical systems and services are used wherever possible. Therefore, this project is testing the use of ArchivesDirect, a hosted digital preservation solution that combines the Archivematica digital preservation workflow tool and the DuraCloud storage service.

The project is divided into three phases:

  1. Research and Practice
  • Research digital preservation concepts, good practices, tools and services
  • Assess the digital stewardship landscape at UMass Boston
  • Examine digitization practices and the digital asset management systems employed by the University Archives and Special Collections
  1. Review and Testing

During the Review and Testing phase I will be collaborating with Archives staff, Library leadership and University stakeholders to:

  • Further develop/refine workflows which prepare UASC for continuing digitization projects
  • Develop policies and procedures to implement long-term preservation of holdings using cloud-based storage services
  • Review and test new policies and procedures
  1. Implementation and Final Reporting
  • Apply the new digital preservation policies and procedures to the MMRS digital objects
  • Perform digital preservation tasks
    • Assign technical and preservation metadata
    • Generate and verify fixity information to ensure data integrity
    • Create Archival Information Packages (AIPs), upload them to DuraCloud service
    • Other tasks as necessary
  • Prepare a final report documenting the project, the procedures and my recommendations


The Mass. Memories Road Show is an ongoing community-based digital humanities project operated by UMass Boston’s Archives and Special Collections. The project captures stories and photographs about Massachusetts communities contributed by its residents. The goal of the project is to build communities and create a collection of images and videos collected at public events (“Road shows”) for educational purposes. Members of the community participate by contributing their personal photographs, which are digitized by UMass Boston Archives staff and volunteers. Additionally, participants may be photographed or video recorded telling their stories. The collected images and videos are then processed and uploaded to the UASC’s CONTENTdm repository. The digital objects in this collection require digital preservation because they are original materials that cannot be replaced if they were to become lost or damaged. To date the collection consists of more than 7,500 photographs and 600 video interviews, with several hundred added each year.


The initial Research Phase of the project has been completed. I have gathered a lot of information about digital preservation concepts and good practices. I have evaluated many digital preservation tools, including Archivematica and DuraCloud. I have familiarized myself with the Mass. Memories Road Show; documenting the digitization workflows and studying the digital asset management system in use at UMass Boston. A Road Show event was held in October on Martha’s Vineyard so I was able to witness first-hand how the digital objects are captured and digitized.

The primary deliverables for the Research Phase of the project were a Digital Content Review (DCR) and a GAP Analysis. The DCR defines the scope of the digital objects in the collection and assesses the collection’s future growth and digital preservation needs. The GAP Analysis considers the current digitization practices and the digital asset management system and compares them to the digital preservation requirement as outlined in the OAIS Reference model. The GAP Analysis and the parameters of the project dictate that digital preservation efforts be concentrated on preparing digital objects for ingest and implementing archival storage.


Moving forward, I will be working with Archives staff and ArchivesDirect consultants to address the following issues.

  1. Identifying tools and developing procedures for generating and reviewing fixity information to ensure the authenticity and data integrity of the collection.
  2. Refining the workflows to determine how best to integrate the new systems, Archivematica and DuraCloud, with the current digital asset management system centered on the CONTENTdm repository.
  3. Developing local storage options to ensure that multiple copies of the collection are preserved in archival storage.
  4. Determining procedures and solutions for incorporating descriptive, technical and preservation metadata requirements into the digital preservation workflow.
  5. Creating an exit strategy to ensure digital preservation can continue in the event that any of the hosted services become unavailable for any reason.

I am looking forward to moving from the research phase to the practice phases of the project and applying the information I have gathered to the tasks involved in the preservation process. I anticipate that the most challenging elements standing between me and a successful completion of the project will be figuring out how Archivematica, CONTENTdm and DuraCloud can best be configured to work together and how to manage the metadata requirements. Integrating the three systems to work together will require me to gain an in-depth understanding of how they work and how they are configured. As a “systems” guy, I look forward to taking a look under the hood. Similarly, I am looking forward to gaining a stronger understanding of how to manage and work with technical and preservation metadata.

I feel like I have learned so much through the first half of my NDSR project. But I realize that there is still a lot to do and even more to learn. Thank you for your interest in my project.

Islandora: Taking Islandora to Twitter: YUDLbot &amp; friends

planet code4lib - Mon, 2016-01-25 14:05

We're taking this week's blog spot to highlight a nifty little tool for Islandora that anyone can adopt: YUDLbot. Short for York University Digital Library bot, the YUDLbot was written by Nick Ruest to take objects from York's Islandora repository and tweet them hourly, using the object description as the body of the tweet and linking back to the object in the repo. Randomly trolling through such an extensive repository turns up some pretty fun things, such as:

Image of a cat with a cast on its leg, laying in a basket.

— YUDLbot (@YUDLbot) January 18, 2016

Or my personal favourite:

Blurry image of closeup of surface.

— YUDLbot (@YUDLbot) January 14, 2016

The parameters the bot uses to select objects can also be tweaked further, spawning new bots like YUDLcat and YUDLdog. The code behind all of this is available on GitHub, along with some quick instruction on how to customize it for your own repo. The University of Toronto Scarborough has their own YUDLbot-based Twitterbot and there is an unofficial bot tweeting from the collection of the Toronto Public Library. Why not take the code for a spin and start tweeting your repo?

LibUX: 032 – What can libraries learn from the FANG playbook?

planet code4lib - Mon, 2016-01-25 13:06

Jim Cramer, who coined the “FANG” acronym as a descriptor for the high-flying Facebook, Amazon, Netflix, and Google group of tech stocks that have dramatically outperformed the market …. In fact, though, Cramer was more right than he apparently knows: the performance of the FANG group is entirely justified because of the underlying companies, or, to be more precise, because the underlying companies are following the exact same playbook. Ben Thompson

We read The FANG Playbook on Stratechery by Ben Thompson, who explains how controlling users’ entry points into a market category enables them to exert control over that user experience and subsequently control — like a dam in a river — what happens in that market.

There is a clear pattern for all four companies: each controls, to varying degrees, the entry point for customers to the category in which they compete. This control of the customer entry point, by extension, gives each company power over the companies actually supplying what each company “sells”, whether that be content, goods, video, or life insurance. Ben Thompson

Libraries aren’t so disimilar. I wrote in The Library Interface how looking at the library as an intermediary touchpoint between what patrons want and that content / product / service shows the value of being and designing to an access point:

These are the core features of the library interface. Libraries absorb the community-wide cost to access information curated by knowledge-experts that help sift through the Googleable cruft. They provide access to a repository of physical items users want and don’t want to buy (books, tools, looms, 3d printers, machines). A library is, too, where community is accessed. In the provision of this access anywhere on the open web and through human proxies, the library creates delight. Michael Schofield

If you like you can download the MP3.

As usual, you support us by helping us get the word out: share a link and take a moment to leave a nice review. Thanks!

You can subscribe to LibUX on Stitcher, iTunes, or plug our feed right into your podcatcher of choice. Help us out and say something nice. You can find every podcast on

The post 032 – What can libraries learn from the FANG playbook? appeared first on LibUX.

DuraSpace News: Digital Preservation Planning: An NDSR Boston Project Update featuring DuraCloud and ArchivesDirect

planet code4lib - Mon, 2016-01-25 00:00

From The Signal: Digital Preservation Blog from Library of Congress


Subscribe to code4lib aggregator