You are here

Feed aggregator

Alf Eaton, Alf: Quantifying journals

planet code4lib - Thu, 2015-09-10 14:16
Impact factor

  • Average (mean) number of citations in the last 12 months of articles published in the previous 2 years

impact = c(year 0) / (n(year –1) + n(year –2))

e.g. 6 = 600 / 100

  • Publish more articles = lower impact factor => selectivity
  • Doesn’t matter which articles get the citations (could be just one)
  • If one article out of 100 published has 600 citations, and the rest have none, impact factor = 6

Pros: publishing more, low cited articles reduces the impact factor

Cons: Distribution can be highly skewed, affected by a few highly cited papers


h-index = max(n(articles with n citations))

  • Can publish 10000 articles with no citations, h will stay the same
  • If 60 articles out of 60000 published have 60 citations, and the rest have none, h-index = 60

Pros: less affected by a few highly cited papers (n represents how many of those there are)

Cons: can publish many low cited articles without reducing the score


eigenfactor = n citations, weighted by citing journal rank

  • Can publish 10000 articles with no citations, e will stay the same
  • Doesn’t matter which articles get the citations (could be just one)
  • If 1 article out of 60000 published has 60 citations from journals with highly-cited articles, and the rest have none, e = 60

LITA: Jobs in Information Technology: September 9, 2015

planet code4lib - Thu, 2015-09-10 01:36

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

Emerging Technologies Librarian, Marquette University Libraries, Milwaukee, WI

Head of the Physics Library, University of Arkansas, Fayetteville, AR

Director of Digital Strategies, Multnomah County Library, Portland, OR

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.


SearchHub: Min/Max On Multi-Valued Field For Functions & Sorting

planet code4lib - Thu, 2015-09-10 00:07

One of the new features added in Solr 5.3 is the ability to specify that you wanted Solr to use either the min or max value of a multi-valued numeric field — either to use directly (perhaps as a sort), or to incorporate into a larger more complex function.

For example: Suppose you were periodically collecting temperature readings from a variety of devices and indexing them into Solr. Some devices may only have a single temperature sensor, and return a single reading; But other devices might have 2, or 3, or 50 different sensors, each returning a different temperature reading. If all of the devices were identical — or at least: had identical sensors — you might choose to create a distinct field for each sensor, but if the devices and their sensors are not homogeneous, and you generally just care about the “max” temperature returned, you might just find it easier to dump all the sensor readings at a given time into a single multi-valued “all_sensor_temps” field, and also index the “max” value into it’s own “max_sensor_temp” field. (You might even using the MaxFieldValueUpdateProcessorFactory so Solr would handle this for you automatically every time you add a document.)

With this type of setup, you can use different types of Solr options to answer a lot of interesting questions, such as:

  • For a given time range, you can use sort=max_sensor_temp desc to see at a glance which devices had sensors reporting the hottest temperatures during that time frame.
  • Use fq={!frange l=100}max_sensor_temp to restrict your results to situations where at least one sensor on a device reported that it was “overheating”

…etc. But what if you one day decide you’d really like to which devices have sensors reporting the lowest temperature? Or readings that have ever been below some threshold which is so low it must be a device error?

Since you didn’t create a min_sensor_temp field in your index before adding all of those documents, there was no easy way in the past to answer those types of questions w/o either reindexing completely, or by using a cursor to fetch all matching documents and determine the “min” value of the all_sensor_temps field yourself.

This is all a lot easier now in Solr 5.3, using some underlying DocValues support. Using the field(...) function, you can now specify the name of a multi-valued numeric field (configured with docValues="true") along with either min or max to indicate which value should be selected.

For Example:

  • Sorting: sort=field(all_sensor_temps,min) asc
  • Filtering: fq={!frange u=0}field(all_sensor_temps,min)
  • As a pseudo-field: fl=device_id,min_temp:field(all_sensor_temps,min),max_temp:field(all_sensor_temps,max)
  • In complex functions: facet.query={!frange key=num_docs_with_large_sensor_range u=50}sub(field(all_sensor_temps,max),field(all_sensor_temps,min))

One of the first things I wondered about when I started adding the code to Solr to take advantage of the underlying DocValues functionality that makes this all possible is: “How slow is it going to be to find the min/max of each doc at query time going to be?” The answer, surprisingly, is: “Not very slow at all.”

The reason why there isn’t a major performance hit in finding these min/max values at query time comes from the fact that the DocValues implementation sorts the multiple values at index time when writing them to disk. At query time they are accessible via SortedSetDocValues. Finding the “min” is as simple as accessing the first value in the set, while finding the “max” is only slightly harder: one method call to ask for the “count” of values are in the (sorted) set, and then ask for the “last one” (ie: count-1).

The theory was sound, but I wanted to do some quick benchmarking to prove it to myself. So I whipped up a few scripts to generate some test data and run a lot of queries with various sort options and compare the mean response time of different equivalent sorts. The basic idea behind these scripts are:

  • Generate 10 million random documents containing a random number of numeric “long” values in a multi_l field
    • Each doc had a 10% chance of having no values at all
    • The rest of the docs have at least 1 and at most 13 random values
  • Index these documents using a solrconfig.xml that combines CloneFieldUpdateProcessorFactory with MinFieldValueUpdateProcessorFactory and MaxFieldValueUpdateProcessorFactory to ensure that single valued min_l and max_l fields are also populated accordingly with the correct values.
  • Generate 500 random range queries against the uniqueKey field, such that there is exactly one query matching each multiple of 200 documents up to 100,000 documents, and such that the order of the queries is fixed but randomized
  • For each sort options of interest:
    • Start Solr
    • Loop over and execute all of the queries using that particular sort option
    • Repeat the loop over all of the queries a total of 10 times to try and eliminate noise and find a mean/stddev response time
    • Shutdown Solr
  • Plot the response time for each set of comparable sort options relative to number of matching documents that were sorted in that request

Before looking at the results, I’d like to remind everyone of a few important caveats:

  • I ran these tests on my laptop, while other applications where running and consuming CPU
  • There was only the one client providing query load to Solr during the test
  • The data in these tests was random and very synthetic, it doesn’t represent any sort of typical distribution
  • The queries themselves are very synthetic, and designed to try and minimize the total time Solr spends processing a request other then sorting the various number of results

Even with those caveats however, the tests — and the resulting graphs — should still be useful for doing an “apples to apples” comparison of the performance of sorting on a single valued field, vs sorting on the new field(multivaluedfield,minmax) function. For example, let’s start by looking at a comparison of the relative time needed to sort documents using sort=max_l asc vs field(multi_l,max) asc …

We can see that both sort options had fairly consistent, and fairly flat graphs, of the mean request time relative to the number of documents being sorted. Even if we “zoom in” to only look at the noisy left edge of the graph (requests that match at most 10,000 documents) we see that while the graphs aren’t as flat, they are still fairly consistent in terms of the relative response time…

This consistency is (ahem) consistent in all of the comparisons tested — you can use the links in the table below to review any of the graphs you are interested in.

single sortmulti sortdirection max_lfield(multi_l,max)ascresultszoomed max_lfield(multi_l,max)descresultszoomed min_lfield(multi_l,min)ascresultszoomed min_lfield(multi_l,min)descresultszoomed sum(min_l,max_l)sum(def(field(multi_l,min),0),def(field(multi_l,max),0))ascresultszoomed sum(min_l,max_l)sum(def(field(multi_l,min),0),def(field(multi_l,max),0))descresultszoomed

The post Min/Max On Multi-Valued Field For Functions & Sorting appeared first on Lucidworks.

DuraSpace News: SLIDE PREVIEW: Paris Fedora 4 Workshop and User Group Meeting

planet code4lib - Thu, 2015-09-10 00:00

Winchester, MA  The upcoming Fedora 4 Workshop and User Group Meeting will be held in Paris at the Conservatoire national des arts et métiers on Sept. 25. The event will include both a Fedora 4 workshop and regional Fedora user group presentations along with a wrap-up discussion. To find out more about the topics that will be offered as part of the Workshop you may preview slides that will be presented by Fedora Product Manager David Wilcox:

LITA: Interacting with Patrons Through Their Mobile Devices :: Image Scanning

planet code4lib - Wed, 2015-09-09 05:00

QR codes are not a new technology. Their recent adoption and widespread usage has even led the technology into a pervasive state, mostly due to their misuse. However, I want to address QR codes in this series — because I believe the technology is brilliant. I enjoy the potential of its concept, and what has recently developed from the technology in the form of Augmented Reality codes.

Originally developed in 1994 by Denso Wave Incorporated, the Quick Reference Code was devised to increase the scan-ability and data storage capacity of the standard linear barcode.

Today, they are most often seen in advertising. Their modern pervasiveness is understandable as an inexpensive, easily produced, versatile method of transmitting information. However, their effectiveness as a mode of relaying information is reliant on their method of use. The QR code needs to provide a direct extension of the information in its proximity, and not be an ambiguous entity.

One method of accomplishing the direct association is to brand your QR codes, and take advantage of their ability to error correct.

By taking the extra effort to develop branded QR codes, a higher level of interest may correspond with usage. But, never rely on assumptions; one of the greatest benefits of using QR codes is their ability to be easily tracked with google analytics. If you are going to put forth the effort to develop them, you might as well be able to quantify their effectiveness.

Their purpose is a direct call to action, a bridge to transport users to supplemental information immediately. This functionality should be used advantageously.

  • Provide users with a helpline
  • Establish a vCalendar for an event or send them to a Facebook event page
  • Send them to a Dropbox they can obtain important documents
  • Supply them with contact information or a direct email address
  • Provide one to your Digital Library App
  • Make one for each of your Social Media accounts

There are several ways to use QR codes in an effective manner. Using them appropriately as a relay, or extension, of anticipated information should reassure your users to continue to use them.

The future of image scanning

Although, many have been dismissive about the QR code as a technology, it has initiated the field of image scanning codes. It’s evolution from QR to image triggers has removed the stigma of ugliness associated image codes. Any picture can now be a trigger to interact with, and the prospects of this technology are extraordinary.

A recent example of image scanning technology has been exhibited in the recent publication of Modern Polaxis, an interactive comic book. The creators utilize image triggers throughout the comic to access Flash media, through their published application, to incorporating animation and sound functionality. The ability to overlay these two media functions creates a level of interaction that is exciting to discover. It has even led to the new media form of AR Tattoos.

This same technology has been made more readily available for users to develop on using the app Aurasma, and has already been brought into the classroom.

Aurasma allows users to linking real world environments with correlated digital content to, hopefully, develop an improved experience. Because the technology utilizes Flash, it allows developers to overlay menu options, audio, video, or just addition imagery or animation. It is a technology I hope to see more of in the future.

Ed Summers: Seminar Week 2

planet code4lib - Wed, 2015-09-09 04:00

Here are some more random thoughts about the seminar readings this week.

Buckland, M. K. (1991). Information as thing. JASIS, 42(5):351–360.

This is another classic in the field where Buckland uses the notion of information-as-thing as a fulcrum for exploring what information is. I noticed on re-reading this paper that he seems to feel that information science theorists have dismissed the study of information-as-thing. So in many ways this article is a defense of the study of information as an object. He uses this focus on the materiality of documents to explore and delimit other aspects of information systems, such as how events can be viewed as information and the situational aspects of information. Early on in the paper is one of his most interesting findings, a matrix for characterizing information along two axes:

Intangible Tangible Entity Knowledge Data Process Becoming informed Information processing

His analysis keeps returning to the centrality of information-as-thing, in an attempt to avoid this logical dead end:

If anything is, or might be, informative, then everything is, or might well be, information. In which case calling something “information” does little or nothing to define it. If everything is information, then being information is nothing special. (Buckland, 1991, p. 356)

It seems to me that the tension here is one of economy: you can’t put everything in the archive, things must be appraised, some things are left out of the information system. Buckland does note that not everything needs to be relocated into the information system to become information:

Some informative objects, such as people and historic buildings, simply do not lend themselves to being collected, stored, and retrieved. But physical relocation into a collection is not always necessary for continued access. Reference to objects in their existing locations creates, in effect, a “virtual collection.” One might also create some description or representation of them: a film, a photograph, some measurements, a directory, or a written description. What one then collects is a document describing or representing the person, building, or other object. (Buckland, 1991, p. 354)

But even in this case a reference or a representation of the thing must be created and it must be made part of the information system. This takes some effort by someone or an action by something. Even in the world of big data and the Internet of Things that we live in now, the information system is not as big as the universe. We make decisions to create and deploy devices to monitor our thermostats. We build systems to aggregate, analyze the data to inform more decisions. Can these systems be thought of as operating outside of human experience? I guess there are people like Stephen Wolfram who think that the universe itself (which includes us) is an information system, or really a computational system. I wonder what Wolfram and Buckland would have to say to each other…

Some of the paper seems to be defending information-as-thing a bit too strenuously, to the point that it seems like the only viable way of looking at information. So I liked that Buckland closes with this:

It is not asserted that sorting areas of information science with respect to their relationship to information-as-thing would produce clearly distinct populations. Nor is any hierarchy of scholarly respectability intended.

Information certainly can be considered as material, and Buckland demonstrates it’s a useful lever for learning more about what information is. But considering it only as material, absent information-as-process, and other situational aspects leads to some pretty deep philosophical problems. Somewhat relatedly Dorothea Salo and I recently wrote a paper that looks examines Linked Data using the work of Buckland and Suzanne Briet (Summers & Salo, 2013s).

Buckland, M. K. (1997). What is a ”document”? JASIS, 48(9):804–809.

Again as in (Buckland, 1991) Buckland is attempting to defend ground that he feels many find untenable: defining the scope and limits of the word “document”. Reading between the lines a bit he sees the explosion of printed information as giving rise to attempts to control it, and since printed information exceeds our ability to organize it, it seems only natural to limit the scope of documentation, so the whole enterprise doesn’t seem like folly.

A document is evidence in support of a fact. (Briet in Buckland, 1991, p. 806)

Buckland quotes Briet to focus the discussion on the value of evidence. A star as not being a document, but a photo of a star as a document. This reminds me a lot of his discussion of situational information, where the circumstances have a great deal to say about whether something is information or a document.

traces of human activity, andCommentsother objects not intended as communication (Buckland, 1997, p. 807)

Reminds me of Geiger’s work on trace ethnography, e.g. looking at the behavior of Wikipedia bots (Geiger & Ribes, 2011). The quote of Barthes makes me think of pragmatic philosophy:

… the object effectively serves some purpose, but it also serves to communicate information (Buckland (1997)).

And what of the purpose? Can something communicate information while effectively not serving a purpose?

Information systems can also be used to find new evidence, so documents are not limited to things having evidential value now. Electronic documents push the boundaries even more, because anything information is less and less like a distinct thing, since everything in a computer is represented ultimately as logic gates, binary ones and zeros.

In both this article and (Buckland, 1991) Buckland seems to resist the notion that information could be anything:

If anything is, or might be, informative, then everything is, or might well be, information. In which case calling something “information” does little or nothing to define it. If everything is information, then being information is nothing special. (Buckland, 1991, p. 356)

if the term ‘‘document’’ were used in a specialized meaning as the technical term to denote the objects to which the techniques of documentation could be applied, how far could the scope of documentation be extended. What could ( or could not ) be a document? The question was, however, rarely formulated in these terms. (Buckland, 1997, p. 805)

Why is it a problem for anything to potentially be information? Is it only a problem because he wants to be able to identify information as an object? If he accepts that information always exists as part of a process, and that these processes are extended in time, does that help relieve this tension about what can be a document?

Saracevic, T. (1999). Information science. Journal of the American Society for Information Science, 50(12):1051–1063.

Definitions of information science are best understood by considering the problems that practitioners are focused on. Saracevic sees information in three primary ways, which are not unique to information science:

  • interdisciplinary
  • connected to technology
  • with a social/human dimension that shapes society

He also sees there having been three powerful ideas:

  • information retrieval (formal logic for processing information)
  • relevance: a model for examining information retrieval systems
  • interaction: models for feedback between people and information systems

Saracevic’s emphasis on problems seems like it could provide useful avenue and citation trail for me to explore with respect to Broken World Thinking outlined by (Jackson, 2014). Is it possible to see Saracevic’s problems as Jackson’s sites for repair? Saracevic claims to have divided information science research into two camps: systems researchers and human-centered research. Is Saracevic’s wanting to merge the two poles of information science an attempt to repair something he sees as broken?


Buckland, M. K. (1991). Information as thing. JASIS, 42(5), 351–360.

Buckland, M. K. (1997). What is a “document”? JASIS, 48(9), 804–809.

Geiger, R. S., & Ribes, D. (2011). Trace ethnography: Following coordination through documentary practices. In System sciences (hICSS), 2011 44th hawaii international conference on (pp. 1–10). IEEE.

Jackson, S. J. (2014). Media technologies: Essays on communication, materiality and society. In P. Boczkowski & K. Foot (Eds.),. MIT Press. Retrieved from

Summers, E., & Salo, D. (2013s). Linking things on the web: A pragmatic examination of linked data for libraries, archives and museums. ArXiv Preprint ArXiv:1302.4591. Retrieved from

Christina Harlow: Moving Beyond Authorities Accessyyz Speaker Notes

planet code4lib - Wed, 2015-09-09 00:00

Here is a post that details some thoughts and experiences that lead to my short slides and other speaking notes for the Navigating Linked Data panel given at the Access YYZ conference. This was originally envisioned as a sort of interactive workshop, since I’ve learned a lot of this by tinkering, so forgive me if it flows a bit awkwardly in place. It was wrapped into a panel to give a better range of projects and approaches, which I’m excited about. All of the following notes were built off of experimentation for particular use cases from a data munger’s viewpoint, not a semantic web developer’s viewpoint, so any corrections, updates, or additions for future reference or edification are very much welcome.

If you have questions, please let me know - @cm_harlow or

Exploding Library Data Authorities: The Theoretical Context

So this post/presentation notes/etc is going to work off the assumption that you know about RDF, that you’re down with the idea that library data should move towards a RDF model, and that you understand the benefits of exposing this data on the web.

One area that I’m particularly interested in experimenting with is the concept of Library Authorities in a Linked Open Data world. Not just the concept of Library Authorities, however; but how can we leverage years, nay, decades of structured (to varying quality levels) data that describes concepts we use regularly in library data creation? How do we best do this, without recreating the more complicated parts of Library Authorities in RDF? I would imagine we want to create RDF data and drop many of the structures around Libraries Authorities that make it not sustainable as it exists today. Why then call it authority here? Here is a quote I particularly agree with from Kevin Ford, when he was still working at the Library of Congress, and Ted Fons of OCLC:

[Authority the term] expresses, subtly or not so subtly, the opportunities libraries and other cultural organizations have in re-asserting credibility on the Web along with providing new means for connecting and contextualizing users to content. The word “Authority” (along with managing “authoritative” information on People, Places, Organizations, etc.) is more valuable and accurate in a larger Web of interconnected data. Nevertheless, because a BIBFRAME Authority is not conceptually identical to the notion of a traditional library authority, the name - Authority - may be confusing and distracting to traditional librarians and their developers. – “On BIBFRAME”, section 3.6,

Okay, please don’t run away now because there is a mention of BIBFRAME. This post is not about BIBFRAME, and I just really like the approach to discussing the term ‘Authority’ here.

I continue to use the term Libraries Authority (but always with some sort of air quotes context) though because I’m uncertain what else to call this process I’m about to unravel, to be frank. And because Libraries Authorities carries with it a whole question of infrastructure as well as data creation/use, in my mind. I’d like to see Libraries Authorities in LOD evolve into how we interact with datastores that don’t directly describe some sort of binary or physical object. It would explain concepts we want to unite across platforms, systems, projects, and other. It would be curating data not just about physical/digital objects, but concepts, as expressed in the above quote.

In line with RDF best practices for ontology development, LOD Authorities should try to reuse concepts where possible, and where not possible, build relationships between the vocabularies and ontologies that exist to what we’re trying to describe locally. I want those relationships to be explicit and documented so we can then use Libraries Authorities data more readily to enhance our records. Though a necessary first step in Libraries Authorities in LOD (or any other, honestly) usage is just getting all our possible controlled access points - WHEREVER THEY MIGHT APPEAR IN LIBRARY DATA - connected to metadata describing concepts, i.e. LOD Authorities, through URI capture of some sort. Hence my really strong interest and a lot of tinkering in the area of library data reconciliation.

I should also mention that a lot of thinking on this was born of the fact that updating many commonly used Library Authorities is difficult if not impossible for many to access. Adding Library of Congress Name Authority Files in particular requires the institution to have the resources to support their employees deal with a huge training bureacracy, whether we are talking about PCC (Program for Cooperative Cataloging) institutional-level membership or NACO funnel cataloging. Training often takes months; then there is a review period that can last years. This has caused Libraries Authorities work, which is very much a public service, to be not really equitable nor sustainable, as many institutions find that navigating that bureacracy is prohibitive.

As I critique the system, this does not mean that I, in any way, devalue the individual effort that goes into Libraries Authority work. Many people donate time and energy to accurately describe concepts that don’t relate directly to their work or institution. They realize that good metadata is a public service, and I appreciate that because it is. I’m critiquing here the larger system, not the individuals working therein.

Regardless, back on the agenda, the following describes one way I am attempting to begin expand the concept of Library Authorities for a particular use case at the University of Tennessee Knoxville (UTK) Libraries. I’m hoping that this will then grow into a new way for us to handle Library Authorities, but exploding authorities to become eventually a store of RDF statements for metadata we use to describe concepts, or to negotiate our concepts with external data sources’ descriptions of them.

Exploding Library Data Authorities: The Use Case

At UTK, there is a special unit in the Special Collections department: the Great Smoky Mountains Regional Collection. This deals with representing all kinds of materials and collections that focus on the Great Smoky Mountains in southern Appalachia, southeastern United States. They also try to represent concepts that are important to this region and culture, but weren’t represented adequately, or were open to their edits, in the usual Library Authorities and vocabularies. This became Database of the Smokes (DOTS) terms, a very simple list of terms, sort of in taxonomy form, that was primarily used for sorting citations of works and indexing works on the region and culture that goes into Database of the Smokies.

DOTS is currently just a Drupal plugin working off a database of citations/works with appropriate DOTS index terms applied. Some of these terms were haphazardly applied to non-MARC descriptive metadata records (largely applied when the DOTS project managers were creating the metadata themselves), and these digital objects with DOTS terms assigned do not show up in the DOTS database of works/citations/objects. These terms were not applied to MARC records describing resources that go into the larger Great Smoky Mountains Regional Collection either. The DOTS terms did not have identifiers, nor a hierarchical structure that was machine-parseable. Finally, the DOTS terms often loosely mirrored the preferred access point text string construction to a point, making inconsistent facets for subject terms in the digital collections platform (as LC authorities and DOTS are both used).

What we wanted to do with DOTS was a number of things, including addressing the above:

  1. Get the hierarchy formalized and machine-readable
  2. Assign terms unique identifiers instead of using text-string matching as a kind of identifier.
  3. Build out reconciliation services for using the updated DOTS in MARC and non-MARC metadata, replete with capturing the valueURI or $0 fields as well.
  4. Pull in subject terms used for DOTS resources outside of DOTS (in the digital collections platform)
  5. Allow the content specialists to be able to add information about these terms - such as local/alternative names, relationships to other terms, other such information describing the concept
  6. Allow external datasets, in particular Geonames and LoC, to enhance the DOTS terms through explicitly declaring relationships between DOTS terms and those datasets.
  7. Look to eventually building this taxonomy to a LOD vocabulary, then enhancing the LOD vocabulary into a full-fledged LOD ontology (the difference between vocabulary and ontology being that ontology have fuller use of formal statements so that accurate reasoning and inference based off DOTS can be done; a vocabulary may have some formal statements but the reasoning cannot entirely be trusted to return accurate results.)
  8. Find ways to then pull the updates to the term description where there are relationships to LoC vocabulary terms for seeding updated Authority records.
  9. Use this work and experimentation to support further exploding of the concept of Library Authorities.

We are focusing on relating DOTS to LoC (primarily LCSH and LCNAF) and Geonames at the moment because they are the vocabularies that have something to offer to DOTS - for the LCSH and LCNAF, they offer a broader context to place the DOTS vocabulary within; for Geonames, it offers hierarchical geographic information and coordinates in consistent encoding and part of the record. Inconsistency in use, coordinates encoding, and where the relationships are declared in the record are the reasons why we did not just rely on the LoC authorities to link out to Geonames, since there has been some matching of the two vocabularies done recently.

Building a Vocabulary: The Setup

The DOTS taxonomy currently lives as a list of text terms. We first wanted to link those terms to LC Authorities and Geonames. This was done by pulling the terms into LODRefine and using existing reconciliation services for the vocabularies. For the LCNAF terms, a different approach was needed as there is no LODRefine reconciliation service currently - this was recently solved by using Linked Data Fragments server and HDT files to create an endpoint through which reconciliation scripts could work. We pulled in the URI and the label from the external vocabularies for the matching term.

We’ve then taken these terms and the matching URIs and created simple SKOS RDF N-Triples records with basic information included. In short:

  1. DOTS was declared as a skos:ConceptScheme and given some simple SKOS properties for name, contact.
  2. all terms were declared as skos:Concepts and skos:inScheme of DOTS.
  3. all terms were given a URI to be made into a URL by the platform below.
  4. the external URIs were applied with skos:closeMatch* then reviewed by content specialists for ones that could become skos:exactMatch.
  5. for all labels that end up with an skos:exactMatch to external vocabularies, the external vocabularies’ labels were brought in as skos:altLabel.

A snippet of one example of the SKOS RDF N-triples created:

<> <> <> . <> <> "Tellico"@en . <> <> <> . <> <> "Talequo"@en . <> <> <> . <> <> <> . <> <> <> . ...

This SKOS RDF N-triples document was then passed through Skosify for improving and ‘validating’ the SKOS document. Next, it was loaded into a Jena Fuseki SPARQL server and triple store, for then being access and used in Skosmos, “a web-based tool providing services for accessing controlled vocabularies, which are used by indexers describing documents and searchers looking for suitable keywords. Vocabularies are accessed via SPARQL endpoints containing SKOS vocabularies.” ( Skosmos, developed by the National Library of Finland, is open source and built originally for Finto, the Linked Open Data Vocabulary service used by government agencies in Finland. It means to help support interoperability of SKOS vocabularies, as well as allow editing.

We’ve got a local instance of Skosmos with the basic DOTS SKOS vocabulary in it, used primarily as a proof of concept for the content specialists. Our DOTS Skosmos test instance supports browsing and using the vocabulary, but not editing currently. I’m hoping we can use a simple form to connect to the SPARQL server (as many existing RDF vocabulary and ontology editors are too complicated for this use), but this has been a lower priority than working on a general MODS editor first. There is the ability to visualize relationships in Skosmos that supports the content specialists really understanding how SKOS structure can help better define their work and discovery.

With the SKOS document, both MARC and non-MARC data with subject terms can now be reconciled and the URI captured either though OpenRefine reconciliation services, or some reconciliation with scripts. This has already helped clean up so much metadata related to this collection. We hope to start using the SPARQL endpoint directly for this reconciliation work.

DOTS SKOS Feedback

This work has inspired the DOTS librarians to want to expand a lot of the kind of ‘Library Authority’ information captured, and the inclusion of other schema/systems for additional classes and properties to support other types of information. This included everything from hiking trail lengths to the cemetery where a person is buried. In the above N-triples snippet example, a particularly strong use case is put forward: extending the Libraries Authorities record, so to speak, to better cover Cherokee concepts. ‘Tellico’ was a Cherokee town that was been partially replaced by the current U.S. town of Tellico Plains, as well as the site of many Tellico archaelogical digs. The LCNAF has authority records for the latter two concepts, but not the first - not the Cherokee town. What would happen with automated reconciliation is that Tellico was often linked to/overwritten by Tellico Plains (or other Tellico towns currently existing in the U.S.). We are building out DOTS and, we hope, other negotiation layers for Libraries Authorities being migrated to RDF then extended in a way that will not erase concepts like Tellico that don’t exist in the authority file. This is also an important motivation for extending many researcher identity management systems to work between the metadata that wants to link to a particular person and the authority file that may or may not have a record for that person. In moving from, in the case of the LoC vocabularies, relying on unique text-string matching to identifiers, we have moved from a sort of but not entirely closed world assumption to an open world assumption. So identifiers are not just pulled in as preparation for RDF.

Additionally, having this local vocabulary better connected in our local data systems to the external authorities has started many discussions about how we can create a new way of updating or expanding external authorities. UTK is not a PCC institution, but we do have 1 cataloger able to create NACO records for the Tennessee NACO funnel. This work still needs to be reviewed by a number of parties and follow the RDA standard for creating MARC Authorities, and it is limited by the amount of work we need to do beyond NACO work. So there is not much time at the present for this cataloger to spend on updating and/or creating records that relate in some way to DOTS terms. This should not mean that 1. the terms of regional interest to UTK will continue to not be adequately described in Library Authorities, and 2. we continue to keep out the content specialists from updating this metadata (though in a negotiated or moderated way, i.e. some kind of ingest form that can handle the data encoding and formation in the back end).

Another question that this project has brought up is where to keep this RDF metadata that we are using currently to negotiate with and extend external Library Authorities. The idea of keeping as a separate store from say descriptive metadata attached to objects has been mostly accepted as a default, but this doesn’t mean that storing concepts say in a Fedora instance as objects without binaries, then use Fedora to build the relationships, is not worth investigating. And, additionally, do we want to pull in the full external datasets as well? It is definitely possible with many LOD Library Authorities available as data dumps. I think at this moment, I would like to see this vocabulary and other vocabularies continue to expand in Fuseki and Skosmos, with an eye to making this work in some ways like VIVO has done for negotiating multiples datasources in describing researchers.

DOTS SKOS Next Steps

In going forward, we would like to: - Expand the editing capabilities of DOTS SKOS so that the content specialists can more readily and directly do this work. - Enhance the hierarchical relationships that we can now support with SKOS. This will mostly involve a lot of manual review that can be done once we’ve got an actual editor in place. - Review beyond SKOS for properties that can support extending the descriptions. - Discuss pulling in full external datasets for better relationship building and querying locally, which is somewhat described above. - See if this is really is part of evolution to store Library Authorities & further concept descriptions not directly related to a physical/digital object in local data ecosystem, and how.

Resources + References

Here are some links to tools, projects, or resources that we found helpful in working on this project. They are in alphabetical order:


Library Tech Talk (U of Michigan): The Next Mirlyn: More Than Just a Fresh Coat of Paint

planet code4lib - Wed, 2015-09-09 00:00

The next version of Mirlyn ( is going to take some time to create, but let's take a peak under the hood and see how the next generation of search will work.

David Rosenthal: Infrastructure for Emulation

planet code4lib - Tue, 2015-09-08 23:00
I've been writing a report about emulation as a preservation strategy. Below the fold, a discussion of one of the ideas that I've been thinking about as I write, the unique position national libraries are in to assist with building the infrastructure emulation needs to succeed.

Less and less of the digital content that forms our cultural heritage consists of static documents, more and more is dynamic. Static digital documents have traditionally been preserved by migration. Dynamic content is generally not amenable to migration and must be preserved by emulation.

Successful emulation requires the entire software stack be preserved. Not just the bits the content creator generated and over which the creator presumably has rights allowing preservation, but also the operating system, libraries, databases and services upon which the execution of the bits depends. The creator presumably has no preservation rights over this software, necessary for the realization of their work. A creator wishing to ensure that future audiences can access their work has no legal way to do so. In fact, creators cannot even legally sell their work in any durably accessible form. They do not own an instance of the infrastructure upon which it depends, they merely have a (probably non-transferable) license to use an instance of it.

Thus a key to future scholars' ability to access the cultural heritage of the present is that in the present all these software components be collected, preserved, and made accessible. One way to do this would be for some international organization to establish and operate a global archive of software. In an initiative called PERSIST, UNESCO is considering setting up such a Global Repository of software. The technical problems of doing so are manageable, but the legal and economic difficulties are formidable.

The intellectual property frameworks, primarily copyright and the contract law underlying the End User License Agreements (EULAs), under which software is published differ from country to country. At least in the US, where much software originates, these frameworks make collecting, preserving and providing access to collections of software impossible except with the specific permission of every copyright holder. The situation in other countries is similar. International trade negotiations such as the TPP are being used by copyright interests to make these restrictions even more onerous.

For the hypothetical operator of the global software archive to identify the current holder of the copyright on every software component that should be archived, and negotiate permission with each of them for every country involved, would be enormously expensive. Research has shown that the resources devoted to current digital preservation efforts, such as those for e-journals, e-books and the Web, suffice to collect and preserve less than half of the materialin their scope. Absent major additional funding, diverting resources from these existing efforts to fund the global software archive would be robbing Peter to pay Paul.

Worse, the fact that the global software archive would need to obtain permission before ingesting each publisher's software means that there would be significant delays before the collection would be formed, let alone be effective in supporting scholars' access.

An alternative approach worth considering would separate the issues of permission to collect from the issues of permission to provide access. Software is copyright. In the paper world, many countries had copyright deposit legislation allowing their national library to acquire, preserve and provide access (generally restricted to readers physically at the library) to copyright material. Many countries, including most of the major software producing countries, have passed legislation extending their national library's rights to the digital domain.

The result is that most of the relevant national libraries already have the right to acquire and preserve digital works, although not the right to provide unrestricted access to them. Many national libraries have collected digital works in physical form. For example, the German National Library's CD-ROM collection includes half a million items. Many national libraries are crawling the Web to ingest Web pages relevant to their collections.

It does not appear that national libraries are consistently exercising their right to acquire and preserve the software components needed to support future emulations, such as operating systems, libraries and databases. A simple change of policy by major national libraries could be effective immediately in ensuring that these components were archived. Each national library's collection could be accessed by emulations on-site. No time-consuming negotiations with publishers would be needed.

An initial step would be for national libraries to assess the set of software components that would be needed to provide the basis for emulating the digital artefacts already in their collections, which of them were already to hand, and what could be done to acquire the missing pieces. The German National Library is working on a project of this kind with the bwFLA team at the University of Freiburg, which will be presented at iPRES2015.

The technical infrastructure needed to make these diverse national software collections accessible as a single homogeneous global software archive is already in place. Existing emulation frameworks access their software components via the Web, and the Memento protocol aggregates disparate collections into a single resource.

Of course, absent publisher agreements it would not be legal for national libraries to make their software collections accessible in this way. But negotiations about the terms of access could proceed in parallel with the growth of the collections. Global agreement would not be needed; national libraries could strike individual, country-specific agreements which would be enforced by their access control systems.

Incremental partial agreements would be valuable. For example, agreements allowing scholars at one national library to access preserved software components at another would reduce duplication of effort and storage without posing additional risk to publisher business models.

By breaking the link that makes building collections dependent on permission to provide access, by basing collections on the existing copyright deposit legislation, and by making success depend on the accumulation of partial, local agreements instead of a few comprehensive global agreements, this approach could cut the Gordian knot that has so far prevented the necessary infrastructure for emulation being established.

Bohyun Kim: From Programmable Biology to Robots and Bitcoin – New Technology Frontier

planet code4lib - Tue, 2015-09-08 20:50

A while ago, I gave a webinar on the topic of the new technology frontier for libraries. This webinar was given for the South Central Regional Library Council Webinar Series.  I don’t get asked to pick technologies that I think are exciting for libraries and library patrons too often. So I went wild! These are the six technology trends that I picked.

  • Maker Programs
  • Programmable Biology (or Synthetic Biology)
  • Robots
  • Drones
  • Bitcoin (Virtual currency)
  • Gamification (or Digital engagement)

OK, actually the maker programs, drones, and gamification are not too wild, I admit. But programmable biology, robots, and bitcoin were really fun to talk about.

I did not necessarily pick the technologies that I thought would be widely adopted by libraries, as you can guess pretty well from bitcoin. Instead, I tried to pick the technologies that are tackling interesting problems, solutions of which are likely to have a great impact on our future and our library patrons’ lives. It is important to note not only what a new technology is and how it works but also how it can influence our lives, and therefore library patrons and libraries ultimately.

Below are my slides. And if you must, you can watch the webinar recording on Youtube as well. Would you pick one of these technologies if you get to pick your own? If not, what else would that be?

Back to the Future Part III: Libraries and the New Technology Frontier

Eric Hellman: Hey, Google! Move Blogspot to HTTPS now!

planet code4lib - Tue, 2015-09-08 15:35
Since I've been supporting a Library Privacy Pledge to implement HTTPS, I've made an inventory of the services I use myself, to make sure that all the services I use will by HTTPS by the end of 2016. The main outlier: THIS BLOG!

This is odd, because Google, the owner of Blogger and Blogspot, has made noise about moving its services to HTTPS, marking HTTP pages as non-secure, and is even giving extra search engine weight to webpages that use HTTPS.

I'd like to nudge Google, now that it's remade its logo and everything, to get their act together on providing secure service for Blogger. So I set the "description" of my blog to "Move Blogspot to HTTPS NOW." If you have a blog on Blogspot, you can do the same. Go to your control panel and click settings. "description" is the second setting at the top. Depending on the design of your page, it will look like this:

So Google, if you want to avoid a devastating loss of traffic when I move Go-To-Hellman to another platform on January 1, 2017, you better get cracking. Consider yourself warned.

Library of Congress: The Signal: The National Digital Platform for Libraries: An Interview with Trevor Owens and Emily Reynolds from IMLS

planet code4lib - Tue, 2015-09-08 14:01

I had the chance to ask Trevor Owens and Emily Reynolds at the Institute of Museum and Library Services (IMLS) about the national digital platform priority and current IMLS grant opportunities.  I was interested to hear how these opportunities could support ongoing activities and research in the digital preservation and stewardship communities.

Erin: Could you give us a quick overview of the Institute of Museum and Library Services national digital platform? In what way is it similar or different from how IMLS has previously funded research and development for digital tools and services?

Trevor Owens, IMLS senior program officer.

Trevor: The national digital platform has to do with the digital capability and capacity of libraries across the U.S. It is the combination of software applications, social and technical infrastructure, and staff expertise that provide library content and services to all users in the US. The idea for the platform has been developed in dialog with a range of stakeholders through annual convenings. For more information on those, you can see the notes (PDF) and videos from our 2014 and 2015 IMLS Focus convenings.

As libraries increasingly use digital infrastructure to provide access to digital content and resources, there are more opportunities for collaboration around the tools and services used to meet their users’ needs. It is possible for every library in the country to leverage and benefit from the work of other libraries in shared digital services, systems, and infrastructure. We need to bridge gaps between disparate pieces of the existing digital infrastructure for increased efficiencies, cost savings, access, and services.

IMLS is focusing on the national digital platform as an area of priority in the National Leadership Grants to Libraries and the Laura Bush 21st Century Librarian grant programs. Both of these programs have October 1st deadlines for two-page preliminary proposals and will have another deadline for proposals in February. It is also relevant to the Sparks! Ignition Grants for Libraries program.

Erin: One of the priorities identified in the 2015 NDSA National Agenda for Digital Stewardship (PDF) centers around enhancing staffing and training, and the report on the recent national digital platform convening (PDF) stresses issues in supporting professional development and training.  There’s obvious overlap here; how do you see the current education and training opportunities in the preservation community contributing to the platform?  How would you like to see them expanded?

Emily Reynolds, IMLS program specialist and 2014 Future Steward NDSA Innovation Awardee.

Emily: We know that there are many excellent efforts that support digital skill development for librarians and archivists. Since so much of this groundwork has been done, with projects like POWRR, DigCCurr, and the Digital Preservation Management Workshops, we’d love to see collaborative approaches that build on existing curricula and can serve as stepping stones or models for future efforts. That is to say, we don’t need to keep reinventing the wheel! Increasing collaboration also broadens the opportunities for updating training as time passes and desirable skills change.

The impact that the education and training component has on the national digital platform as a whole is tremendous. Even for projects without a specific focus on professional development or training, we’re emphasizing things like documentation and outreach to professional staff. After all, what good is all of this infrastructure if the vast majority of librarians can’t use it? We need to make sure that the tools and systems being used nationally are available and usable to professionals at all types of organizations, even those with fewer resources, and training is a big part of making that happen.

Erin:  Another priority identified in the Agenda is supporting content selection at scale.  For example, there are huge challenges in collecting and preserving large amounts of digital content that libraries and archives that may be interested in for their users, patrons, or researchers.  One of those challenges is knowing what’s been created or being collected or available for access.  Do you see the national digital platform supporting any activities or research around digital content selection?

Trevor: Yes, content selection at scale fits squarely in a broader need for using computational methods to scale up library practices in many different areas. One of the panels at the national digital platform convening this year focused directly on scaling up practice in libraries and archives. Broadly, this included discussions of crowdsourcing, linked data, machine learning, natural language processing and data mining. All of these have considerable potential to move further away from doing things one at a time and duplicating effort.

As an example that directly addresses the issue of content selection at scale, in the first set of grants awarded through the national digital platform, one focuses directly on this issue for web archives. In Combining Social Media Storytelling with Web Archives (LG-71-15-0077) (PDF), Old Dominion University and the Internet Archive are working to develop tools and techniques for integrating “storytelling” social media and web archiving. The partners will use information retrieval techniques to (semi-)automatically generate stories summarizing a collection and mine existing public stories as a basis for librarians, archivists, and curators to create collections about breaking events.

Erin: Supporting interoperability seems to be a strong and necessary component of the platform.  Could you discuss broadly and specifically what role interoperable tools or services could fill for the platform? For example, IMLS recently funded the Hydra-in-a-Box project, an open source digital repository, so it would be interesting to hear how you see the digital preservation community’s existing and developing tools and services working together to benefit the platform.

“Defining and Funding the National Digital Platform” panel (James G. Neal, Amy Garmer, Brett Bobley, Trevor Owens). Courtesy of IMLS.

Trevor: First off, I’d stress that the platform already exists, it’s just not well connected and there are lots of gaps where it needs work. The Platform is the aggregate of the tools and services that libraries, archives and museums build, use and maintain. It also includes the skills and expertise required to put those tools and services into use for users across the country. Through the platform, we are asking the national community to look at what exists and think about how they can fill in gaps in that ecosystem. From that perspective, interoperability is a huge component here. What we need are tools and services that easily fit together so that libraries can benefit from the work of others.

The Hydra-in-a-box project is a great example of how folks in the library and archives community are thinking. The full name of that project, Fostering a New National Library Network through a Community-­Based, Connected Repository System (LG-70-15-0006) (PDF), gets into more of the logic going on behind it. What I think reviewers found compelling about this project is how it brought together a series of related problems and initiatives, and is working to bridge different, but related, library communities.

On one hand, the Digital Public Library of America is integrating with a lot of different legacy systems, from which it’s challenging to share collection data. The Fedora Hydra open source software community has been growing significantly across academic libraries. There is a barrier for entrants to start using Hydra. Large academic libraries that often have several developers working on their projects are the ones who are able to use and benefit from Hydra at this point. By working together, these partners can create and promulgate a solution that makes it easier for more organizations to use Hydra. When more organizations can use Hydra, more organizations can then become content hubs for the DPLA. The partnership with DuraSpace brings their experience in sustaining digital projects, and the possibility of establishing hosted solutions for a system that could provide Hydra to smaller institutions.

“The State of Distributed National Capacity” panel (James Shulman, Sibyl Schaefer, Evelyn McLellan, Dan Cohen, Tom Scheinfeldt) Courtesy of IMLS.

Erin: IMLS hosted Focus Convenings on the national digital platform in April 2014 and April 2015.  Engaging communities and end users at the local level seemed to be a recurring theme at both meetings, but also how to encourage involvement and share resources at the national level.  What are some of the opportunities the digital preservation community could address related to engagement activities to support this theme?

Emily: I think this is a question we’re still actively trying to figure out, and we are interested in seeing ideas from libraries and librarians on how we can help in these areas. We know that there are communities whose records and voices aren’t equally represented in a range of national efforts, and we know that in many cases there are unique issues around cultural sensitivity. Addressing those issues requires direct and sustained contact with, and understanding of, the groups involved.  For example, one of the reasons Mukurtu CMS has been so successful with Native communities is because of how embedded in the project those communities’ concerns are. Those relationships have allowed Mukurtu to create a national network of collections while still encouraging individual repositories to maintain local perspectives and relationships.

Engaging communities to participate in national digital platform activities is another way to address concerns about local involvement. We’ve seen great success with the Crowd Consortium, for example, and the tools and relationships that are being developed around crowdsourcing. Various institutions have also done a great deal of work in this area through use of HistoryPin and similar tools. Crowdsourcing and other opportunities for community engagement in digital collections have the unique capacity to solicit and incorporate the viewpoints and input of a huge range of participants.

Erin: Do you have any thoughts on what would make a proposal compelling? Either a theme or project-related topic that fits with the national digital platform priority?

Participants at IMLS Focus: The National Digital Platform. Courtesy of IMLS.

Trevor: The criteria for evaluating proposals for any of our programs are spelled out in the relevant IMLS Notice of Funding Opportunity. The good news is that there aren’t any secrets to this. The proposals likely to be the most compelling are going to be the ones that best respond to the criteria for any individual program. Across all of the programs, applicants need to make the case that there is a significant need for the work they are going to engage in. Things like the report from the national digital platform convening are a great way to establish the case for the need for the work an applicant wants to do.

I’m also happy to offer thoughts on some points in proposals that aren’t quite as competitive. For the National Leadership Grants, I can’t stress enough the words National and Leadership. This is a very competitive program and the things that rise to the top are generally going to be the things that have a clear, straightforward path to making a national impact. So spend a lot of time thinking about what that national impact would be and how you would measure the change a project could make.

Emily: The Laura Bush 21st Century Librarian Program focuses on building human capital capacity in libraries and archives, through continuing education, as well as through formal LIS master’s and doctoral programs. Naturally, when we talk about “21st century skills” in this program, a lot of capabilities related to technology and the national digital platform surface. Projects in this program are most successful when they show awareness of work that has come before, and explain how they are building upon that previous work. Similarly, and as with all of our programs, reviewers are looking to see how the results of the project will be shared with the field.

For example, the National Digital Stewardship Residency (NDSR) has been very successful with Laura Bush peer reviewers. The original Library of Congress NDSR built on the Library’s existing DPOE curriculum. Subsequently, the New York and Boston NDSR programs adapted the Library of Congress’s model based on resident feedback and other findings. Now we’re seeing a new distributed version of the model being piloted by WGBH. This is a great example of a project that is replicable and iterative. Each organization modified it based on their specific situation, contributing to an overall vision of the program and increasing the impact of IMLS funding.

The Sparks! grants are a little different than the grants of other programs because the funding cap for this program is much lower, at $25,000, and has no cost share requirement. Sparks! is intended to fund projects that are innovative and potentially somewhat risky. It’s a great opportunity for prototyping new tools, exploring new collaborations, and testing new services. As a special funding opportunity within the IMLS National Leadership Grants for Libraries program, Sparks! guidelines also call for potential for broad impact and innovative approaches. Funded projects are required to submit a final report in the form of a white paper that is published on the IMLS website, in order to ensure that these new approaches are shared with the community.

Maura Marx, Acting Director of IMLS, wrapping up at IMLS Focus. Courtesy of IMLS.

Erin: I’m sure many of our readers have applied for IMLS grants in previous cycles. Could you talk a bit about the current proposal process?  Is there any other info you’d like to share with our readers about it?

Emily: The traditional application process, and the one currently used in the Sparks! program, is that applicants submit a full proposal at the time of the application deadline. This includes a narrative, a complete budget and budget justification, staff resumes, and a great deal of other documentation. With Sparks!, these applications are sent directly to peer reviewers in the field, and funding decisions are made based on their scores.

We’ve made some significant changes to the National Leadership Grants and Laura Bush 21st Century Librarian program. For FY16, both programs will require the submission of only a two-page preliminary proposal, along with a couple of standard forms. The preliminary proposals will be sent to peer reviewers, and IMLS will hold a panel meeting with the reviewers to select the most promising proposals. That subset of applicants is then invited to submit full proposals, with a deadline six to eight weeks later. The full proposals go through another round of panel review before funding decisions are made. We’re also adding a second annual application deadline for each program, currently slated for February 2016.

This process was piloted with the National Leadership Grants this past year, and we’ve seen a number of substantial benefits for applicants. Of course, the workload of creating a two-page preliminary proposal is much less than for the full proposal. But for the applicants who are invited to submit a full proposal, also gain the peer reviewers’ comments to help them strengthen their applications. And for unsuccessful applicants, the second deadline makes it possible for them to revise and resubmit their proposal. We’ve found that the resulting full proposals are much more competitive, and reviewers are still able to provide substantial feedback for unsuccessful applicants.

Erin: Now for the quintessential interview question: where do you see the platform in five years?

Trevor: I think we can make a lot of progress in five years. I can see a series of interconnected national networks and projects where different libraries, archives, museums and related non-profits are taking the lead on aspects directly connected to the core of their missions, but benefiting from the work of all the other institutions, too. The idea that there is one big library with branches all over the world is something that I think can increasingly become a reality. In sharing that digital infrastructure, we can build on the emerging value proposition of libraries identified in the Aspen Institute’s report on public libraries (PDF).  By pooling those efforts, and establishing and building on radical collaborations, we can turn the corner on the digital era. We can stop playing catch up and have a seat at the table. We can make sure that our increasingly digital future is shaped by values at the core of libraries and archives around access, equity, openness, privacy, preservation and the integrity of information.

Islandora: Islandora Contributor Licence Agreements

planet code4lib - Tue, 2015-09-08 12:59

We are now making a concerted effort to collect Contributor License Agreements (CLAs) from all project contributors. The CLAs are based on Apache's agreements; they give the Islandora Foundation non-exclusive, royalty free copyright and patent licenses for contributions. They do not transfer intellectual property ownership to the project from the contributor, nor do they otherwise limit what the creator can do with their contributions. This license is for your protection as a contributor as well as the protection of the Foundation and its users; it does not change your rights to use your own contributions for any other purpose.

The CLA's are here:

Current CLAs on file are here.

We are seeking corporate CLAs (cCLA) from all institutions that employ Islandora contributors. We are also seeking individual CLAs (iCLAs) from all individual contributors, in addition to the cCLA. (In most cases the cCLA is probably sufficient, but getting iCLAs in addition helps the project avoid worrying about whether certain contributions were "work for hire", and also help provide continuity in case a developer continues to contribute even after changing employment).

All Foundation members and individual contributors will soon be receiving a direct email request to sign the CLAs, along with instructions on how to submit them. At a certain point later this year, we will no longer accept code contributions that are not covered by a CLA and will look to excise any legacy code that isn't covered by an agreement.

If you have any questions, please don't hesitate to ask on the Islandora list, or to send an email to


SearchHub: Search-Time Parallelism at Etsy: An Experiment With Apache Lucene

planet code4lib - Tue, 2015-09-08 08:52
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shikhar Bhushan from Etsy’s experiments at Etsy with search-time parallelism. Is it possible to gain the parallelism benefit of sharding your data into multiple indexes, without actually sharding? Isn’t your Lucene index already composed of shards i.e. segments? This talk will present an experiment in parallelizing Lucene’s guts: the collection protocol. An express goal was to try to do this in a lock-free manner using divide-and-conquer. Changes to the Collector API were necessary, such as orienting it to work at the level of child “leaf”-collectors so that segment-level state could be accumulated in parallel. I will present technical details that were learned along the way, such as how Lucene’s TopDocs collectors are implemented using priority queues and custom comparators. Onto the parallelizability of collectors — how some collectors like hit counting are embarrassingly parallelizable, how some like DocSet collection were a delightful challenge, and others where the space-time tradeoffs need more consideration. Performance testing results, which currently span from worse to exciting, will be discussed. Shikhar works on Search Infrastructure at Etsy, the global handmade and vintage marketplace. He has contributed patches to Solr/Lucene, and maintains several open-source projects such as a Java SSH library and a discovery plugin for elasticsearch. He previously worked at Bloomberg where he delivered talks introducing developers to Python and internal Python tooling. He has a special interest in JVM technology and distributed systems. Search-time Parallelism: Presented by Shikhar Bhushan, Etsy from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…4

The post Search-Time Parallelism at Etsy: An Experiment With Apache Lucene appeared first on Lucidworks.

Terry Reese: MarcEdit 6.1 (Windows/Linux)/MarcEdit Mac (1.1.25) Update

planet code4lib - Tue, 2015-09-08 02:23

So, this update is a bit of a biggie.  If you are a Mac user, the program officially moves out of the Preview and into release.  If you are a Mac user, this version brings the following changes:

** 1.1.25 ChangeLog

  • Bug Fix: MarcEditor — changes may not be retained after save if you make manual edits following a global updated.
  • Enhancement: Delimited Text Translator completed.
  • Enhancement: Export Tab Delimited complete
  • Enhancement: Validate Headings Tool complete
  • Enhancement: Build New Field Tool Complete
  • Enhancement: Build New Field Tool added to the Task Manager
  • Update: Linked Data Tool — Added Embed OCLC Work option
  • Update: Linked Data Tool — Enhance pattern matching
  • Update: RDA Helper — Updated for parity with the Windows Version of MarcEdit
    * Update: MarcValidator — Enhancements to support better checking when looking at the mnemonic format.

If you are on the Windows/Linux version – you’ll see the following changes:

* 6.1.60 ChangeLog

  • Update: Validate Headings — Updated patterns to improve the process for handling heading validation.
  • Enhancement: Build New Field — Added a new global editing tool that provides a pattern-based approach to building new field data.
  • Update: Added the Build New Field function to the Task Management tool.
  • UI Updates: Specific to support Windows 10.

The Windows update is a significant one.  A lot of work went into the Validate Headings function, which impacts the Linked Data tools and the underlying linked data engine.  Additionally, the Build New Fields tool provides a new global editing function that should simplify complex edits.  If I can find the time, I’ll try to mark up a youtube video demoing the process.

You can get the updates from the MarcEdit downloads page: or if you have MarcEdit configured to check automated updates – the tool will notify you of the update and provide a method for you to download it.

If you have questions – let me know.


DuraSpace News: Telling DSpace Stories at University of Texas Libraries with Colleen Lyon

planet code4lib - Tue, 2015-09-08 00:00

“Telling DSpace Stories” is a community-led initiative aimed at introducing project leaders and their ideas to one another while providing details about DSpace implementations for the community and beyond. The following interview includes personal observations that may not represent the opinions and views of the University of Texas or the DSpace Project.

William Denton: OLITA lending library report on BKON beacon

planet code4lib - Mon, 2015-09-07 23:19

In June I borrowed a BKON A-1 from the OLITA technology lending library. It’s a little black plastic box with a low energy Bluetooth transmitter inside, and you can configure it to broadcast a URL that can be detected by smartphones. I was curious to see what it was like, though I have no use case for it. If you borrow something from the library you’re supposed to write it up, so here’s my brief review.

  1. I took it out of its box and put two batteries in.
  2. I installed a beacon detector on my phone and scanned for it.
  3. I saw it:
  4. I followed the instructions on the BKON Quick Start Guide.
  5. I set up an account.
  6. I couldn’t log in. I tried two browsers but for whatever unknown reason it just wouldn’t work.
  7. I took out the two batteries and put it back in its box.

I’ll give it back to Dan Scott, who said he’s going to ship it back to the manufacturer so they can install the new firmware. I wish better luck to the next borrower.

Access Conference: Watch out for the Livestream!

planet code4lib - Mon, 2015-09-07 18:09

Cast your FOMO feelings aside, a livestream of the conference will be on the website Wednesday to Friday HERE . An archived copy will be available on Youtube after the conference as well!

Terry Reese: MarcEdit Mac–Release Version 1 Notes

planet code4lib - Mon, 2015-09-07 02:45

This has been a long-time coming – making up countless hours and the generosity of a great number of people to test and provide feedback (not to mention the folks that crowd sourced the purchase of a Mac) – but MarcEdit’s Mac version is coming out of Preview and will be made available for download on Labor Day.  I’ll be putting together a second post officially announcing the new versions (all versions of MarcEdit are getting an update over labor day), so if this interests you – keep an eye out.

So exactly what is different from the Preview versions?  Well, at this point, I’ve completed all the functions identified for the first set of development tasks – and then some.  New to this version will be the new Validate Headings tool just added to the Windows version of MarcEdit, the new Build New Field utility (and inclusion into the Task Automation tool), updates to the Editor for performance, updates to the Linking tool due to the validator, inclusion of the Delimited Text Translator and the Export Tab Delimited Text Translator – and a whole lot more.

At this point, the build is made, the tests have been run – so keep and eye out tomorrow – I’ll definitely be making it available before the Ohio State/Virginia Tech football game (because everything is going to stop here once that comes on). 

To everyone that has helped along the way, providing feedback and prodding – thanks for the help.  I’m hoping that the final result will be worth the wait and be a nice addition to the MarcEdit family.  And of course, this doesn’t end the development on the Mac – I have 3 additional sprints planned as I work towards functional parity with the Windows version of MarcEdit.


William Denton: A Catalogue of Cuts

planet code4lib - Mon, 2015-09-07 00:36

I wrote a short piece for the newsletter of the York University Faculty Association: York University Libraries: A Catalogue of Cuts. We’ve had year after year of budget cuts at York and York University Libraries, but we in the library we don’t talk about them in public much. We should.

(Librarians at York University are members of YUFA and have academic status. I am in the final year of my second term as a steward for the Libraries chapter of YUFA. Patti Ryan is my fellow steward.)


Subscribe to code4lib aggregator