You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 20 hours 14 min ago

FOSS4Lib Recent Releases: Koha - Security and maintenance releases - v 3.20.2, 3.18.9, 3.16.13

Tue, 2015-08-11 12:09
Package: KohaRelease Date: Thursday, July 30, 2015

Last updated August 11, 2015. Created by David Nind on August 11, 2015.
Log in to edit this page.

Monthly security and maintenance releases for Koha.

See the release announcements for the details:

SearchHub: Basics of Storing Signals in Solr with Fusion for Data Engineers

Tue, 2015-08-11 11:36

In April we featured a guest post Mixed Signals: Using Lucidworks Fusion’s Signals API, which is a great introduction to the Fusion Signals API. In this post I work through a real-world e-commerce dataset to show how quickly the Fusion platform lets you leverage signals derived from search query logs to rapidly and dramatically improve search results over a products catalog.

Signals, What’re They Good For?

In general, signals are useful any time information about outside activity, such as user behavior, can be used to improve the quality of search results. Signals are particularly useful in e-commerce applications, where they can be used to make recommendations as well as to improve search. Signal data comes from server logs and transaction databases which record items that users search for, view, click on, like, or purchase. For example, clickstream data which records a user’s search query together with the item which was ultimately clicked on is treated as one “click” signal and can be used to:

  • enrich the results set for that search query, i.e., improve the items returned for that query
  • enrich the information about the item clicked on, i.e., improve the queries for that item
  • uncover similarities between items, i.e., cluster items based on other clicks on for queries
  • make recommendations of the form:
    • “other customers who entered this query clicked on that”
    • “customers who bought this also bought that”
Signals Key Concepts
  • A signal is a piece of information, event, or action, e.g., user queries, clicks, and other recorded actions that can be related back to a document or documents which are stored in a Fusion collection, referred to as the “primary collection”.
    • A signal has a type, an id, and a timestamp. For example, signals from clickstream information are of type “click” and signals derived from query logs are of type “query”.
    • Signals are stored in an auxiliary collection and naming conventions link the two so that the name of the signals collection is the name the primary collection plus the suffix “_signals”.
  • An aggregation is the result of processing a stream of signals into a set of summaries that can be used to improve the search experience. Aggregation is necessary because in the usual case there is a high volume of signals flowing into the system but each signal contains only a small amount of information in and of itself.
    • Aggregations are stored in an auxiliary collection and naming conventions link the two so that the name of the aggregations collection is the name the primary collection plus the suffix “_signals_aggr”.
    • Query pipelines use aggregated signals to boost search results.
    • Fusion provides an extensive library of aggregation functions allowing for complex models of user behavior. In particular, date-time functions provide a temporal decay function so that over time, older signals are automatically downweighted.
  • Fusion’s job scheduler provides the mechanism for processing signals and aggregations collections in near real-time.
Some Assembly Required

In a canonical e-commerce application, your primary Fusion collection is the collection over your products, services, customers, and similar. Event information from transaction databases and server logs would be indexed into an auxiliary collection of raw signal data and subsequently processed into an aggregated signals collection. Information from the aggregated signals collection would be used to improve search over the primary collection and make product recommendations to users.

In the absence of a fully operational ecommerce website, the Fusion distribution includes an example of signals and a script that processes this signal data into an aggregated signals collection using the Fusion Signals REST-API. The script and data files are in the directory $FUSION/examples/signals (where $FUSION is the top-level directory of the Fusion distribution). This directory contains:

  • signals.json – a sample data set of 20,000 signal events. These are ‘click’ events.
  • – a script that loads signals, runs one aggregation job, and gets recommendations from the aggregated signals.
  • aggregations_definition.json – examples of how to write custom aggregation functions. These examples demonstrate several different advanced features of aggregation scripting, all of which are outside of the scope of this introduction.

The example signals data comes from a synthetic dataset over Best Buy query logs from 2011. Each record contains the user search query, the categories searched, and the item ultimately clicked on. In the next sections I create the product catalog, the raw signals, and the aggregated signals collections.

Product Data: the primary collection ‘bb_catalog’

In order to put the use of signals in context, first I recreate a subset of the Best Buy product catalog. Lucidworks cannot distribute the Best Buy product catalog data that is referenced by the example signals data, but that data is available from the Best Buy Developer API, which is a great resource both for data and example apps. I have a copy of previously downloaded product data which has been processed into a single file containing a list of products. Each product is a separate JSON object with many attribute-value pairs. To create your own Best Buy product catalog dataset, you must register as a developer via the above URL. Then you can use the Best Buy Developer API query tool to select product records or you can download a set of JSON files over the complete product archives.

I create a data collection called “bb_catalog” using the Fusion 2.0 UI. By default, this creates collections for the signals and aggregated signals as well.

Although the collections panel only lists collection “bb_catalog”, collections “bb_catalog_signals” and “bb_catalog_signals_aggr” have been created as well. Note that when I’m viewing collection “bb_catalog”, the URL displayed in the browser is: “localhost:8764/panels/bb_catalog”:

By changing the collection name to “bb_catalog_signals” or “bb_catalog_signals_aggr”, I can view the (empty) contents of the auxiliary collections:

Next I index the Best Buy product catalog data into collection “bb_catalog”. If you choose to get the data in JSON format, you can ingest it into Fusion using the “JSON” indexing pipeline. See blog post Preliminary Data Analysis in Fusion 2 for more details on configuring and running datasources in Fusion 2.

After loading the product catalog dataset, I check to see that collection “bb_catalog” contains the products referenced by the signals data. The first entry in the example signals file “signals.json”is a search query with query text: “Televisiones Panasonic 50 pulgadas” and docId: “2125233”. I do a quick search to find a product with this id in collection “bb_catalog”, and the results are as expected:

Raw Signal Data: the auxiliary collection ‘bb_catalog_signals’

The raw signals data in the file “signals.json” are the synthetic Best Buy dataset. I’ve modified the timestamps on the search logs in order to make them seem like fresh log data. This is the first signal (timestamp updated):

{ "timestamp": "2015-06-01T23:44:52.533Z", "params": { "query": "Televisiones Panasonic 50 pulgadas", "docId": "2125233", "filterQueries": [ "cat00000", "abcat0100000", "abcat0101000", "abcat0101001" ] }, "type": "click" },

The top-level attributes of this object are:

  • type – As stated above, all signals must have a “type”, and as noted in the earlier post “Mixed Signals”, section “Sending Signals”, the value should be applied consistently to ensure accurate aggregation. In the example dataset, all signals are of type “click”.
  • timestamp – This data has timestamp information. If not present in the raw signal, it will be generated by the system.
  • id – These signals don’t have distinct ids; they will be generated automatically by the system.
  • params – This attribute contains a set of key-value pairs, using a set of pre-defined keys which a appropriate for search-query event information. In this dataset, the information captured includes the free-text search query entered by the user, the document id of the item clicked on, and the set of Best Buy site categories that the search was restricted to. These are codes for categories and sub-categories such as “Electronics” or “Televisions”.

In summary, this dataset is an unremarkable snapshot of user behaviors between the middle of August and the end of October, 2011 (updated to May through June 2015).

The example script “” loads the raw signal via a POST request to the Fusion REST-API endpoint: /api/apollo/signals/<collectionName> where <collectionName> is the name of the primary collection itself. Thus, to load raw signal data into the Fusion collection “bb_catalog_signals”, I send a POST request to the endpoint: /api/apollo/signals/bb_catalog.

Like all indexing processes, an indexing pipeline is used to process the raw signal data into a set of Solr documents. The pipeline used here is the default signals indexing pipeline named “_signals_ingest”. This pipeline consists of three stages, the first of which is a Signal Formatter stage, followed by a Field Mapper stage, and finally a Solr Indexer stage.

(Note that in a production system, instead of doing a one time upload of some server log data, raw signal data could be streamed into a signals collection an ongoing basis by using a Logstash or JDBC connector together with a signals indexing pipeline. For details on using a Logstash connector, see blog post on Fusion with Logstash).

Here is the curl command I used, running Fusion locally in single server mode on the default port:

curl -u admin:password123 -X POST -H 'Content-type:application/json' http://localhost:8764/api/apollo/signals/bb_catalog?commit=true --data-binary @new_signals.json

This command succeeds silently. To check my work, I use the Fusion 2 UI to view the signals collection, by explicitly specifying the URL “localhost:8764/panels/bb_catalog_signals”. This shows that all 20K signals have been indexed:

Further exploration of the data can be done using Fusion dashboards. To configure a Fusion dashboard using Banana 3, I specify the URL “localhost:8764/banana”. (For details and instructions on Banana 3 dashboards, see this this post on log analytics). I configure a signals dashboard and view the results:

The top row of this dashboard shows that there are 20,000 clicks in the collection bb_catalog_signals that were recorded in the last 90 days. The middle row contains a bar-chart showing the time at which the clicks came in and a pie chart display of top 200 documents that were clicked on. The bottom row is a table over all of the signals – each signal contains only click.

The pie chart allows us to visualize a simple aggregation of clicks per document. The most popular document got 232 clicks, roughly 1% of the total clicks. The 200th most popular document got 12 clicks, and the vast majority of documents only got one click per document. In order to use information about documents clicked on, we need to make this information available in a form that Solr can use. In other words, we need to create a collection of aggregated signals.

Aggregated Signals Data: the auxiliary collection ‘bb_catalog_signals_aggr’

Aggregation is the “processing” part of signals processing. Fusion runs queries over the documents in the raw signals collection in order to synthesize new documents for the aggregated signals collection. Synthesis ranges from counts to sophisticated statistical functions. The nature of the signals collected determines the kinds of aggregations performed. For click signals from query logs, the processing is straightforward: an aggregated signal record contains a search query, a count of the number of raw signals that contained that search query; and aggregated information from all raw signals: timestamps, ids of documents clicked on, search query settings, in this case, the product catalog categories over which that search was carried out.

To aggregate the raw signals in collection “bb_catalog_signals” from the Fusion 2 UI, I choose the “Aggregations” control listed in the “Index” section of the “bb_catalog_signals” home panel:

I create a new aggregation called “bb_aggregation” and define the following:

  • Signal Types = “click”
  • Time Range = “[* TO NOW]” (all signals)
  • Output Collection = “bb_catalog_signals_aggr”

The following screenshot shows the configured aggregation. The circled fields are the fields which I specified explicitly; all other fields were left at their default values.

Once configured, the aggregation is run via controls on the aggregations panel. This aggregation only takes a few seconds to run. When it has finished, the number of raw signals processed and aggregated signals created are displayed below the Start/Stop controls. This screenshot shows that the 20,000 raw signals have been synthesized into 15,651 aggregated signals.

To check my work, I use the Fusion 2 UI to view the aggregated signals collection, by explicitly specifying the URL “localhost:8764/panels/bb_catalog_signals_aggr”. Aggregated click signals have a “count” field. To see the more popular search queries, I sort the results in the search panel on field “count”:

The most popular searches over the Best Buy catalog are for major electronic consumer goods: TVs and computers, at least according to this particular dataset.

Fusion REST-API Recommendations Service

The final part of the example signals script “” calls the Fusion REST-API’s Recommendation service endpoints “itemsForQuery”, “queriesForItem”, and “itemsForItems”. The first endpoint, “itemsForQuery” returns the list of items that were clicked on for a query phrase. In the “” example, the query string is “laptop”. When I do a search on query string “laptop” over collection “bb_catalog”, using the default search pipeline, the results don’t actually include any laptops:

With properly specified fields, filters, and boosts, the results could probably be improved. With aggregated signals, we see improvements right away. I can get recommendations using the “itemsForQuery” endpoint via a curl command:

curl -u admin:password123 http://localhost:8764/api/apollo/recommend/bb_catalog/itemsForQuery?q=laptop

This returns the following list of ids: [ 2969477, 9755322, 3558127, 3590335, 9420361, 2925714, 1853531, 3179912, 2738738, 3047444 ], most of which are popular laptops:

When not to use signals

If the textual content of the documents in your collection provides enough information such that for a given query, the documents returned are the most relevant documents available, then you don’t need Fusion signals. (If it ain’t broke, don’t fix it.) If the only information about your documents is the documents themselves, you can’t use signals. (Don’t use a hammer when you don’t have any nails.)


Fusion provides the tools to create, manage, and maintain signals and aggregations. It’s possible to build extremely sophisticated aggregation functions, and to use aggregated signals in many different ways. It’s also possible to use signals in a simple way, as I’ve done in this post, with quick and impressive results.

In future posts in this series, we will show you:

  • How to write query pipelines to harness this power for better search over your data, your way.
  • How to harness the power of Apache Spark for highly scalable, near-real-time signal processing.

The post Basics of Storing Signals in Solr with Fusion for Data Engineers appeared first on Lucidworks.

William Denton: Jesus, to his credit

Tue, 2015-08-11 01:30

A quote from a sermon given by an Anglican minister a couple of weeks ago: “Jesus, to his credit, was a lot more honourable than some of us would have been.”

DuraSpace News: 2016 DLF eResearch Network

Tue, 2015-08-11 00:00

From Rita Van Duinen, Council on Library and Information Resources (CLIR)/Digital Library Federation (DLF)

Interested in joining next year's Digital Library Federation’s (DLF) eResearch Network?

Karen Coyle: Google becomes Alphabet

Mon, 2015-08-10 22:24
I thought it was a joke, especially when the article said that they have two investment companies, Ventures and Capital. But it's all true, so I have this to say:

G is for Google, H is for cHutzpah. In addition to our investment companies Ventures and Capital, we are instituting a think tank, Brain, and a company focused on carbon-based life-based forms, Body. Servicing these will be three key enterprises: Food, Water, and Air. Support will be provided by Planet, a subsidiary of Universe. Of course, we'll also need to provide Light. Let there be. Singularity. G is for God. 

Code4Lib: Code4Lib 2016

Mon, 2015-08-10 20:37

The Code4Lib 2016 Philadelphia Committee is pleased to announce that we have finalized the dates and location of the 2016 conference.

The 2016 conference will be held from March 7 through March 10 in the Old City District of Philadelphia. This location puts conference attendees within easy walking distance of many of Philadelphia’s historical treasures, including Independence Hall, the Liberty Bell, the Constitution Center, and the house where Thomas Jefferson drafted the Declaration of Independence. Attendees will also be a very short distance from the Delaware River waterfront and will be a short walk from numerous excellent restaurants.

As we’ll be reserving almost all of the space within the hotel for our conference (both rooms and conference spaces), Code4Lib 2016 will have the tight-knit community feel we know is important.

More details to come soon; in the meantime, the Keynote Committee is gearing up to open submissions for the conference keynote speaker, so be sure to contact them at for more information, or you can nominate a keynote speaker at

Also, our Sponsorship Committee is actively looking for sponsors for 2016, so please contact the committee via Shaun Ellis to learn about all the ways your organization can help sponsor our 2016 conference.

It’s shaping up to be a great conference this year, and there will be lots more opportunities to volunteer. Our team is looking forward to seeing you on March 7!

The Code4Lib 2016 Philadelphia Committee

Code4Lib: Code4Lib 2016

Mon, 2015-08-10 20:37

The Code4Lib 2016 Philadelphia Committee is pleased to announce that we have finalized the dates and location of the 2016 conference.

The 2016 conference will be held from March 7 through March 10 in the Old City District of Philadelphia. This location puts conference attendees within easy walking distance of many of Philadelphia’s historical treasures, including Independence Hall, the Liberty Bell, the Constitution Center, and the house where Thomas Jefferson drafted the Declaration of Independence. Attendees will also be a very short distance from the Delaware River waterfront and will be a short walk from numerous excellent restaurants.

As we’ll be reserving almost all of the space within the hotel for our conference (both rooms and conference spaces), Code4Lib 2016 will have the tight-knit community feel we know is important.

More details to come soon; in the meantime, the Keynote Committee is gearing up to open submissions for the conference keynote speaker, so be sure to contact them at for more information, or you can nominate a keynote speaker at

Also, our Sponsorship Committee is actively looking for sponsors for 2016, so please contact the committee via Shaun Ellis to learn about all the ways your organization can help sponsor our 2016 conference.

It’s shaping up to be a great conference this year, and there will be lots more opportunities to volunteer. Our team is looking forward to seeing you on March 7!

The Code4Lib 2016 Philadelphia Committee

Islandora: Call for 7.x-1.6 Release Team volunteers -- We want you on our team!!!

Mon, 2015-08-10 18:30

The 7.x-1.6 Release Team will be working on the next release very soon, and you could be our very next release team member! Want some more motivation? Release team members get really awesome shirts!

We are looking for members for all four release team roles. We added a new auditor role this time around in order to break up some of the responsibilities between testers and documentors.

Release team roles

  • Documentors: Documentation will need to be updated for the next release. Any new components will also need to be documented. If you are interested in working on the documentation for a given component, please add your name to any component listed here.
  • Testers: All components with JIRA issues set to 'Ready for Test' will need to be tested and verifying. Additionally, testers test the overall functionality of a given component. If you are interested in being a tester for a given component, please add your name to any component listed here. Testers will be provided with a release candidate virtual machine to do testing on.
  • Auditors: Each release we audit our README and LICENSE files. Auditors will be responsible auditing a given component. If you are interested in being an auditor for a given component, please add your name to any component listed here.
  • Component managers: Are responsible for the code base of their components. If you are interested in being a component manager, please add your name to any component listed here.

Time lines

  • Code Freeze: Tuesday, September 1, 2015
  • Release Candidate: Tuesday, September 15, 2015
  • Release: Friday October 30, 2015

If you have a questions about being a member of the release team, feel free to ask here here.

DPLA: New DPLA Job Opportunity: Ebook Project Manager

Mon, 2015-08-10 17:00

Come work with us! We’re pleased to share an exciting new DPLA job opportunity: Ebook Project Manager. The deadline to apply is August 31. We encourage you to share this posting far and wide!

Ebook Project Manager

The Digital Public Library of America seeks a full-time Ebook Project Manager to assist DPLA with its new ebook initiatives. The Ebook Project Manager should be a knowledgeable, creative community leader who can move our early stage ebook work from conversation to action. We are seeking a creative individual who demonstrates strong organizational and project management skills, with a broad knowledge of the ebook landscape. The Ebook Project Manager will work closely with the Business Development Director to develop DPLA’s ebook strategy and services, and will coordinate DPLA’s National Ebook Working Group, organize future meetings, and administer discrete pilots targeting key areas of our framework for ebooks.

Responsibilities of the Ebook Project Manager:

  • Serves as DPLA’s primary point person for service development, community engagement and other aspects of DPLA developing ebook program;
  • Leads community convenings; facilitates stakeholder conversations; and synthesizes issues, decisions, and system/service requirements;
  • Organizes and directs the DPLA ebook curation group;
  • Coordinates external communications to the broader DPLA community;
  • Works with DPLA network partners to identify and curate open content for use by content distribution partners.

Requirements for the position:

  • Strong knowledge of current ebook landscape, with a preference given to candidates who demonstrate deep understanding of the public library marketplace, publisher distribution/acquisition processes, and library collection development/acquisition workflow;
  • Understanding of the technology behind ebooks, including EPUB and EPUB conversion processes, web- and app-based display of ebooks;
  • Experience with project management, especially as it relates to large-scale digital projects;
  • MLS or equivalent experience with books, cataloguing, and metadata;
  • Demonstrated commitment to DPLA’s mission to maximize access to our shared culture.

This position is full-time, ideally situated either in DPLA’s Boston headquarters, or remotely in New York, Washington, or another location in the northeast corridor, but other locations will also be considered.

Like its collection, DPLA is strongly committed to diversity in all of its forms. We provide a full set of benefits, including health care, life and disability insurance, and a retirement plan. Starting salary is commensurate with experience.

Please send a letter of interest, a resume/cv, and contact information for three references to by August 31, 2015. Please put “Ebook Project Manager” in the subject line. Questions about the position may also be directed to

About DPLA

The Digital Public Library of America strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated 11 million items from 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

Islandora: Goodbye, Islandoracon. You were awesome.

Mon, 2015-08-10 14:43

Last week marked a huge milestone for the Islandora Community as we came together for our first full length conference, in the birthplace of Islandora at the University of Prince Edward Island in Charlottetown, PEI. With a final headcount of 80 attendees, a line up of 28 sessions and 16 workshops, and a day-long Hackfest to finish things off, there is a lot to reflect on.

Mark Leggott opened the week with a Keynote that looked back over the history of the Islandora project through the lens of evolution - from its single-celled days as an idea at UPEI to the "Futurzoic" era ahead of us. We spent the rest of Day One talking about repository strategies, how Islandora works as a community, and how Islandora can work for communities of end users. The day ended with a BBQ on the lawn of the Robertson Library and a first exposure to a variety of Canadian potato chip flavours (roast chicken flavour, anyone?)

Day Two split the conference into two tracks, which meant some tough choices between some really great sessions on Islandora tools, sites, migration strategies, working with the arts and digital humanities, and the future with Fedora 4. You can find the slides from many sessions linked in the conference schedule. We ended with beer, snacks, and brutally hard bar trivia.

Day Three launched two days of 90-minute workshops in two tracks, delving into the details of Islandora with some hands-on training from Islandora experts. We covered everything from the basics of setting up the Drupal side of an Islandora Site, to a detailed look at the Tuque API or mass-migrating content with Drush scripts. The social events continued as well, with our big conference dinner at Fishbones, complete with live music and an oyster bar on Wednesday night, and a seafood tasting hosted by conference sponsor discoverygarden, Inc, where this view from the deck was augmented with an actual, literal, rainbow:

Not pictured: Rainbow

After the workshops finished up on Thursday, we held the first Islandora Foundation AGM, where new Chairman Mark Jordan (Simon Fraser University) received the ceremonial screaming monkey and former Chairman Mark Leggott took a new place as the Foundation's Treasurer. There was also lively debate around the subject of individual membership in the IF (more on that in days to come). 

Finally, we had the Hackfest, which went off better than we could have hoped. In addition to some bug fixes and improvements, the teams of the Hackfest produced an whopping four new tools, one of which is so use-ready that it has been proposed for adoption in the next release. The Hackfest tools are:

With apologies to anyone whose name I've left out. It was a big crowd and everyone did great work.

From the conference planning team and the Islandora Foundation, thank you very much to our attendees for making our first conference a big success. We hope you enjoyed yourself and learned a ton. And we hope you'll join us again at our next conference!

Next up, for those who can't wait for the second Islandoracon: Islandora Camp CT in Hartford, Connecticut, October 20 - 23

Shelley Gullikson: Adventures in Information Architecture Part 2: Thinking Big, Thinking Small

Mon, 2015-08-10 12:45

When we last saw them in Part 1, our Web Committee heroes were stuck with a tough decision: do we shoehorn the Ottawa Room content into an information architecture that doesn’t really fit it, or do we try to revamp the whole IA?

There was much hand-wringing and arm-waving. (Okay, did a lot of hand-wringing and arm-waving.) Our testing showed that users were either using Summon or asking someone to get information, and that when they needed to use the navigation they were stymied. Almost no one looked at the menus. What are our menus for if no one is using them? Are they just background noise? If so, should we just try to make the background noise more pleasant? What if the IA isn’t there primarily to organize and categorize our content, but to tell our users something about our library? Maybe our menus are grouping all the rocks in one spot and all the trees in another spot and all the sky bits somewhere else and what we really need to do is build a beautiful path that leads them…

Oh, hey, (said our lovely and fabulous Web Committee heroes) why don’t you slow down there for a second? What is the problem we need to solve? We’ve already tossed around some ideas that might help, why don’t we look at those to see if they solve our problem? Yes, those are interesting questions you have, and that thing about the beautiful path sounds swell, but… maybe it can wait.

And they kindly took me by the hand — their capes waving in the breeze — and led me out of the weeds. And we realized that we had already come up with a couple of solutions. We could use our existing category of “Research” (which up to now only had course guides and subject guides in it) to include other things like the resources in the Ottawa Room and all our Scholarly Communications / Open Access stuff. We could create a new category called “In the Library” (or maybe “In the Building” is better?) and add information about the physical space that people are searching our site for because it doesn’t fit anywhere in our current IA.

The more we talked about small, concrete ideas like this we realized they might also help with some of the issues left back in the weeds. The top-level headings on the main page (and in the header menu) would read: “Find Research Services In the Building.” Which is not unpleasant background noise for a library.

DuraSpace News: NOW AVAILABLE: Fedora 4.3.0—Towards Meeting Key Objectives

Mon, 2015-08-10 00:00

Winchester, MA  On July 24, 2015 Fedora 4.3.0  was released by the Fedora team. Full release notes are included in this message and are also available on the wiki: This new version  furthers several major objectives including:

  • Moving Fedora towards a clear set of standards-based services

  • Moving Fedora towards runtime configurability

Terry Reese: MarcEdit 6 Wireframes — Validating Headings

Sun, 2015-08-09 14:44

Over the last year, I’ve spent a good deal of time looking for ways to integrate many of the growing linked data services into MarcEdit.  These services, mainly revolving around vocabularies, provide some interesting opportunities for augmenting our existing MARC data, or enhancing local systems that make use of these particular vocabularies.  Examples like those at the Bentley ( are real-world demonstrations of how computers can take advantage of these endpoints when they are available. 

In MarcEdit, I’ve been creating and testing linking tools for close to a year now, and one of the areas I’ve been waiting to explore is whether libraries to utilized linking services to build their own authorities workflows.  Conceptually, it should be possible – the necessary information exists…it’s really just a matter of putting it together.  So, that’s what I’ve been working on.  Utilizing the linked data libraries found within MarcEdit, I’ve been working to create a service that will help users identify invalid headings and records where those headings reside. 

Working Wireframes

Over the last week, I’ve prototyped this service.  The way that it works is pretty straightforward.  The tool extracts the data from the 1xx, 6xx, and 7xx fields, and if they are tagged as being LC controlled, I query the service to see what information I can learn about the heading.  Additionally, since this tool is designed for work in batch, there is a high likelihood that headings will repeat – so MarcEdit is generating a local cache of headings as well – this way it can check against the local cache rather than the remote cache when possible.  The local cache will constantly be grown – with materials set to expire after a month.  I’m still toying with what to do with the local cache, expirations, and what the best way to keep it in sync might be.  I’d originally considered pulling down the entire LC names and subjects headings – but for a desktop application, this didn’t make sense.  Together, these files, uncompressed, consumed GBs of data.  Within an indexed database, this would continue to be true.  And again, this file would need to be updated regularly.  To, I’m looking for an approach that will give some local caching, without the need to make the user download and managed huge data files. 

Anyway – the function is being implemented as a Report.  Within the Reports menu in the MarcEditor, you will eventually find a new item titled Validate Headings.

When you run the Validate Headings tool, you will see the following window:

You’ll notice that there is a Source file.  If you come from the MarcEditor, this will be prepopulated.  If you come from outside the MarcEditor, you will need to define the file that is being processed.  Next, you select the elements to authorize.  Then Click Process.  The Extract button will initially be enabled until after the data run.  Once completed, users can extract the records with invalid headings.

When completed, you will receive the following report:

This includes the total processing time, average response from LC’s service, total number of records, and the information about how the data validated.  Below, the report will give you information about headings that validated, but were variants.  For example:

Record #846
Term in Record: Arnim, Bettina Brentano von, 1785-1859
LC Preferred Term: Arnim, Bettina von, 1785-1859

This would be marked as an invalid heading, because the data in the record is incorrect.  But the reporting tool will provide back the Preferred LC label so the user can then see how the data should be currently structured.  Actually, now that I’m thinking about it – I’ll likely include one more value – the URI to the dataset so you can actually go to the authority file page, from this report. 

This report can be copied or printed – and as I noted, when this process is finished, the Extract button is enabled so the user can extract the data from the source records for processing. 

Couple of Notes

So, this process takes time to run – there just isn’t any way around it.  For this set, there were 7702 unique items queried.  Each request from LC averaged 0.28 seconds.  In my testing, depending on the time of day, I’ve found that response rate can run between 0.20 seconds per request to 1.2 seconds per response.  None of those times are that bad when done individually, but when taken in aggregate against 7700 queries – it adds up.  If you did the math, 7702*0.2 = 1540 seconds to just ask for the data.  Divide that by 60 and you get 25.6 minutes.  The total time to process that means that there are 11 minutes of “other” things happening here.  My guess, that other 11 minutes is being eaten up by local lookups, character conversions (since LC request UTF8 and my data was in MARC8) and data normalization.  Since there isn’t anything I can do about the latency between the user and the LC site – I’ll be working over the next week to try and remove as much local processing time from the equation as possible. 

Questions – let me know.


Manage Metadata (Diane Hillmann and Jon Phipps): Five Star Vocabulary Use

Fri, 2015-08-07 18:50

Most of us in the library and cultural heritage communities interested in metadata are well aware of Tim Berners-Lee’s five star ratings for linked open data (in fact, some of us actually have the mug).

The five star rating for LOD, intended to encourage us to follow five basic rules for linked data is useful, but, as we’ve discussed it over the years, a basic question rises up: What good is linked data without (property) vocabularies? Vocabulary manager types like me and my peeps are always thinking like this, and recently we came across solid evidence that we are not alone in the universe.

Check out: “Five Stars of Linked Data Vocabulary Use”, published last year as part of the Semantic Web Journal. The five authors posit that TBL’s five star linked data is just the precondition to what we really need: vocabularies. They point out that the original 5 star rating says nothing about vocabularies, but that Linked Data without vocabularies is not useful at all:

“Just converting a CSV file to a set of RDF triples and linking them to another set of triples does not necessarily make the data more (re)usable to humans or machines.”

Needless to say, we share this viewpoint!

I’m not going to steal their thunder and list here all five star categories–you really should read the article (it’s short), but only note that the lowest level is a zero star rating that covers LD with no vocabularies. The five star rating is reserved for vocabularies that are linked to other vocabularies, which is pretty cool, and not easy to accomplish by the original publisher as a soloist.

These five star ratings are a terrific start to good practices documentation for vocabularies used in LOD, which we’ve had in our minds for some time. Stay tuned.

Patrick Hochstenbach: Penguin in Africa II

Fri, 2015-08-07 18:01
Filed under: Comics Tagged: africa, cartoon, comic, comics cartoons, inking, kinshasa, Penguin

District Dispatch: Envisioning copyright education

Fri, 2015-08-07 16:52

I have been an ALA employee for a while now, primarily on copyright policy and education. During that time, I have worked with several librarian groups, taught a number of copyright workshops, and appreciate that more librarians have a better understanding of what copyright is than was true several years ago. Nonetheless, on a regular basis, librarians across the country, primarily academic but also school librarians, find themselves tasked with the assignment to be the “copyright person” for their library or educational institution. These new job responsibilities are usually unwanted, because the victims recognize that they don’t know anything about copyright. The fortunate among them make connections with more knowledgeable colleagues, or perhaps have the funding to attend a copyright workshop here or there that may be, but often is not, reliable. In short, their graduate degree in library and information science, accredited or not, has not prepared them for the assignment. Information policy course work in library school is limited to a discussion of censorship and banned books week.

Sounds a bit harsh, doesn’t it?

I don’t expect or recommend that graduate students become fluent in the details of every aspect of the copyright law. What they do need to know is the purpose of the copyright law, why information professionals in particular have a responsibility for upholding balanced copyright law by representing the information rights of their communities, why information policy understanding must go hand in hand with librarianship, and of course, what is fair use? They need to understand copyright law as a concept, not a set of dos and don’ts.

Recently, this void in library and information science education is being investigated. I know several librarians that are conducting research on MLIS programs, the need for copyright education, how copyright is taught and the requirements of those teaching information policy courses. More broadly, the University of Maryland Information School published Re-envisioning the MLS: Findings, Issues and Considerations, the first year report of a three-year study on the future of the Masters of Library Science degree and how we prepare information professionals for their careers. If you already have your masters’ degrees, don’t feel left out. Look forward to new learning, knowing that not all of the old learning is for naught.  The values of librarianship have survived and will continue to be at the heart of what we need to know and do.

The post Envisioning copyright education appeared first on District Dispatch.

Open Knowledge Foundation: Onwards to AbreLatAm 2015: what we learned last year

Fri, 2015-08-07 14:44

This post was co-written by Mor Rubinstein and Neal Bastek. It is cross-posted and available in Spanish at the AbreLatAm blog.

AbreLatAm, for us “gringos”, is magical. Even in the age where everyone is glued to a screen, face to face connection is still the strongest connection humans can have; it fosters the trust that can lead to new cooperations and innovations. However, in the case of Latin America, it also creates a family. This feeling creates both a sense of solidarity and security that lets people share and consult about their open data and transparency issues with greater passion and awareness of the challenges and conditions we face daily in our own communities. It is unique, and difficult to replicate. You may not realise it, but in our experience, this feeling is not so common in other parts of the world, where the culture of work is more strict and, with all due respect for our differences, less personal. AbreLatAm therefore is a gift to the movement itself and not just to those of us lucky enough to attend.

For open data practitioners from outside of America Latina like us, AbreLatAm is a place to learn how communities evolve and how they work together. It is a place for us to listen, deeply. So, our command of the Spanish language is not so great (pero es mejor que ayer!) but we don’t need Spanish to feel the atmosphere, see the sparks and contribute, in English, with hand gestures to amplify the event. We try hard to understand the context and the words (and are grateful for the support we have from patient translators!) and are understand the unique problems in the region. For example, the high levels of corruption, the low levels of trust in government and highest rates of inequality in the world. However, other problems are universal, and we should all examine how to solve them together. The question is how?

The Open Knowledge Network has gained tremendous inspiration from AbreLatAm. What appeared early on as a good opportunity to promote the Global Open Data Index and build connections with the Latin American community has become so much more — a fertile ground for sharing and feedback. Some of the processes that we are doing now in this year’s Index, such as our methodology consultation and datasets selections, were the direct result of our participation in AbreLatAm last year.

Neal and Mor promoting the Index in last’s year AbreLatam

We are very excited to see what we will learn this year. As AbreLatAm matures, it also receives more attention and attracts more participants. AbreLatAm was, and still is, a pioneering community participatory event. The challenges now are about scaling, and it is a mirror to similar challenges around the globe. How can we harness the energy of an un-conference with such a vast amount of participants? How can we go from talking and sharing to coordinated global action?

The movement’s ability to scale will only be a success if it’s rooted in community-based, citizen driven needs and not handed down from on high by way of intellectual and academic arguments rooted in a Eurocentric experience. AbreLatAm is an ideal setting for discovering this demand in the Latin American context and matching it and adapting it to global practices and experiences that have succeeded elsewhere– be it in the North or South! Likewise, the LATAM community has much to share in terms of their own experiences and success, and at Open Knowledge we’re keenly interested in bringing those back to our global network for reflection and consideration.

District Dispatch: The future of the MLS: New report from the University of Maryland

Fri, 2015-08-07 13:17


Last summer, the iSchool at the University of Maryland launched the Re-Envisioning the MLS initiative. The premise is that future professionals in library and library-related fields will likely need fundamentally different educational preparation than what is provided by current curricula. Based on an extensive body of research, outreach, and analysis, yesterday the iSchool released its report Re-Envisioning the MLS: Findings, Issues, and Considerations.

The Maryland initiative is important to our work in public policy—particularly through ALA’s Policy Revolution initiative and ALA’s Libraries Transform campaign—as the field needs more professionals with an outward orientation. Fundamentally, the focus of library work is evolving from internal optimization of information resources and systems within a library to collaborative efforts across libraries and with non-library entities. Thus, the role of “policy advocate” becomes a greater part of a librarian’s job, whether that advocacy occurs at the community/local level, regional level, state level, or with a national focus. The Maryland initiative is important enough to me that I’ve served on the iSchool’s MLS Advisory Board during the past year to provide input into the process and this report.

As summarized in the report release:

The findings have a number of implications for LIS education and MLS programs, including:

• Attributes of Successful Information Professionals. Successful information professionals are not those who wish to seek a quiet refuge out of the public’s view. They need to be collaborative, problem solvers, creative, socially innovative, flexible and adaptable, and have a strong desire to work with the public.
• Ensure a Balance of Competencies and Abilities. MLS programs need to ensure that students have a range of competencies, but that aptitude needs to be balanced with a progressive attitude (“can do,” “change agent,” “public service”).
• Re-Thinking the MLS Begins with Recruitment. Neither a love of books or libraries is enough for the next generation of information professionals. Instead they must thrive on change, embrace public service, and seek challenges that require creative solutions. Attracting students with a strong desire to serve the public is critical.
• Be Disruptive, Savvy, and Fearless. Through creativity, collaboration, innovation, and entrepreneurship, information professionals have the opportunity to disrupt current approaches and practices to existing social challenges. The future belongs to those who are socially innovative, entrepreneurial, and change agents who are bold, fearless, willing to take risks, go “big,” and go against convention.

The report is far from the end point of the initiative, as the next stage focuses on redesign of the curriculum with continued stakeholder engagement and ultimately implementation. And, of course, there is much more in the report than described here; I urge you to take a look. Background materials and other research used to produce the report are available at Feel free to provide comments, either to the University of Maryland folks or to me. I look forward to my continuing collaboration on this excellent initiative.

The post The future of the MLS: New report from the University of Maryland appeared first on District Dispatch.

LITA: If You Build It They Might Not Come

Fri, 2015-08-07 13:00

I’ve felt lately that I am trying to row upstream when getting faculty and students to use our research guides. They have great content, we discuss them in instruction sessions, and we prominently feature them on our webpage. In spite of this though they are not used nearly as much as I think they should be.

Licensed under a CC BY-SA 2.0 by Side Wages

This summer, I spent time brainstorming ways to market the guides to increase usage and it hit me that maybe I’m going about the process all wrong. I’m trying to promote a resource to students that is outside the typical resources they use. Our students use the university’s learning management system, Moodle, extensively. It is the way they access courses and communicate with their professors and fellow classmates.

We have integrated links in Moodle directly to the library, but based on our Google Analytics students go directly from the library homepage to the databases. They don’t frequently traffic other parts of the website. So instead of rowing upstream, what if we start using Moodle? I’m still brainstorming what this could look like but here are a few ideas:

  • Enroll students in a library course (I’ve seen this done, but I’m not sure it is the best fit for my institution)
  • Create lessons and pages in Moodle that faculty can import into their own courses
  • Work more closely with the instructional design team to include library resources in the courses

How do you use the LMS to encourage student use of the library?



Shelley Gullikson: Adventures in Information Architecture Part 1: Test what we have

Thu, 2015-08-06 22:51

For a while now, Web Committee has been discussing revamping the information architecture on our library website. There are some good reasons:

  • more than half our of visitors are arriving at the site through a web search and so only have the menu — not the home page — to orient them to what our site is and does
  • the current architecture does not have an obvious home for our growing scholarly communications content
  • the current architecture is rather weak on the connection with the library building, which is a problem because:
    • people are searching the site for content about the building
    • there are more visits to the building than visits to the website

However, we also know that changing the IA is hard. Our students have already told us that they don’t like it when the website changes, so we really want to make sure that any change is a positive one. But that takes time.

And we have a pressing need to do something soon. The Library will be opening a new Ottawa Resource Room in the fall that has related web content and we can’t decide where it fits. So: user testing! Maybe our users can see something we don’t in our current IA. (Spoiler: they can’t)

We did guerrilla-style testing with a tablet, asking people to show us how they would find:

  • information relating to Ottawa (we asked what program they were in to try to make it relevant; for example we asked the Child Studies major about finding information related to child welfare in specific Ottawa neighbourhoods)
  • information about the Ottawa Room
  • (for another issue) how they would get help with downloading ebooks

As an aside: We’re not so naive to think that students use the library website for all of their information needs. We made a point of asking them where on the library website they would go because we needed to put the information somewhere on the website. For the ebooks question, we also asked what they would really do if they had problems with ebooks. 6/8 people said they would ask someone at the library. Yup. They’d talk to real person. Anyway, back to IA…

We talked to 8 different students. For information relating to Ottawa, the majority would do a Summon search. Makes sense. For information about the Ottawa Room itself, the answers were all over the place and nothing was repeated more than twice. So our users weren’t any better than we were at finding a place in our current IA for this information. (Hey, it was worth a try!)

So… we either need to shove the Ottawa Room somewhere, anywhere, in the structure we have or we need to tweak the IA sooner rather than later. So on to Web Committee for discussion and (I hope!) decisions.