You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 7 hours 22 min ago

FOSS4Lib Upcoming Events: Hack-A-Way 2015

Wed, 2015-08-05 13:09
Date: Wednesday, November 4, 2015 - 09:00 to Friday, November 6, 2015 - 17:00Supports: Evergreen

Last updated August 5, 2015. Created by Peter Murray on August 5, 2015.
Log in to edit this page.

From the announcement:

After much deliberation, forecasting, haruspicy the dates for the 2015 in Danvers MA (just north of Boston) have been selected and it will be November 4th – November 6th.

Library of Congress: The Signal: Mapping the Digital Galaxy: The Keepers Registry Expands its Tool Kit

Wed, 2015-08-05 12:40

This past month, The Keepers Registry released a new version of its website with a suite of significant new features to help its members monitor the archival status of e-journal content. The Library of Congress has been one of the archiving institutions of The Keepers Registry and we thought this was a good time to learn a little bit more about this important initiative. Ted Westervelt, manager of the eDeposit Program for Library Services, interviewed the Keepers Registry team.

This Keepers Registry is a Jisc Service at EDINA. It is operated in partnership with the ISSN International Centre in Paris. The Keepers Registry team includes Peter Burnhill, Mary Elder, Adam Rusbridge, Tim Stickland, Lisa Otty and Steven Davies from EDINA and Gaelle Bequet, Pierre Godefroy and their colleagues from ISSN IC.

Ted: On its website, the Keepers Registry has a clear statement of purpose. Can you elaborate on the background of the Registry and how this was developed? What is the significance of this project for the digital preservation community?

Keepers team: The purpose of the Keepers Registry is to act as a lens onto the activities of archiving agencies around the world. It enables librarians and information managers to discover which journals and continuing resources are being looked after, by whom and how. Now funded as a Jisc service, the Registry was developed and is operated jointly by EDINA (University of Edinburgh) and the ISSN International Centre. This followed a project called PEPRS (Piloting an E-journals Preservation Registry Service) in which CLOCKSS, LOCKSS and Portico were active partners in the project phase alongside Konjinklijke Bibliotheek (KB, Netherlands) and the British Library (UK).

The importance of digital preservation for e-journal content has been widely recognized in the US and the UK, but until the provision of the Keepers Registry there was no systematic and easily accessible information (PDF) about the arrangements in place for individual journals. The Keepers Registry also records information on issues and volumes, not just serial titles.

Old College, University of Edinburgh, UK by user Sue Hongjia via Wikimedia Commons

The first version of the online Registry was launched in 2011. It was included in the portfolio of Jisc services at EDINA in 2013-14 shortly after the Library of Congress joined as a ‘Keeper.’

We continue to develop the Keepers Registry with a special injection of funding from Jisc in the Keepers Extra project. Now halfway through this two-year project we will host an open conference in Edinburgh later this year on September 7th, Taking the Long View. This will provide a showcase for each Keeper, including the Library of Congress, and bring them together to address the international dimensions of the challenge of increasing preservation coverage.

Ted: One of the goals of the Keepers Registry is to highlight e-journals that are at risk of loss. What specific actions on its part does the Keepers Registry think are most useful for accomplishing this?

Keepers team: We have just launched a new Title List Comparison feature that makes the process of discovering what is at risk of loss much quicker and easier. This enables a librarian (or a publisher) to upload lists of titles having ISSN and produce a report on which titles are being ‘ingested with archival intent’ and which are not known to be in safe hands. Archiving agencies can also use the Title List Comparison facility to find gaps in their collections and then take action to fill these gaps to inform local collection management decisions.

As noted, the Keepers Registry records information on issues and volumes, not just serial titles. This is important when archiving digital content from serials and other continuing resources. Recording the parts of serials is a demanding aspect for the keepers and the Registry.

We hope that will encourage cooperation between the Keepers organizations as they assess how to ensure complete coverage. We are very interested in how we can help the library community understand their collections better, and how to empower the community to take actions.

Ted: The list of the archiving agencies, which are part of the Keepers Registry is a substantial one. Could you tell us how you went about choosing them? To what extent are you soliciting the participation of other archiving agencies?

Keepers team: Thus far we do not have a formal definition of scope for who should and who should not be recognized as a ‘Keeper’ within the Keepers Registry. We know that is needed at some point. Instead, we have the yardstick that the Keepers Registry should report on the activities of organizations and initiatives that have ‘archival intent.’ This helps us decide that some operations geared towards providing current online access but without ‘archival intent’ are not on our target list.

We are actively seeking participation from two types of archiving organizations. The first are those which have a national mission to keep content over the long term. Often this means national libraries which had active programs of archiving of digital content that is published as serials and other online continuing resources. Countries vary on how this mission is allocated. The second are consortia of research libraries, again in different countries, who work together to create archival arrangements. The motive is to have such memory organizations declare what each is looking after and how. Researchers in any one country are dependent upon what is published, and archived, in other countries.

As part of the Keepers Extra project, we are also building on an idea once mooted by KB Netherlands, developing a set of recommendations for a safe places network. This is thought of as a global grouping that can advocate for increased preservation coverage of online continuing resources. Engaging new Keepers on board will be crucial to making progress. We would encourage any archives, libraries or repositories that would be interested in participating, to get in touch with us.

Ted: Most of the participating agencies are Western European or North American, which makes sense given the origins of the Keepers Registry. How actively are you looking at adding members from other parts of the world?

Cyberspace by user jstll via Flickr

Keepers team: We have become well-traveled in our quest! The initial focus was on the UK, Europe and the USA as it was much easier for us to encourage participation from agencies in Europe and the US as we had existing contact and relationships with many of the original agencies.

However, we are very conscious of need for more international participation as mentioned earlier. There is now engagement with China and Canada and with active outreach to India and Brazil, as well as more countries across Europe.

National and regional activity is crucial for effective digital preservation of the scholarly and cultural record. Archival agencies rightly have collection policies, with mission to collect what is important to their communities. Space and resources are finite, so archiving involves making choices about the value of material and assigning priority of effort. But what is regarded as less valuable in one country may be recognized as valuable to another, which is why we are interested including international archiving agencies as Keepers. Researchers in any one country are dependent upon what is published, and archived for the long term, in other countries.

Ted: You are just now releasing a new version of the Keepers Registry, with some interesting new functionalities. One of these is the Title List Comparison, which you mentioned above. Who do you hope will use this and what do you hope this will do for them and for the mission of the Keepers Registry in general?

Keepers team: We anticipate that the Title List Comparison facility will prove very popular. This complements the simple search facility on the home page, to enable library staff to have archival information for a list of titles identified by ISSN. This was in test mode for a while during which we found the right balance of reporting of archival status. The Title List Comparison facility should allow libraries to have insight into the archival status of collections in order to assist informed decision making about subscriptions, cancellations and print rationalization.

We hope that the tool will also improve communication between the library community and the Keeper organizations themselves, as libraries make known their priorities for the serial titles that they discover are not being kept safe.  The Title List Comparison service is part of our Members Services; access to these requires membership, which is free of charge.

Ted: Another major functionality in the new release is the Machine to Machine Interfaces. Who do you expect will use this? What outcome would you like to see from launching this?

Keepers team: Librarians interact on a daily basis with a wide range of services and tools for serials. We want the information on archiving that we bring together in the Keepers Registry to be available and useful at the point of need – when there is need for a quick reference to make a measured decision. Those machine-to-machine interfaces are there to support linking tools from those other services, such as union catalogs, and even OPACs, as well as vendor platforms. In general, those ‘APIs’ are there so that others can do unimaginable things with our data – so please get in touch!

DPLA: Building a Stronger Cultural Heritage Network at the DLF Forum 2015

Tue, 2015-08-04 18:00

Headed to the DLF Forum in October 2015? Join us for a half-day workshop learning about DPLA data and how you can use it to inform your practice and make extreme sharing part of your workflow!

The workshop will be comprised of 3 sessions, with ample time to network in-between.

Collections Data as Research Data: What we know in aggregate

Defining the problem of metadata quality and the opportunities around investing in better data quality, demonstrated by research done by community members.

Discussion aspect to touch on research agenda for/by DPLA, and the opportunity of using DPLA data for research.

Data analysis performed on metadata available through the DPLA API will be topic of discussion. Data analyzed will include rights statements, subjects, etc.

DPLA Rights work: how better rights statements mean better (re)use.

An overview of the work of the international committee on the development of standardized rights statements for use in our community will be provided. We will discuss the proposed statements and the technical implementation of the statements, to be housed at A project timeline and demonstration of how the statements will function in the DPLA network and beyond will be provided.

Break out work: Workshop participants will be encouraged to provide feedback on the implementation of rights statements and to identify potential adopters/implementers. Workshop participants will also be encouraged to discuss needed education/tools for implementation.

DPLA Metadata Enrichment work: Better together

We will have a discussion around processes of metadata enrichment, with a specific focus on enrichment processes undertaken by DPLA and its Hubs. This will include a brief review of the DPLA metadata and mapping enrichment processes, and may include brief presentations from others within the DPLA network. We will also have an open discussion of known gaps and current needs, such as the inability to round trip enrichment back to Hubs and partners.

Break out work: Participants will break into groups to discuss existing tools and needs; groups will be encouraged to develop concrete enrichment use cases to inform future work on tools, including the aggregation and enrichment portions of the Hydra in a Box project.

Registration is required and free.


When: Wednesday, October 28, 2015, 1–5pm
Where: Pinnacle Hotel Vancouver Harbourfront – room block information

Harvard Library Innovation Lab: Link roundup August 4, 2015

Tue, 2015-08-04 16:22

This is the good stuff.

The new importance of ‘social listening’ tools – Columbia Journalism Review

Newsrooms’ growing use of social media listening tools to uncover and break stories

How screens make us feel – Columbia Journalism Review

Reading on a screen prompted equal emotional engagement and recollection of details as reading on paper

How screens make us feel – Columbia Journalism Review

Reading on a screen prompted equal emotional engagement and recollection of details as reading on paper

The History (and Many Looks) of the Penguin Books Logo | Mental Floss

Penguin Books logo through the years

Can Superman Get Sued for Trashing Metropolis? | WIRED

Not even Superman is above the law

ACRL TechConnect: Collaborative UX Testing: Cardigans Included

Tue, 2015-08-04 14:00

Understanding and responding to user needs has always been at the heart of librarianship, although in recent years this has taken a more intentional approach through the development of library user experience positions and departments.  Such positions are a mere fantasy though for many smaller libraries, where librarian teams of three or four run the entire show.  For the twenty-three member libraries of the Private Academic Library Network of Indiana (PALNI) consortium this is regularly the case, with each school on staff having an average of four librarians.  However, by leveraging existing collaborative relationships, utilizing recent changes in library systems and consortium staffing, and (of course) picking up a few new cardigans, PALNI has begun studying library user experience at scale with a collaborative usability testing model.

With four library testing locations spread over 200 miles in Indiana, multiple facilitators were used to conduct testing for the consortial discovery product, OCLC’s WorldCat Discovery. Using WebEx to screen record and project the testing into a library staff observation room, 30 participants completed three general tasks with multiple parts helping us to assess user needs and participant behavior.

There were clear advantages of collaborative testing over the traditional, siloed approach which were most obviously shown in the amount and type of data we received. The most important opportunity was the ability to test different setups of the same product. This type of comparative data led to conclusive setup recommendations, and showed problems unique to the institutions versus general user problems. The chance to test multiple schools also provided a lot more data, which reduced the likelihood of testing only outliers.

The second major advantage of collaborative testing was the ability to work as a team. From a physical standpoint, working as a team allowed us to spread the testing out, keeping it fresh in our minds and giving enough time in-between to fix scripts and materials. This also allowed us to test before and after technical upgrades. From a relational perspective, the shouldering of the work and continual support reduced burn out during the testing. Upon analyzing the data, different people brought different skill sets. Our particular team consisted of a graphic/interface designer, a sympathetic ear, and a master editor, all of whom played important roles when it came to analyzing and writing the report. Simply put, it was an enjoyable experience which resulted in valuable, comparative data – one that could not have happened if the libraries had taken a siloed approach.

When we were designing our test, we met with Arnold Arcolio, a User Researcher in OCLC’s User Experience and Information Architecture Group. He gave us many great pieces of advice. Some of them we found to work well in our testing, while others we rejected. The most valuable piece of advice he gave us was to start with the end in mind. Make sure you have clear objectives for what data you are trying to obtain. If you leave your objectives open ended, you will spend the rest of your life reviewing the data and learning interesting things about your users every time.

He recommended: We decided: Test at least two users of the same type. This helps avoid outliers. For us, that meant testing at least two first year students and two seniors. Test users on their own devices. We found this to be impractical for our purposes, as all devices used for testing had to have web conferencing software which allowed us to record users’ screen. Have the participants read the tasks out loud. A technique that we used and recommend as well. Use low-tech solutions for our testing, rather than expensive software and eye tracking software. This was a huge relief to PALNI’s executive director who manages our budget. Test participants where they would normally do their research, in dorm rooms, faculty offices, etc. We did not take this recommendation due to time and privacy concerns. He was very concerned about our use of multiple facilitators. We standardized our testing as much as possible.  First, we choose uniforms for our facilitators. Being librarians, the obvious choice was cardigans. We ordered matching, logoed cardigans from Lands’ End and wore those to conduct our testing. This allowed us to look as similar as possible and avoid skewing participants’ impressions.  We chose cardigans in blue because color theory suggests that blue persuades the participants to trust the facilitator while feeling calm and confident. We also worked together to create a very detailed script that was used by each facilitator for each test.

Our next round of usability testing will incorporate many of the same recommendations provided by our usability expert, discussed above, with a few additions and changes. This Fall, we will be including a mobile device portion using a camera mount (Mr. Tappy see to screen record, testing different tasks, and working with different libraries. Our libraries’ staff also recommended making the report more action-oriented with best setup practices and highlighting instructional needs.  We are also developing a list of common solutions for participant problems, such as when to redirect or correct misspellings. Finally, as much as we love the cardigans, we will be wearing matching logoed polos underneath for those test rooms that mirror the climate of the Sahara Desert.

We have enjoyed our usability experiences immensely–it is a great chance to visit with both library staff, faculty, and students from other institutions in our consortium. Working collaboratively proved to be a success in our consortia where smaller libraries, short staff, and minimal resources made it otherwise impossible to conduct large scale usability testing.   Plus, we welcome having another cardigan in our wardrobe.

More detailed information about our Spring 2015 study can be found in our report, “PALNI WorldCat Discovery Usability Report.”

About our guest authors:

Eric Bradley is Head of Instruction and Reference at Goshen College and an Information Fluency Coordinator for PALNI.  He has been at Goshen since 2013.  He does not moonlight as a Mixed Martial Arts fighter or Los Angeles studio singer.

Ruth Szpunar is an Instruction and Reference Librarian at DePauw University and an Information Fluency Coordinator for PALNI. She has been at DePauw since 2005. In her spare time she can be found munching on chocolate or raiding the aisles at the Container Store.

Megan West has been the Digital Communications Manager at PALNI since 2011. She specializes in graphic design, user experience, project management and has a strange addiction to colored pencils.

SearchHub: Solr 5’s new ‘bin/post’ utility

Tue, 2015-08-04 00:11
Series Introduction

This is the first in a three part series demonstrating how it’s possible to build a real application using just a few simple commands.  The three parts to this are:

  • Getting data into Solr using bin/post
  • Visualizing search results: /browse and beyond
  • Putting it together realistically: example/files – a concrete useful domain-specific example of bin/post and /browse
Introducing bin/post: a built-in Solr 5 data indexing tool In the beginning was the command-line… As part of the ease of use improvements in Solr 5, the bin/post tool was created to allow you to more easily index data and documents. This article illustrates and explains how to use this tool. For those (pre-5.0) Solr veterans who have most likely run Solr’s “example”, you’ll be familiar with post.jar, under example/exampledocs.  You may have only used it when firing up Solr for the first time, indexing example tech products or book data. Even if you haven’t been using post.jar, give this new interface a try if even for the occasional sending of administrative commands to your Solr instances.  See below for some interesting simple tricks that can be done using this tool. Let’s get started by firing up Solr and creating a collection: $ bin/solr start $ bin/solr create -c solr_docs The bin/post tool can index a directory tree of files, and the Solr distribution has a handy docs/ directory to demonstrate this capability: $ bin/post -c solr_docs docs/ java -classpath /Users/erikhatcher/solr-5.3.0/dist/solr-core-5.3.0.jar -Dauto=yes -Dc=solr_docs -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool /Users/erikhatcher/solr-5.3.0/docs/ SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/solr_docs/update... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log Entering recursive mode, max depth=999, delay=0s Indexing directory /Users/erikhatcher/solr-5.3.0/docs (3 files, depth=0) . . . 3575 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/solr_docs/update... Time spent: 0:00:30.705 30 seconds later we have Solr’s docs/ directory indexed and available for searching. Foreshadowing to the next post in this series, check out http://localhost:8983/solr/solr_docs/browse?q=faceting to see what you’ve got. Is there anything bin/post can do that clever curling can’t do?  Not a thing, though you’d have to iterate over a directory tree of files or do web crawling and parsing out links to follow for entirely comparable capabilities.  bin/post is meant to simplify the (command-line) interface for many common Solr ingestion and command needs. Usage The tool provides solid -h help, with the abbreviated usage specification being: $ bin/post -h Usage: post -c <collection> [OPTIONS] <files|directories|urls|-d ["...",...]> or post -help collection name defaults to DEFAULT_SOLR_COLLECTION if not specified ... See the full bin/post -h output for more details on parameters and example usages. A collection, or URL, must always be specified with -c (or by DEFAULT_SOLR_COLLECTION set in the environment) or -url. There are parameters to control the base Solr URL using -host, -port, or the full -url. Note that when using -url it must be the full URL, including the core name all the way through to the /update handler, such as -url http://staging_server:8888/solr/core_name/update. Indexing “rich documents” from the file system or web crawl File system indexing was demonstrated above, indexing Solr’s docs/ directory which includes a lot of HTML files. Another fun example of this is to index your own documents folder like this: $ bin/solr create -c my_docs bin/post -c my_docs ~/Documents There’s a constrained list of file types (by file extension) that bin/post will pass on to Solr, skipping the others.  bin/post -h provides the default list used. To index a .png file, for example, set the -filetypes parameter: bin/post -c test -filetypes png image.png.  To not skip any files, use “*” for the filetypes setting: bin/post -c test -filetypes "*" docs/ (note the double-quotes around the asterisk, otherwise your shell may expand that to a list of files and not operate as intended) Browse and search your own documents at http://localhost:8983/solr/my_docs/browse Rudimentary web crawl Careful now: crawling web sites is no trivial task to do well. The web crawling available from bin/post is very basic, single-threaded, and not intended for serious business.  But it sure is fun to be able to fairly quickly index a basic web site and get a feel for the types of content processing and querying issues to face as a production scale crawler or other content acquisition means are in the works: $ bin/solr create -c site $ bin/post -c site -recursive 2 -delay 1 # (this will take some minutes) Web crawling adheres to the same content/file type filtering as the file crawling mentioned above; use -filetypes as needed.  Again, check out /browse; for this example try http://localhost:8983/solr/site/browse?q=revolution Indexing CSV (column/delimited) files Indexing CSV files couldn’t be easier! It’s just this, where data.csv is a standard CSV file: $ bin/post -c collection_name data.csv CSV files are handed off to the /update handler with the content type of “text/csv”. It detects it is a CSV file by the .csv file extension. Because the file extension is used to pick the content type and it currently only has a fixed “.csv” mapping to text/csv, you will need to explicitly set the content -type like this if the file has a different extension: $ bin/post -c collection_name -type text/csv data.file If the delimited file does not have a first line of column names, some columns need excluding or name mapping, the file is tab rather than comma delimited, or you need to specify any of the various options to the CSV handler, the -params option can be used. For example, to index a tab-delimited file, set the separator parameter like this: $ bin/post -c collection_name data.tsv -type text/csv -params "separator=%09" The key=value pairs specified in -params must be URL encoded and ampersand separated (tab is url encoded as %09).  If the first line of a CSV file is data rather than column names, or you need to override the column names, you can provide the fieldnames parameter, setting header=true if the first line should be ignored: $ bin/post -c collection_name data.csv -params "fieldnames=id,foo&header=true" Here’s a neat trick you can do with CSV data, add a “data source”, or some type of field to identify which file or data set each document came from. Add a literal.<field_name>= parameter like this: $ bin/post -c collection_name data.csv -params "literal.data_source=temp" Provided your schema allows for a data_source field to appear on documents, each file or set of files you load get tagged to some scheme of your choosing making it easy to filter, delete, and operate on that data subset.  Another literal field name could be the filename itself, just be sure that the file being loaded matches the value of the field (easy to up-arrow and change one part of the command-line but not another that should be kept in sync). Indexing JSON If your data is in Solr JSON format, it’s just bin/post -c collection_name data.json. Arbitrary, non-Solr, JSON can be mapped as well. Using the exam grade data and example from here, the splitting and mapping parameters can be specified like this: $ bin/post -c collection_name grades.json -params "split=/exams&f=first:/first&f=last:/last&f=grade:/grade&f=subject:/exams/subject&f=test:/exams/test&f=marks:/exams/marks&json.command=false" Note that json.command=false had to be specified so the JSON is interpreted as data not as potential Solr commands. Indexing Solr XML Good ol’ Solr XML, easy peasy: bin/post -c collection_name example/exampledocs/*.xml.  If you don’t know what Solr XML is, have a look at Solr’s example/exampledocs/*.xml files. Alas, there’s currently no splitting and mapping capabilities for arbitrary XML using bin/post; use Data Import Handler with the XPathEntityProcessor to accomplish this for now. See SOLR-6559 for more information on this future enhancement. Sending commands to Solr Besides indexing documents, bin/post can also be used to issue commands to Solr. Here are some examples:
  • Commit: bin/post -c collection_name -out yes -type application/json -d '{commit:{}}' Note: For a simple commit, no data/command string is actually needed.  An empty, trailing -d suffices to force a commit, like this – bin/post -c collection_name -d
  • Delete a document by id: bin/post -c collection_name -type application/json -out yes -d '{delete: {id: 1}}'
  • Delete documents by query: bin/post -c test -type application/json -out yes -d '{delete: {query: "data_source:temp"}}'
The -out yes echoes the HTTP response body from the Solr request, which generally isn’t any more helpful with indexing errors, but is nice to see with commands like commit and delete, even on success. Commands, or even documents, can be piped through bin/post when -d dangles at the end of the command-line: # Pipe a commit command $ echo '{commit: {}}' | bin/post -c collection_name -type application/json -out yes -d # Pipe and index a CSV file $ cat data.csv | bin/post -c collection_name -type text/csv -d Inner workings of bin/post The bin/post tool is a straightforward Unix shell script that processes and validates command-line arguments and launches a Java program to do the work of posting the file(s) to the appropriate update handler end-point. Currently, SimplePostTool is the Java class used to do the work (the core of the infamous post.jar of yore). Actually post.jar still exists and is used under bin/post, but this is an implementation detail that bin/post is meant to hide. SimplePostTool (not the bin/post wrapper script) uses the file extensions to determine the Solr end-point to use for each POST.  There are three special types of files that POST to Solr’s /update end-point: .json, .csv, and .xml. All other file extensions will get posted to the URL+/extract end-point, richly parsing a wide variety of file types. If you’re indexing CSV, XML, or JSON data and the file extension doesn’t match or isn’t actually a file (if you’re using the -d option) be sure to explicitly set the -type to text/csv, application/xml, or application/json. Stupid bin/post tricks Introspect rich document parsing and extraction Want to see how Solr’s rich document parsing sees your files? Not a new feature, but a neat one that can be exploited through bin/post by sending a document to the extract handler in a debug mode returning an XHTML view of the document, metadata and all. Here’s an example, setting -params with some extra settings explained below: $ bin/post -c test -params "extractOnly=true&wt=ruby&indent=yes" -out yes docs/SYSTEM_REQUIREMENTS.html java -classpath /Users/erikhatcher/solr-5.3.0/dist/solr-core-5.3.0.jar -Dauto=yes -Dparams=extractOnly=true&wt=ruby&indent=yes -Dout=yes -Dc=test -Ddata=files org.apache.solr.util.SimplePostTool /Users/erikhatcher/solr-5.3.0/docs/SYSTEM_REQUIREMENTS.html SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/test/update?extractOnly=true&wt=ruby&indent=yes... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file SYSTEM_REQUIREMENTS.html (text/html) to [base]/extract { 'responseHeader'=>{ 'status'=>0, 'QTime'=>3}, ''=>'<?xml version="1.0" encoding="UTF-8"?> <html xmlns=""> <head> <meta name="stream_size" content="1100"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By" content="org.apache.tika.parser.html.HtmlParser"/> <meta name="stream_content_type" content="text/html"/> <meta name="dc:title" content="System Requirements"/> <meta name="Content-Encoding" content="UTF-8"/> <meta name="resourceName" content="/Users/erikhatcher/solr-5.2.0/docs/SYSTEM_REQUIREMENTS.html"/> <meta name="Content-Type" content="text/html; charset=UTF-8"/> <title>System Requirements</title> </head> <body> <h1>System Requirements</h1> ... </body> </html> ', 'null_metadata'=>[ 'stream_size',['1100'], 'X-Parsed-By',['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.html.HtmlParser'], 'stream_content_type',['text/html'], 'dc:title',['System Requirements'], 'Content-Encoding',['UTF-8'], 'resourceName',['/Users/erikhatcher/solr-5.3.0/docs/SYSTEM_REQUIREMENTS.html'], 'title',['System Requirements'], 'Content-Type',['text/html; charset=UTF-8']]} 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/test/update?extractOnly=true&wt=ruby&indent=yes... Time spent: 0:00:00.027 Setting extractOnly=true instructs the extract handler to return the structured parsed information rather than actually index the document. Setting wt=ruby (ah yes! go ahead, try it in json or xml :) and indent=yes allows the output (be sure to specify -out yes!) to render readably in a console. Prototyping, troubleshooting, tinkering, demonstrating It’s really handy to be able to test and demonstrate a feature of Solr by “doing the simplest possible thing that will work”  and bin/post makes this a real joy.  Here are some examples – Does it match? This technique allows you to easily index data and quickly see how queries work against it.  Create a “playground” index and post a single document with fields id, description, and value: $ bin/solr create -c playground $ bin/post -c playground -type text/csv -out yes -d $'id,description,value\n1,are we there yet?,0.42'

Unix Note: that dollar-sign before the single-quoted CSV string is crucial for the new-line escaping to pass through properly.  Or one could post the same data but putting the field names into a separate parameter using bin/post -c playground -type text/csv -out yes -params "fieldnames=id,description,value" -d '1,are we there yet?,0.42' avoiding the need for a new-line and the associated issue.

Does it match a fuzzy query?  their~, in the /select request below, is literally a FuzzyQuery, and ends up matching the document indexed (based on string edit distance fuzziness), rows=0 so we just see the numFound and debug=query output:

$ curl 'http://localhost:8983/solr/playground/select?q=their~&wt=ruby&indent=on&rows=0&debug=query' { 'responseHeader'=>{ 'status'=>0, 'QTime'=>0, 'params'=>{ 'q'=>'their~', 'debug'=>'query', 'indent'=>'on', 'rows'=>'0', 'wt'=>'ruby'}}, 'response'=>{'numFound'=>1,'start'=>0,'docs'=>[] }, 'debug'=>{ 'rawquerystring'=>'their~', 'querystring'=>'their~', 'parsedquery'=>'_text_:their~2', 'parsedquery_toString'=>'_text_:their~2', 'QParser'=>'LuceneQParser'}}

Have fun with your own troublesome text, simply using an id field and any/all fields involved in your test queries, and quickly get some insight into how documents are indexed, text analyzed, and queries match.  You can use this CSV trick for testing out a variety of scenarios, including complex faceting, grouping, highlighting, etc often with just a small bit of representative CSV data.

Windows, I’m sorry. But don’t despair. bin/post is a Unix shell script. There’s no comparable Windows command file, like there is for bin/solr. The developer of bin/post is a grey beard Unix curmudgeon and scoffs “patches welcome” when asked where the Windows version is.  But don’t despair, before there was bin/post there was post.jar. And there still is post.jar.  See the Windows support section of the Reference Guide for details on how to run the equivalent of everything bin/post can do. Future What more could you want out of a tool to post content to Solr? Turns out a lot! Here’s a few ideas for improvements:
  • For starters, SolrCloud support is needed. Right now the exact HTTP end-point is needed, whereas SolrCloud indexing is best done with ZooKeeper awareness. Perhaps this fits under SOLR-7268.
  • SOLR-7057: Better content-type detection and handling (.tsv files could be considered delimited with separator=%09 for example)
  • SOLR-6994: Add a comparable Windows command file
  • SOLR-7042: Improve bin/post’s arbitrary JSON handling
  • SOLR-7188: And maybe, just maybe, this tool could also be the front-end to client-side Data Import Handler
And no doubt there are numerous other improvements to streamline the command-line syntax and hardiness of this handy little tool. Conclusion $ bin/post -c your_collection your_data/ No, bin/post is not necessarily the “right” way to get data into your system considering streaming Spark jobs, database content, heavy web crawling, or other Solr integrating connectors.  But pragmatically maybe it’s just the ticket for sending commit/delete commands to any of your Solr servers, or doing some quick tests.  And, say, if you’ve got a nightly process that produces new data as CSV files, a cron job to bin/post the new data would be as pragmatic and “production-savvy” as anything else. Next up… With bin/post, you’ve gotten your data into Solr in one simple, easy-to-use command.  That’s an important step, though only half of the equation.  We index content in order to be able to query it, analyze it, and visualize it.  The next article in this series delves into Solr’s templated response writing capabilities, providing a typical (and extensible) search results user interface.

The post Solr 5’s new ‘bin/post’ utility appeared first on Lucidworks.

FOSS4Lib Recent Releases: Hydra - 9.2.2

Mon, 2015-08-03 23:44

Last updated August 3, 2015. Created by Peter Murray on August 3, 2015.
Log in to edit this page.

Package: HydraRelease Date: Monday, August 3, 2015

Nicole Engard: Bookmarks for August 3, 2015

Mon, 2015-08-03 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Pydio The mature open source alternative to Dropbox and

Digest powered by RSS Digest

The post Bookmarks for August 3, 2015 appeared first on What I Learned Today....

Related posts:

  1. Students get extra Dropbox space
  2. Project Gutenberg to Dropbox
  3. Open Source Options for Education

FOSS4Lib Upcoming Events: VuFind Summit 2015

Mon, 2015-08-03 19:12
Date: Monday, October 12, 2015 - 08:00 to Tuesday, October 13, 2015 - 17:00Supports: VuFind

Last updated August 3, 2015. Created by Peter Murray on August 3, 2015.
Log in to edit this page.

From the announcement:

Registration is now open for the 2015 VuFind Summit held Monday October 12 and Tuesday October 13, 2015 at Villanova University (in Villanova, PA). Registration will be $45 for two days of events, with breakfast/lunch included. You can register here:

As usual, the event will be a combination of structured talks, planning sessions and free-form hacking.

Evergreen ILS: Hack-A-Way 2015

Mon, 2015-08-03 18:40

After much deliberation, forecasting, haruspicy the dates for the 2015 in Danvers MA (just north of  Boston) have been selected and it will be November 4th – November 6th.

Lukas Koster: Maps, dictionaries and guidebooks

Mon, 2015-08-03 14:51

Interoperability in heterogeneous library data landscapes

Libraries have to deal with a highly opaque landscape of heterogeneous data sources, data types, data formats, data flows, data transformations and data redundancies, which I have earlier characterized as a “data maze”. The level and magnitude of this opacity and heterogeneity varies with the amount of content types and the number of services that the library is responsible for. Academic and national libraries are possibly dealing with more extensive mazes than small public or company libraries.

In general, libraries curate collections of things and also provide discovery and delivery services for these collections to the public. In order to successfully carry out these tasks  they manage a lot of data. Data can be regarded as the signals between collections and services.

These collections and services are administered using dedicated systems with dedicated datastores. The data formats in these dedicated datastores are tailored to perform the dedicated services that these dedicated systems are designed for. In order to use the data for delivering services they were not designed for, it is common practice to deploy dedicated transformation procedures, either manual ones or as automated utilities. These transformation procedures function as translators of the signals in the form of data.

Here lies the origin of the data maze: an inextricably entangled mishmash of systems with explicit and

© Ron Zack

implicit data redundancies using a number of different data formats, some of which systems are talking to each other in some way. This is not only confusing for end users but also for library system staff. End users lack clarity about user interfaces to use, and are missing relevant results from other sources and possible related information. Libraries need licenses and expertise for ongoing administration, conversion and migration of multiple systems, and suffer unforeseen consequences of adjustments elsewhere.

To take the linguistic analogy further, systems make use of a specific language (data format) to code their signals in. This is all fine as long as they are only talking to themselves. But as soon as they want to talk to other systems that use a different language, translations are needed, as mentioned. Sometimes two systems use the same language (like MARC, DC, EAD), but this does not necessarily mean they can understand each other. There may be dialects (DANMARC, UNIMARC), local colloquialisms, differences in vocabularies and even alphabets (local fields, local codes, etc.). Some languages are only used by one system (like PNX for Primo). All languages describe things in their own vocabulary. In the systems and data universe there are not many loanwords or other mechanisms to make it clear that systems are talking about the same thing (no relations or linked data). And then there is syntax and grammar (such as subfields and cataloguing rules) that allow for lots of variations in formulations and formats.

Translation does not only require applying a dictionary, but also interpretation of the context, syntax, local variations and transcriptions. Consequently much is lost in translation.

The transformation utilities functioning as translators of the data signals suffer from a number of limitations. They translate between two specific languages or dialects only. And usually they are employed by only one system (proprietary utilities). So even if two systems speak the same language, they probably both need their own translator from a common source language. In many cases even two separate translators are needed if source and target system do not speak each other’s language or dialect. The source signals are translated to some common language which in turn is translated into the target language. This export-import scenario, which entails data redundancy across systems, is referred to as ETL (Extract Transform Load). Moreover, most translators only know a subset of the source and target language dependent on the data signals needed by the provided services. In some cases “data mappings” are used as conversion guides. This term does not really cover what is actually needed, as I have tried to demonstrate. It is not enough to show the paths between source and target signals. It is essential to add the selections and transformations needed as well. In order to make sense of the data maze you need a map, a dictionary and a guidebook.

To make things even more complicated, sometimes reading data signals is only possible with a passport or visa (authentication for access to closed data). Or even worse, when systems’ borders are completely closed and no access whatsoever is possible, not even with a passport. Usually, this last situation is referred to with the term “data silos”, but that is not the complete picture. If systems are fully open, but their data signals are coded by means of untranslatable languages or syntaxes, we are also dealing with silos.

Anyway, a lot of attention and maintenance is required to keep this Tower of Babel functioning. This practice is extremely resource-intensive, costly and vulnerable. Are there any solutions available to diminish maintenance, costs and vulnerability? Yes there are.

First of all, it is absolutely crucial to get acquainted with the maze. You need a map (or even an atlas) to be able to see which roads are there, which ones are inaccessible, what traffic is allowed, what shortcuts are possible, which systems can be pulled down and where new roads can be built. This role can be fulfilled by a Dataflow Repository, which presents an up-to-date overview of locations and flows of all content types and data elements in the landscape.

Secondly it is vital to be able to understand the signals. You need a dictionary to be able to interpret all signals, languages, syntaxes, vocabularies, etc. A Data Dictionary describing data elements, datastores, dataflows and data formats is the designated tool for this.

And finally it is essential to know which transformations are taking place en route. A guidebook should be incorporated in the repository, describing selections and transformations for every data flow.

You could leave it there and be satisfied with these guiding tools to help you getting around the existing data maze more efficiently, with all its ETL utilities and data redundancies. But there are other solutions, that focus on actually tackling or even eliminating the translation problem. Basically we are looking at some type of Service Oriented Architecture (SOA) implementation. SOA is a rather broad concept, but it refers to an environment where individual components (“systems”) communicate with each other in a technology and vendor agnostic way using interoperable building blocks (“services”). In this definition “services” refer to reusable dataflows between systems, rather than to useful results for end users. I would prefer a definition of SOA to mean “a data and utilities architecture focused on delivering optimal end user services no matter what”.

Broadly speaking there are four main routes to establish a SOA-like condition, all of which can theoretically be implemented on a global, intermediate or local level.

  1. Single Store/Single Format: A single universal integrated datastore using a universal data format. No need for dataflows and translations. This would imply some sort of linked (open) data landscape with RDF as universal language and serving all systems and services. A solution like this would require all providers of relevant systems and databases to commit to a single universal storage format. Unrealistic in the short term indeed, but definitely something to aim for, starting at the local level.
  2. Multiple Stores/Shared Format: A heterogeneous system and datastore landscape with a universal communication language (a lingua franca, like English) for dataflows. No need for countless translators between individual systems. This universal format could be RDF in any serialization. A solution like this would require all providers of relevant systems and databases to commit to a universal exchange format. Already a bit less unrealistic.
  3. Shared Store/Shared Format: A heterogeneous system and datastore landscape with a central shared intermediate integrated datastore in a single shared format. Translations from different source formats to only one shared format. Dataflows run to and from the shared store only. For instance with RDF functioning as Esperanto, the artificial language which is actually sometimes used as “Interlingua” in machine translation. A solution like this does not require a universal exchange format, only a translator that understands and speaks all formats, which is the basis of all ETL tools. This is much more realistic, because system and vendor dependencies are minimized, except for variations in syntax and vocabularies. The platform itself can be completely independent.
  4. Multiple Stores/Single Translation Pool: or what is known as an Enterprise Service Bus (ESB). No translations are stored, no data is integrated. Simultaneous point to point translations between systems happen on the fly. Looks very much like the existing data maze, but with all translators sitting together in one cubicle. This solution is not a source of much relief, or as one large IT vendor puts it: “Using an ESB can become problematic if large volumes of data need to be sent via the bus as a large number of individual messages. ESBs should never replace traditional data integration like ETL tools. Data replication from one database to another can be resolved more efficiently using data integration, as it would only burden the ESB unnecessarily.”.

Overlooking the possible routes out of the data maze, it seems that the first step should be employing the map, dictionary and guidebook concept of the dataflow repository, data dictionary and transformation descriptions. After that the only feasible road on the short term is the intermediate integrated Shared Store/Shared Format solution.

Library of Congress: The Signal: The Personal Digital Archiving 2015 Conference

Mon, 2015-08-03 12:13

“Washington Square Park” by Jean-Christophe Benoist. On Wikimedia.

The annual Personal Digital Archiving conference is about preserving any digital collection that falls outside the purview of large cultural institutions. Considering the expanding range of interests at each subsequent PDA conference, the meaning of the word “personal” has become thinly stretched to cover topics such as family history, community history, genealogy and digital humanities.

New York University hosted Personal Digital Archiving 2015 this past April, during a chilly snap over an otherwise perfect Manhattan spring weekend. The event attracted about 150 people, including more students than in the past.

Each year, depending on the conference’s location and the latest events or trends, the PDA audience and topics vary. But the presenters and attendees always share the same core interest: taking action about digital preservation.

PDA conferences glimpse at projects that are often created by citizen archivists, people who have taken on the altruistic task of preserving a digital collection simply because they recognized the importance of the content and felt that someone should step up and save it. PDA conferences are a chance for trained archivists, amateur archivists and accidental archivists to share information about their projects, about their challenges, about what worked and what didn’t, and about lessons learned.

Videos from Day 1 and Day 2 are online at the Internet Archive.

Howard Besser. Photo by Jasmyn Castro.

Howard Besser, professor of Cinema Studies at NYU and founding director of the NYU Moving Image Archiving and Preservation Program, set the tone in his welcome speech by talking about the importance of digital video as evidence and as cultural records, especially regarding eyewitness video of news events.

Keynote speaker Don Perry, a documentary film producer, talked about “What Becomes of the Family Album in the Digital Age?” (PDF) and his work with “Digital Diaspora Family Reunion.” He talked about preserving digital photos and the cultural impact of sharing photos with friends and family. He stressed that this work applies to every individual, family and community, and that everyone should consider the cultural importance of their digital photos. “The value of the artifact – and what we keep trying to tell young people – is that they are the authors of a history in the making,” said Perry. “And that they need to consider the archives that they’re creating as exactly the same kinds of images that filmmakers like us use to make a documentary. People in the future will be looking through their images to try to understand who we are today.”

Panel: Personal Tools and Methods for PDA. Photo by Jasmyn Castro.

Preserving digital photos is always popular at PDA conferences. It is a common interest that binds us together as stakeholders, especially since the advent of mobile phone digital cameras. Presentations related to digital photos included:

The digital preservation of art (material and digital) is quickly emerging as an area of concern and archivist activism. PDA 2015 had these art-related presentations:

There was a noticeable absence of commercial products and digital scrapbooks at PDA 2015. Instead, presentations, workshops and posters shared practical information about projects that used open-source tools:

Another emerging trend at PDA conferences is toward Digital Humanities and Social Sciences research. Some researchers analyzed and pondered human behavior and digital collections, while others compiled data into presentations of historical events. Presentations included:

I wrote in the beginning of this post that the meaning of the word “personal” is getting stretched at PDA conferences; it’s more like the concept of “personal” is expanding. Personal photos mingle with family personal photos to become a larger archive, a family archive. Facebook has spawned a “local history” phenomena, where members of a community post their personal photos and comments, and the individual personal contributions congeal organically into a community history site. PDA 2015 had several community-related presentations:

Increasingly we hear from colleges and universities, usually — though not exclusively — from their librarians, expressing concern that students and faculty may not be aware of the need to preserve their digital stuff. PDA 2015 hosted a panel, titled “Reflections on Personal Digital Archiving Day on the College Campus” (PDF), comprising representatives from five colleges who spoke about their on-campus outreach experiences:

  • Rachel Appel, Bryn Mawr College
  • Amy Bocko, Wheaton College
  • Joanna DiPasquale, Vassar College
  • Sarah Walden, Amherst College
  • Kevin Powell, Brown University.

We featured a follow-up post on their work for The Signal titled, “Digital Archiving Programming at Four Liberal Arts Colleges.”

A visiting scholar from China, Xiangjun Feng, was scheduled to deliver a presentation on a similar subject — personal digital archiving and students and scholars at her university — but she had to cancel her trip. We put her presentation online, “The Behavior and Perception of Personal Digital Archiving of Chinese University Students.” (PDF)

Howard Besser gave the keynote address on Day 2 of the conference, along with his fellow video-preservation pioneer, Rick Prelinger. It was more like a jam session between two off-beat scholars. Each showed a video clip; Besser showed “Why Archive Video?” and Prelinger showed the infamous “Duck and Cover,” a 1951 public service film aimed at school children that advises them to take shelter under their desks during a nuclear attack.

Other presentations during the conference also touched on video preservation:

The third day of the conference was set aside for hand-on workshops:

  • Courtney Mumma, “Archivematica and AtoM: End-to-End Digital Curation for Diverse Collections”
  • Peter Chan, “Appraise, Process, Discover & Deliver Email”
  • Cal Lee and Kam Woods, “Curating Personal Digital Archives Using BitCurator and BitCurator Access Tools”
  • Yvonne Ng, Marie Lascu and Maggie Schreiner, “Do-It-Yourself Personal Digital Archiving.”

Ng, who is with the human-rights organization, Witness, also gave a presentation during the conference titled, “Evaluating the Effectiveness of a PDA Resource.” (PDA)

Perhaps the conference is also expanding past the “preservation” part of its name into usage; after all, preservation and access are two sides of the same coin. It’s a pleasure every year to see the new ways that people address access and usability.

We still have yet to hear much from the genealogy community, from community historians and public librarians about preserving family history and community history. The same for the healthcare, medical and personal-health communities, though it’s just a matter of time before they join the conversation.

Cliff Lynch, directory of the Coalition for Networked Information, wrote in his essay titled “The Future of Personal Digital Archiving: Defining the Research Agendas,” (published in the book, Personal Archiving), “In the near future, medical records will commonly include genotyping or gene sequencing data, detailed machine-readable medical history records, perhaps prescription or insurance claim information, tests, and imaging. Whether the individual is dead or alive, this is prime material for data mining on a large scale…We could imagine a very desirable — though perhaps currently impossible – future option where an individual could choose to place his or her medical records (before and after death) in a genuinely public research commons, perhaps somewhat like signing up to become an organ donor.”

“…personal collections, and now personal digital archives, are the signature elements that distinguish many of the genuinely great research collections housed  in libraries and archives…We need policy discussions about…what organizations should take responsibility for collecting them. This conversation has connection to the evolving mission and strategies not just of national and research libraries, but of local historical societies, public libraries, and similar groups.”

The conversation will continue at Personal Digital Archiving 2016, hosted by the University of Michigan in Ann Arbor.

Terry Reese: MarcEdit Mac Preview Update

Sun, 2015-08-02 23:42

MarcEdit Mac users, a new preview update has been made available.  This is getting pretty close to the first “official” version of the Mac version.  And for those that may have forgotten, the preview designation will be removed on Sept. 1, 2015.

So what’s been done since the last update?  Well, I’ve pretty much completed the last of the work that was scheduled for the first official release.  At this point, I’ve completed all the planned work on the MARC Tools and the MarcEditor functions.  For this release, I’ve completed the following:

** 1.0.9 ChangeLog

  • Bug Fix: Opening Files — you cannot select any files but a .mrc extension. I’ve changed this so the open dialog can open multiple file types.
  • Bug Fix: MarcEditor — when resizing the form, the filename in the status can disappear.
  • Bug Fix: MarcEditor — when resizing, the # of records per page moves off the screen.
  • Enhancement: Linked Data Records — Tool provides the ability to embed URI endpoints to the end of 1xx, 6xx, and 7xx fields.
  • Enhancement: Linked Data Records — Tool has been added to the Task Manager.
  • Enhancement: Generate Control Numbers — globally generates control numbers.
  • Enhancement: Generate Call Numbers/Fast Headings – globally generated call numbers/fast headings for selected records.
  • Enhancement: Edit Shortcuts — added back the tool to enabled Record Marking via a comment.

Over the next month, I’ll be working on trying to complete four other components prior to the first “official” release Sept. 1.  This means that I’m anticipating at least 1, maybe 2 more large preview releases before Sept. 1, 2015.  The four items I’ll be targeting for completion will be:

  1. Export Tab Delimited Records Feature — this feature allows users to take MARC data and create delimited files (often for reporting or loading into a tool like Excel).
  2. Delimited Text Translator — this feature allows users to generate MARC records from a delimited file.  The Mac version will not, at least initially, be able to work with Excel or Access data.  The tool will be limited to working with delimited data.
  3. Update Preferences windows to expose MarcEditor preferences
  4. OCLC Metadata Framework integration…specifically, I’d like to re-integrate the holdings work and the batch record download.

How do you get the preview?  If you have the current preview installed, just open the program and as long as you have the notifications turned on – the program will notify that an update is available.  Download the update, and install the new version.  If you don’t have the preview installed, just go to: and select the Mac app download.

If you have any questions, let me know.


Roy Tennant: The Oldest Internet Publication You’ve Never Heard Of

Fri, 2015-07-31 18:48

Twenty-five years ago I started a library current awareness service called Current Cites. The idea was to have a team of volunteers monitor library and information technology literature and cite only the best publications in a monthly publication (see the first page of the inaugural issue pictured). Here is the latest issue. TidBITS is, I think, the only Internet publication that is older, and they beat us only by a few months.

Originally, the one-paragraph description accompanying the bibliographic details was intended to summarize the contents. However, we soon allowed each reviewer latitude in using humor and personal insights to provide context and an individual voice.

Although we began publication in print only and for an intended audience of UC Berkeley Library staff, we quickly realized that the audience could be global and the technologies were coming to make it available for free to such a worldwide audience. If you’re curious, you can read more about how Current Cites came to be as well as its early history.

Ever since we have published every month without fail. It has weathered my paternity leave (twins, with one now graduated from college and the other soon to be), the turnover of many reviewers, and going through several sponsoring organizations. We have had only three editors in all that time: David F.W. Robison, Teri Rinne, and myself.

On our 20th anniversary I wrote some of my thoughts about longevity and what contributes to it, which still applies. But then I’ve always been hard to dump, as Library Journal can attest. I’ve been writing for them since 1997.

So please bear with me as I mark this milestone. With only about 3,300 subscribers to the mailing list distribution (we also have an RSS feed and I tweet a link to each issue), we are probably the longest-lived Internet publication you’ve never heard of. Until now.

Here for your edification is the current number of subscribers by country:

United States 2,476 Canada 210 Australia 134 United Kingdom 69 Netherlands 40 New Zealand 33 Spain 32 Germany 28 Italy 26 Taiwan 20 Sweden 18 Israel 17 Brazil 16 Norway 15 Japan 14 France 13 Belgium 11 ??? 11 India 10 Ireland 10 South Africa 8 Finland 7 Denmark 6 Portugal 6 Hungary 5 Singapore 5 Switzerland 5 Mexico 4 Peru 4 Austria 3 Croatia 3 Greece 3 Lebanon 3 Republic of Korea 3 Saudi Arabia 3 United Arab Emirates 3 Argentina 2 Chile 2 China 2 Colombia 2 Federated States of Micronesia 2 Kazakhstan 2 Lithuania 2 Philippines 2 Poland 2 Slovakia 2 Trinidad and Tobago 2 Turkey 2 Botswana 1 Czech Republic 1 Estonia 1 Hong Kong 1 Iceland 1 Islamic Republic of Iran 1 Jamaica 1 Malaysia 1 Morocco 1 Namibia 1 Pakistan 1 Qatar 1 Uruguay 1

Harvard Library Innovation Lab: Link roundup July 31, 2015

Fri, 2015-07-31 15:21

This is the good stuff.

The Factory of Ideas: Working at Bell Labs

Technology is cyclical. Timesharing is cloud computing.

The UK National Videogame Arcade is the inspirational mecca that gaming needs | Ars Technica

UK’s National Videogame Arcade is a sort of interactive art installation allowing visitors to tweak and play games

I Can Haz Memento

include the hash tag “#icanhazmemento” in a tweet with a link and a service replies with an archive

A Graphical Taxonomy of Roller Derby Skate Names

Dewey Decimator or Dewey Decimauler? Hmmm, maybe Scewy Decimal.

The White House’s Alpha Geeks — Backchannel — Medium

Making tech happen inside gov

HangingTogether: Current Cites – the amazing 25th anniversary

Fri, 2015-07-31 15:00

I suspect that a large part of the audience for this blog also subscribes to Current Cites the “annotated bibliography of selected articles, books, and digital documents on information technology” as the masthead describes it. Those of us who subscribe would describe it as “essential”. Those of us who publish newsletters describe the fact that as of August 2015 it will have been published continuously for twenty five years as “amazing”. Those of us who know the editor, our pal and colleague, Roy Tennant, describe the feat he has performed as “stunning” and him as “indefatigable“.

And if you are not a subscriber to this essential, amazing, and stunning newsletter you should be clicking right here. And then you should congratulate Roy in a comment below. Do that right now.


By Mireia Garcia Bermejo (Own work)  via Wikimedia Commons

About Jim Michalko

Jim coordinates the OCLC Research office in San Mateo, CA, focuses on relationships with research libraries and work that renovates the library value proposition in the current information environment.

Mail | Web | Twitter | LinkedIn | Google+ | More Posts (105)

Open Knowledge Foundation: Launch of timber tracking dashboard for Global Witness

Fri, 2015-07-31 14:53

Open Knowledge has produced an interactive trade dashboard for anti-corruption NGO Global Witness to supplement their exposé on EU and US companies importing illegal timber from the Democratic Republic of Congo (DRC).

The DRC Timber Timber Trade Tracker consumes open data from to visualise where in the world Congolese timber is going. The dashboard makes it easy to identify countries that are importing large volumes of potentially illegal timber, and to see where timber shipped by companies accused of systematic illegal logging and social and environmental abuses is going on.

Global Witness has long campaigned for greater oversight of the logging industry in DRC which is home to two thirds of the world’s second largest rainforest. The logging industry is mired with corruption with two of the DRC’s biggest loggers allegedly complicit in the beating and raping of local populations. Alexandra Pardal, campaign leader at Global Witness said:

We knew that DRC logging companies were breaking the law, but the extent of illegality is truly shocking. The EU and US are failing in their legal obligations to keep timber linked to illegal logging, violence and intimidation off our shop floors. Traders are cashing in on a multi-million dollar business that is pushing the world’s vanishing rainforests to extinction.

The dashboard is part of a long term collaboration between Open Knowledge and Global Witness through which they have jointly created a series of interactives and data-driven investigations around corruption and conflict in the extractives industries.

To read the full report and see the dashboard go here.

If you work for an organisation that wants to make its data come alive on the web, get in touch with our team through

LibUX: “The User Experience” in Public Libraries Magazine

Fri, 2015-07-31 14:17

Toby Greenwalt asked Amanda and I —um, Michael — to guest-write about the user experience for his The Wired Library column in Public Libraries Magazine. Our writeup was just published online after appearing in print a couple of months ago.

“The Wired Library” in Public Libraries Magazine, vol. 54, no. 3

We were pretty stoked to have an opportunity to jabber outside our usual #libux echo chamber to evangelize a little and rejigger woo-woo ideas about the user experience for real-world use — it’s catching on.

Such user experience is holistic, negatively or positively impacted at every interaction point your patron has with your library. The brand spanking new building loses its glamour when the bathrooms are filthy; the breadth of the collection loses its meaning when the item you drove to the library for isn’t on the shelf; an awesome digital collection just doesn’t matter if it’s hard to access; the library that literally pumps joy through its vents nets a negative user experience when the hump of the doorframe makes it hard to enter with a wheelchair.

The rest of the post has to do with simple suggestions for improving the website, but the big idea stuff is right up top. Knowing what we know about how folks read on the web, we still get to flashbake some neurons even if this is a topic readers don’t care about.

Read the “The User Experience” over at Public Libraries Online.

I'm just going to go ahead and take a little credit for the way they referred to THE user experience here.

— Michael Schofield (@schoeyfield) July 22, 2015

I write the Web for Libraries each week — a newsletter chock-full of data-informed commentary about user experience design, including the bleeding-edge trends and web news I think user-oriented thinkers should know.

Email Address

The post “The User Experience” in Public Libraries Magazine appeared first on LibUX.

Islandora: Meet Your Developer: Will Panting

Fri, 2015-07-31 13:50

A Meet Your Developer double feature this week, as we introduce another instructor for the upcoming Islandora Conference: Will Panting. A Programmer/Analyst at discoverygarden, Inc., Will is a key member of the Committers Group and of one of the most stalwart defenders of best practices and backwards compatibility in Islandora. If you adopt a brand new module and it doesn't break anything, you may well have Will to thank.

Please tell us a little about yourself. What do you do when you’re not at work?

I went to UPEI and have a major in Comp Sci and a minor in Business. Before DGI I had a short stint at the University. As well as all the normal things like friends and family I spend my spare time developing some personal projects and brewing beer. I've been trying to get my brown recipe right for years now.

How long have you been working with Islandora? How did you get started?

More than four years that I've been with DGI. I had heard about the company through UPEI. I find working on Islandora very rewarding; I think this space is of some very real value.

Sum up your area of expertise in three words:

Complete Islandora Stack

What are you working on right now?

A complex migration from a custom application. It's a good one, using most of the techniques we've had to in the past.

What contribution to Islandora are you most proud of?

I've been in about just every corner of the code base and written tons of peripheral modules and customizations. I think the thing that I'm most proud of isn't a thing, but a consistent push for sustainable practice.

What new feature or improvement would you most like to see?

I'm divided between a viewer framework, an XSLT management component or the generic graph traversal hooks. All basic technology that would create greater consistency and speed development.

What’s the one tool/software/resource you cannot live without?

Box provisioning; absolutely crucial to our rate of development.

If you could leave the community with one message from reading this interview, what would it be?

Commit. Dive deep in the code, let it cut you up then stitch the wounds and do it again. It's great to see new committers.

FOSS4Lib Recent Releases: Fedora Repository - 4.3.0

Fri, 2015-07-31 12:55

Last updated July 31, 2015. Created by Peter Murray on July 31, 2015.
Log in to edit this page.

Package: Fedora RepositoryRelease Date: Friday, July 24, 2015