You are here

Feed aggregator

Code4Lib Journal: SierraDNA – Demonstrating the Usefulness of Direct ILS Database Access

planet code4lib - Tue, 2015-10-20 13:18
Innovative Interface’s Sierra(™) Integrated Library System (ILS) brings with it a Database Navigator Application (SierraDNA) - in layman's terms SierraDNA gives Sierra sites read access to their ILS database. Unlike the closed use cases produced by vendor supplied APIs, which restrict Libraries to limited development opportunities, SierraDNA enables sites to publish their own APIs and scripts based upon custom SQL code to meet their own needs and those of their users and processes. In this article we give examples showing how SierraDNA can be utilized to improve Library services. We highlight three example use cases which have benefited our users, enhanced online security and improved our back office processes. In the first use case we employ user access data from our electronic resources proxy server (WAM) to detect hacked user accounts. Three scripts are used in conjunction to flag user accounts which are being hijacked to systematically steal content from our electronic resource provider’s websites. In the second we utilize the reading histories of our users to augment our search experience with an Amazon style “People who borrowed this book also borrowed…these books” feature. Two scripts are used together to determine which other items were borrowed by borrowers of the item currently of interest. And lastly, we use item holds data to improve our acquisitions workflow through an automated demand based ordering process. Our explanation and SQL code should be of direct use for adoption or as examples for other Sierra customers willing to exploit their ILS data in similar ways, but the principles may also be useful to non-Sierra sites that also wish to enhancement security, user services or improve back office processes.

Code4Lib Journal: Streamlining Book Requests with Chrome

planet code4lib - Tue, 2015-10-20 13:18
This article starts by examining why a Chrome Extension was desired and how we saw it making the workflow for requesting new items both easier and more accurate. We then go on to outline how we constructed our extension, looking at the folder structure, third party scripts and services that combine to make this achievable. Finally, the article looks at how the extension is regulated and plans for future development.

Code4Lib Journal: Generating Standardized Audio Technical Metadata: AES57

planet code4lib - Tue, 2015-10-20 13:18
Long-term access to digitized audio may be heavily dependent on the quality of technical metadata captured during digitization. The AES57-2011 standard offers a standardized method of documenting fairly comprehensive technical information, but its complexity may be confusing. In an effort to lower the barrier to use, we have developed software that generates valid AES57 files for digitized audio, using output from FITS (File Information Tool Set) and a few fields of information from a tab-delimited spreadsheet. This article will describe the logic used, the fields required, the basic process, applications, and options for further development.

Code4Lib Journal: Topic Space: Rapid Prototyping a Mobile Augmented Reality Recommendation App

planet code4lib - Tue, 2015-10-20 13:18
With funding from an Institute of Museum and Library Services (IMLS) Sparks! Ignition Grant, researchers from the University of Illinois Library designed and tested a mobile recommender app with augmented reality features. By embedding open source optical character recognition software into a “Topic Space” module, the augmented reality app can recognize call numbers on a book in the library and suggest relevant items that are not shelved nearby. Topic Space can also show users items that are normally shelved in the starting location but that are currently checked out. Using formative UX methods, grant staff shaped app interface and functionality through early user testing. This paper reports results of UX testing; a redesigned mobile interface, and provides considerations on the future development of personalized recommendation functionality.

Code4Lib Journal: Integration of Library Services with Internet of Things Technologies

planet code4lib - Tue, 2015-10-20 13:18
The SELIDA framework is an integration layer of standardized services that takes an Internet-of-Things approach for item traceability in the library setting. The aim of the framework is to provide tracing of RFID tagged physical items among or within various libraries. Using SELIDA we are able to integrate typical library services—such as checking in or out items at different libraries with different Integrated Library Systems—without requiring substantial changes, code-wise, in their structural parts. To do so, we employ the Object Naming Service mechanism that allows us to retrieve and process information from the Electronic Product Code of an item and its associated services through the use of distributed mapping servers. We present two use case scenarios involving the Koha open source ILS and we briefly discuss the potential of this framework in supporting bibliographic Linked Data.

DuraSpace News: NOW AVAILABLE: Fedora 4.4.0—Progress Towards Key Objectives

planet code4lib - Tue, 2015-10-20 00:00

Winchester, MA  On October 12, 2015 Fedora 4.4.0  was released by the Fedora team. Full release notes are included in this message and are also available on the wiki: This new version furthers several major objectives including:

SearchHub: Stump The Chump: Austin Winners

planet code4lib - Mon, 2015-10-19 21:31

Last week was another great Stump the Chump session at Lucene/Solr Revolution in Austin. After a nice weekend of playing tourist and eating great BBQ, today I’m back at my computer and happy to announce last weeks winners:

I want to thank everyone who participated — either by sending in your questions, or by being there in person to heckle me. But I would especially like to thank the judges and our moderator Cassandra Targett, who had to do all the hard work preparing the questions.

Keep an eye on the Lucidworks YouTube page to see the video once it’s available. And if you can make it to Cambridge, MA next week, make sure to sign up for the October 28th Boston Lucene/Solr MeetUp and hear all about the winning questions, and how I think they stacked up over the past 5 years.

The post Stump The Chump: Austin Winners appeared first on

Nicole Engard: Bookmarks for October 19, 2015

planet code4lib - Mon, 2015-10-19 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Discourse Discourse is the 100% open source discussion platform built for the next decade of the Internet. It works as a mailing list, a discussion forum, and a long-form chat room

Digest powered by RSS Digest

The post Bookmarks for October 19, 2015 appeared first on What I Learned Today....

Related posts:

  1. Library Association Rant
  2. Open Access Day in October
  3. MarkMail: Mailing List Search

SearchHub: Focusing on Search Quality at Lucene/Solr Revolution 2015

planet code4lib - Mon, 2015-10-19 20:03

I just got back from Lucene/Solr Revolution 2015 in Austin on a big high. There were a lot of exciting talks at the conference this year, but one thing that was particularly exciting to me was the focus that I saw on search quality (accuracy and relevance), on the problem of inferring user intent from the queries, and of tracking user behavior and using that to improve relevancy and so on. There were also plenty of great talks on technology issues this week that attack the other ‘Q’ problem – we keep pushing the envelope of what is possible with SolrCloud at scale and under load, are indexing data faster and faster with streaming technologies such as Spark and are deploying Solr to more and more interesting domains. Big data integrations with SolrCloud continue to be a hot topic – as they should since search is probably the most (only?) effective answer to dealing with the explosion of digital information. But without quality results, all the technology improvements in speed, scalability, reliability and the like will be of little real value. Quantity and quality are two sides of the same coin. Quantity is more of a technology or engineering problem (authors like myself that tend to “eschew brevity” being a possible exception) and quality is a language and user experience problem. Both are critical to success where “success” is defined by happy users. What was really cool to me was the different ways people are using to solve the same basic problem – what does the user want to find? And, how do we measure how well we are doing?

Our Lucidworks CTO Grant Ingersoll started the ball rolling in his opening keynote address by reminding us of the way that we typically test search applications by using a small set of what he called “pet peeve queries” that attack the quality problem in piecemeal fashion but don’t come near to solving it. We pat ourselves on the back when we go to production and are feeling pretty smug about it until real users start to interact with our system and the tweets and/or tech support calls start pouring in – and not with the sentiments we were expecting. We need better ways of developing and measuring search quality. Yes, the business unit is footing the bill and has certain standards (which tend to be their pet peeve queries as Grant pointed out) so we give them knobs and dials that they can twist to calm their nerves and to get them off our backs, but when the business rules become so pervasive that they start to take over from what the search engine is designed to do, we have another problem. To be clear, there are some situations where we know that the search engine is not going to get it right so we have to do a manual override. We can either go straight a destination (using a technique that we call “Landing Pages” ) or force what we know to be the best answer to the top – so called “Best Bets” which is implemented in Solr using the QueryElevationComponent. However, this is clearly a case where moderation is needed! We should use these tools to tweak our results – i.e. fix the intractable edge cases, not to fix the core problems.

This ad-hoc or subjective way of measuring search quality that Grant was talking about is pervasive. The reason is that quality – unlike quantity – is hard to measure. What do you mean by “best”? And we know from our own experience and from our armchair data science-esque cogitations on this, that what is best for one user may not be best for another and this can in fact change over time for a given user. So quality, relevance is “fuzzy”. But what can we do? We’re engineers not psychics dammit! Paul Nelson, the Chief Scientist at Search Technologies, then proceeded to show us what we can do to measure search quality (precision and recall) in an objective (i.e. scientific!) way. Paul gave a fascinating talk showing the types of graphs that you typically see in a nuts-and-bolts talk that tracked the gradual improvement in accuracy over time during the course of search application development. The magic behind all of this are query logs and predictive analytics. So given that you have this data (even if from your previous search engine app) and want to know if you are making “improvements” or not, Paul and his team at Search Technologies have developed a way to use this information to essentially regression test for search quality – pretty cool huh? Check out Paul’s talk if you didn’t get a chance to see it.

But look, lets face it, getting computers to understand language is a hard problem. But rather than throwing up our hands, in my humble opinion, we are really starting to dig into solving this one! The rubber is hitting the road folks. One of the more gnarly problems in this domain is name recognition. Chris Mack of Basis Technologies gave a very good presentation of how Basis is using their suite of language technologies to help solve this. Name matching is hard because there are many ambiguities and alternate ways of representing names and there are many people that share the same name, etc. etc. etc. Chris’s family name is an example of this problem – is it a truck, a cheeseburger (spelled Mac) or a last name? For those of you out there that are migrating from Fast ESP to Solr (a shoutout here to that company in Redmond Washington for sunsetting enterprise support for Fast ESP – especially on Linux – thanks for all of the sales leads guys! Much appreciated!) – you should know that Basis Technologies (and Search Technologies as well I believe) have a solution for Lemmatization that you can plug into Solr (a more comprehensive way to do stemming). I was actually over at the Basis Tech booth to see about getting a dev copy of their lemmatizer for myself so that we could demonstrate this to potential Fast ESP customers when I met Chris. Besides name recognition, Basis Tech has a lot of other cool things. Their flagship product is Rosette – a world class ontology / rules-based classification engine among other things. Check it out.

Next up on my list was Trey Grainger of CareerBuilder. Trey is leading a team there that is doing some truly outstanding work on user intent recognition and using that to craft more precise queries. When I first saw the title of Trey’s talk “Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine”, I thought that he and his team had scooped me since my own title is very similar – great minds think alike I guess, (certainly true in Trey’s case, a little self-aggrandizement on my part here but hey, its my blog post so cut me some slack!). What they are basically doing is using classification approaches such as machine learning to build a Knowledge Graph in Solr and then using that at query time to determine what the user is asking for and then to craft a query that brings back those things and other closely related things. The “related to” thing is very important especially in the buzz-word salad that characterizes most of our resumes these days. The query rewrite that you can do if you get this right can slice through noise hits like a hot knife through butter.

Trey is also the co-author of Solr in Action with our own Tim Potter – I am already on record about this wonderful book – but it was cool what Trey did – he offered a free signed copy to the person who had the best tweet about his talk. Nifty idea – wish I had thought of it but, oh yeah, I’d have to write a book first – whoever won, don’t just put this book on your shelf when you get home – read it!

Not to be outdone, Simon Hughes of, Trey’s competitor in the job search sector gave a very interesting talk about how they are using machine learning techniques such as Latent Semantic Analysis (LSA) and Google’s Word2Vec software to do similar things. They are using Lucene payloads in very interesting ways and building Lucene Similarity implementations to re-rank queries – heavy duty stuff that the nuts-and-bolts guys would appreciate too (the code that Simon talked about is open sourced). The title of the talk was “Implementing Conceptual Search in Solr using LSA and Word2Vec”. The keyword here is “implementing” – as I said earlier in this post, we are implementing this stuff now, not just talking about it as we have been doing for too long in my opinion. Simon also stressed the importance of phrase recognition and I was excited to realize that the techniques that Dice is using can feed into some of my own work, specifically to build autophrasing dictionaries that can then be ingested by the AutoPhraseTokenFilter. In the audience with me were Chris Morley of and Koorosh Vakhshoori of who have made some improvements to my autophrasing code that we hope to submit to Solr and github soon.

Nitin Sharma and Li Ding of BloomReach introduced us to a tool that they are working on called NLP4L – a natural language processing tool for Lucene. In the talk, they emphasized important things like precision and recall and how to use NLP techniques in the context of a Lucene search. It was a very good talk but I was standing too near the door– because getting a seat was hard – and some noisy people in the hallway were making it difficult to hear well. That’s a good problem to have as this talk like the others were very well attended. I’ll follow up with Nitin and Li because what they are doing is very important and I want to understand it better. Domo Arrigato!

Another fascinating talk was by Rama Yannam and Viju Kothuvatiparambil (“Viju”) of Bank of America. I had met Viju earlier in the week as he attended our Solr and Big Data course ably taught by my friend and colleague Scott Shearer. I had been tapped to be a Teaching Assistant for Scott. Cool, a TA, hadn’t done that since Grad School, made me feel younger … Anyway, Rama and Viju gave a really great talk on how they are using open-source natural language processing tools such as UIMA, Open NLP, Jena/SPARQL and others to solve the Q&A problem for users coming to the BofA web site. They are also building/using an Ontology (that’s where Jena and SPARQL come in) which as you may know is a subject near and dear to my heart, as well as NLP techniques like Parts Of Speech (POS) detection.

They have done some interesting customizations on Solr but unfortunately this is proprietary. They were also not allowed to publish this talk by having their slides shared online or the talk recorded. People were talking pictures of the slides with their cell phones (not me, I promise) but were asked not to upload them to Facebook, LinkedIn, Instagram or such. There was also a disclaimer bullet on one of their slides like you see on DVDs – the opinions expressed are the authors own and not necessarily shared by BofA – ta da ta dum – lawyereze drivel for we are not liable for ANYTHING these guys say but they’ll be sorry if they don’t stick to the approved script! So you will have to take my word for it, it was a great talk, but I have to be careful here – I may be on thin ice already with BofA legal and at the end of the day, Bank Of America already has all of my money! That said, I was grateful for this work because it will benefit me personally as a BofA customer even if I can’t see the source code. Their smart search knows the difference between when I need to “check my balance” vs when I need to “order checks”. As they would say in Boston – “Wicked Awesome”! One interesting side note here, Ramman and Viju mentioned that the POS tagger that they are using works really well for full sentences (on which the models were trained) but less well on sentence fragments (noun phrases) – still not too bad though – about 80%. More on this in a bit. But hey Banks – gotta love it – don’t get me started on ATM fees.

Last but not least (hopefully?) – as my boss Grant Ingersoll is fond of saying – was my own talk where I tried to stay competitive with all of this cool stuff. I had to be careful not to call it a Ted talk because this is a patented trademark and I didn’t want to get caught by the “Ted Police”. Notice that I didn’t use all caps to spell my own name here – they registered that so it probably would have been flagged by the Ted autobots.  But enough about me. First I introduced my own pet peeve – why we should think of precision and recall before we worry about relevance tuning because technically speaking that is exactly what the Lucene engine does. If we don’t get precision and recall right we have created a garbage in – garbage out problem for the ranking engine. I then talked about autophrasing a bit, bringing out my New York – Big Apple demo yet again. I admitted that this is a toy problem but it does show that you can absolutely nail the phrase recognition and synonym problem which brings precision and recall to 100%. Although this is not a real world problem, I have gotten feedback that autophrasing is currently solving production problems, which is why Chris and Koorosh (mentioned above) needed to improve the code over my initial hack, for their respective dot-coms.

The focus of my talk then shifted to the work I have been doing on Query Autofiltering where you get the noun phrases from the Lucene index itself courtesy of the Field Cache (and yes Hoss, uh Chump, it works great, is less filling than some other NLP techniques – and there is a JIRA: SOLR-7539, take a look). This is more useful in a structured data situation where you have string fields with noun phrases in them. Autophrasing is appropriate for Solr text fields (i.e. tokenized / analyzed fields) so the techniques are entirely complementary. I’m not going to bore you with the details here since I have already written three blog posts on this but I will tell you that the improvements I have made recently will impell me to write a fourth installment – (hey, maybe I can get a movie deal like the guy who wrote The Martian which started out as a blog … naaaah, his was techy but mine is way too techy and it doesn’t have any NASA tie ins … )

Anyway, what I am doing now is adding verb/adjective resolution to the mix. The Query Autofiltering stuff is starting to resemble real NLP now so I am calling it NLP-Lite. “Pseudo NLP”, “Quasi-NLP” and “query time NLP” are also contenders. I tried to do a demo on this (which was partially successful) using a Music Ontology I am developing where I could get the questions “Who’s in The Who” and “Beatles songs covered by Joe Cocker” right, but Murphy was heavily on my case so I had to move on because the “time’s up” enforcers were looming and I had a plane to catch. I should say that the techniques that I was talking about do not replace classical NLP – rather we (collectively speaking) are using classic NLP to build knowledge bases that we can use on the query side with techniques such as query autofiltering. That’s very important and I have said this repeatedly – the more tools we have, the better chance we have of finding the right one for a given situation. POS tagging works well on full sentences and less well on sentence fragments, where the Query Autofilter excels. So its “front-end NLP” – you use classic NLP techniques to mine the data at index time and to build your knowledge base, and you use this type of technique to harvest the gold at query time. Again, the “knowledge base” as Trey’s talk and my own stressed can be the Solr/Lucene index itself!

Finally, I talked about some soon-to-be-published work I am doing on auto suggest. I was looking for a way to generate more precise typeahead queries that span multiple fields which the Query Autofilter could then process. I discovered a way to use Solr facets, especially pivot facets to generate multi-field phrases and regular facets to pull context so that I could build a dedicated suggester collection derived from a content collection. (whew!!) The pivot facets allow me to turn a pattern like “genre,musician_type” into “Jazz Drummers”, “Hard Rock Guitarists”, “Classical Pianists”, “Country Singers” and so on. The facets enable me to then grab related information to the subject so if I do a pivot pattern like “name,composition_type” to generate suggestions like “Bob Dylan Songs”, I can pull back other related things to Bob Dylan such as “The Band” and “Folk Rock” that I can then use to create user context for the suggester. Now, if you are searching for Bob Dylan songs, the suggester can start to boost them so that song titles that would normally be down the list will come to the top.

This matches a spooky thing that Google was doing while I was building the music ontology – after awhile, it would start to suggest long song titles with just two words entered if my “agenda” for that moment was consistent. So if I am searching for Beatles songs for example, after a few searches, typing “ba” brings back (in the typeahead) “Baby’s In Black and “Baby I’m a Rich Man” above the myriad of songs that start with Baby as well as everything else in their typeahead dictionary starting with “ba”. WOW – that’s cool – and we should be able to do that too! (i.e., be more “Google-esque” as one of my clients put it in their Business Requirements Document) I call it “On-The-Fly Predictive Analytics” – as we say in the search quality biz – its ALL about context!

I say “last but not least” above, because for me, that was the last session that I attended due to my impending flight reservation. There were a few talks that I missed for various other reasons (there was a scheduling conflict, my company made me do some pre-sales work, I was wool gathering or schmoozing/networking, etc) where the authors seem to be on the same quest for search quality. Talks like “Nice Docs Finish First” by Fiona Condon at Etsy, “Where Search Meets Machine Learning” by folks at Verizon, “When You Have To Be Relevant” by Tom Burgmans of Wolters-Kluwer and “Learning to Rank” by those awesome Solr guys at Bloomberg – who have got both ‘Qs’ working big time!

Since I wasn’t able to attend these talks and don’t want to write about them from a position of ignorance, I invite the authors (or someone who feels inspired to talk about it) to add comments to this post so we can get a post-meeting discussion going here. Also, any author that I did mention who feels that I botched my reporting of their work should feel free to correct me. And finally, anybody who submitted on the “Tweet about Trey’s Talk and Win an Autographed Book” contest is encouraged to re-tweet – uh post, your gems here.

So, thanks for all the great work on this very important search topic. Maybe next year we can get Watson to give a talk so we can see what the computers think about all of this. After all, Watson has read all of Bob Dylan’s song lyrics so he (she?) must be a pretty cool dude/gal by now. I wonder what it thinks about “Stuck Inside of Mobile with the Memphis Blues Again”? To paraphrase the song, yes Mama, this is really the end. So, until we meet again at next year’s Revolution, Happy searching!

The post Focusing on Search Quality at Lucene/Solr Revolution 2015 appeared first on

Stuart Yeates: Thoughts on the NDFNZ wikipedia panel

planet code4lib - Mon, 2015-10-19 18:24

Last week I was on an NDFNZ wikipedia panel with Courtney Johnston, Sara Barham and Mike Dickison. Having reflected a little and watched the youtube at I've got some comments to make (or to repeat, as the case may be).

Many people, including apparently including Courtney, seemed to get the most enjoyment out of writing the ‘body text’ of articles. This is fine, because the body text (the core textual content of the article) is the core of what the encyclopaedia is about. If you can’t be bothered with wikiprojects, categories, infoboxes, common names and wikidata, you’re not alone and there’s no reason you need to delve into them to any extent. If you start an article with body text and references that’s fine; other people will to a greater or less extent do that work for you over time. If you’re starting a non-trivial number of similar articles, get yourself a prototype which does most of the stuff for you (I still use which I wrote for doing New Zealand women academics). If you need a prototype like this, feel free to ask me.

If you have a list of things (people, public art works, exhibitions) in some machine readable format (Excel, CSV, etc) it’s pretty straightforward to turn them into a table like or Send me your data and what kind of direction you want to take it.

If you have a random thing that you think needs a Wikipedia article, add to  if you have a hundred things that you think need articles, start a subpage, a la and both completed projects of mine.

Sara mentioned that they were thinking of getting subject matter experts to contribute to relevant wikipedia articles. In theory this is a great idea and some famous subject matter experts contributed to Britannica, so this is well-established ground. However, there have been some recent wikipedia failures particularly in the sciences. People used to ground-breaking writing may have difficulty switching to a genre where no original ideas are permitted and everything needs to be balanced and referenced.

Preparing for the event, I created a list of things the awesome Dowse team could do as follow-ups to they craft artists work, but we never got to that in the session, so I've listed them here:
  1. [[List of public art in Lower Hutt]] Since public art is out of copyright, someone could spend a couple of weeks taking photos of all the public art and creating a table with clickable thumbnail, name, artist, date, notes and GPS coordinates. Could probably steal some logic from somewhere to make the table convertible to a set of points inside a GPS for a tour.
  2. Publish from their archives a complete list of every exhibition ever held at the Dowse since founding. Each exhibition is a shout-out to the artists involved and the list can be used to check for potentially missing wikipedia articles.
  3. Digitise and release photos taken at exhibition openings, capturing the people, fashion and feeling of those era. The hard part of this, of course, is labelling the people.
  4. Reach out to their broader community to use the Dowse blog to publish community-written obituaries and similar content (i.e. encourage the generation of quality secondary sources).
  5. Engage with your local artists and politicians by taking pictures at Dowse events, uploading them to commons and adding them to the subjects’ wikipedia articles—have attending a Dowse exhibition opening being the easiest way for locals to get a new wikipedia image.
I've not listed the 'digitise the collections' option, since at the end of the day, the value of this (to wikipedia) declines over time (because there are more and more alternative sources) and the price of putting them online declines. I'd much rather people tried new innovative things when they had the agility and leadership that lets them do it, because that's how the community as a whole moves forward.

Mark E. Phillips: Date values in the UNT Libraries’ Digital Collections

planet code4lib - Mon, 2015-10-19 15:00

This past week I was clearing out a bunch of software feature request tickets to prepare for a feature push for our digital library system.  We are getting ready to do a redesign of The Portal to Texas History and the UNT Digital Library interfaces.

Buried deep in our ticketing system were some tickets made during the past five years that included notes about future implementations that we could create for the system.  One of these notes caught my eye because it had the phrase “since date data is so poor in the system”.  At first I had dismissed this phrase and ticket altogether because our ideas related to the feature request had changed, but later that phrase stuck with me a bit.

I began to wonder,  “what is the quality of our date data in our digital library” and more specifically “what does the date resolution look like across the UNT Libraries’ Digital Collections”.

Getting the Data

The first thing to do was to grab all of the date data for each record in the system.  At the time of writing there were 1,310,415 items in the UNT Libraries Digital Collections.  I decided the easiest way to grab the date information for these records was to pull it from our Solr index.

I constructed a solr query that would return the value of our dc_date field, the ark identifier we use to uniquely identify each item in the repository, and finally which of the systems (Portal, Digital Library, or Gateway) a record belongs to.

I pulled these as JSON files with 10,000 records per request,  did 132 requests and I was in business.

I wrote a short Python little script that takes those Solr responses and converts them into a tab separated format that looks like this:

ark:/67531/metapth2355 1844-01-01 PTH ark:/67531/metapth2356 1845-01-01 PTH ark:/67531/metapth2357 1845-01-01 PTH ark:/67531/metapth2358 1844-01-01 PTH ark:/67531/metapth2359 1844-01-01 PTH ark:/67531/metapth2360 1844 PTH ark:/67531/metapth2361 1845-01-01 PTH ark:/67531/metapth2362 1883-01-01 PTH ark:/67531/metapth2363 1844 PTH ark:/67531/metapth2365 1845 PTH

Next I wrote another Python script that classifies a date into the following categories:

  • Day
  • Month
  • Year
  • Other-EDTF
  • Unknown
  • None

Day, Month, and Year are the three units that I’m really curious about,  I identified these with simple regular expressions for yyyy-mm-dd, yyyy-mm, and yyyy respectively.  For records that had date strings that weren’t day, month, or year, I checked if the string was an Extended Date Time Format string.  If it was valid EDTF I marked it as Other-EDTF, if it wasn’t a valid EDTF and wasn’t a day, month, year I marked it as Unknown.  Finally if there wasn’t a date present for a metadata record at all, it is marked as “None”.

One thing to note about the way I’m doing the categories,  I am probably missing quite a few values that have day, month or years somewhere in the string by not parsing the EDTF and Unknown strings a little more liberally for days, months and years.  This is true but for what I’m trying to accomplish here, I think we will let that slide.

What does the data look like?

The first thing for me to do was to see how many of the records had date strings compared to the number of records that do not have date strings present.

Date values vs none

Looking at the numbers shows 1,222,750 (93%) of records having date strings and 87,665 (7%) are missing date strings.  Just with those numbers I think that we negate the statement that “date data is poor in the system”.  But maybe just the presence of dates isn’t what the ticket author meant.  So we investigate further.

The next thing I did was to see how many of the dates overall were able to be classified as a day, month, or year.  The reasoning for looking at these values is that you can imagine building user interfaces that make use of date values to let users refine their searching activities or browse a collection by date.

Identified Resolution vs Not

This chart shows that the overwhelming majority of objects in our digital library 1,202,625 (92%) had date values that were either day, month, or year and only 107,790 (8%) were classified as “Other”. Now this I think does blow the statement about poor date data quality away.

The last thing I think there is to look at is how each of the categories stack up against each other.  Once again, a pie chart.

UNT Digital Libraries Date Resolution Distribution

Here is a table view of the same data.

Date Classification Instances Percentage Day 967,257 73.8% Month 43,952 3.4% Year 191,416 14.6% Other-EDTF 15,866 1.2% Unknown 4,259 0.3% None 87,665 6.7%

So looking at this data it is clear that the majority of our digital objects have the resolution at the “day” level with 967,257 records or 73.8% of all records being in the format yyyy-mm-dd.  Next year resolution is the second highest occurrence with 191,416 or 14.6%.  Finally Month resolution came in with 43,952 records or 3.4%.  There were 15,866 records that had valid EDTF values, 4,259 with other date values and finally the 87,665 records that did not contain a data at all.


I think that I can safely say that we do in fact have a large amount of date data in our digital libraries.  This date data can be parsed easily into day, month and year buckets for use in discovery interfaces, and by doing very basic work with the date strings we are able to account for 92% of all records in the system.

I’d be interested to see how other digital libraries stand on date data to see if we are similar or different as far as this goes.  I might hit up my colleagues at the University of Florida because their University of Florida Digital Collections is of similar scale with similar content. If you would like to work to compare your digital libraries’ date data let me know.

Hope you enjoyed my musings here, if you have thoughts, suggestions, or if I missed something in my thoughts,  please let me know via Twitter.

Ariadne Magazine: Editorial: Ariadne: the neverending story.

planet code4lib - Mon, 2015-10-19 07:49

Jon Knight, the latest in the long line of Ariadne editors, explains some of the changes that the journal has undergone this year, and introduces the articles in issue 74.

Welcome to issue 74 of Ariadne! This is the first issue of the magazine that we have hosted here at Loughborough University, with an editorial team spread over a number of institutions, after we took over the reins (and the software and database) from Bath University back in April. You might have noticed a few changes since the move that I’ll hopefully explain in this editorial. Read more about Editorial: Ariadne: the neverending story.

Article type: Issue number: Authors: Organisations: Date published: Mon, 10/19/201574

LITA: Interacting with Patrons Through Their Mobile Devices :: NFC Tags

planet code4lib - Mon, 2015-10-19 05:00

Wireless — this term evokes an array of feelings in technologists today. Even though the definition of the term is relatively simple, there are numerous protocols, standards, and methods that have been developed to perform wireless interactions. For example, by now, many of you have heard of the mobile applications, such as Apple Pay or Google Wallet, similarly, you might have a transit pass or badge for your gym or work. With a wave of your device or pass a scanner processes a “contactless transactions”. The tap-and-go experience of these technologies often utilize Near Field Communication, or NFC.

NFC is a set of standards that allows devices to establish radio communication with each other by touching them together or bringing them into close proximity, an effective distance of 4 cm.  A direct transmissions of specific information, separate from the open ended Wi-Fi access and seemingly limitless information resources it provides.

NFC tags are used to send a resource, or a specific set of data, directly to a patron’s mobile device to improve their information seeking experience. By utilizing this technology, Libraries have the ability to perform data exchanges with patron mobile devices without scanning a QR-code, or pairing devices (as required by Bluetooth) providing a less complex experience.

There are many useful tasks you can program these tags to perform. One example would be to set a tag to update a patron’s mobile calendar with an event your library is having. These tags have the ability to be programmed with date, time, location, and an alarm information to remind the patron of the event, which is substantially more effective than a QR codes ability to connect a patron with a destination. Another useful method of using this technology would be to program a set of NFC keychains for the library staff to have on hand programmed to allow Wi-Fi access, no more password requests or questions about access, just a simple tap of the NFC keychain. The ability to execute preset instructions, beyond just a URL for the mobile device, differentiates NFC tags from QR codes. Many NFC tag users also find them more appealing visually, because they can be placed into posters or other advertisement materials without visually altering the design.

The use of this technology has been anticipated in libraries for several years now. However, there is a one minor issue with implementing NFC tags, Apple only supports the use of this technology for Apple Pay. Apple devices do not currently support the use of NFC for any other transaction, even though the technology is available on their devices. Hopefully, in the future Apple will make NFC unrestrained on their devices, and this technology and it will become more widely utilized. 

Ed Summers: Seminar Week 7

planet code4lib - Mon, 2015-10-19 04:00

This week we focused on information visualization with Niklas Elmqvist from the UMD iSchool. Niklas studies information visualization and human computer interaction. He joined UMD in the last year, after arriving from Purdue University.

For this week we also read Elzen & Wijk (2014), Heer, Bostock, & Ogievetsky (2010) and Heer & Shneiderman (2012). I enjoyed the two Heer articles because of their accessibility (they were written for the more general readership of ACM Queue), but also for their breadth. The 2012 paper in particular does a really nice job of summarizing a large number of visualization techniques by breaking them down into a taxonomy of data/view specification, view manipulation, and analysis process / provenance.

The surprise for me (since I’ve just been a dabbler in dataviz) is that the iterative feedback loop of the analysis/provenance piece is deemed an important part of the visualization itself. Niklas stressed this as he described how Visual Analytics which studies not only how to visualize data, but how the interaction between data processing, data visualization, computer interfaces and the human can enable new forms of reasoning that have previously been impossible, or at least very difficult.

The 2010 article was also very interesting to me because I recognized the name of Mike Bostock, who is a legend in the developer community for having played a part in the creation of Data Driven Documents (D3). D3 is a Web standards compliant data visualization toolkit. I have also used Bostock’s Protovis library, but learned from Niklas that Heer (his PhD advisor) also played a role in the creation of both Protovis and D3, as well as the Flare and Prefuse visualization libraries. It seems like there is a lesson here about persistence, or at least not staying still. Bostock was at the NYTimes until recently, helping bootstrap their data visualization capabilities.

We did spend a little bit of time talking about how essential it is to be able to share visualizations. We talked briefly about Bostock’s D3 publishing framework at, which allows GitHub repositories containing data and D3 visualizations to be easily published on the Web. I’ve heard from friends at the NYTimes that Bostock created a very similar in-house system for reporters and editors there.

I left this meeting more excited than I thought I was going to be about the propspects of learning more about data visualization. I hadn’t considered before how much of a HCI and data visualization problem there is lying in the web archiving domain. My immediate interest centers on the appraisal process itself: how do curators and archivists sift through social media to identify salient Web documents to preserve. But also the very act of exploring Web archives is quite under-developed. The Wayback experience of diving into the archive with a known URL and then wandering around in links is the de facto standard for Web archives. But how would search be presented: what are the new new and useful ways to search through time as well as text? It feels like there is a big piece of work that could be done in this area. At any rate, I definitely would like to take Niklas’ class when it is available next.


Elzen, S. van den, & Wijk, J. J. van. (2014). Multivariate network exploration and presentation: From detail to overview via selections and aggregations. Visualization and Computer Graphics, IEEE Transactions on, 20(12), 2310–2319. Retrieved from

Heer, J., & Shneiderman, B. (2012). Interactive dynamics for visual analysis. Queue, 10(2), 30. Retrieved from

Heer, J., Bostock, M., & Ogievetsky, V. (2010). A tour through the visualization zoo. Commun. Acm, 53(6), 59–67. Retrieved from


Subscribe to code4lib aggregator