You are here

Feed aggregator

Galen Charlton: Tips and tricks for leaking patron information

planet code4lib - Sat, 2014-10-18 00:39

Here is a partial list of various ways I can think of to expose information about library patrons and their search and reading history by use (and misuse) of software used or recommended by libraries.

  • Send a patron’s ebook reading history to a commercial website…
    • … in the clear, for anybody to intercept.
  • Send patron information to a third party…
    • … that does not have an adequate privacy policy.
    • … that has an adequate privacy policy but does not implement it well.
    • … that is sufficiently remote that libraries lack any leverage to punish it for egregious mishandling of patron data.
  • Use an unencrypted protocol to enable a third-party service provider to authenticate patrons or look them up…
    • … such as SIP2.
    • … such as SIP2, with the patron information response message configured to include full contact information for the patron.
    • … or many configurations of NCIP.
    • … or web services accessible over HTTP (as opposed to HTTPS).
  • Store patron PINs and passwords without encryption…
    • … or using weak hashing.
  • Store the patron’s Social Security Number in the ILS patron record.
  • Don’t require HTTPS for a patron to access her account with the library…
    • … or if you do, don’t keep up to date with the various SSL and TLS flaws announced over the years.
  • Make session cookies used by your ILS or discovery layer easy to snoop.
  • Use HTTP at all in your ILS or discovery layer – as oddly enough, many patrons will borrow the items that they search for.
  • Send an unencrypted email…
    • … containing a patron’s checkouts today (i.e., an email checkout receipt).
    • … reminding a patron of his overdue books – and listing them.
    • … listing the titles of the patron’s available hold requests.
  • Don’t encrypt connections between an ILS client program and its application server.
  • Don’t encrypt connections between an ILS application server and its database server.
  • Don’t notice that a rootkit has been running on your ILS server for the past six months.
  • Don’t notice that a keylogger has been running on one of your circulation PCs for the past three months.
  • Fail to keep up with installing operating system security patches.
  • Use the same password for the circulator account used by twenty circulation staff (and 50 former circulation staff) – and never change it.
  • Don’t encrypt your backups.
  • Don’t use the feature in your ILS to enable severing the link between the record of a past loan and the specific patron who took the item out…
    • … sever the links, but retain database backups for months or years.
  • Don’t give your patrons the ability to opt out of keeping track of their past loans.
  • Don’t give your patrons the ability to opt in to keeping track of their past loans.
  • Don’t give the patron any control or ability to completely sever the link between her record and her past circulation history whenever she chooses to.
  • When a patron calls up asking “what books do I have checked out?” … answer the question without verifying that the patron is actually who she says she is.
  • When a parent calls up asking “what books does my teenager have checked out?”… answer the question.
  • Set up your ILS to print out hold slips… that include the full name of the patron. For bonus points, do this while maintaining an open holds shelf.
  • Don’t shred any circulation receipts that patrons leave behind.
  • Don’t train your non-MLS staff on the importance of keeping patron information confidential.
  • Don’t give your MLS staff refreshers on professional ethics.
  • Don’t shut down library staff gossiping about a patron’s reading preferences.
  • Don’t immediately sack a library staff member caught misusing confidential patron information.
  • Have your ILS or discovery interface hosted by a service provider that makes one or more of the mistakes listed above.
  • Join a committee writing a technical standard for library software… and don’t insist that it take patron privacy into account.

Do you have any additions to the list? Please let me know!

Of course, I am not actually advocating disclosing confidential information. Stay tuned for a follow-up post.

Harvard Library Innovation Lab: Link roundup October 17, 2014

planet code4lib - Fri, 2014-10-17 21:23

This is the good stuff.

UNIX: Making Computers Easier To Use — AT&T Archives film from 1982, Bell Laboratories

Love the idea that UNIX and computing should be social. Building things, together.

Digital Public Library of America » GIF IT UP

The @DPLA @digitalnz GIF IT UP competition is the funnest thing in libraries right now. Love it.

physical-web/ at master · google/physical-web

URLs emitted from physical world devices. This is the right way to think about phone/physical world interfaces.

Forty Portraits in Forty Years –

Gotta love the Brown sisters. Photos from our archives are neat. Stitching together a time lapse would be amazing.

Peter Thiel Thinks We All Can Do Better | On Point with Tom Ashbrook


Roy Tennant: A Tale of Two Records

planet code4lib - Fri, 2014-10-17 20:46

Image courtesy Wikipedia, public domain.

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the epoch of belief,
it was the epoch of incredulity,
it was the season of BIBFRAME,
it was the season of RDA,
it was the spring of hope,
it was the winter of despair,
we had everything before us,
we had nothing before us,
we were all going direct to metadata Heaven,
we were all going direct the other way–
in short, the period was so far like the present period, that some of
its noisiest authorities insisted on its being received, for good or for
evil, in the superlative degree of comparison only.

There were a MARC with a large set of tags and an RDA with a plain face, on the throne of library metadata; there were a with a large following and a JSON-LD with a fair serialization, on the throne of all else. In both camps it was clearer than crystal to the lords of the Library preserves of monographs and serials, that things in general were settled for ever.


But they weren’t. Oh were they not. It mayhaps would have been pleasant, back in 2014, to have settled everything for all time, but such things were not to be.

The library guilds united behind the RDA wall, where they frantically ran MARC records through the furnace to forge fresh new records of RDA, employed to make the wall ever thicker and higher.

The Parliamentary Library assaulted their ramparts with the BIBFRAME, but the stones flung by that apparatus were insufficient to breach the wall of RDA.

Meanwhile, the vast populace in neither camp employed, to garner the attention of the monster crawlers and therefore their many minions, ignoring the internecine squabbles over arcane formats.

Eventually warfare settled down to a desultory, almost emotionless flinging of insults and the previous years of struggle were rendered meaningless.

So now we, the occupants of mid-century modernism, are left to contemplate the apparent fact that formats never really mattered at all. No, dear reader, they never did. What mattered was the data, and the parsing of it, and its ability to be passed from hand to hand without losing meaning or value.

One wonders what those dead on the Plain of Standards would say if they could have lived to see this day.


My humble and abject apologies to Mr. Charles Dickens, for having been so bold as to damage his fine work with my petty scribblings.

District Dispatch: Free webinar: Giving legal advice to patrons

planet code4lib - Fri, 2014-10-17 20:05

Reference librarian assisting readers. Photo by the Library of Congress.

Every day, public library staff are asked to answer legal questions. Since these questions are often complicated and confusing, and because there are frequent warnings about not offering legal advice, reference staff may be uncomfortable addressing legal reference questions. To help reference staff build confidence in responding to legal inquiries, the American Library Association (ALA) and iPAC will host the free webinar “ Connecting Patrons with Legal Information” on Wednesday, November 12, 2014, from 2:00–3:00 p.m. EDT.

The session will offer information on laws, legal resources and legal reference practices. Participants will learn how to handle a law reference interview, including where to draw the line between information and advice, key legal vocabulary and citation formats. During the webinar, leaders will offer tips on how to assess and choose legal resources for patrons. Register now as space is limited.

Catherine McGuire, head of Reference and Outreach at the Maryland State Law Library, will lead the free webinar. McGuire currently plans and presents educational programs to Judiciary staff, local attorneys, public library staff and members of the public on subjects related to legal research and reference. She currently serves as Vice Chair of the Conference of Maryland Court Law Library Directors and the co-chair of the Education Committee of the Legal Information Services to the Public Special Interest Section (LISP-SIS) of the American Association of Law Libraries (AALL).

Webinar: Connecting Patrons with Legal Information
Date: Wednesday, November 12, 2014
Time: 2:00–3:00 p.m. EDT

The archived webinar will be emailed to District Dispatch subscribers.

The post Free webinar: Giving legal advice to patrons appeared first on District Dispatch.

OCLC Dev Network: A Close Look at the WorldCat::Discovery Ruby Gem

planet code4lib - Fri, 2014-10-17 18:15

This is the third installment in our deep dive series on the WorldCat Discovery API. This week we will be taking a close look at some of the demo code we have written ourselves to exercise the API throughout its development process. We have decided to share our work through our OCLC Developer Network account.

pinboard: NCSA Brown Dog

planet code4lib - Fri, 2014-10-17 16:53
@todrobbins saw you were asking about browndog in #code4lib did you find this already?

OCLC Dev Network: Systems Maintenance on October 19

planet code4lib - Fri, 2014-10-17 16:30

Web services that require user level authentication will be down for Identity Management system (IDM) updates beginning Sunday, October 19th, at 3:00 am local data center time.


HangingTogether: Evolving Scholarly Record workshop (Part 2)

planet code4lib - Fri, 2014-10-17 16:00

This is the second of three posts about the workshop.

Part 1 introduced the Evolving Scholarly Record framework.  This part summarizes the two plenary discussions.

Research Records and Artifact EcologiesNatasa Miliç-Frayling, Principal Researcher, Microsoft Research Cambridge

Natasa illustrated the diversity and complexity of digital research information comparing it to a rainbow and asking how do we preserve a rainbow?  She began with the question, How can we support the reuse of scientific data, tools, and resources to facilitate new scientific discoveries?  We need to take a sociological point of view because scientific discovery is a social enterprise within communities of practice – and the information takes a complex journey from the lab to the paper, evolving en route.  When teams consist of distributed scientists notions of ownership and sharing are challenged.  We need to be attuned to the interplay between technology and collaborative practices as it affects the information artifacts.

Natasa encouraged a shift in thinking from the record to the ecology, as she shared her study of the artifacts ecology of a particular nanotechnology endeavor.  Their ecosystem has electronic lab books, includes tools, ingests sensor data, and incorporates analysis and interpretation.  This ecosystem provides context for understanding the data and other artifacts, but scientists want help linking these artifacts and overcoming limitations of physical interaction.  They want content extraction and format transformation services.  They want to create project maps and overviews to support their work in order to convey meaning to guide third party reuse of the artifacts.  Preservation is not just persistence; it requires a connection with the contemporary ecosystem.  A file and an application can persist and be completely unusable.  They need to be processed and displayed to be experienced and this requires preserving them in their original state and virtualising the old environments on future platforms.  She acknowledged the challenges in supporting research, but implored libraries to persevere.

A Perspective on Archiving the Evolving Scholarly RecordHerbert Van de Sompel, Scientist, Los Alamos National Laboratory

Herbert took a web-focused view, saying that not only is nearly everything digital, it is nearly all networked, which must be taken into account when we talk about archiving.  His presentation reflected thinking in progress with Andrew Treloar, of the Australian National Data Service.  Herbert highlighted the “collect” and “fix” roles and how the materials will be obtained by archives.  He used Roosendaal and Geurtz’s functions of scholarly communication to structure his talk: Registration (the claim, with its related objects), Certification (peer review and other validation), Awareness (alerts and discovery of new claims), and Archiving (preserving over time), emphasizing that there is no scholarly record without archiving.  The four functions had been integrated in print journal publishing, but now the functions are disaggregated and distributed among many entities.

Herbert then characterized the future environment as the Web of Objects.  Scholarly communication is becoming more visible, continuous, informal, instant, and content-driven.  As a result, research objects are more varied, compound, diverse, networked, and open.  He discussed several challenges this presents to libraries.  Archiving must take into account that objects are often hosted on common web platforms (e.g., GitHub, SlideShare, WordPress), which are not necessarily dedicated to scholarship.  We archive only 50% of journal articles and they tend to be the easy, low-risk titles.  “Web at Large” resources are seldom archived.   Today’s approach to archiving focuses on atomic objects and loses context.  We need to move toward archiving compound objects in various states of flux, as resources on the web rather than as files in file systems.  He distinguished between recording (short-term, no guarantees, many copies, and tied to the scholarly process) and archiving (longer-term, guarantees, one copy, and part of the scholarly record).  Curatorial decisions need to be made to transfer materials from the recording infrastructures to an archival infrastructure through collaborations, interoperability, and web-scale processes.

Part 3 will summarize the breakout discussions.

About Ricky Erway

Ricky Erway, Senior Program Officer at OCLC Research, works with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation.

Mail | Web | Twitter | LinkedIn | More Posts (36)

Jason Ronallo: HTML and PDF Slideshows Written in Markdown with DZSlides, Pandoc, Guard, Capybara Webkit, and a little Ruby

planet code4lib - Fri, 2014-10-17 13:57

Update: This post is still an alright overview of how to simply create HTML slide decks using these tools. See the more recent version of the code I created to jump start slide deck creation that has added features including synchronized audience notes.

I’ve used different HTML slideshow tools in the past, but was never satisfied with them. I didn’t like to have to run a server just for a slideshow. I don’t like when a slideshow requires external dependencies that make it difficult to share the slides. I don’t want to actually have to write a lot of HTML.

I want to write my slides in a single Markdown file. As a backup I always like to have my slides available as a PDF.

For my latest presentations I came up with workflow that I’m satisfied with. Once all the little pieces were stitched together it worked really well for me. I’ll show you how I did it.

I had looked at DZSlides before but had always passed it by after seeing what a default slide deck looked like. It wasn’t as flashy as others and doesn’t immediately have all the same features readily available. I looked at it again because I liked the idea that it is a single file template. I also saw that Pandoc will convert Markdown into a DZSlides slideshow.

To convert my Markdown to DZSlides it was as easy as:

pandoc -w dzslides > presentation.html

What is even better is that Pandoc has settings to embed images and any external files as data URIs within the HTML. So this allows me to maintain a single Markdown file and then share my presentation as a single HTML file including images and all–no external dependencies.

pandoc -w dzslides --standalone --self-contained > presentation.html

The DZSlides default template is rather plain, so you’ll likely want to make some stylistic changes to the CSS. You may also want to add some more JavaScript as part of your presentation or to add features to the slides. For instance I wanted to add a simple way to toggle my speaker notes from showing. In previous HTML slides I’ve wanted to control HTML5 video playback by binding JavaScript to a key. The way I do this is to add in any external styles or scripts directly before the closing body tag after Pandoc does its processing. Here’s the simple script I wrote to do this:

#! /usr/bin/env ruby # markdown_to_slides.rb # Converts a markdown file into a DZslides presentation. Pandoc must be installed. # Read in the given CSS file and insert it between style tags just before the close of the body tag. css ='styles.css') script ='scripts.js') `pandoc -w dzslides --standalone --self-contained > presentation.html` presentation ='presentation.html') style = "<style>#{css}</style>" scripts = "<script>#{script}</script>" presentation.sub!('</body>', "#{style}#{scripts}</body>")'presentation.html', 'w') do |fh| fh.puts presentation end

Just follow these naming conventions:

  • Presentation Markdown should be named
  • Output presentation HTML will be named presentation.html
  • Create a stylesheet in styles.css
  • Create any JavaScript in a file named scripts.js
  • You can put images wherever you want, but I usually place them in an images directory.
Automate the build

Now what I wanted was for this script to run any time the Markdown file changed. I used Guard to watch the files and set off the script to convert the Markdown to slides. While I was at it I could also reload the slides in my browser. One trick with guard-livereload is to allow your browser to watch local files so that you do not have to have the page behind a server. Here’s my Guardfile:

guard 'livereload' do watch("presentation.html") end guard :shell do # If any of these change run the script to build presentation.html watch('') {`./markdown_to_slides.rb`} watch('styles.css') {`./markdown_to_slides.rb`} watch('scripts.js') {`./markdown_to_slides.rb`} watch('markdown_to_slides.rb') {`./markdown_to_slides.rb`} end

Add the following to a Gemfile and bundle install:

source '' gem 'guard-livereload' gem 'guard-shell'

Now I have a nice automated way to build my slides, continue to work in Markdown, and have a single file as a result. Just run this:

bundle exec guard

Now when any of the files change your HTML presentation will be rebuilt. Whenever the resulting presentation.html is changed, it will trigger livereload and a browser refresh.

Slides to PDF

The last piece I needed was a way to convert the slideshow into a PDF as a backup. I never know what kind of equipment will be set up or whether the browser will be recent enough to work well with the HTML slides. I like being prepared. It makes me feel more comfortable knowing I can fall back to the PDF if needs be. Also some slide deck services will accept a PDF but won’t take an HTML file.

In order to create the PDF I wrote a simple ruby script using capybara-webkit to drive a headless browser. If you aren’t able to install the dependencies for capybara-webkit you might try some of the other capybara drivers. I did not have luck with the resulting images from selenium. I then used the DZSlides JavaScript API to advance the slides. I do a simple count of how many times to advance based on the number of sections. If you have incremental slides this script would need to be adjusted to work for you.

The Webkit driver is used to take a snapshot of each slide, save it to a screenshots directory, and then ImageMagick’s convert is used to turn the PNGs into a PDF. You could just as well use other tools to stitch the PNGs together into a PDF. The quality of the resulting PDF isn’t great, but it is good enough. Also the capybara-webkit browser does not evaluate @font-face so the fonts will be plain. I’d be very interested if anyone gets better quality using a different browser driver for screenshots.

#! /usr/bin/env ruby # dzslides2pdf.rb # dzslides2pdf.rb http://localhost/presentation_root presentation.html require 'capybara/dsl' require 'capybara-webkit' # require 'capybara/poltergeist' require 'fileutils' include Capybara::DSL base_url = ARGV[0] || exit presentation_name = ARGV[1] || 'presentation.html' # temporary file for screenshot FileUtils.mkdir('./screenshots') unless File.exist?('./screenshots') Capybara.configure do |config| config.run_server = false config.default_driver config.current_driver = :webkit # :poltergeist = "fake app name" config.app_host = base_url end visit '/presentation.html' # visit the first page # change the size of the window if Capybara.current_driver == :webkit page.driver.resize_window(1024,768) end sleep 3 # Allow the page to render correctly page.save_screenshot("./screenshots/screenshot_000.png", width: 1024, height: 768) # take screenshot of first page # calculate the number of slides in the deck slide_count = page.body.scan(%r{slide level1}).size puts slide_count (slide_count - 1).times do |time| slide_number = time + 1 keypress_script = "Dz.forward();" # dzslides script for going to next slide page.execute_script(keypress_script) # run the script to transition to next slide sleep 3 # wait for the slide to fully transition # screenshot_and_save_page # take a screenshot page.save_screenshot("./screenshots/screenshot_#{slide_number.to_s.rjust(3,'0')}.png", width: 1024, height: 768) print "#{slide_number}. " end puts `convert screenshots/*png presentation.pdf` FileUtils.rm_r('screenshots')

At this point I did have to set this up to be behind a web server. On my local machine I just made a symlink from the root of my Apache htdocs to my working directory for my slideshow. The script can be called with the following.

./dzslides2pdf.rb http://localhost/presentation/root/directory presentation.html Speaker notes

One addition that I’ve made is to add some JavaScript for speaker notes. I don’t want to have to embed my slides into another HTML document to get the nice speaker view that DZslides provides. I prefer to just have a section at the bottom of the slides that pops up with my notes. I’m alright with the audience seeing my notes if I should ever need them. So far I haven’t had to use the notes.

I start with adding the following markup to the presentation Markdown file.

<div role="note" class="note"> Hi. I'm Jason Ronallo the Associate Head of Digital Library Initiatives at NCSU Libraries. </div>

Add some CSS to hide the notes by default but allow for them to display at the bottom of the slide.

div[role=note] { display: none; position: absolute; bottom: 0; color: white; background-color: gray; opacity: 0.85; padding: 20px; font-size: 12px; width: 100%; }

Then a bit of JavaScript to show/hide the notes when pressing the “n” key.

window.onkeypress = presentation_keypress_check; function presentation_keypress_check(aEvent){ if ( aEvent.keyCode == 110) { aEvent.preventDefault(); var notes = document.getElementsByClassName('note'); for (var i=0; i < notes.length; i++){ notes[i].style.display = (notes[i].style.display == 'none' || !notes[i].style.display) ? 'block' : 'none'; } } } Outline

Finally, I like to have an outline I can see of my presentation as I’m writing it. Since the Markdown just uses h1 elements to separate slides, I just use the following simple script to output the outline for my slides.

#!/usr/bin/env ruby # outline_markdown.rb file ='') index = 0 file.each_line do |line| if /^#\s/.match line index += 1 title = line.sub('#', index.to_s) puts title end end Full Example

You can see the repo for my latest HTML slide deck created this way for the 2013 DLF Forum where I talked about Embedded Semantic Markup,, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives.


I like doing slides where I can write very quickly in Markdown and then have the ability to handcraft the deck or particular slides. I’d be interested to hear if you do something similar.

Jason Ronallo: Styling HTML5 Video with CSS

planet code4lib - Fri, 2014-10-17 13:41

If you add an image to an HTML document you can style it with CSS. You can add borders, change its opacity, use CSS animations, and lots more. HTML5 video is just as easy to add to your pages and you can style video too. Lots of tutorials will show you how to style video controls, but I haven’t seen anything that will show you how to style the video itself. Read on for an extreme example of styling video just to show what’s possible.

Here’s a simple example of a video with a single source wrapped in a div:

<div id="styled_video_container"> <video src="/video/wind.mp4" type="video/mp4" controls poster="/video/wind.png" id="styled_video" muted preload="metadata" loop> </div>

Add some buttons under the video to style and play the video and then to stop the madness.

<button type="button" id="style_it">Style It!</button> <button type="button" id="stop_style_it">Stop It!</button>

We’ll use this JavaScript just to add a class to the containing element of the video and play/pause the video.

jQuery(document).ready(function($) { $('#style_it').on('click', function(){ $('#styled_video')[0].play(); $('#styled_video_container').addClass('style_it'); }); $('#stop_style_it').on('click', function(){ $('#styled_video_container').removeClass('style_it'); $('#styled_video')[0].pause(); }); });

Using the class that gets added we can then style and animate the video element with CSS. This is a simplified version without vendor flags.

#styled_video_container.style_it { background: linear-gradient(to bottom, #ff670f 0%,#e20d0d 100%); } #styled_video_container.style_it video { border: 10px solid green !important; opacity: 0.6; transition: all 8s ease-in-out; transform: rotate(300deg); box-shadow: 12px 9px 13px rgba(255, 0, 255, 0.75); } Stupid Video Styling Tricks Style It! Stop It!


OK, maybe there aren’t a lot of practical uses for styling video with CSS, but it is still fun to know that we can. Do you have a practical use for styling video with CSS that you can share?

Terry Reese: MarcEdit LibHub Plug-in

planet code4lib - Fri, 2014-10-17 03:19

As libraries begin to join and participate in systems to test Bibframe principles, my hope is that when possible, I can provide support through MarcEdit to provide these communities a conduit to simplify the publishing of information into those systems.  The first of these test systems is the Libhub Initiative, and working with Eric Miller and the really smart folks at Zepheira (, have created a plug-in specifically for libraries and partners working with the LibHub initiative.  The plug-in provides a mechanism to publish a variety of metadata formats into the system – from MARC, MARCXML, EAD, and MODS data – the process will hopefully help users contribute content and help spur discussion around the data model Zepheira is employing with this initiative.

For the time being, the plug-in is private, and available to any library currently participating in the LibHub project.  However, my understanding is that as they continue to ramp up the system, the plugin will be made available to the general community at large.

For now, I’ve published a video talking about the plug-in and demonstrating how it works.  If you are interested, you can view the video on YouTube.



FOSS4Lib Upcoming Events: Receive replica cartier watches

planet code4lib - Fri, 2014-10-17 01:25
Date: Thursday, October 16, 2014 - 21:15 to Thursday, October 30, 2014 - 21:15Supports: Ceridwen Library Self Issue Software

Last updated October 16, 2014. Created by cocolove on October 16, 2014.
Log in to edit this page.

Tuck the residual wire in to the bottom in the coil firmly with the chain nose pliers. There are already some variations off lately and today the gold jewelry that you receive replica cartier watches, incorporates enamels studded to it. Like the title warns, Murphy’s Law is in full force tonight. You will be able to go out and meet new people who may even become lifelong friends. Charms happen to be kept inside the garments and are actually used just like a kind of identification on the list of other person.

FOSS4Lib Upcoming Events: Cartier's Santos watch was the timepiece that drew men away from pocket

planet code4lib - Fri, 2014-10-17 01:23
Date: Thursday, October 16, 2014 - 21:15Supports: BibwikiKoha Stow Extras

Last updated October 16, 2014. Created by cartierlover on October 16, 2014.
Log in to edit this page.

Nowadays Cartier has more than 200 stores in 125 countries worldwide along with their product range goes from watches to accessories and from leather good to perfumes. So nothing will put me off more quickly than you rambling on about yourself. The website will supply you with the complete information about How to hemp patterns as well as the Hemp knots in making different kinds of jewelry.

Terry Reese: Automated Language Translation using Microsoft’s Translation Services

planet code4lib - Fri, 2014-10-17 01:13

We hear the refrain over and over – we live in a global community.  Socially, politically, economically – the ubiquity of the internet and free/cheap communications has definitely changed the world that we live in.  For software developers, this shift has definitely been felt as well.  My primary domain tends to focus around software built for the library community, but I’ve participated in a number of open source efforts in other domains as well, and while it is easier than ever to make one’s project/source available to the masses, efforts to localize said projects is still largely overlooked.  And why?  Well, doing internationalization work is hard and often times requires large numbers of volunteers proficient in multiple languages to provide quality translations of content in a wide range of languages.  It also tends to slow down the development process and requires developers to create interfaces and inputs that support language sets that they themselves may not be able to test or validate.   


If your project team doesn’t have the language expertise to provide quality internalization support, you have a variety of options available to you (with the best ones reserved for those with significant funding).  These range of tools available to open source projects like: TranslateWiki ( which provides a platform for volunteers to participate in crowd-sourced translation services.  There are also some very good subscription services like Transifex (, a subscription service that again, works as both a platform and match-making service between projects and translators.  Additionally, Amazon’s Mechanical Turk can be utilized to provide one off translation services at a fairly low cost.  The main point though, is that services do exist that cover a wide spectrum in terms of cost and quality.   The challenge of course, is that many of the services above require a significant amount of match-making, either on the part of the service or the individuals involved with the project and oftentimes money.  All of this ultimately takes time, sometimes a significant amount of time, making it a difficult cost/benefit analysis of determining which languages one should invest the time and resources to support.

Automated Translation

This is a problem that I’ve been running into a lot lately.  I work on a number of projects where the primary user community hails largely from North America; or, well, the community that I interact with most often are fairly English language centric.  But that’s changing — I’ve seen a rapidly growing international community and increasing calls for localized versions of software or utilities that have traditionally had very niche audiences. 

I’ll use MarcEdit ( as an example.  Over the past 5 years, I’ve seen the number of users working with the program steadily increase, with much of that increase coming from a growing international user community.  Today, 1/3-1/2 of each month’s total application usage comes from outside of North America, a number that I would have never expected when I first started working on the program in 1999.  But things have changed, and finding ways to support these changing demographics are challenging.. 

In thinking about ways to provide better support for localization, one area that I found particularly interesting was the idea of marrying automated language transcription with human intervention.  The idea being that a localized interface could be automatically generated using an automated translation tool to provide a “good enough” translation, that could also serve as the template for human volunteers to correct and improve the work.  This would enable support for a wide range of languages where English really is a barrier but no human volunteer has been secured to provide localized translation; but would enable established communities to have a “good enough” template to use as a jump-off point to improve and speed up the process of human enhanced translation.  Additionally, as interfaces change and are updated, or new services are added, automated processes could generate the initial localization, until a local expert was available to provide the high quality transcription of the new content, to avoid slowing down the development and release process.

This is an idea that I’ve been pursing for a number of months now, and over the past week, have been putting into practice.  Utilizing Microsoft’s Translation Services, I’ve been working on a process to extract all text strings from a C# application and generate localized language files for the content.  Once the files have been generated, I’ve been having the files evaluated by native speakers to comment on quality and usability…and for the most part, the results have been surprising.  While I had no expectation that the translations generated through any automated service would be comparable to human mediated translation, I was pleasantly surprised to hear that the automated data is very often, good enough.  That isn’t to say that it’s without its problems, there are definitely problems.  The bigger question has been, do these problems impede the use of the application or utility.  In most cases, the most glaring issue with the automated translation services has been context.  For example, take the word Score.  Within the context of MarcEdit and library bibliographic description, we know score applies to musical scores, not points scored in a game…context.  The problem is that many languages do make these distinctions with distinct words, and if the translation service cannot determine the context, it tends to default to the most common usage of a term – and in the case of library bibliographic description, that would be often times incorrect.  It’s made for some interesting conversations with volunteers evaluating the automated translations – which can range from very good, to down right comical.  But by a large margin, evaluators have said that while the translations were at times very awkward, they would be “good enough” until someone could provide better a better translation of the content.  And what is more, the service gets enough of the content right, that it could be used as a template to speed the translation process.  And for me, this is kind of what I wanted to hear.

Microsoft’s Translation Services

There really aren’t a lot of options available for good free automated translation services, and I guess that’s for good reason.  It’s hard, and requires both resources and adequate content to learn how to read and output natural language.  I looked hard at the two services that folks would be most familiar with: Google’s Translation API ( and Microsoft’s translation services (  When I started this project, my intention was to work with Google’s Translation API – I’d used it in the past with some success, but at some point in the past few years, Google seems to have shut down its free API translation services and replace them with a more traditional subscription service model.  Now, the costs for that subscription (which tend to be based on number of characters processed) is certainly quite reasonable, my usage will always be fairly low and a little scattershot making the monthly subscription costs hard to justify.  Microsoft’s translation service is also a subscription based service, but it provides a free tier that supports 2 million characters of through-put a month.  Since that more than meets my needs, I decided to start here. 

The service provides access to a wide range of languages, including Klingon (Qo’noS marcedit qaStaHvIS tlhIngan! nuq laH ‘oH Dunmo’?), which made working with the service kind of fun.  Likewise, the APIs are well-documented, though can be slightly confusing due to shifts in authentication practice to an OAuth Token-based process sometime in the past year or two.  While documentation on the new process can be found, most code samples found online still reference the now defunct key/secret key process.

So how does it work?  Performance-wise, not bad.  In generating 15 language files, it took around 5-8 minutes per file, with each file requiring close to 1600 calls against the server, per file.  As noted above, accuracy varies, especially when doing translations of one word commands that could have multiple meanings depending on context.  It was actually suggested that some of these context problems may actually be able to be overcome by using a language other than English as the source, which is a really interesting idea and one that might be worth investigating in the future. 

Seeing how it works

If you are interested in seeing how this works, you can download a sample program which pulls together code copied or cribbed from the Microsoft documentation (and then cleaned for brevity) as well as code on how to use the service from:–Language-Translator.  I’m kicking around the idea of converting the C# code into a ruby gem (which is actually pretty straight forward), so if there is any interest, let me know.


HangingTogether: Evolving Scholarly Record workshop (Part 1)

planet code4lib - Thu, 2014-10-16 21:28

This is the first of three posts about the The Evolving Scholarly Record and the Evolving Stewardship Ecosystem workshop held on 10 June 2014 in Amsterdam.

OCLC Research staff observed that while there are a lot of discussions about changes in the scholarly record, the discussions are fragmented. They set out to provide a high-level framework to facilitate future discussion. That work is represented in our Evolving Scholarly Record report and formed the basis for an international workshop.

The workshop explored the boundaries of the scholarly record and the curation roles of various stakeholders. Participants from nine countries included OCLC Research Partners and Data Archiving and Networked Services (DANS) community members with a mission for collecting, making available and preserving the scholarly record. They gathered to explore the responsibilities of research libraries, data archives, and other stewards of research output in creating a reliable ecosystem for preserving the scholarly record and making it accessible. Presentation slides, photos, and videos from the workshop are available.

There is a vast amount of digital research information in need of curation. Currently, libraries are reconceiving their roles regarding stewardship and curation, but it is obvious that libraries and archives are not the only stakeholders in the emerging ecosystem. Scholarly practices and the landscape of information services around them are undergoing significant change. Scholars embrace digital and networked technologies, inventing and experimenting with new forms of scholarship, and perceptions are changing about the long-term value of various forms of scholarly information. Libraries and other stewardship organizations are redefining their tasks as guides to and guardians of research information. Open access policies, funder requirements, and new venues for scholarly communication are blurring the roles of the various stakeholders, including commercial publishers, governmental entities, and universities. Digital information is being curated in different ways and at different places, but some of it is not curated at all. There is a real danger of losing the integrity of the scholarly record. The impact of changes in digital scholarship requires a collective effort among the variety of stakeholders.

The workshop discussion began with an overview of the OCLC Research report, The Evolving Scholarly Record. Ricky Erway (Senior Program Officer, OCLC Research) outlined the framework that OCLC created to facilitate discussions of our evolving stewardship roles in the broader ecosystem. She said that the boundaries of the scholarly record are always evolving, but a confluence of trends is accelerating the evolutionary process. Ricky emphasized that the framework does not attempt to describe scholarly processes nor encompass scholarly communication. The framework focuses on the “stuff” or the units of communication that become part of the scholarly record — and, for the purposes of the workshop, how that stuff will be stewarded going forward.

The framework has at its center what has traditionally been the payload, research outcomes, but it is a deeper and more complete record of scholarly inquiry with greater emphasis on context (process & aftermath).

Evolving Scholarly Record Framework – OCLC Research

Process has three parts:

  • Method – lab notebooks, computer models, protocols
  • Evidence – datasets, primary source documents, survey results
  • Discussion – proposal reviews, preprints, conference presentations

Outcomes include traditional articles and monographs, but also simulations, performances, and a growing variety of other “end products”

Aftermath has three parts:

  • Discussion – this time after the fact: reviews, commentary, online exchanges
  • Revision – can include the provision of additional findings, corrections, and clarifications
  • Reuse – might involve summaries, conference presentations, and popular media versions

Nothing is fixed. For example, in some fields, a conference presentation may be the outcome, in others it is used to inform the outcome, and in others it may amplify the outcome to reach new audiences. And those viewing the scholarly record will see the portions pertinent to their purpose. The framework document addresses traditional stakeholder roles (create, fix, collect, and use) and how they are being combined in new ways. Workshop attendees were encouraged to use the framework as they discussed the changing scholarly record and the increasingly distributed ecosystem of custodial responsibility.

Part 2 will feature views from Natasa Miliç-Frayling, Principal Researcher at Microsoft Research Cambridge, UK and Herbert Van de Sompel, Scientist, Los Alamos National Laboratory.

About Ricky Erway

Ricky Erway, Senior Program Officer at OCLC Research, works with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation.

Mail | Web | Twitter | LinkedIn | More Posts (36)

District Dispatch: Webinar archive: Fighting Ebola with information

planet code4lib - Thu, 2014-10-16 17:52

Photo by Phil Moyer

Archived video from the American Library Association (ALA) webinar “Fighting Ebola and Infectious Diseases with Information: Resources and Search Skills Can Arm Librarians,” is now available. The free webinar teaches participants how to find and share reliable health information on the infectious disease. Librarians from the U.S. National Library of Medicine hosted the interactive webinar. Watch the webinar or download copies of the slides (pdf).

Speakers include:

Siobhan Champ-Blackwell
Siobhan Champ-Blackwell is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center. She selects material to be added to the NLM disaster medicine grey literature data base and is responsible for the Center’s social media efforts. She has over 10 years of experience in providing training on NLM products and resources.

Elizabeth Norton
Elizabeth Norton is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center where she has been working to improve online access to disaster health information for the disaster medicine and public health workforce. She has presented on this topic at national and international association meetings and has provided training on disaster health information resources to first responders, educators, and librarians working with the disaster response and public health preparedness communities.

To view past webinars also hosted collaboratively with iPAC, please visit Lib2Gov.

The post Webinar archive: Fighting Ebola with information appeared first on District Dispatch.

LibraryThing (Thingology): NEW: Annotations for Book Display Widgets

planet code4lib - Thu, 2014-10-16 14:21

Our Book Display Widgets is getting adopted by more and more libraries, and we’re busy making it better and better. Last week we introduced Easy Share. This week we’re rolling out another improvement—Annotations!

Book Display Widgets is the ultimate tool for libraries to create automatic or hand-picked virtual book displays for their home page, blog, Facebook or elsewhere. Annotations allows libraries to add explanations for their picks.

Some Ways to Use Annotations 1. Explain Staff Picks right on your homepage.
2. Let students know if a book is reserved for a particular class.
3. Add context for special collections displays.
How it Works

Check out the LibraryThing for Libraries Wiki for instructions on how to add Annotations to your Book Display Widgets. It’s pretty easy.


Watch a quick screencast explaining Book Display Widgets and how you can use them.

Find out more about LibraryThing for Libraries and Book Display Widgets. And sign up for a free trial of either by contacting

Library of Congress: The Signal: Five Questions for Will Elsbury, Project Leader for the Election 2014 Web Archive

planet code4lib - Thu, 2014-10-16 14:13

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

The 2008 Barack Obama presidential campaign web site a week before the election.

Since the U.S. national elections of 2000, the Library of Congress has been harvesting the web sites of candidates for elections for Congress, state governorships and the presidency. These collections  require considerable manual effort to identify the sites correctly, then to populate our in-house tool that controls the web harvesting activity that continues on a weekly basis during about a six month period during the election year cycle.  (The length of the crawling depends on the timing of each jurisdiction’s primaries and availability of the information about the candidates.)

Many national libraries started their web archiving activities by harvesting the web sites of political campaigns – by their very nature and function, they typically have a short lifespan and following the election will disappear, and during the course of the election campaign the contents of such a web site may change dramatically.  A weekly “capture” of the web site made available through a web archive for the election provides a snapshot of the sites and how they evolved during the campaign.

With Election Day in the U.S. approaching, it’s a great opportunity to talk with project leader Will Elsbury on the identification and nomination of the 2014 campaign sites and his other work on this effort as part of our Content Matters interview series.

Michael: Will, please describe your position at the Library of Congress and how you spend most of your time.

Will: I came to the Library in 2002.  I am the military history specialist and a reference librarian for the Humanities and Social Sciences Division. I divide most of my time between Main Reading Room reference desk duty, answering researchers’ inquiries via Ask a Librarian and through email, doing collection development work in my subject area, participating in relevant Library committees, and in addition, managing a number of Web archiving projects.  Currently, a good part of my time is devoted to coordinating and conducting work on the United States Election 2014 Web Archive. Several other Web archiving collections are currently ongoing for a determined period of time to encompass important historical anniversaries.

Michael: Tell us about this project and your involvement with it over the time you have been working on it.

Will: I have been involved with Web archiving in the Library for the last ten years or so. The projects have been a variety of thematic collections ranging from historical anniversaries such as the 150th commemoration of the Civil War and the centennial of World War I, to public policy topics and political elections. The majority of the projects I have worked on have been collecting the political campaign Web sites of candidates for the regular and special elections of Congress, the Presidency and state governorships. In most of these projects, I have served as the project coordinator. This involves gathering a work team, creating training documents and conducting training, assigning tasks, reviewing work, troubleshooting, corresponding with candidates and election officials and liaising with the Office of Strategic Initiatives staff who handle the very important technical processing of these projects. Their cooperation in these projects has been vital. They have shaped the tools used to build each Web archive, evolving them from a Microsoft Access-created entry form to today’s Digiboard (PDF) and its Candidates Module, which is a tool that helps the team manage campaign data and website URLs.

Michael: What are the challenges?  Have they changed over time?

Will: One of the most prominent challenges with election Web archiving is keeping abreast of the many differences found among the election practices of 50 states and the various territories. This is even more pronounced in the first election after state redistricting or possible reapportionment of Congressional seats. Our Web archive projects only archive the Web sites of those candidates who win their party’s primary and those who are listed as official candidates on the election ballot, regardless of party affiliation. Because the laws and regulations vary in each state and territory, I have to be certain that I or an assigned team member have identified a given state’s official list of candidates.

Some states are great about putting this information out. Others are more challenging and a few don’t provide a list until Election Day. That usually causes a last minute sprint of intense work both on my team’s part and that of the OSI staff. Another issue is locating contact information for candidates. We need this so an archiving and display notification message can be sent to a candidate. Some candidates very prominently display their contact information, but others present more of a challenge and it can take a number of search strategies and sleuthing tricks developed over the years to locate the necessary data. Sometimes we have to directly contact a candidate by telephone, and I can recall more than once having to listen to some very unique and interesting political theories and opinions.

2002 web site for the campaign of then-Speaker Denny Hastert of Illinois.

Michael: You must end up looking at many archived websites of political campaigns – what changes have you seen?  Do any stand out, or do they all run together?

Will: I have looked at thousands of political campaign web sites over the years. They run the gamut of slick and professional, to functional, to extremely basic and even clunky. There is still that variety out there, but I have noticed that many more candidates now use companies dedicated to the business of creating political candidacy web sites. Some are politically affiliated and others will build a site for any candidate. The biggest challenge here has to be identifying the campaign web site and contact information of minor party and independent candidates. Often times these candidates work on a shoestring budget if at all and cannot afford the cost of a campaign site. These candidates will usually run their online campaign using free or low-cost social media such as a blog or Facebook and Twitter.

Michael: How do you imagine users 10 or 20 years from now will make use of the results of this work?

Will: Researchers have already been accessing these Web archives for various purposes. I hope that future researchers will use these collections to enhance and expand their research into the historical aspects of U.S. elections, among other purposes. There are many incidents and events that have taken place which influence elections. Scandals, economic ups and downs, divisive social issues, military deployments, and natural disasters are prominent in how political campaigns are shaped and which may ultimately help win or lose an election for a candidate. Because so much of candidates’ campaigns is now found online, it is doubly important that these campaign Web sites are archived. Researchers will likely find many ways to use the Library of Congress Web archives we may not anticipate now. I look forward to helping continue the Library’s effort in this important preservation work.

FOSS4Lib Updated Packages: Retailer

planet code4lib - Thu, 2014-10-16 13:18

Last updated October 16, 2014. Created by Conal Tuohy on October 16, 2014.
Log in to edit this page.

Retailer is a platform for hosting simple web applications. Retailer apps are written in pure XSLT. Retailer itself is written in Java, and runs in a Java Servlet container such as Apache Tomcat.

Retailer currently includes two XSLT applications which implement OAI-PMH providers of full text of historic newspaper articles. These apps are implemented on top of the web API of the National Library of Australia's Trove newspaper archive, and the National Library of New Zealand's "Papers Past" newspaper archive, via the "Digital NZ" web API.

However, Retailer is simply a platform for hosting XSLT code, and could be used for many other purposes than OAI-PMH. It is a kind of XML transforming web proxy, able to present a RESTful API as another API.

Retailer works by receiving an HTTP request, converting the request into an XML document, passing the document to the XSLT, and returning the result of the XSLT back to the HTTP client. The XML document representing a request is described here:

Package Type: Discovery InterfaceLicense: Apache 2.0 Package Links Development Status: Production/StableOperating System: LinuxMacWindowsTechnologies Used: OAITomcatXSLTProgramming Language: Java


Subscribe to code4lib aggregator