You are here

Feed aggregator

Jason Ronallo: HTML and PDF Slideshows Written in Markdown with DZSlides, Pandoc, Guard, Capybara Webkit, and a little Ruby

planet code4lib - Fri, 2014-10-17 13:57

Update: This post is still an alright overview of how to simply create HTML slide decks using these tools. See the more recent version of the code I created to jump start slide deck creation that has added features including synchronized audience notes.

I’ve used different HTML slideshow tools in the past, but was never satisfied with them. I didn’t like to have to run a server just for a slideshow. I don’t like when a slideshow requires external dependencies that make it difficult to share the slides. I don’t want to actually have to write a lot of HTML.

I want to write my slides in a single Markdown file. As a backup I always like to have my slides available as a PDF.

For my latest presentations I came up with workflow that I’m satisfied with. Once all the little pieces were stitched together it worked really well for me. I’ll show you how I did it.

I had looked at DZSlides before but had always passed it by after seeing what a default slide deck looked like. It wasn’t as flashy as others and doesn’t immediately have all the same features readily available. I looked at it again because I liked the idea that it is a single file template. I also saw that Pandoc will convert Markdown into a DZSlides slideshow.

To convert my Markdown to DZSlides it was as easy as:

pandoc -w dzslides > presentation.html

What is even better is that Pandoc has settings to embed images and any external files as data URIs within the HTML. So this allows me to maintain a single Markdown file and then share my presentation as a single HTML file including images and all–no external dependencies.

pandoc -w dzslides --standalone --self-contained > presentation.html

The DZSlides default template is rather plain, so you’ll likely want to make some stylistic changes to the CSS. You may also want to add some more JavaScript as part of your presentation or to add features to the slides. For instance I wanted to add a simple way to toggle my speaker notes from showing. In previous HTML slides I’ve wanted to control HTML5 video playback by binding JavaScript to a key. The way I do this is to add in any external styles or scripts directly before the closing body tag after Pandoc does its processing. Here’s the simple script I wrote to do this:

#! /usr/bin/env ruby # markdown_to_slides.rb # Converts a markdown file into a DZslides presentation. Pandoc must be installed. # Read in the given CSS file and insert it between style tags just before the close of the body tag. css ='styles.css') script ='scripts.js') `pandoc -w dzslides --standalone --self-contained > presentation.html` presentation ='presentation.html') style = "<style>#{css}</style>" scripts = "<script>#{script}</script>" presentation.sub!('</body>', "#{style}#{scripts}</body>")'presentation.html', 'w') do |fh| fh.puts presentation end

Just follow these naming conventions:

  • Presentation Markdown should be named
  • Output presentation HTML will be named presentation.html
  • Create a stylesheet in styles.css
  • Create any JavaScript in a file named scripts.js
  • You can put images wherever you want, but I usually place them in an images directory.
Automate the build

Now what I wanted was for this script to run any time the Markdown file changed. I used Guard to watch the files and set off the script to convert the Markdown to slides. While I was at it I could also reload the slides in my browser. One trick with guard-livereload is to allow your browser to watch local files so that you do not have to have the page behind a server. Here’s my Guardfile:

guard 'livereload' do watch("presentation.html") end guard :shell do # If any of these change run the script to build presentation.html watch('') {`./markdown_to_slides.rb`} watch('styles.css') {`./markdown_to_slides.rb`} watch('scripts.js') {`./markdown_to_slides.rb`} watch('markdown_to_slides.rb') {`./markdown_to_slides.rb`} end

Add the following to a Gemfile and bundle install:

source '' gem 'guard-livereload' gem 'guard-shell'

Now I have a nice automated way to build my slides, continue to work in Markdown, and have a single file as a result. Just run this:

bundle exec guard

Now when any of the files change your HTML presentation will be rebuilt. Whenever the resulting presentation.html is changed, it will trigger livereload and a browser refresh.

Slides to PDF

The last piece I needed was a way to convert the slideshow into a PDF as a backup. I never know what kind of equipment will be set up or whether the browser will be recent enough to work well with the HTML slides. I like being prepared. It makes me feel more comfortable knowing I can fall back to the PDF if needs be. Also some slide deck services will accept a PDF but won’t take an HTML file.

In order to create the PDF I wrote a simple ruby script using capybara-webkit to drive a headless browser. If you aren’t able to install the dependencies for capybara-webkit you might try some of the other capybara drivers. I did not have luck with the resulting images from selenium. I then used the DZSlides JavaScript API to advance the slides. I do a simple count of how many times to advance based on the number of sections. If you have incremental slides this script would need to be adjusted to work for you.

The Webkit driver is used to take a snapshot of each slide, save it to a screenshots directory, and then ImageMagick’s convert is used to turn the PNGs into a PDF. You could just as well use other tools to stitch the PNGs together into a PDF. The quality of the resulting PDF isn’t great, but it is good enough. Also the capybara-webkit browser does not evaluate @font-face so the fonts will be plain. I’d be very interested if anyone gets better quality using a different browser driver for screenshots.

#! /usr/bin/env ruby # dzslides2pdf.rb # dzslides2pdf.rb http://localhost/presentation_root presentation.html require 'capybara/dsl' require 'capybara-webkit' # require 'capybara/poltergeist' require 'fileutils' include Capybara::DSL base_url = ARGV[0] || exit presentation_name = ARGV[1] || 'presentation.html' # temporary file for screenshot FileUtils.mkdir('./screenshots') unless File.exist?('./screenshots') Capybara.configure do |config| config.run_server = false config.default_driver config.current_driver = :webkit # :poltergeist = "fake app name" config.app_host = base_url end visit '/presentation.html' # visit the first page # change the size of the window if Capybara.current_driver == :webkit page.driver.resize_window(1024,768) end sleep 3 # Allow the page to render correctly page.save_screenshot("./screenshots/screenshot_000.png", width: 1024, height: 768) # take screenshot of first page # calculate the number of slides in the deck slide_count = page.body.scan(%r{slide level1}).size puts slide_count (slide_count - 1).times do |time| slide_number = time + 1 keypress_script = "Dz.forward();" # dzslides script for going to next slide page.execute_script(keypress_script) # run the script to transition to next slide sleep 3 # wait for the slide to fully transition # screenshot_and_save_page # take a screenshot page.save_screenshot("./screenshots/screenshot_#{slide_number.to_s.rjust(3,'0')}.png", width: 1024, height: 768) print "#{slide_number}. " end puts `convert screenshots/*png presentation.pdf` FileUtils.rm_r('screenshots')

At this point I did have to set this up to be behind a web server. On my local machine I just made a symlink from the root of my Apache htdocs to my working directory for my slideshow. The script can be called with the following.

./dzslides2pdf.rb http://localhost/presentation/root/directory presentation.html Speaker notes

One addition that I’ve made is to add some JavaScript for speaker notes. I don’t want to have to embed my slides into another HTML document to get the nice speaker view that DZslides provides. I prefer to just have a section at the bottom of the slides that pops up with my notes. I’m alright with the audience seeing my notes if I should ever need them. So far I haven’t had to use the notes.

I start with adding the following markup to the presentation Markdown file.

<div role="note" class="note"> Hi. I'm Jason Ronallo the Associate Head of Digital Library Initiatives at NCSU Libraries. </div>

Add some CSS to hide the notes by default but allow for them to display at the bottom of the slide.

div[role=note] { display: none; position: absolute; bottom: 0; color: white; background-color: gray; opacity: 0.85; padding: 20px; font-size: 12px; width: 100%; }

Then a bit of JavaScript to show/hide the notes when pressing the “n” key.

window.onkeypress = presentation_keypress_check; function presentation_keypress_check(aEvent){ if ( aEvent.keyCode == 110) { aEvent.preventDefault(); var notes = document.getElementsByClassName('note'); for (var i=0; i < notes.length; i++){ notes[i].style.display = (notes[i].style.display == 'none' || !notes[i].style.display) ? 'block' : 'none'; } } } Outline

Finally, I like to have an outline I can see of my presentation as I’m writing it. Since the Markdown just uses h1 elements to separate slides, I just use the following simple script to output the outline for my slides.

#!/usr/bin/env ruby # outline_markdown.rb file ='') index = 0 file.each_line do |line| if /^#\s/.match line index += 1 title = line.sub('#', index.to_s) puts title end end Full Example

You can see the repo for my latest HTML slide deck created this way for the 2013 DLF Forum where I talked about Embedded Semantic Markup,, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives.


I like doing slides where I can write very quickly in Markdown and then have the ability to handcraft the deck or particular slides. I’d be interested to hear if you do something similar.

Jason Ronallo: Styling HTML5 Video with CSS

planet code4lib - Fri, 2014-10-17 13:41

If you add an image to an HTML document you can style it with CSS. You can add borders, change its opacity, use CSS animations, and lots more. HTML5 video is just as easy to add to your pages and you can style video too. Lots of tutorials will show you how to style video controls, but I haven’t seen anything that will show you how to style the video itself. Read on for an extreme example of styling video just to show what’s possible.

Here’s a simple example of a video with a single source wrapped in a div:

<div id="styled_video_container"> <video src="/video/wind.mp4" type="video/mp4" controls poster="/video/wind.png" id="styled_video" muted preload="metadata" loop> </div>

Add some buttons under the video to style and play the video and then to stop the madness.

<button type="button" id="style_it">Style It!</button> <button type="button" id="stop_style_it">Stop It!</button>

We’ll use this JavaScript just to add a class to the containing element of the video and play/pause the video.

jQuery(document).ready(function($) { $('#style_it').on('click', function(){ $('#styled_video')[0].play(); $('#styled_video_container').addClass('style_it'); }); $('#stop_style_it').on('click', function(){ $('#styled_video_container').removeClass('style_it'); $('#styled_video')[0].pause(); }); });

Using the class that gets added we can then style and animate the video element with CSS. This is a simplified version without vendor flags.

#styled_video_container.style_it { background: linear-gradient(to bottom, #ff670f 0%,#e20d0d 100%); } #styled_video_container.style_it video { border: 10px solid green !important; opacity: 0.6; transition: all 8s ease-in-out; transform: rotate(300deg); box-shadow: 12px 9px 13px rgba(255, 0, 255, 0.75); } Stupid Video Styling Tricks Style It! Stop It!


OK, maybe there aren’t a lot of practical uses for styling video with CSS, but it is still fun to know that we can. Do you have a practical use for styling video with CSS that you can share?

Terry Reese: MarcEdit LibHub Plug-in

planet code4lib - Fri, 2014-10-17 03:19

As libraries begin to join and participate in systems to test Bibframe principles, my hope is that when possible, I can provide support through MarcEdit to provide these communities a conduit to simplify the publishing of information into those systems.  The first of these test systems is the Libhub Initiative, and working with Eric Miller and the really smart folks at Zepheira (, have created a plug-in specifically for libraries and partners working with the LibHub initiative.  The plug-in provides a mechanism to publish a variety of metadata formats into the system – from MARC, MARCXML, EAD, and MODS data – the process will hopefully help users contribute content and help spur discussion around the data model Zepheira is employing with this initiative.

For the time being, the plug-in is private, and available to any library currently participating in the LibHub project.  However, my understanding is that as they continue to ramp up the system, the plugin will be made available to the general community at large.

For now, I’ve published a video talking about the plug-in and demonstrating how it works.  If you are interested, you can view the video on YouTube.



FOSS4Lib Upcoming Events: Receive replica cartier watches

planet code4lib - Fri, 2014-10-17 01:25
Date: Thursday, October 16, 2014 - 21:15 to Thursday, October 30, 2014 - 21:15Supports: Ceridwen Library Self Issue Software

Last updated October 16, 2014. Created by cocolove on October 16, 2014.
Log in to edit this page.

Tuck the residual wire in to the bottom in the coil firmly with the chain nose pliers. There are already some variations off lately and today the gold jewelry that you receive replica cartier watches, incorporates enamels studded to it. Like the title warns, Murphy’s Law is in full force tonight. You will be able to go out and meet new people who may even become lifelong friends. Charms happen to be kept inside the garments and are actually used just like a kind of identification on the list of other person.

FOSS4Lib Upcoming Events: Cartier's Santos watch was the timepiece that drew men away from pocket

planet code4lib - Fri, 2014-10-17 01:23
Date: Thursday, October 16, 2014 - 21:15Supports: BibwikiKoha Stow Extras

Last updated October 16, 2014. Created by cartierlover on October 16, 2014.
Log in to edit this page.

Nowadays Cartier has more than 200 stores in 125 countries worldwide along with their product range goes from watches to accessories and from leather good to perfumes. So nothing will put me off more quickly than you rambling on about yourself. The website will supply you with the complete information about How to hemp patterns as well as the Hemp knots in making different kinds of jewelry.

Terry Reese: Automated Language Translation using Microsoft’s Translation Services

planet code4lib - Fri, 2014-10-17 01:13

We hear the refrain over and over – we live in a global community.  Socially, politically, economically – the ubiquity of the internet and free/cheap communications has definitely changed the world that we live in.  For software developers, this shift has definitely been felt as well.  My primary domain tends to focus around software built for the library community, but I’ve participated in a number of open source efforts in other domains as well, and while it is easier than ever to make one’s project/source available to the masses, efforts to localize said projects is still largely overlooked.  And why?  Well, doing internationalization work is hard and often times requires large numbers of volunteers proficient in multiple languages to provide quality translations of content in a wide range of languages.  It also tends to slow down the development process and requires developers to create interfaces and inputs that support language sets that they themselves may not be able to test or validate.   


If your project team doesn’t have the language expertise to provide quality internalization support, you have a variety of options available to you (with the best ones reserved for those with significant funding).  These range of tools available to open source projects like: TranslateWiki ( which provides a platform for volunteers to participate in crowd-sourced translation services.  There are also some very good subscription services like Transifex (, a subscription service that again, works as both a platform and match-making service between projects and translators.  Additionally, Amazon’s Mechanical Turk can be utilized to provide one off translation services at a fairly low cost.  The main point though, is that services do exist that cover a wide spectrum in terms of cost and quality.   The challenge of course, is that many of the services above require a significant amount of match-making, either on the part of the service or the individuals involved with the project and oftentimes money.  All of this ultimately takes time, sometimes a significant amount of time, making it a difficult cost/benefit analysis of determining which languages one should invest the time and resources to support.

Automated Translation

This is a problem that I’ve been running into a lot lately.  I work on a number of projects where the primary user community hails largely from North America; or, well, the community that I interact with most often are fairly English language centric.  But that’s changing — I’ve seen a rapidly growing international community and increasing calls for localized versions of software or utilities that have traditionally had very niche audiences. 

I’ll use MarcEdit ( as an example.  Over the past 5 years, I’ve seen the number of users working with the program steadily increase, with much of that increase coming from a growing international user community.  Today, 1/3-1/2 of each month’s total application usage comes from outside of North America, a number that I would have never expected when I first started working on the program in 1999.  But things have changed, and finding ways to support these changing demographics are challenging.. 

In thinking about ways to provide better support for localization, one area that I found particularly interesting was the idea of marrying automated language transcription with human intervention.  The idea being that a localized interface could be automatically generated using an automated translation tool to provide a “good enough” translation, that could also serve as the template for human volunteers to correct and improve the work.  This would enable support for a wide range of languages where English really is a barrier but no human volunteer has been secured to provide localized translation; but would enable established communities to have a “good enough” template to use as a jump-off point to improve and speed up the process of human enhanced translation.  Additionally, as interfaces change and are updated, or new services are added, automated processes could generate the initial localization, until a local expert was available to provide the high quality transcription of the new content, to avoid slowing down the development and release process.

This is an idea that I’ve been pursing for a number of months now, and over the past week, have been putting into practice.  Utilizing Microsoft’s Translation Services, I’ve been working on a process to extract all text strings from a C# application and generate localized language files for the content.  Once the files have been generated, I’ve been having the files evaluated by native speakers to comment on quality and usability…and for the most part, the results have been surprising.  While I had no expectation that the translations generated through any automated service would be comparable to human mediated translation, I was pleasantly surprised to hear that the automated data is very often, good enough.  That isn’t to say that it’s without its problems, there are definitely problems.  The bigger question has been, do these problems impede the use of the application or utility.  In most cases, the most glaring issue with the automated translation services has been context.  For example, take the word Score.  Within the context of MarcEdit and library bibliographic description, we know score applies to musical scores, not points scored in a game…context.  The problem is that many languages do make these distinctions with distinct words, and if the translation service cannot determine the context, it tends to default to the most common usage of a term – and in the case of library bibliographic description, that would be often times incorrect.  It’s made for some interesting conversations with volunteers evaluating the automated translations – which can range from very good, to down right comical.  But by a large margin, evaluators have said that while the translations were at times very awkward, they would be “good enough” until someone could provide better a better translation of the content.  And what is more, the service gets enough of the content right, that it could be used as a template to speed the translation process.  And for me, this is kind of what I wanted to hear.

Microsoft’s Translation Services

There really aren’t a lot of options available for good free automated translation services, and I guess that’s for good reason.  It’s hard, and requires both resources and adequate content to learn how to read and output natural language.  I looked hard at the two services that folks would be most familiar with: Google’s Translation API ( and Microsoft’s translation services (  When I started this project, my intention was to work with Google’s Translation API – I’d used it in the past with some success, but at some point in the past few years, Google seems to have shut down its free API translation services and replace them with a more traditional subscription service model.  Now, the costs for that subscription (which tend to be based on number of characters processed) is certainly quite reasonable, my usage will always be fairly low and a little scattershot making the monthly subscription costs hard to justify.  Microsoft’s translation service is also a subscription based service, but it provides a free tier that supports 2 million characters of through-put a month.  Since that more than meets my needs, I decided to start here. 

The service provides access to a wide range of languages, including Klingon (Qo’noS marcedit qaStaHvIS tlhIngan! nuq laH ‘oH Dunmo’?), which made working with the service kind of fun.  Likewise, the APIs are well-documented, though can be slightly confusing due to shifts in authentication practice to an OAuth Token-based process sometime in the past year or two.  While documentation on the new process can be found, most code samples found online still reference the now defunct key/secret key process.

So how does it work?  Performance-wise, not bad.  In generating 15 language files, it took around 5-8 minutes per file, with each file requiring close to 1600 calls against the server, per file.  As noted above, accuracy varies, especially when doing translations of one word commands that could have multiple meanings depending on context.  It was actually suggested that some of these context problems may actually be able to be overcome by using a language other than English as the source, which is a really interesting idea and one that might be worth investigating in the future. 

Seeing how it works

If you are interested in seeing how this works, you can download a sample program which pulls together code copied or cribbed from the Microsoft documentation (and then cleaned for brevity) as well as code on how to use the service from:–Language-Translator.  I’m kicking around the idea of converting the C# code into a ruby gem (which is actually pretty straight forward), so if there is any interest, let me know.


HangingTogether: Evolving Scholarly Record workshop (Part 1)

planet code4lib - Thu, 2014-10-16 21:28

This is the first of three posts about the The Evolving Scholarly Record and the Evolving Stewardship Ecosystem workshop held on 10 June 2014 in Amsterdam.

OCLC Research staff observed that while there are a lot of discussions about changes in the scholarly record, the discussions are fragmented. They set out to provide a high-level framework to facilitate future discussion. That work is represented in our Evolving Scholarly Record report and formed the basis for an international workshop.

The workshop explored the boundaries of the scholarly record and the curation roles of various stakeholders. Participants from nine countries included OCLC Research Partners and Data Archiving and Networked Services (DANS) community members with a mission for collecting, making available and preserving the scholarly record. They gathered to explore the responsibilities of research libraries, data archives, and other stewards of research output in creating a reliable ecosystem for preserving the scholarly record and making it accessible. Presentation slides, photos, and videos from the workshop are available.

There is a vast amount of digital research information in need of curation. Currently, libraries are reconceiving their roles regarding stewardship and curation, but it is obvious that libraries and archives are not the only stakeholders in the emerging ecosystem. Scholarly practices and the landscape of information services around them are undergoing significant change. Scholars embrace digital and networked technologies, inventing and experimenting with new forms of scholarship, and perceptions are changing about the long-term value of various forms of scholarly information. Libraries and other stewardship organizations are redefining their tasks as guides to and guardians of research information. Open access policies, funder requirements, and new venues for scholarly communication are blurring the roles of the various stakeholders, including commercial publishers, governmental entities, and universities. Digital information is being curated in different ways and at different places, but some of it is not curated at all. There is a real danger of losing the integrity of the scholarly record. The impact of changes in digital scholarship requires a collective effort among the variety of stakeholders.

The workshop discussion began with an overview of the OCLC Research report, The Evolving Scholarly Record. Ricky Erway (Senior Program Officer, OCLC Research) outlined the framework that OCLC created to facilitate discussions of our evolving stewardship roles in the broader ecosystem. She said that the boundaries of the scholarly record are always evolving, but a confluence of trends is accelerating the evolutionary process. Ricky emphasized that the framework does not attempt to describe scholarly processes nor encompass scholarly communication. The framework focuses on the “stuff” or the units of communication that become part of the scholarly record — and, for the purposes of the workshop, how that stuff will be stewarded going forward.

The framework has at its center what has traditionally been the payload, research outcomes, but it is a deeper and more complete record of scholarly inquiry with greater emphasis on context (process & aftermath).

Evolving Scholarly Record Framework – OCLC Research

Process has three parts:

  • Method – lab notebooks, computer models, protocols
  • Evidence – datasets, primary source documents, survey results
  • Discussion – proposal reviews, preprints, conference presentations

Outcomes include traditional articles and monographs, but also simulations, performances, and a growing variety of other “end products”

Aftermath has three parts:

  • Discussion – this time after the fact: reviews, commentary, online exchanges
  • Revision – can include the provision of additional findings, corrections, and clarifications
  • Reuse – might involve summaries, conference presentations, and popular media versions

Nothing is fixed. For example, in some fields, a conference presentation may be the outcome, in others it is used to inform the outcome, and in others it may amplify the outcome to reach new audiences. And those viewing the scholarly record will see the portions pertinent to their purpose. The framework document addresses traditional stakeholder roles (create, fix, collect, and use) and how they are being combined in new ways. Workshop attendees were encouraged to use the framework as they discussed the changing scholarly record and the increasingly distributed ecosystem of custodial responsibility.

Part 2 will feature views from Natasa Miliç-Frayling, Principal Researcher at Microsoft Research Cambridge, UK and Herbert Van de Sompel, Scientist, Los Alamos National Laboratory.

About Ricky Erway

Ricky Erway, Senior Program Officer at OCLC Research, works with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation.

Mail | Web | Twitter | LinkedIn | More Posts (36)

District Dispatch: Webinar archive: Fighting Ebola with information

planet code4lib - Thu, 2014-10-16 17:52

Photo by Phil Moyer

Archived video from the American Library Association (ALA) webinar “Fighting Ebola and Infectious Diseases with Information: Resources and Search Skills Can Arm Librarians,” is now available. The free webinar teaches participants how to find and share reliable health information on the infectious disease. Librarians from the U.S. National Library of Medicine hosted the interactive webinar. Watch the webinar or download copies of the slides (pdf).

Speakers include:

Siobhan Champ-Blackwell
Siobhan Champ-Blackwell is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center. She selects material to be added to the NLM disaster medicine grey literature data base and is responsible for the Center’s social media efforts. She has over 10 years of experience in providing training on NLM products and resources.

Elizabeth Norton
Elizabeth Norton is a librarian with the U.S. National Library of Medicine Disaster Information Management Research Center where she has been working to improve online access to disaster health information for the disaster medicine and public health workforce. She has presented on this topic at national and international association meetings and has provided training on disaster health information resources to first responders, educators, and librarians working with the disaster response and public health preparedness communities.

To view past webinars also hosted collaboratively with iPAC, please visit Lib2Gov.

The post Webinar archive: Fighting Ebola with information appeared first on District Dispatch.

LibraryThing (Thingology): NEW: Annotations for Book Display Widgets

planet code4lib - Thu, 2014-10-16 14:21

Our Book Display Widgets is getting adopted by more and more libraries, and we’re busy making it better and better. Last week we introduced Easy Share. This week we’re rolling out another improvement—Annotations!

Book Display Widgets is the ultimate tool for libraries to create automatic or hand-picked virtual book displays for their home page, blog, Facebook or elsewhere. Annotations allows libraries to add explanations for their picks.

Some Ways to Use Annotations 1. Explain Staff Picks right on your homepage.
2. Let students know if a book is reserved for a particular class.
3. Add context for special collections displays.
How it Works

Check out the LibraryThing for Libraries Wiki for instructions on how to add Annotations to your Book Display Widgets. It’s pretty easy.


Watch a quick screencast explaining Book Display Widgets and how you can use them.

Find out more about LibraryThing for Libraries and Book Display Widgets. And sign up for a free trial of either by contacting

Library of Congress: The Signal: Five Questions for Will Elsbury, Project Leader for the Election 2014 Web Archive

planet code4lib - Thu, 2014-10-16 14:13

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

The 2008 Barack Obama presidential campaign web site a week before the election.

Since the U.S. national elections of 2000, the Library of Congress has been harvesting the web sites of candidates for elections for Congress, state governorships and the presidency. These collections  require considerable manual effort to identify the sites correctly, then to populate our in-house tool that controls the web harvesting activity that continues on a weekly basis during about a six month period during the election year cycle.  (The length of the crawling depends on the timing of each jurisdiction’s primaries and availability of the information about the candidates.)

Many national libraries started their web archiving activities by harvesting the web sites of political campaigns – by their very nature and function, they typically have a short lifespan and following the election will disappear, and during the course of the election campaign the contents of such a web site may change dramatically.  A weekly “capture” of the web site made available through a web archive for the election provides a snapshot of the sites and how they evolved during the campaign.

With Election Day in the U.S. approaching, it’s a great opportunity to talk with project leader Will Elsbury on the identification and nomination of the 2014 campaign sites and his other work on this effort as part of our Content Matters interview series.

Michael: Will, please describe your position at the Library of Congress and how you spend most of your time.

Will: I came to the Library in 2002.  I am the military history specialist and a reference librarian for the Humanities and Social Sciences Division. I divide most of my time between Main Reading Room reference desk duty, answering researchers’ inquiries via Ask a Librarian and through email, doing collection development work in my subject area, participating in relevant Library committees, and in addition, managing a number of Web archiving projects.  Currently, a good part of my time is devoted to coordinating and conducting work on the United States Election 2014 Web Archive. Several other Web archiving collections are currently ongoing for a determined period of time to encompass important historical anniversaries.

Michael: Tell us about this project and your involvement with it over the time you have been working on it.

Will: I have been involved with Web archiving in the Library for the last ten years or so. The projects have been a variety of thematic collections ranging from historical anniversaries such as the 150th commemoration of the Civil War and the centennial of World War I, to public policy topics and political elections. The majority of the projects I have worked on have been collecting the political campaign Web sites of candidates for the regular and special elections of Congress, the Presidency and state governorships. In most of these projects, I have served as the project coordinator. This involves gathering a work team, creating training documents and conducting training, assigning tasks, reviewing work, troubleshooting, corresponding with candidates and election officials and liaising with the Office of Strategic Initiatives staff who handle the very important technical processing of these projects. Their cooperation in these projects has been vital. They have shaped the tools used to build each Web archive, evolving them from a Microsoft Access-created entry form to today’s Digiboard (PDF) and its Candidates Module, which is a tool that helps the team manage campaign data and website URLs.

Michael: What are the challenges?  Have they changed over time?

Will: One of the most prominent challenges with election Web archiving is keeping abreast of the many differences found among the election practices of 50 states and the various territories. This is even more pronounced in the first election after state redistricting or possible reapportionment of Congressional seats. Our Web archive projects only archive the Web sites of those candidates who win their party’s primary and those who are listed as official candidates on the election ballot, regardless of party affiliation. Because the laws and regulations vary in each state and territory, I have to be certain that I or an assigned team member have identified a given state’s official list of candidates.

Some states are great about putting this information out. Others are more challenging and a few don’t provide a list until Election Day. That usually causes a last minute sprint of intense work both on my team’s part and that of the OSI staff. Another issue is locating contact information for candidates. We need this so an archiving and display notification message can be sent to a candidate. Some candidates very prominently display their contact information, but others present more of a challenge and it can take a number of search strategies and sleuthing tricks developed over the years to locate the necessary data. Sometimes we have to directly contact a candidate by telephone, and I can recall more than once having to listen to some very unique and interesting political theories and opinions.

2002 web site for the campaign of then-Speaker Denny Hastert of Illinois.

Michael: You must end up looking at many archived websites of political campaigns – what changes have you seen?  Do any stand out, or do they all run together?

Will: I have looked at thousands of political campaign web sites over the years. They run the gamut of slick and professional, to functional, to extremely basic and even clunky. There is still that variety out there, but I have noticed that many more candidates now use companies dedicated to the business of creating political candidacy web sites. Some are politically affiliated and others will build a site for any candidate. The biggest challenge here has to be identifying the campaign web site and contact information of minor party and independent candidates. Often times these candidates work on a shoestring budget if at all and cannot afford the cost of a campaign site. These candidates will usually run their online campaign using free or low-cost social media such as a blog or Facebook and Twitter.

Michael: How do you imagine users 10 or 20 years from now will make use of the results of this work?

Will: Researchers have already been accessing these Web archives for various purposes. I hope that future researchers will use these collections to enhance and expand their research into the historical aspects of U.S. elections, among other purposes. There are many incidents and events that have taken place which influence elections. Scandals, economic ups and downs, divisive social issues, military deployments, and natural disasters are prominent in how political campaigns are shaped and which may ultimately help win or lose an election for a candidate. Because so much of candidates’ campaigns is now found online, it is doubly important that these campaign Web sites are archived. Researchers will likely find many ways to use the Library of Congress Web archives we may not anticipate now. I look forward to helping continue the Library’s effort in this important preservation work.

FOSS4Lib Updated Packages: Retailer

planet code4lib - Thu, 2014-10-16 13:18

Last updated October 16, 2014. Created by Conal Tuohy on October 16, 2014.
Log in to edit this page.

Retailer is a platform for hosting simple web applications. Retailer apps are written in pure XSLT. Retailer itself is written in Java, and runs in a Java Servlet container such as Apache Tomcat.

Retailer currently includes two XSLT applications which implement OAI-PMH providers of full text of historic newspaper articles. These apps are implemented on top of the web API of the National Library of Australia's Trove newspaper archive, and the National Library of New Zealand's "Papers Past" newspaper archive, via the "Digital NZ" web API.

However, Retailer is simply a platform for hosting XSLT code, and could be used for many other purposes than OAI-PMH. It is a kind of XML transforming web proxy, able to present a RESTful API as another API.

Retailer works by receiving an HTTP request, converting the request into an XML document, passing the document to the XSLT, and returning the result of the XSLT back to the HTTP client. The XML document representing a request is described here:

Package Type: Discovery InterfaceLicense: Apache 2.0 Package Links Development Status: Production/StableOperating System: LinuxMacWindowsTechnologies Used: OAITomcatXSLTProgramming Language: Java

Open Knowledge Foundation: Joint Submission to UN Data Revolution Group

planet code4lib - Thu, 2014-10-16 11:12

The following is the joint Submission to the UN Secretary General’s Independent Expert Advisory Group on a Data Revolution from the World Wide Web Foundation, Open Knowledge, Fundar and the Open Institute, October 15, 2014. It derives from and builds on the Global Open Data Initiative’s Declaration on Open Data.

To the UN Secretary General’s Independent Expert Advisory Group on a Data Revolution

Societies cannot develop in a fair, just and sustainable manner unless citizens are able to hold governments and other powerful actors to account, and participate in the decisions fundamentally affecting their well-being. Accountability and participation, in turn, are meaningless unless citizens know what their government is doing, and can freely access government data and information, share that information with other citizens, and act on it when necessary.

A true “revolution” through data will be one that enables all of us to hold our governments accountable for fulfilling their obligations, and to play an informed and active role in decisions fundamentally affecting their well-being.

We believe such a revolution requires ambitious commitments to make data open; invest in the ability of all stakeholders to use data effectively; and to commit to protecting the rights to information, free expression, free association and privacy, without which data-driven accountability will wither on the vine.

In addition, opening up government data creates new opportunities for SMEs and entrepreneurs, drives improved efficiency and service delivery innovation within government, and advances scientific progress. The initial costs (including any lost revenue from licenses and access charges) will be repaid many times over by the growth of knowledge and innovative data-driven businesses and services that create jobs, deliver social value and boost GDP.

The Sustainable Development Goals should include measurable, time-bound steps to:

1. Make data open by default

Government data should be open by default, and this principle should ultimately be entrenched in law. Open means that data should be freely available for use, reuse and redistribution by anyone for any purpose and should be provided in a machine-readable form (specifically it should be open data as defined by the Open Definition and in line with the 10 Open Data Principles).

  • Government information management (including procurement requirements and research funding, IT management, and the design of new laws, policies and procedures) should be reformed as necessary to ensure that such systems have built-in features ensuring that open data can be released without additional effort.
  • Non-compliance, or poor data quality, should not be used as an excuse for non-publication of existing data.
  • Governments should adopt flexible intellectual property and copyright policies that encourage unrestricted public reuse and analysis of government data.
2. Put accountability at the core of the data revolution

A data revolution requires more than selective release of the datasets that are easiest or most comfortable for governments to open. It should empower citizens to hold government accountable for the performance of its core functions and obligations. However, research by the Web Foundation and Open Knowledge shows that critical accountability data such as company registers, land record, and government contracts are least likely to be freely available to the public.

At a minimum, governments endorsing the SDGs should commit to the open release by 2018 of all datasets that are fundamental to citizen-state accountability. This should include:

  • data on public revenues, budgets and expenditure;
  • who owns and benefits from companies, charities and trusts;
  • who exercises what rights over key natural resources (land records, mineral licenses, forest concessions etc) and on what terms;
  • public procurement records and government contracts;
  • office holders, elected and un-elected and their declared financial interests and details of campaign contributions;
  • public services, especially health and education: who is in charge, responsible, how they are funded, and data that can be used to assess their performance;
  • constitution, laws, and records of debates by elected representatives;
  • crime data, especially those related to human rights violations such as forced disappearance and human trafficking;
  • census data;
  • the national map and other essential geodata.

    • Governments should create comprehensive indices of existing government data sets, whether published or not, as a foundation for new transparency policies, to empower public scrutiny of information management, and to enable policymakers to identify gaps in existing data creation and collection.
 3. Provide no-cost access to government data

One of the greatest barriers to access to ostensibly publicly-available information is the cost imposed on the public for access–even when the cost is minimal. Most government information is collected for governmental purposes, and the existence of user fees has little to no effect on whether the government gathers the data in the first place.

  • Governments should remove fees for access, which skew the pool of who is willing (or able) to access information and preclude transformative uses of the data that in turn generates business growth and tax revenues.

  • Governments should also minimise the indirect cost of using and re-using data by adopting commonly owned, non-proprietary (or “open”) formats that allow potential users to access the data without the need to pay for a proprietary software license.

  • Such open formats and standards should be commonly adopted across departments and agencies to harmonise the way information is published, reducing the transaction costs of accessing, using and combining data.

4. Put the users first

Experience shows that open data flounders without a strong user community, and the best way to build such a community is by involving users from the very start in designing and developing open data systems.

  • Within government: The different branches of government (including the legislature and judiciary, as well as different agencies and line ministries within the executive) stand to gain important benefits from sharing and combining their data. Successful open data initiatives create buy-in and cultural change within government by establishing cross-departmental working groups or other structures that allow officials the space they need to create reliable, permanent, ambitious open data policies.

  • Beyond government: Civil society groups and businesses should be considered equal stakeholders alongside internal government actors. Agencies leading on open data should involve and consult these stakeholders – including technologists, journalists, NGOs, legislators, other governments, academics and researchers, private industry, and independent members of the public – at every stage in the process.

  • Stakeholders both inside and outside government should be fully involved in identifying priority datasets and designing related initiatives that can help to address key social or economic problems, foster entrepreneurship and create jobs. Government should support and facilitate the critical role of both private sector and public service intermediaries in making data useful.

5. Invest in capacity

Governments should start with initiatives and requirements that are appropriate to their own current capacity to create and release credible data, and that complement the current capacity of key stakeholders to analyze and reuse it. At the same time, in order to unlock the full social, political and economic benefits of open data, all stakeholders should invest in rapidly broadening and deepening capacity.

  • Governments and their development partners need to invest in making data simple to navigate and understand, available in all national languages, and accessible through appropriate channels such as mobile phone platforms where appropriate.

  • Governments and their development partners should support training for officials, SMEs and CSOs to tackle lack of data and web skills, and should make complementary investments in improving the quality and timeliness of government statistics.

6. Improve the quality of official data

Poor quality, coverage and timeliness of government information – including administrative and sectoral data, geospatial data, and survey data – is a major barrier to unlocking the full value of open data.

  • Governments should develop plans to implement the Paris21 2011 Busan Action Plan, which calls for increased resources for statistical and information systems, tackling important gaps and weaknesses (including the lack of gender disaggregation in key datasets), and fully integrating statistics into decision-making.

  • Governments should bring their statistical efforts into line with international data standards and schemas, to facilitate reuse and analysis across various jurisdictions.

  • Private firms and NGOs that collect data which could be used alongside government statistics to solve public problems in areas such as disease control, disaster relief, urban planning, etc. should enter into partnerships to make this data available to government agencies and the public without charge, in fully anonymized form and subject to robust privacy protections.

7. Foster more accountable, transparent and participatory governance

A data revolution cannot succeed in an environment of secrecy, fear and repression of dissent.

  • The SDGs should include robust commitments to uphold fundamental rights to freedom of expression, information and association; foster independent and diverse media; and implement robust safeguards for personal privacy, as outlined in the UN Covenant on Civil and Political Rights.

  • In addition, in line with their commitments in the UN Millennium Declaration (2000) and the Declaration of the Open Government Partnership (2011), the SDGs should include concrete steps to tackle gaps in participation, inclusion, integrity and transparency in governance, creating momentum and legitimacy for reform through public dialogue and consensus.


This submission derives and follows on from the Global Open Data Inititiave’s Global Open Data Declaration which was jointly created by Fundar, Open Institute, Open Knowledge and World Wide Web Foundation and the Sunlight Foundation with input from civil society organizations around the world.

The full text of the Declaration can be found here:

Eric Hellman: Adobe, Privacy and the Big Yellow Taxi

planet code4lib - Thu, 2014-10-16 03:04
Here's the most important thing to understand about privacy on the Internet: Google doesn't know your password. The FBI can't march into Sergey Brin's office and threaten to put him in jail unless he tells them your password (if it thinks you're making WMD's). Because it wouldn't do them any good. If Google could produce your password, it would be a sign either of gross incompetance or the ill-considered choice of your cat's name, "mittens" as your password.

Because Google's engineers are at least moderately competent, they don't store your password anywhere.  Instead, they salt it and hash it. The next time they ask you for your password, they salt it and hash it again and see if the result is the same as the hash they've saved. It would be easier for Jimmy Dean to make a pig from a sausage than it would be to get the password from its hash. And that's how the privacy of your password is constructed.

Using similar techniques, Apple is able to build strong privacy into the latest version of iOS, and despite short-sighted espio-nostalgia from the likes of James Comey,  strong privacy is both essential and achievable for many types of data. I would include reading data in that category. Comey's arguments could easily apply to ebook reading data. After all, libraries have books on explosives, radical ideologies, and civil disobediance. But that doesn't mean that our reading lists should be available to the FBI and the NSA.

Here's the real tragedy: "we take your privacy seriously" has become a punch line. Companies that take care to construct privacy using the tools of modern software engineering and strong encryption aren't taken seriously. The language of privacy has been perverted by lawyers who "take privacy seriously" by crafting privacy policies that allow their clients to do pretty much anything with your data.

CC BY bevgoodinWhich brings me the the second most important thing to understand about privacy on the Internet. Don't it always seem to go that you don't know what you've got till it's gone? (I call this the Big Yellow Taxi principle)

Think about it. The only way you know if a website is being careless with your password is if it gets stolen, or they send it to you in an email. If any website sends you your password by email, make sure that website has no sensitive information of yours because it's being run by incompetents. Then make sure you're not using that password anywhere else and if you are, change it.

Failing gross incompetence, it's very difficult for us to know if a website or a piece of software has carefully constructed privacy, or whether it's piping everything you do to a server in Kansas. Last week's revelations about Adobe Digital Editions (ADE4) were an example of such gross incompetence, and yes, ADE4 tries to send a message to a server in Kansas every time you turn an ebook page. Much outrage has been directed at Adobe over the fact that the messages were being sent in the clear. Somehow people are less upset at the real outrage: the complete absence of privacy engineering in the messages being sent.

The response of Adobe's PR flacks to the brouhaha is so profoundly sad. They're promising to release a software patch that will make their spying more secret.

Now I'm going to confuse you. By all accounts, Adobe's DRM infrastructure (called ACS) is actually very well engineered to protect a user's privacy. It provides for features such as anonymous activation and delegated authentication so that, for example, you can borrow an ACS-protected library ebook through Overdrive without Adobe having any possibility of knowing who you are. Because the privacy has been engineered into the system, when you borrow a library ebook, you don't have to trust that Adobe is benevolently concerned for your privacy.

Yesterday, I talked with Micah Bowers, CEO of Bluefire, a small company doing a nice (and important) niche business in the Adobe rights management ecosystem. They make the Bluefire Reader App, which they license to other companies who rebrand it and use it for their own bookstores. He is confident that the Adobe ACS infrastructure they use is not implicated at all by the recently revealed privacy breeches. I had reached out to Bowers because I wanted to confirm that ebook sync systems could be built without giving away user privacy. I had speculated that the reason Adobe Digital Editions was phoning home with user reading data was part of an unfinished ebook sync system. "Unfinished" because ADE4 doesn't do any syncing. It's also possible that reading data is being sent to enable business models similar to Amazon's "Kindle Unlimited", which pays authors when a reader has read a defined fraction of the book.

For Bluefire ( and the "white label" apps based on Bluefire), ebook syncing is a feature that works NOW. If you read through chapter 5 of a book on your iPhone, the Bluefire Reader on your iPad will know. Bluefire users have to opt in to this syncing and can turn it off with a single button push, even after they've opted in. But even if they've opted in, Bluefire doesn't know what books they're reading. If the FBI wants a list of people reading a particular book, Bluefire probably doesn't have the ability to say who's reading the books. Of course, the sync data is encrypted when transmitted and stored. They've engineered their system to preserve privacy, the same way Google doesn't know your password, and Apple can't decrypt your iphone data. Maybe the FBI and the NSA can get past their engineering, but maybe they can't, and maybe it would be too much trouble.

To some extent, you have to trust what Bluefire says, but I asked Bowers some pointed questions about ways to evade their privacy cloaking, and it was clear to me from his answers that his team had considered these attacks.  Bluefire doesn't send or receive any reading data to or from Adobe.

For now, Bluefire and other ebook reading apps that use Adobe's ACS (including Aldiko, Nook, Apps from Overdrive and 3M) are not affected by the ADE privacy breech. I'm convinced from talking to Bowers that the Bluefire sync system is engineered to keep reading private. But the Big Yellow Taxi principle applies to all of these. It's very hard for consumers to tell a well engineered system from a shoddy hack until there's been a breach and then it's too late.

Perhaps this is where the library community needs to forcefully step in. Privacy audits and 3rd party code review should be required for any application or website that purports to "Take privacy seriously" when library records privacy laws are in play.

Or we could pave over the libraries and put up some parking lots.

DuraSpace News: WATCH NOW: Fedora 4 Training

planet code4lib - Thu, 2014-10-16 00:00
Winchester, MA  Have you been wondering about how Fedora 4 features will work for your organization? Two Fedora 4 training videos are now available on YouTube [1][2] to watch at your convenience that will provide you with answers and how-tos.  

DuraSpace News: Recording Available: “Fedora 4.0 in Action at The Art Institute of Chicago and UCSD”

planet code4lib - Thu, 2014-10-16 00:00

On October 15th DuraSpace presented “Fedora 4.0 in Action at The Art Institute of Chicago and UCSD.”  This webinar was the first in the ninth Hot Topics Community Webinar series, “Early Advantage: Introducing New Fedora 4.0 Repositories.”

Jason Ronallo: HTML Slide Decks With Synchronized and Interactive Audience Notes Using WebSockets

planet code4lib - Wed, 2014-10-15 21:10

One question I got asked after giving my Code4Lib presentation on WebSockets was how I created my slides. I’ve written about how I create HTML slides before, but this time I added some new features like an audience interface that synchronizes automatically with the slides and allows for audience participation.

TL;DR I’ve open sourced starterdeck-node for creating synchronized and interactive HTML slide decks.

Not every time that I give a presentation am I able to use the technologies that I am talking about within the presentation itself, so I like to do it when I can. I write my slide decks as Markdown and convert them with Pandoc to HTML slides which use DZslides for slide sizing and animations. I use a browser to present the slides. Working this way with HTML has allowed me to do things like embed HTML5 video into a presentation on HTML5 video and show examples of the JavaScript API and how videos can be styled with CSS.

For a presentation on WebSockets I gave at Code4Lib 2014, I wanted to provide another example from within the presentation itself of what you can do with WebSockets. If you have the slides and the audience notes handout page open at the same time, you will see how they are synchronized. (Beware slowness as it is a large self-contained HTML download using data URIs.) When you change to certain slides in the presenter view, new content is revealed in the audience view. Because the slides are just an HTML page, it is possible to make the slides more interactive. WebSockets are used to allow the slides to send messages to each audience members’ browser and reveal notes. I am never able to say everything that I would want to in one short 20 minute talk, so this provided me a way to give the audience some supplementary material.

Within the slides I even included a simplistic chat application that allowed the audience to send messages directly to the presenter slides. (Every talk on WebSockets needs a gratuitous chat application.) At the end of the talk I also accepted questions from the audience via an input field. The questions were then delivered to the slides via WebSockets and displayed right within a slide using a little JavaScript. What I like most about this is that even someone who did not feel confident enough to step up to a microphone would have the opportunity to ask an anonymous question. And I even got a few legitimate questions amongst the requests for me to dance.

Another nice side benefit of getting the audience to notes before the presentation starts is that you can include your contact information and Twitter handle on the page.

I have wrapped up all this functionality for creating interactive slide decks into a project called starterdeck-node. It includes the WebSocket server and a simple starting point for creating your own slides. It strings together a bunch of different tools to make creating and deploying slide decks like this simpler so you’ll need to look at the requirements. This is still definitely just a tool for hackers, but having this scaffolding in place ought to make the next slide deck easier to create.

Here’s a video where I show starterdeck-node at work. Slides on the left; audience notes on the right.

Other Features

While the new exciting feature added in this version of the project is synchronization between presenter slides and audience notes, there are also lots of other great features if you want to create HTML slide decks. Even if you aren’t going to use the synchronization feature, there are still lots of reasons why you might want to create your HTML slides with starterdeck-node.

Self-contained HTML. Pandoc uses data-URIs so that the HTML version of your slides have no external dependencies. Everything including images, video, JavaScript, CSS, and fonts are all embedded within a single HTML document. That means that even if there’s no internet connection from the podium you’ll still be able to deliver your presentation.

Onstage view. Part of what gets built is a DZSlides onstage view where the presenter can see the current slide, next slide, speaker notes, and current time.

Single page view. This view is a self-contained, single-page layout version of the slides and speaker notes. This is a much nicer way to read a presentation than just flipping through the slides on various slide sharing sites. If you put a lot of work into your talk and are writing speaker notes, this is a great way to reuse them.

PDF backup. A script is included to create a PDF backup of your presentation. Sometimes you have to use the computer at the podium and it has an old version of IE on it. PDF backup to the rescue. While you won’t get all the features of the HTML presentation you’re still in business. The included Node.js app provides a server so that a headless browser can take screenshots of each slide. These screenshots are then compiled into the PDF.


I’d love to hear from anyone who tries to use it. I’ll list any examples I hear about below.

Here are some examples of slide decks that have used starterdeck-node or starterdeck.

Jenny Rose Halperin: New /contribute page

planet code4lib - Wed, 2014-10-15 19:58

In an uncharacteristically short post, I want to let folks know that we just launched our new /contribute page.

I am so proud of our team! Thank you to Jess, Ben, Larissa, Jen, Rebecca, Mike, Pascal, Flod, Holly, Sean, David, Maryellen, Craig, PMac, Matej, and everyone else who had a hand. You all are the absolute most wonderful people to work with and I look forward to seeing what comes next!

I’ll be posting intermittently about new features and challenges on the site, but I first want to give a big virtual hug to all of you who made it happen and all of you who contribute to Mozilla in the future.

LITA: Jobs in Information Technology: October 15

planet code4lib - Wed, 2014-10-15 17:55

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing.  Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Assistant Coordinator, Stacks and Circulation,  Colorado State University,  Fort Collins, CO

Digital Archivist, University of Georgia Libraries,  Athens,  GA

Metadata Systems Specialist, NYU, Division of Libraries, New York City,  NY

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

Roy Tennant: The Great Plateau

planet code4lib - Wed, 2014-10-15 15:37

I had what you might call an unusual early adulthood. Whereas most young adults march off to college and garner the degree that will define their life, I dropped out of high school at the 8th grade, attended an alternative high school (read dope-smoking, although I passed at the time) for two years, then dropped out entirely. The story is long, but I helped to build two dome homes in Indiana, built and slept in a treehouse through an Indiana winter, and returned to California where I had been mostly raised, two weeks after I turned 18, with not much more than bus fare and a duffle bag.

From there I built my own life, on my own terms, which meant (oddly enough, although there are reasons if you cared to ask) a job at the local community college library in the foothills of the Sierra Nevada Mountains and a life in the outdoors, which had always beckoned.

This is all background for the point I want to make. In the end, I paused before seriously attending college for about seven years. I dabbled in courses, I learned to run rivers and many other things. And that made all of the difference.

In the end, what made the difference was the timing. Had I entered college when I should have (in 1975), that would have been too early for the computer revolution. As it was, I entered college exactly with the computer revolution. I remember writing my first software program just as I was getting serious about pursuing my college education in the early 1980s, on a Commodore PET computer. My fate was sealed, and I didn’t even realize it.

Later, at Humboldt State University where I majored in Geography and minored in Computer Science, I wrote programs in FORTRAN to process rainfall data for my Geography professor. From there, I jumped on every single computer and network opportunity there was to be had.

I was an early and enthusiastic adopter (and proselytizer in the various organizations where I found work) for the Macintosh computer. I still was, when I joined OCLC seven years ago and broke the Microsoft stranglehold that still existed.

I was an operator of an early automated circulation system (CLSI) at Humboldt State. And not long after that, I co-wrote the first book about the Internet aimed at librarians.

So I am here to tell you, that after a career of being on the cutting edge, the cutting edge doesn’t seem so cutting anymore. We seem to have reached, in libraries and I would argue in society more generally, a technical plateau. We might see innovation around the edges, but there is nothing I can point to that is truly transformative like the Internet was.

This is not necessarily a problem. In fact, systemic, major change can be downright painful. Believe me, I lived it in trying to make others understand how transformative it would be when few actually wanted to hear it. But for someone like me who counted his salad days as finding and pursuing the next truly transformative technology, this feels like a desert. Well, call it a plateau.

A long straight stretch without much struggle, or altitude gain, or major benefit. It is what it is. But you will have to forgive me if I regret the days when massive change was obvious, and surprising, and massively enabling.

Photo of the Tonto Plateau, Grand Canyon National Park, by Roy Tennant,

David Rosenthal: The Internet of Things

planet code4lib - Wed, 2014-10-15 15:00
In 1996, my friend Steven McGeady gave a fascinating and rather prophetic keynote address to the Harvard Conference on the Internet and Society. In his introduction, Steven said:
I was worried about speaking here, but I'm even more worried about some of the pronouncements that I have heard over the last few days, ... about the future of the Internet. I am worried about pronouncements of the sort: "In the future, we will do electronic banking at virtual ATMs!," "In the future, my car will have an IP address!," "In the future, I'll be able to get all the old I Love Lucy reruns - over the Internet!" or "In the future, everyone will be a Java programmer!"

This is bunk. I'm worried that our imagination about the way that the 'Net changes our lives, our work and our society is limited to taking current institutions and dialling them forward - the "more, better" school of vision for the future.I have the same worries that Steven did about discussions of the Internet of Things that looms so large in our future. They focus on the incidental effects, not on the fundamental changes. Barry Ritholtz points me to a post by Jon Evans at TechCrunch entitled The Internet of Someone Else's Things that is an exception. Jon points out that the idea that you own the Smart Things you buy is obsolete:
They say “possession is nine-tenths of the law,” but even if you physically and legally own a Smart Thing, you won’t actually control it. Ownership will become a three-legged stool: who physically owns a thing; who legally owns it; …and who has the ultimate power to command it. Who, in short, has root.What does this have to do with digital preservation? Follow me below the fold.

On a smaller scale than the Internet of Things (IoT), we already have at least two precursors that demonstrate some of the problems of connecting to the Internet huge numbers of devices over which consumers don't have "root" (administrative control). The first is mobile phones. As Jon says:
Your phone probably has three separate computers in it (processor, baseband processor, and SIM card) and you almost certainly don’t have root on any of them, which is why some people refer to phones as “tracking devices which make phone calls.” The second is home broadband routers. My friend Jim Gettys points me to a short piece by Vint Cerf entitled Bufferbloat and Other Internet Challenges that takes off from Jim's work on these routers. Vint concludes:
I hope it’s apparent that these disparate topics are linked by the need to find a path toward adapting Internet-based devices to change, and improved safety. Internet users will benefit from the discovery or invention of such a path, and it’s thus worthy of further serious research.Jim got sucked into working on these problems when, back in 2010, he got fed up with persistent network performance problems on his home's broadband internet service, and did some serious diagnostic work. You can follow the whole story, which continues, on his blog. But the short, vastly over-simplified version is that he discovered that Moore's Law had converted a potential problem with TCP first described in 1985 into a nightmare.

Back then, the idea of a packet switch with effectively infinite buffer storage was purely theoretical. A quarter of a century later, RAM was so cheap that even home broadband routers had packet buffers so large as to be almost infinite. TCP depends on dropping packets to signal that a link is congested. Very large buffers mean packets don't get dropped, so the sender never finds out the some link is congested, so it never slows down. Jim called this phenomenon "bufferbloat", and started a crusade to eliminate it. In less than two years, Kathleen Nichols and Van Jacobson working with Jim and others had a software fix to the TCP/IP stack, called CoDel.

CoDel isn't a complete solution, further work has produced even more fixes, but it makes a huge difference. Problem solved, right? All we needed to do was to deploy CoDel everywhere in the Internet that managed a packet buffer, which is every piece of hardware connected to it. This meant convincing every vendor of an internet-connected  device that they needed to adopt and deploy CoDel not just in new products they were going to ship, but in all the products that they had ever shipped that were still connected to the Internet.

For major vendors such as Cisco this was hard, but for vendors of consumer devices, including even Cisco's Linksys divison, it was simply impossible. There is no way for Linksys to push updates of the software to their installed base. Worse, many networking chips implement on-chip packet buffering; their buffer management algorithms are probably both unknowable and unalterable. So even though there is a pretty good fix for bufferbloat that, if deployed, would be a major improvement to Internet performance, we will have to wait for much of the hardware in the edge of the Internet to be replaced before we can get the benefit.

We know that the Smart Things the IoT is made of are full of software. That's what makes them smart. Software has bugs and performance problems like the ones Jim found. More importantly it has vulnerabilities that allow the bad guys to compromise the systems running it. Botnets assembled from hundreds of thousands of compromised home routers have been around from at least 2009 to the present. Other current examples include the Brazilian banking malware that hijacks home routers DNS settings, and the Moon worm that is scanning the Internet for vulnerable Linksys routers (who do you think would want to do that?). It isn't just routers that are affected. For example, network storage boxes have been hijacked to mine $620K worth of Dogecoin, and (PDF):
HP Security Research reviewed 10 of the most popular devices in some of the most common IoT niches revealing an alarmingly high average number of vulnerabilities per device. Vulnerabilities ranged from Heartbleed to Denial of Service to weak passwords to cross-site scripting.Just as with bufferbloat, its essentially impossible to eliminate the vulnerabilities that enable these bad guys. It hasn't been economic for low-cost consumer product vendors to provide the kind of automatic or user-approved updates that PC and smartphone systems now routinely provide; the costs of the bad guys attacks are borne by the consumer. It is only fair to mention that there are some exceptions. The Nest smoke detector can be updated remotely; Google did this when it was discovered that it might disable itself instead of reporting a fire. Not, as Vint points out, that the remote update systems have proven adequately trustworthy:
Digital signatures and certificates authenticating software’s origin have proven only partly successful owing to the potential for fabricating false but apparently valid certificates by compromising certificate authorities one way or another.See, for an early example, the Flame malware. Further, as Jon points out:
When you buy a Smart Thing, you get locked into its software ecosystem, which is controlled by its manufacturer, whether you like it or not.Even valid updates are in the vendor's interest, which may not be yours.

This will be the case for the Smart Things in the IoT too. The IoT will be a swamp of malware. In Charles Stross' 2011 novel Rule 34 many of the deaths Detective Inspector Liz Cavanaugh investigates are caused by malware-infested home appliances; you can't say you weren't warned of the risks of the IoT. Jim has a recent blog post about this problem, with links to pieces he inspired by Bruce Schneier and Dan Geer. All three are must-reads.

This whole problem is another example of a topic I've often blogged about, the short-term thinking that pervades society and makes investing, or even planning, now to reap benefits or avoid disasters in the future so hard. In this case, the disaster is already starting to happen.

Finally, why is this relevant to digital preservation? I've written frequently about the really encouraging progress being made in delivering emulation in browsers and as a cloud service in ways that make running really old software transparent. This solves a major problem in digital preservation that has been evident since Jeff Rothenberg's seminal 1995 article.

Unfortunately, the really old software that will be really easy for everyone to run will have all the same bugs and vulnerabilities it had when it was new. Because old vulnerabilities, especially in consumer products, don't go away with time, attempts to exploit really old vulnerabilities don't go away either. And we can't fix the really old software to make the bugs and vulnerabilities go away, because the whole point of emulation is to run the really old software exactly the way it used to be. So the emulated system will be really, really vulnerable and it will be attacked. How are we going to limit the damage from these vulnerabilities?


Subscribe to code4lib aggregator