You are here

Feed aggregator

FOSS4Lib Upcoming Events: DC Area Fedora User Group Meeting

planet code4lib - Wed, 2015-01-21 16:40
Date: Tuesday, March 31, 2015 - 08:00 to Wednesday, April 1, 2015 - 17:00Supports: Fedora RepositoryIslandoraHydra

Last updated January 21, 2015. Created by Peter Murray on January 21, 2015.
Log in to edit this page.

The next DC Area Fedora User Group Meeting will be held on Tuesday, March 31 and Wednesday, April 1 at the USDA National Agriculture Library. Please register in advance (registration is free) by completing this brief form:

David Rosenthal: New Yorker on Web Archiving

planet code4lib - Wed, 2015-01-21 16:00
Do not hesitate, do not pass Go, right now please read Jill Lepore's really excellent New Yorker article Cobweb: can the Web be archived?

FOSS4Lib Recent Releases: DSpace - 5.0

planet code4lib - Wed, 2015-01-21 15:54
Package: DSpaceRelease Date: Tuesday, January 20, 2015

Last updated January 21, 2015. Created by Peter Murray on January 21, 2015.
Log in to edit this page.

From the release announcement:

With a new, modern look and feel for every device, the ability to auto-upgrade from older versions of DSpace, to batch import content and more, the release of DSpace 5 offers its far-flung global community of developers and stakeholders an even easier-to-use and more efficient institutional repository solution.

LITA: Amazon Echo

planet code4lib - Wed, 2015-01-21 13:00

Have you read about Amazon Echo? It is a new consumer product from Amazon that users can ask it questions and receive answers, tell it to play music, request it to add items to your shopping/to do list, etc.

I first saw a video about it in October and quickly signed up to receive an invitation to purchase the product. I received my invitation this month and Echo should arrive at my house in May.

I’m pretty excited about it for a few reasons. First, Amazon is letting people develop for it. I’m already brainstorming ways the product can be used in both my home and office.

Second, I can’t wait to be able to talk to a device without having to push a button. The reviews for the voice recognition aren’t perfect, but they seem really good for a first launch.

Finally, I’m also really interested in it as an information retrieval tool. I don’t claim to be able to predict the future, but I think devices like Echo will be a new way that people access information. It seems like a logical next step.

This only emphasizes the importance for people to understand their information need, to understand biases associated with information retrieval tools (to find answers to questions Echo reviews Wikipedia, a few other databases, and will conduct a Bing search), and the amazing role that algorithms are going to play in the future. Algorithms already play such a big role in how people retrieve information. With tools that tell people answers to their questions users won’t even see other options. They will only be told one answer.

Image Courtesy of Flickr user jm_escalante CC BY-NC-SA 2.0

I’d love to chat about Echo. Do you have ideas for how to use it?

District Dispatch: What does the new Congress mean for libraries?

planet code4lib - Wed, 2015-01-21 06:59

A panel of experts from the ranks of politics, academia and the press will explore the implica­tions of the November mid-term Congressional elections for America, libraries and library advocacy at the 2015 American Library Association (ALA) Midwinter Meeting in Chicago. ALA invited U.S. Senator and Democratic Majority Whip Richard Durbin to keynote the conference session.

The session, titled “Whither Washington?: The 2014 Election and What it Means for Libraries,” takes place from 8:30–10:00a.m. on Saturday, January 31, 2015, in the McCormick Convention Center, room W183A. With critical bills to reauthorize federal library funding, efforts to reform key privacy and surveillance statutes, and changes to copyright law all likely to be on legislators’ plates, libraries will engage heavily with the newly-elected 114th Congress.

Speakers include J. Mark Hansen, professor for the Department of Political Science at the University of Chicago and Thomas Susman, director of government affairs for the American Bar Association.

View other ALA Washington Office Midwinter Meeting conference sessions

The post What does the new Congress mean for libraries? appeared first on District Dispatch.

John Miedema: The author slip selects and boosts words for questioning unread content

planet code4lib - Wed, 2015-01-21 02:58

Lila uses author slips to “question” a collection of unread articles and books, suggesting “answers” or responses that extend the author’s material. The term, question, is appropriate because Lila uses natural language processing to enhance search. The application of natural language is shown here in three ways.

1. Distill the focus of the author slip

Perhaps the most important step is to decide which words are the most meaningful for questioning unread content. The design of the slip provides the necessary structures for making this decision, as shown in the figure:

An algorithm could use these design features to group keywords and calculate relative weights for use in searching, as shown in the table:

Figure # Field Calculation Word (weight) 4 Content Weight of 1 for each uncommon word. Increase by 1 for each occurrence, so static and dynamic add up to 3. static (3), dynamic (3), quality (1), scientific (1), knowledge (1), cave (1), political (1), institutions (1), centuries (1), king (1), constitution (1), destroying (1), government (1) 3 Subject Line Weight of 2, twice that of Content. Stop words removed. pencil (2), mightier (2), pen (2) 2 Tags Weight of 4, twice that of Subject Line staticVsDynamic (4) 1 Categories Weight of 8, twice that of Tags. “Quality” appears in both Content and Categories; the weight for this word could be their sum, 9. quality (8+1=9)

The words can be used as keywords in a natural language query. The weights would be included as boost factors, ranking search results higher if they contain those words.

2. Apply other natural language analysis, such as word frequency

In the above table, not all words were selected. In the Subject Line, stop words (e.g., “the”) are removed. This is a common practice in the query construction, since stop words are too common to add value. Similarly, in Content, only uncommon words are kept. In this case, word frequency could be calculated using a scientific measure. Words falling below a threshold could be skipped. Word frequency and other linguistic features, such as repetition and word concreteness, will be discussed in detail later on. These steps utilize knowledge of language to improve search relevance.

3. Take advantage of natural language index configurations

Unread content will be crawled and organized in a natural language index, such as Apache Solr’s Lucene index. An index of this sort can be configured to apply other natural language processing, e.g., synonym matching between queries and documents.

John Miedema: Eliza, Turing, and Whatson vs Lila. Enlist the cooperation of the human rather than design around a fight.

planet code4lib - Sun, 2015-01-18 16:09

Remember Eliza, the psychotherapist program? Eliza is a computer program written by Joseph Weizenbaum in the mid-sixties and circulated widely in the early days of personal computing. Eliza is modeled on non-directional Rogerian therapy, programmed with a few prompts and a simulation of human understanding by feeding back content from the user. It is an early example of natural language processing. Eliza appears smart as long as the user played along, but it is not hard to confuse the program. And it has bad grammar. People delight in teasing Eliza.

I am looking forward to seeing the new movie about Alan Turing, The Imitation Game, with Benedict Cumberbatch. Turing is regarded as the father of the computer, and he introduced the Turing Test, a natural language test of machine intelligence. In short, a human asks questions to determine if the hidden respondent is a machine or not. The human is trying to mess with the machine, focused on tripping it up.

‘Whatson’ was my first run at designing a cognitive system. It was designed to be a Question-Answer system for literature. I sensed that a big challenge would be the same one as Eliza or any program faced with the Turing Test. People like Question and Answer systems because it makes life easy. Ask a question, get an answer. It does the work for them. Do a little more than everyone expects and everyone expects a little more. The expectations increase. The questions get trickier. Even if the questioner was trying to help, the clues for finding the answer would often be missing. I would have to design a dialog mechanism for collecting more information. But often the questioner would be deliberately trying to test the intelligence and limits of the system. It’s what we humans do, push systems with the Turing Test. I needed a way to enlist the cooperation of the human user, so that I would not design around a fight.

In 2012 I patented a search technology, “Silent Tagging” (US 8,250,066). The technology solves a problem with social tagging. In the heyday of Web 2.0 people actively tagged content on the open web, as an aid to findability. It works on the open web, but in smaller closed contexts like company intranets, workers are much less likely to tag content. In a small population there are fewer adopters of emerging technology and workers are focused on immediate tasks. Was there a way to benefit from tagging without interrupting an employee’s workflow? I introduced the idea of Silent Tagging. The method associates two things in an employee’s normal workflow: keyword searches and clicks on search results. Keywords are like tags, intelligently selected by a searcher for findability. Clicks on search results follow a small cognitive act, deciding that one search result is better than another. The keyword-click association can be silently captured to adjust rankings of content and benefit other users. The key point here is that human cooperation can be implicitly enlisted in the design.

In January of this year I switched gears from Whatson to Lila. Lila is also a cognitive system but its design implicitly enlists human cooperation in natural language processing tasks. Lila is a cognitive writing system, designed to extend human reading, thinking and writing capabilities. The human user is involved in a writing project. In Lila, the author creates content in short units of text called slips. As I have described lately, in Lila, author slips are questions asked of unread content, just like questions in a Question-Answer system. The difference in Lila is that the author’s intent and work is implicitly intelligent, generating slips or questions with high signal and little noise. In one way, Lila is like Eliza, in that both depend on the intelligence of the user. The difference is that in Lila the purpose of the user and the system are implicitly (silently) aligned. No design work is required to convince or negotiate with a user.

Patrick Hochstenbach: Homework assignment #2 Sketchbookskool

planet code4lib - Sun, 2015-01-18 14:28
Filed under: Doodles Tagged: brugge, doodles, urbansketching

William Denton: Setting up Sonic Pi on Ubuntu, with Emacs

planet code4lib - Sun, 2015-01-18 00:39

It’s no trouble to get Sonic Pi going on a Raspberry Pi (Raspbian comes with it), and as I wrote about last week I had great fun with that. But my Raspberry Pi is slow, so it would often choke, and the interface is meant for kids so they can learn to make music and program, not for middle-aged librarians who love Emacs, so I wanted to get it running on my Ubuntu laptop. Here’s how I did it.

I wanted to get away from this and into Emacs.

There’s nothing really new here, but it might save someone some time, because it involved getting JACK working, which is one of those things where you begin carefully documenting everything you do and an hour later you have thirty browser tabs open, three of them to the same mailing list archive showing a message from 2005, and you’ve edited some core system files but you’re sure you’ve forgotten one and don’t have a backup, and then everything works and you don’t want to touch it in case it breaks.

Linux and Unix users should go to the GitHub Sonic Pi repo and follow the generic Linux installation instructions, which is what I did. I run Ubuntu; I had some of the requirements already installed, but not all. Then:

cd /usr/local/src/ git clone cd sonic-pi/app/server/bin ./compile-extensions cd ../../gui/qt ./rp-build-app ./rp-app-bin Interesting lichens.

The application compiled without any trouble, but it didn’t run because jackd wasn’t running. I had to get JACK going. The JACK FAQ helped.

sudo apt-get install qjackctl

qjackctl is a little GUI front end to control JACK. I installed it and ran it and got an error:

JACK is running in realtime mode, but you are not allowed to use realtime scheduling. Please check your /etc/security/limits.conf for the following line and correct/add it if necessary: @audio - rtprio 99 After applying these changes, please re-login in order for them to take effect. You don't appear to have a sane system configuration. It is very likely that you encounter xruns. Please apply all the above mentioned changes and start jack again!

Editing that file isn’t the right way to do it, though. This is:

sudo apt-get install jackd2 sudo dpkg-reconfigure -p high jackd2

This made /etc/security/limits.d/audio.conf look so:

# Provided by the jackd package. # # Changes to this file will be preserved. # # If you want to enable/disable realtime permissions, run # # dpkg-reconfigure -p high jackd @audio - rtprio 95 @audio - memlock unlimited #@audio - nice -19

Then qjackctl gave me this error:

JACK is running in realtime mode, but you are not allowed to use realtime scheduling. Your system has an audio group, but you are not a member of it. Please add yourself to the audio group by executing (as root): usermod -a -G audio (null) After applying these changes, please re-login in order for them to take effect.

Replace “(null)” with your username. I ran:

usermod -a -G audio wtd

Logged out and back in and ran qjackctl again and got:

ACK compiled with System V SHM support. cannot lock down memory for jackd (Cannot allocate memory) loading driver .. apparent rate = 44100 creating alsa driver ... hw:0|hw:0|1024|2|44100|0|0|nomon|swmeter|-|32bit ALSA: Cannot open PCM device alsa_pcm for playback. Falling back to capture-only mode cannot load driver module alsa

Here I searched online, looked at all kinds of questions and answers, made a cup of tea, tried again, gave up, tried again, then installed something that may not be necessary, but it was part of what I did so I’ll include it:

sudo apt-get install pulseaudio-module-jack My tuner, with knobs and buttons that are easy to frob.

Then, thanks to some helpful answer somewhere, I got onto the real problem, which is about where the audio was going. I grew up in a world where home audio signals (not including the wireless) were transmitted on audio cables with RCA jacks. (Pondering all the cables I’ve used in my life, I think the RCA jack is the closest to perfection. It’s easy to identify, it has a pleasing symmetry and design, and there is no way to plug it in wrong.) Your cassette deck and turntable would each have one coming out and you’d plug them into your tuner and then everything just worked, because when you needed you’d turn a knob that meant “get the audio from here.” I have only the haziest idea of how audio on Linux really works, but at heart there seems to be something similar going on, because what made it work was telling JACK which audio thingie I wanted.

I had to change the interface setting

You can pull up that window by clicking on Settings in qjackctl. The Interface line said “Default,” but I changed it to “hw:PCH (HDA Intel PCH (hw: 1)”, whatever that means) and it worked. What’s in the screenshot is different, and it works too. I don’t know why. Don’t ask me. Just fiddle those options and maybe it will work for you too.

I hit Start and got JACK going, then back in the Sonic Pi source tree I ran ./rp-app-bin and it worked! Sound came out of my speakers! I plugged in my headphones and they worked. Huzzah!


That was all well and good, but nothing is fully working until it can be run from Emacs. A thousand thanks go to sonic-pi.el!

I used the package manager (M-x list-packages) to install sonic-pi; I didn’t need to install dash and osc because I already had them for some reason. Then I added this to init.el:

;; Sonic Pi ( (require 'sonic-pi) (add-hook 'sonic-pi-mode-hook (lambda () ;; This setq can go here instead if you wish (setq sonic-pi-path "/usr/local/src/sonic-pi/") (define-key ruby-mode-map "\C-c\C-c" 'sonic-pi-send-buffer)))

That last line is a customization of my own: I wanted C-c C-c to do the right thing the way it does in Org mode: here, I want it to play the current buffer. A good key combination like that is good to reuse.

Then I could open up test.rb and try whatever I wanted. After a lot of fooling around I wrote this:

define :throb do |note, seconds| use_synth :square with_fx :reverb, phase: 2 do s = play note, attack: seconds/2, release: seconds/2, note_slide: 0.25 (seconds*4).times do control s, note: note sleep 0.25 control s, note: note-2 sleep 0.25 end end end throb(40, 32)

To get it connected, I ran M-x sonic-pi-mode then M-x sonic-pi-connect (it was already running, otherwise M-x sonic-pi-jack-in would do; sometimes M-x sonic-pi-restart is needed), then I hit C-c C-c … and a low uneasy throbbing sound comes out.

Emacs with sonic-pi-mode running

Amazing. Everything feels better when you can do it in Emacs. Especially coding music.

John Miedema: Lila Slip Factory I: “Question” rather than “query” for natural language processing

planet code4lib - Sat, 2015-01-17 20:35

The Lila cognitive writing system extends your reading abilities by converting unread content into slips, units of text for later visualization and analysis. How does Lila convert content into slips? The Lila Slip Factory has two processes. The first process, represented here, involves converting slips written manually by the author into questions to be asked of the unread content. I use the word “question” rather than “query” because I am using natural language processing in addition to more structured query methods. I want to create the association between author slips and natural language questions, such as one might find in a Question-Answer system.

  1. The first process begins with a stack of slips generated manually by the author. Each slip is processed.
  2. Natural language processing is applied to convert the slip into tokens and analyze parts of speech.
  3. The keyword analysis is an algorithm that converts the outputs of step two into keywords. The selection of keywords will depend on their placement in the author slip, i.e., in the subject line, content, categories and tags. It will depend on other text analytics such as word frequency and word concreteness. Weighting factors may be applied to the keywords. This algorithm will be explained more later.
  4. Once the keywords have been selected, a structured question can be formed.
  5. Each question is added to a collection that will be used in the second process, to be represented in a following post.

Harvard Library Innovation Lab: Link roundup January 17, 2015

planet code4lib - Sat, 2015-01-17 17:37

Legos. JavaScript. Photos. And request logs. Smart teams. What a range.

Why Some Teams Are Smarter Than Others

How to compose a smart team, 1. Equal talk time 2. Good at reading facial expressions 3. Not all dudes.

Issues to Readers

The living library. A video of a log of realtime book requests at the British Library.

Wonderful head shots of hand models

The faces attached to the hands that get photographed. I’d drop this in the Awesome Box.

How Lego Became The Apple Of Toys | Fast Company | Business + Innovation

Lego innovates with a walled garden Future Lab that relies on extensive user research


A ghost in the machine. A human ghost, typing to us. Or, maybe just a cool JavaScipt library.

District Dispatch: Free webinar: Understanding credit reports and scores

planet code4lib - Fri, 2015-01-16 21:36

On January 29th, the Consumer Financial Protection Bureau and the Institute for Museum and Library Services will offer a free webinar on financial literacy. This session has limited space so please register quickly.

Tune in to the Consumer Financial Protection Bureau’s monthly webinar series intended to instruct library staff on how to discuss financial education topics with their patrons. As part of the series, the Bureau invites experts from other government agencies and nonprofit organizations to speak about key topics of interest.

What’s the difference between your credit report and your credit score? How are scores used, and what makes them go up or down? Find out during this webinar when financial literacy experts discuss ways that past credit habits can affect your ability to get loans and lower interest rates in the future. If you would like to be notified of future webinars, or ask about in-person trainings for large groups of librarians, email; Subject: Library financial education training. All webinars will be recorded and archived for later viewing.

Webinar Details
January 29, 2015
2:30–3:30 p.m. EDT
To join the webinar, please click on the following link at the time of the webinar: Join the webinar

  • Conference number: PW1172198
  • Audience passcode: LIBRARY

If you are participating only by phone, please dial the following number:

  • Phone: 1-888-947-8930
  • Participant passcode: LIBRARY

The post Free webinar: Understanding credit reports and scores appeared first on District Dispatch.

M. Ryan Hess: Digital Author Services

planet code4lib - Fri, 2015-01-16 20:39

The producers of information at our academic institutions are brilliant at what they do, but they need help from experts in sharing their work online. Libraries are uniquely suited for the task.

There are three important areas where we can help our authors:

  1. Copyright and Author Rights Issues
  2. Developing Readership and Recognition
  3. Helping authors overcome technical hurdles to publishing online

Several libraries are now promoting copyright and author rights information services. These services provide resources (often LibGuides) to scholars who may be sold on the benefits of publishing online, but are unclear what their publishers allow. In fact, in my experience, this is one of the most common problems. Like I said, academics are busy people and focused on their area of specialization, which rarely includes reading the legalese of their publisher agreements, let alone keeping a paper trail handy. This is particularly true for authors that began their careers before the digital revolution.

At any rate, providing online information followed up with face-to-face Q&A is an invaluable service for scholars.

Lucretia McCulley of the University of Richmond and Jonathan Bull of the University of Valpraiso have put together a very concise presentation on the matter, detailing how they’ve solved these issues at their institutions.

Another service, which I’m actually developing at my institution presently, is providing copyright clearance as a service for scholars. In our case, I hope to begin archiving all faculty works in our institutional repository. The problem has been that faculty are busy and relying on individual authors to find the time to do the due diligence of checking their agreements just ain’t gonna happen. In fact, this uncertainty about their rights as authors often stops them cold.

In the service model I’m developing, we would request faculty activity reports or query some other resource on faculty output and then run the checks ourselves (using student labor) on services like SherpaRomeo. When items check out, we publish. When they don’t we post the metadata and link to the appropriate online resource (likely in an online journal).

Developing Readership & Recognition

Another area where library’s can provide critical support is assisting authors in growing their reputations and readership. Skills commonly found in libraries from search engine optimization (SEO) to cataloging play a role in this service offering.

At my institution, we use Digital Commons for our repository, which we selected partly because it has powerful SEO built into it. I’ve seen this at work: where a faculty posts something to the repository and within weeks (and even days), that content is rising to the top of Google search results, beating out even Facebook and LinkedIn for searches on an author’s name.

And of course, while we don’t normally mark up the content with metadata for the authors, we do provide training on using the repository and understanding the implications for adding good keywords and disciplines (subject headings) which also help with SEO.

The final bit, is the reporting. With Digital Commons, reports come out every month via email to the authors, letting them know what their top downloads were and how many they had. This is great and I find the reports help spur word-of-mouth marketing of the repository and enthusiasm for it by authors. This is built into Digital Commons, but no matter what platform you use, I think this is just a basic requirement that helps win author’s hearts, drives growth and is a vital assessment tool.

Walking The Last Mile

MacKenzie Smith of MIT has described the Last Mile Problem (Bringing Research Data into the Library, 2009), which is essentially where technical difficulties, uncertainty about how to get started and basic time constraints keep authors from ever publishing online.

As I touched on above, I’m currently developing a program to help faculty walk the last mile, starting with gathering their CVs and then doing the copyright checks for them. The next step would be uploading the content, adding useful metadata and publishing it for them. A key step before all of this, of course, is setting up policies for how the collection will be structured. This is particularly true for non-textual objects like images, spreadsheets, data files, etc.

So, when we talk about walking the last mile with authors, there’s some significant preparatory work involved. Creating a place for authors to understand your digital publishing services is a good place to start. Some good examples of this include:

Once your policies are in place, you can provide a platform for accepting content. In our case (with Digital Commons), we get stellar customer service from Bepress which includes training users how to use their tools. At institutions where such services is not available, two things will be critical:

  1. Provide a drop-dead easy way to deposit content, which includes simple but logical web forms that guide authors in giving you the metadata and properly-formatted files you require.
  2. Provide personal assistance. If you’re not providing services for adding content, you must have staffing for handling questions. Sorry, an FAQ page is not enough.
Bottom Line

Digital publishing is just such a huge area of potential growth. In fact, as more and more academic content is born digital, preserving it for the future in sustainable and systematic ways is more important than ever.

The Library can be the go-to place on your campus for making this happen. Our buildings are brimming with experts on archives, metadata, subject specialists and web technologies, making us uniquely qualified to help authors of research overcome the challenges they face in getting their stuff out there.

OCLC Dev Network: WMS Web Services Install January 18

planet code4lib - Fri, 2015-01-16 17:30

All Web services that require user level authentication will be unavailable during the installation window, which is between 2:00 – 8:00 AM Eastern USA, Sunday Jan 18th.

Open Knowledge Foundation: BudgetApps: The First All-Russia Contest on Open Finance Data

planet code4lib - Fri, 2015-01-16 17:26

This is a guest post by Ivan Begtin, Ambassador for Open Knowledge in Russia and co-founder of the Russian Local Group.

Dear friends, the end of 2014 and the beginning of 2015 have been marked by an event, which is terrific for all those who are interested in working with open data, participating in challenges for apps developers and generally for all people who are into the Open Data Movement. I’m also sure, by the way, that people who are fond of history will find it particularly fascinating to be involved in this event.

On 23 December 2014, the Russian Ministry of Finance together with NGO Infoculture launched an apps developers’ challenge BudgetApps based on the open data, which have been published by the Ministry of Finance over the past several years. There is a number of various datasets, including budget data, audit organisations registries, public debt, national reserve and many other kinds of data.

Now, it happened so that I have joined the jury. So I won’t be able to participate, but let me provide some details regarding this initiative.

All the published data can be found at the Ministry website. Lots of budget datasets are also available at The Single Web Portal of the Russian Federation Budget System. That includes the budget structure in CSV format, the data itself, reference books and many other instructive details. Data regarding all official institutions are placed here. This resource is particularly interesting, because it contains indicators, budgets, statutes and numerous other characteristics regarding each state organisation or municipal institution in Russia. Such data would be invaluable for anyone who considers creating a regional data-based project.

One of the challenge requirements is that the submitted projects should be based on the data published by the Ministry of Finance. However, it does not mean that participants cannot use data from other sources alongside with the Ministry data. It is actually expected that the apps developers will combine several data sources in their projects.

To my mind, one should not even restrict themselves to machine-readable data, because there are also available human-readable data that can be converted to open data formats by participants.

Many potential participants know how to write parsers on their own. For those who have never had such an experience there are great reference resources, e.g. ScraperWiki that can be helpful for scraping web pages. There are also various libraries for analysing Excel files or extracting spreadsheets from PDF documents (for instance, PDFtables, Abbyy Finereader software or other Abbyy services ).

Moreover, at other web resources of the Ministry of Finance there is a lot of interesting information that can be converted to data, including news items that recently have become especially relevant for the Russian audience.

Historical budgets

There is a huge and powerful direction in the general process of opening data, which has long been missing in Russia. What I mean here is publishing open historical data that are kept in archives as large paper volumes of reference books containing myriads of tables with data. These are virtually necessary when we turn to history referring to facts and creating projects devoted to a certain event.

The time has come at last. Any day now the first scanned budgets of the Russian Empire and the Soviet Union will be openly published. A bit later, but also in the near future, the rest of the existing budgets of the Russian Empire, the Soviet Union, and the Russian Soviet Federated Socialist Republic will be published as well.

These scanned copies are being gradually converted to machine-readable formats, such as Excel and CSV data reconstructed from these reference books – both as raw data and as initially processed and ordered data. We created these ordered normalised versions to make it easier for developers to use them in further visualisations and projects. A number of such datasets have already been openly published. It is also worth mentioning that a considerable number of scanned copies of budget reference books (from both the Russian Empire and USSR) have already been published online by Historical Materials, a Russian-language grass-root project launched by a group of statisticians, historians and other enthusiasts.

Here are the historical machine-readable datasets published so far:

I find this part of the challenge particularly inspiring. If I were not part of the jury, I would create my own project based on historical budgets data. Actually, I may well do something like that after the challenge is over (unless somebody does it earlier).

More data?

There is a greater stock of data sources that might be used alongside with the Ministry data. Here are some of them:

These are just a few examples of numerous available data sources. I know that many people also use data from Wikipedia and DBPedia.

What can be done?

First and foremost, there are great opportunities for creating projects aimed at enhancing the understandability of public finance. Among all, these could be visual demos of how the budget (or public debt, or some particular area of finance) is structured.

Second, lots of projects could be launched based on the data on official institutions at For instance, it could be a comparative registry of all hospitals in Russia. Or a project comparing all state universities. Or a map of available public services. Or a visualisation of budgets of Moscow State University (or any other Russian state university for that matter).

As to the historical data, for starters it could be a simple visualisation comparing the current situation to the past. This might be a challenging and fascinating problem to solve.

Why is this important?

BudgetApps is a great way of promoting open data among apps developers, as well as data journalists. There are good reasons for participating. First off, there are many sources of data that provide a good opportunity for talented and creative developers to implement their ambitious ideas. Second, the winners will receive considerable cash prizes. And last, but not least, the most interesting and perspective projects will get a reference at the Ministry of Finance website, which is a good promotion for any worthy project. Considerable amounts of data have become available. It’s time now for a wider audience to become aware of what they are good for.

Karen Coyle: Real World Objects

planet code4lib - Fri, 2015-01-16 16:01
I was asked a question about the meaning and import of the RDF concept of "Real World Object" (RWO) and didn't give a very good answer off the cuff. I'll try to make up for that here.

The concept of RWO comes out of the artificial intelligence (AI) community. Imagine that you are developing robots and other machines that must operate within the same world that you and I occupy. You have to find a way to "explain," in a machine-operational way, everything in our world: stairs and ramps, chairs and tables, the effect of gravity on a cup when you miss placing it on the table, the stars, love and loyalty (concepts are also objects in this view). The AI folks have actually set a goal to create such descriptions, which they call ontologies, for everything in the world; for every RWO.

You might consider this a conceit, or a folly, but that's the task they have set for themselves.

The original Scientific American article that described the semantic web used as its example intelligent 'bots that would manage your daily calendar and make appointments for you. This was far short of the AI "ontology of everything" but the result that matters to us now is that there have been AI principles baked into the development of RDF, including the concept of the RWO.

RWO isn't as mysterious as it may seem, and I can provide a simple example from our world. The MARC record for a book has the book as its RWO, and most of its data elements "speak" about the book. At the same time, we can say things about the MARC record, such as who originally created it, and who edited it last, and when. The book and the record are different things, different RWO's in an RDF view. That's not controversial, I would assume.

Our difficulties arise because in the past we didn't have a machine-actionable way to distinguish between those two "things": the book and the record. Each MARC record got an identifier, which identified the record. We've never had identifiers for the thing the record describes (although the ISBN sometimes works this way). It has always been safe to assume that the record was about the book, and what identified the book was the information in the record. So we obviously have a real world object, but we didn't give it its own identifier - because humans could read the text of the record and understand what it meant (most of the time or some of the time).

I'm not fully convinced that everything can be reduced to RWO/not-RWO, and so I'm not buying that is the only way to talk about our world and our data. It should be relatively easy, though, without getting into grand philosophical debates, to determine the difference between our metadata and the thing it describes. That "thing it describes" can be fuzzy in terms of the real world, such as when the spirit of Edgar Cayce speaks through a medium and writes a book. I don't want to have to discuss whether the spirit of Edgar Cayce is real or not. We can just say that "whoever authors the book is as real as it gets." So if we forget RWO in the RDF sense and just look sensibly at our data, I'm sure we can come to a practical agreement that allows both the metadata and the real world object to exist.

That doesn't resolve the problem of identifiers, however, and for machine-processing purposes we do need separate identifiers for our descriptions and what we are describing.* That's the problem we need to solve, and while we may go back and forth a bit on the best solution, the problem is tractable without resorting to philosophical confabulations.

* I think that the multi-level bibliographic descriptions like FRBR and BIBFRAME make this a bit more complex, but I haven't finished thinking about that, so will return if I have a clearer idea.

Library of Congress: The Signal: Digital Audio Preservation at MIT: an NDSR Project Update

planet code4lib - Fri, 2015-01-16 14:25

The following is a guest post by Tricia Patterson, National Digital Stewardship Resident at MIT Libraries


This month marks the mid-way point of my National Digital Stewardship Residency at MIT Libraries, a temporal vantage point that allows me to reflect triumphantly on what has been achieved so far and peer fearlessly ahead at all that must be accomplished before I am finished.

As mentioned in our previous introductory group post, I was primarily tasked with completing a gap analysis of the digital preservation workflows currently in place, developing lower-level diagrammatic and narrative workflows, and calling out a digital audio use case from the Lewis Music Library materials we are using to build the workflows. My work is part of a larger preservation planning effort underway at MIT, and it has enabled me to make higher-level, organizational contributions while also familiarizing me with the nitty-gritty procedural details across the departments. This project really has relied on strong, interdepartmental collaboration with input from: Peter Munstedt and Cate Gallivan from the Lewis Music Library; Tom Rosko, Mikki Macdonald, Liz Andrews and Kari Smith from the Institute Archives and Special Collections; Ann Marie Willer from Curation and Preservation Services; Helen Bailey from IT; and finally my host supervisor, Nancy McGovern, who heads Digital Curation and Preservation. Others have consulted throughout the project so far as well.

I will shamefully admit that during graduate school, I really hadn’t designated much consideration to workflow documentation. Aside from the OAIS reference model, my thinking about digital preservation was relegated to isolated, technical steps such as format migration or appropriate preservation metadata. Since beginning this project however, I’ve realized that workflow documentation is receiving increased acknowledgement and appreciation. Without a tested, repeatable road map, it is difficult to process larger projects with efficiency and security. A detailed, documented workflow elucidates processes across departments, giving us insight into redundancies and deficiencies. It allows for transparency, clarification of roles, and accountability within the chain of custody.

Above is the high-level content management workflow that the digital audio workflow subgroup developed prior to my arrival. My work so far has been on the second (digitization) and third (managing digital content) sequences of the workflow, fleshing out optimized, lower-level documentation for the steps within each bubble (or stage). Below is an example of the lower-level workflow diagram that I designed for stage A2: Define Digitization Requirements based on the information gathered from archives and preservation staff. Not pictured is the accompanying narrative documentation for the stage. I actually just wrapped up drafting the six stages of the “Transform Physical/Analog to Digital” sequence at the end of December, and while I am drafting the documentation for the next sequence – “Managing Digital” – we are simultaneously moving through the review process for the initial set.

Other benefits that have emerged so far include getting a better idea of what digitization project documentation is generated and what of that documentation needs to be preserved as well. It has also helped us to identify steps that would benefit from automation. For example, as the physical materials are handed off on their way to a vendor to be digitized, we must maintain a chain of custody for the content, so our metadata archivist created a database to more accurately track the items in real-time as they transition through the workflow. We have also gained better perspective on which tools we need that will have the biggest impact on streamlining the work using this workflow. It is becoming clear how much easier it will be to initiate digitization projects, now knowing exactly which avenues need to be traveled and what documentation is necessary. And developing a strong, tested infrastructure can be leveraged for increased funding for projects and acquisitions.

Beyond the workflow development, I am contributing to other projects such as evaluating a streaming audio access platform for the Lewis Music Library and compiling a PREMIS profile for MIT Libraries that can be used for digital audio. The evaluation, an activity of our new Digital Sustainability Lab, has been especially fascinating, as our team is a combined effort between the technological and organizational wings of the libraries, working together to define requirements and measure options against them.

We began by itemizing 50-60 delivery requirements, including relevant TRAC requirements (PDF), covering display and interface, search and discovery, accessibility, ingest and export, metadata, content management, permissions, documentation and other considerations. From there, our group prioritized the requirements on a scale from zero to four: “might be nice” to “showstopper/must-have.” We also kept in mind that while we are only focusing on audio streaming currently, the system should be extensible to audiovisual materials. Next, we will be measuring the platform options against our prioritized requirements to determine which one will be best suited to meet the needs of the Libraries now. For me, this has been one of the most important parts of the position, to facilitate meaningful access to these audio treasures.

The residencies expand beyond the work at our institutions, however. All of the residents have been organizing tours, demonstrations and classes for one another. In December, I arranged for some of the NDSR-Boston crew to go on a behind-the-scenes tour of the John F. Kennedy Library and Museum, home to a renowned digitization program. This spring, another resident (Joey Heinen) and I are partnering to host a digital audio panel with speakers from some of the host institutions that will hopefully be beneficial to external audiences in the area grappling with common preservation concerns.

The residency will be over before I know it. In the upcoming months, I will wrap up the workflow documentation on the digital sequence, continue work on peripheral extant projects that are ongoing, and attend a couple of conferences to talk about our work. Building these models has been fun and intensely educational – and sharing it with the community will be truly rewarding.

DuraSpace News: SHARE Webinar Recording Available

planet code4lib - Fri, 2015-01-16 00:00

Winchester, MA  On January 14, 2014 Judy Ruttenberg, Program Director, Association of Research Libraries presented, “Roadmap to the Future of SHARE.”  Judy highlighted SHARE’s project plans and long-term vision that include taking a life-cycle approach to research services in an effort to build a robust repository ecosystem.  This was the third and final webinar in the DuraSpace Hot Topics Community Webinar Series, “All About the SHared Access Research Ecosystem (SHARE),” curated by Greg Tananbaum, Product Lead, SHARE.

Harvard Library Innovation Lab: Awesome Box top 110 of all time

planet code4lib - Thu, 2015-01-15 21:11

It’s been almost two years since Somerville Public Library helped us launch the Awesome Box to public libraries and beyond.

There are now 364 Awesome libraries around the world.

Over 41,000 items have been dropped in an Awesome Box in those libraries.

See the items just Awesomed on the Awesome Box page.

Now that the year-end lists have come and gone, we’d like to present the top 110 Awesome items* from the past two years.


  1. Diary of a Wimpy Kid 131
  2. The fault in our stars 110
  3. Divergent 71
  4. Wonder 64
  5. The Hunger Games 59
  6. Gone girl 55
  7. The invention of wings 50
  8. Naruto 50
  9. Unbroken 49
  10. The book thief 45
  11. Orphan train 45
  12. Eleanor & Park 44
  13. Bone 42
  14. The heroes of Olympus 41
  15. Smile 40
  16. The goldfinch 38
  17. Allegiant 38
  18. Star wars 35
  19. All the light we cannot see 33
  20. The maze runner 33
  21. The giver 32
  22. Ready player one 32
  23. Insurgent 32
  24. Big Nate 32
  25. Where’d you go, Bernadette 30
  26. Life after life 30
  27. Fangirl 30
  28. Maximum Ride 29
  29. Dork diaries 29
  30. The boys in the boat 28
  31. Doctor Who 28
  32. Me before you 28
  33. The signature of all things 28
  34. Babymouse 28
  35. The light between oceans 27
  36. Mr. Penumbra’s 24-hour bookstore 27
  37. The storied life of A.J. Fikry 27
  38. Sisters 27
  39. And the mountains echoed 27
  40. Squish 27
  41. Cinder 27
  42. The night circus 27
  43. The Lego movie 27
  44. The help 26
  45. Wild 26
  46. The walking dead 26
  47. Harry Potter and the sorcerer’s stone 26
  48. Junie B. Jones loves handsome Warren 26
  49. Drama 25
  50. Percy Jackson & the Olympians 24
  51. Ender’s game 24
  52. The ocean at the end of the lane 24
  53. Animal Ark Labrador on the Lawn 24
  54. Harry Potter and the Order of the Phoenix 24
  55. The Rosie project 24
  56. Sycamore row 24
  57. Frozen 24
  58. Harry Potter and the Half-Blood Prince 23
  59. I am Malala 23
  60. Amulet 23
  61. Geronimo Stilton 23
  62. Big little lies 23
  63. Every day 22
  64. Harry Potter and the chamber of secrets 22
  65. The husband’s secret 22
  66. Legend 22
  67. The lightning thief 22
  68. Maze runner trilogy 22
  69. Heroes of Olympus 22
  70. Leaving time 22
  71. The invention of Hugo Cabret 21
  72. Out of my mind 21
  73. The sea of monsters 21
  74. Escape from Mr. Lemoncello’s library 21
  75. The perks of being a wallflower 21
  76. Hyperbole and a half 21
  77. Fruits basket 21
  78. Delicious 21
  79. Wonderstruck 20
  80. Black butler 20
  81. Pete the cat 20
  82. Downton Abbey 20
  83. The one and only Ivan 20
  84. Harry Potter and the prisoner of Azkaban 20
  85. The Selection 20
  86. The monuments men 20
  87. Mr. Mercedes 20
  88. Mean streak 20
  89. Room 19
  90. Batman 19
  91. The golem and the jinni 19
  92. The unlikely pilgrimage of Harold Fry 19
  93. Harry Potter and the goblet of fire 19
  94. Matched 19
  95. Game of thrones 19
  96. Paper towns 19
  97. Written in my own heart’s blood 19
  98. The silkworm 19
  99. The immortal life of Henrietta Lacks 18
  100. Graceling 18
  101. One summer 18
  102. The great Gatsby 18
  103. The Cuckoo’s Calling 18
  104. The lowland 18
  105. Steelheart 18
  106. The strange case of Origami Yoda 18
  107. Philomena 18
  108. We were liars 18
  109. Edge of eternity 18
  110. The blood of Olympus 18

*Some series items are clumped together. I kind of like it that way.

Jonathan Rochkind: Ruby threads, gotcha with local vars and shared state

planet code4lib - Thu, 2015-01-15 18:25

I end up doing a fair amount of work with multi-threading in ruby. (There is some multi-threaded concurrency in Umlaut, bento_search, and traject).  Contrary to some belief, multi-threaded concurrency can be useful even in MRI ruby (which can’t do true parallelism due to the GIL), for tasks that spend a lot of time waiting on I/O, which is the purpose in Umlaut and bento_search (in both cases waiting on external HTTP apis). Traject uses multi-threaded concurrency for true parallelism in jruby (or soon rbx) for high performance.

There’s a gotcha with ruby threads that I haven’t seen covered much. What do you think this code will output from the ‘puts’?

value = 'original' t = do sleep 1 puts value end value = 'changed' t.join

It outputs “changed”.   The local var `value` is shared between both threads, changes made in the primary thread effect the value of `value` in the created thread too.  This is an issue not unique to threads, but is a result of how closures work in ruby — the local variables used in a closure don’t capture the fixed value at the time of closure creation, they are pointers to the original local variables. (I’m not entirely sure if this is traditional for closures, or if some other languages do it differently, or the correct CS terminology for talking about this stuff).  It confuses people in other contexts too, but can especially lead to problems with threads.

Consider a loop which in each iteration prepares some work to be done, then dispatches to a thread to actually do the work.  We’ll do a very simple fake version of that, watch:

threads = [] i = 0 10.times do # pretend to prepare a 'work order', which ends up in local # var i i += 1 # now do some stuff with 'i' in the thread threads << do sleep 1 # pretend this is a time consuming computation # now we do something else with our work order... puts i end end threads.each {|t| t.join}

Do you think you’ll get “1”, “2”, … “10” printed out? You won’t. You’ll get 10 10’s. (With newlines in random places becuase of interleaving of ‘puts’, but that’s not what we’re talking about here). You thought you dispatched 10 threads each with different values for ‘i’, but the threads are actually all sharing the same ‘i’, when it changes, it changes for all of them.


Ruby stdlib has a mechanism to deal with this, although like much in ruby stdlib (and much about multi-threaded concurrency in ruby), it’s under-documented. But you can pass args to, which will be passed to the block too, and allow you to avoid this local var linkage:

require 'thread' value = 'original' t = do |t_value| sleep 1 puts t_value end value = 'changed' t.join

Now that prints out “original”. That’s the point of passing one or more args to

You might think you could get away with this instead:

require 'thread' value = 'original' t = do # nope, not a safe way to capture the value, there's # still a race condition t_value = value sleep 1 puts t_value end value = 'changed' t.join

While that will seem to work for this particular example, there’s still a race condition there, the value could change before the first line of the thread block is executed, part of dealing with concurrency is giving up any expectations of what gets executed when, until you wait on a `join`.

So, yeah, the arguments to Which other libraries involving threading sometimes propagate. With a concurrent-ruby ThreadPoolExecutor:

work = 'original' pool = do |t_work| sleep 1 puts t_work # is safe end work = 'new' pool.shutdown pool.wait_for_termination

And it can even be a problem with Futures from ruby-concurrent. Futures seem so simple and idiot-proof, right? Oops.

value = 100 future = Concurrent::Future.execute do sleep 1 # DANGER will robinson! value + 1 end value = 200 puts future.value # you get 201, not 101!

I’m honestly not even sure how you get around this problem with Concurrent::Future, unlike Concurrent::ThreadPoolExecutor it does not seem to copy stdlib in it’s method of being able to pass block arguments. There might be something I’m missing (or a way to use Futures that avoids this problem?), or maybe the authors of ruby-concurrent haven’t considered it yet either? I’ve asked the question of them.  (PS: The ruby-concurrent package is super awesome, it’s still building to 1.0 but usable now; I am hoping that it’s existence will do great things for practical use of multi-threaded concurrency in the ruby community).

This is, for me, one of the biggest, most dangerous, most confusing gotchas with ruby concurrency. It can easily lead to hard-to-notice, hard-to-reproduce, and hard-to-debug race condition bugs.

Filed under: General


Subscribe to code4lib aggregator