You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 3 days 1 hour ago

Alf Eaton, Alf: The trouble with scientific software

Wed, 2014-12-31 12:44

Via Nautilus’ excellent Three Sentence Science, I was interested to read Nature’s list of “10 scientists who mattered this year”.

One of them, Sjors Scheres, has written software - RELION - that creates three-dimensional images of protein structures from cryo-electron microscopy images.

I was interested in finding out more about this software: how it had been created, and how the developer(s) had been able to make such a significant improvement in protein imaging.

The Scheres lab has a website. There’s no software section, but in the “Impact” tab is a section about RELION:

“The empirical Bayesian approach to single-particle analysis has been implemented in RELION (for REgularised LIkelihood OptimisatioN). RELION may be downloaded for free from the RELION Wiki). The Wiki also contains its documentation and a comprehensive tutorial.”

I was hoping for a link to GitHub, but at least the source code is available (though the “for free” is worrying, signifying that the default is “not for free”).

On the RELION Wiki, the introduction states that RELION “is developed in the group of Sjors Scheres” (slightly problematic, as this implies that outsiders are excluded, and that development of the software is not an open process).

There’s a link to “Download & install the 1.3 release”. On that page is a link to “Download RELION for free from here”, which leads to a form, asking for name, organisation and email address (which aren’t validated, so can be anything - the aim is to allow the owners to email users if a critical bug is found, but this shouldn’t really be a requirement before being allowed to download the software).

Finally, you get the software: relion–1.3.tar.bz2, containing files that were last modified in February and June this year.

The file is downloaded over HTTP, with no hash provided that would allow verification of the authenticity or correctness of the downloaded file.

The COPYING file contains the GPLv2 license - good!

There’s an AUTHORS file, but it doesn’t really list the contributors in a way that would be useful for citation. Instead, it’s mostly about third-party code:

This program is developed in the group of Sjors H.W. Scheres at the MRC Laboratory of Molecular Biology. However, it does also contain pieces of code from the following packages: XMIPP: http:/ BSOFT: HEALPIX: Original disclaimers in the code of these external packages have been maintained as much as possible. Please contact Sjors Scheres ( if you feel this has not been done correctly.

This is problematic, because the licenses of these pieces of software aren’t known. They are difficult to find: trying to download XMIPP hits another registration form, and BSOFT has no visible license. At least HEALPIX is hosted on SourceForge and has a clear GPLv2 license.

The CHANGELOG and NEWS files are empty. Apart from the folder name, the only way to find out which version of the code is present is to look in the configure script, which contains PACKAGE_VERSION=‘1.3’. There’s no way to know what has changed from the previous version, as the previous versions are not available anywhere (this also means that it’s impossible to reproduce results generated using older versions of the software).

The README file contains information about how to credit the authors of RELION if it is used: by citing the article Scheres, JMB (2011) (DOI: 10.1016/j.jmb.2011.11.010) which describes how the software works (the version of the software that was available in 2011, at least). This article is Open Access and published under CC-BY v3 (thanks MRC!).

Suggested Improvements

The source code for RELION should be in a public version control system such as GitHub, with tagged releases.

The CHANGELOG should be maintained, so that users can see what has changed between releases.

There should be a CITATION file that includes full details of the authors who contributed to (and should be credited for) development of the software, the name and current version of the software, and any other appropriate citation details.

Each public release of the software should be archived in a repository such as figshare, and assigned a DOI.

There should be a way for users to submit visible reports of any issues that are found with the software.

The parts of the software derived from third-party code should be clearly identified, and removed if their license is not compatible with the GPL.

For more discussion of what is needed to publish citable, re-usable scientific software, see the issues list of Mozilla Science Lab's "Code as a Research Object" project.

Patrick Hochstenbach: Happy New Year!

Wed, 2014-12-31 08:42
Filed under: Doodles Tagged: cartoon, cat, comic, doodle, newyear

Jenny Rose Halperin: Bulbes: a soup zine. Call for Submissions!

Tue, 2014-12-30 21:04

Please forward widely!

It’s that time of year, when hat hair is a reality and wet boots have to be left at the door. Frozen fingers and toes are warmed with lots of tea and hot cocoa, and you have heard so many Christmas songs that music is temporarily ruined.

I came to the conclusion a few years ago that soup is magic (influenced heavily by a friend, a soup evangelist) and decided to start a zine about soup, called


It is currently filled mostly with recipes, but also some poems (written by myself and others) and essays and reflections and jokes about soup. Some of you have already submitted to the zine, which is why all this may sound familiar.

Unfortunately, I hit a wall at some point and never finished it, but this year is the year! I finally have both the funds and feelings to finish this project and I encourage all of you to send me

* Recipes (hot and cold soups are welcome)
* Artwork about soup (particularly cover artwork!)
* Soup poems
* Soup essays
* Soup songs
* Soup jokes
* Anything else that may be worth including in a zine about soup

Submissions can be original or found, new or old.

Submission deadline is January 20 (after all the craziness of this time of year and early enough so that I can finish it and send it out before the end of winter!) If you need more time, please tell me and I will plan accordingly.
If you want to snail mail me your submission, get in touch for my address.

Otherwise email is fine!

Happy holidaze to all of you.



PS I got a big kick in the tuchus to actually finish this when I met Rachel Fershleiser, who kindly mailed me a copy of her much more punnily named “Stock Tips” last week.  It was pretty surreal to meet someone else who made a zine about soup!

check it out!

District Dispatch: CopyTalk webinar on Georgia State e-reserves case

Tue, 2014-12-30 15:15

Join us for our next installment of CopyTalk, January 8th at 2pm Eastern Time.

Laura Quilter will be our guest speaker for CopyTalk, a bimonthly webinar brought to you by the Office for Information Technology Policy’s subcommittee on Copyright Education. Our topic: an update on the lawsuit brought by three academic publishers against Georgia State University regarding fair use and e-reserves. If you have been keeping track, the GSU case is now in its fourth year of litigation. Most recently in October 2014, the Eleventh Circuit Appeals court overturned the 2012 decision in favor of GBS, a decision questioned by both rights holders and supporters of Georgia State due to the court’s formulaic application of fair use. Is this a serious setback or possibly good news? If litigation continues, the GSU case is destined to be a major ruling on fair use of digital copies for educational purposes. Not to be missed!!

Laura Quilter is the copyright and information policy attorney/librarian at the University of Massachusetts Amherst.  She works with the UMass Amherst community on copyright and related matters, equipping faculty, students, and staff with the understanding they need to navigate copyright, fair use, open access, publishing, and related issues.   Laura maintains a teaching appointment at Simmons College School of Library & Information Science, and has previously taught at UC Berkeley School of Law with the Samuelson Law, Technology & Public Policy Clinic. She holds an MSLIS degree (1993, U. of Kentucky) and a JD (2003, UC Berkeley School of Law). She is a frequent speaker, who has taught and lectured to a wide variety of audiences. She previously maintained a consulting practice on intellectual property and technology law matters. Laura’s research interests are the intersection of copyright with intellectual freedom and access to knowledge, and more generally human rights concerns within information law and policy.

There is no need to pre-register for this free webinar! Just show up on January 8, 2015 at 2pm Eastern

Note that the webinar is limited to 100 seats so watch with colleagues if possible. An archived copy will be available after the webinar.




The post CopyTalk webinar on Georgia State e-reserves case appeared first on District Dispatch.

John Miedema: “The reason Phaedrus used slips rather than full-sized sheets of paper is that a card-catalog tray full of slips provides a more random access.”

Tue, 2014-12-30 13:47

The reason Phaedrus used slips rather than full-sized sheets of paper is that a card-catalog tray full of slips provides a more random access. When information is organized in small chunks that can be accessed and sequenced at random it becomes much more valuable than when you have to take it in serial form. It’s better, for example, to run a post office where the patrons have numbered boxes and can come in to access these boxes any time they please. It’s worse to have them all come in at a certain time, stand in a queue and get their mail from Joe, who has to sort through everything alphabetically each time and who has rheumatism, is going to retire in a few years, and who doesn’t care whether they like waiting or not. When any distribution is locked into a rigid sequential format it develops Joes that dictate what new changes will be allowed and what will not, and that rigidity is deadly.

Some of the slips were actually about this topic: random access and Quality. The two are closely related. Random access is at the essence of organic growth, in which cells, like post-office boxes, are relatively independent. Cities are based on random access. Democracies are founded on it. The free market system, free speech, and the growth of science are all based on it. A library is one of civilization’s most powerful tools precisely because of its card-catalog trays. Without the Dewey Decimal System allowing the number of cards in the main catalog to grow or shrink at any point the whole library would soon grow stale and useless and die.

And so while those trays certainly didn’t have much glamour they nevertheless had the hidden strength of a card catalog. They ensured that by keeping his head empty and keeping sequential formatting to a minimum, no fresh new unexplored idea would be forgotten or shut out. There were no ideological Joes to kill an idea because it didn’t fit into what he was already thinking.

Pirsig, Robert M. (1991). Lila: An Inquiry into Morals. Pg. 23-24.

William Denton: CBC appearances (updated)

Tue, 2014-12-30 00:41

Sean Craig’s Amanda Lang took money from Manulife & Sun Life, gave them favourable CBC coverage piece on Canadaland got me looking at the cbcappearances script I wrote earlier this year.

It wasn’t getting any of the recent appearances—looks like the CBC changed how they are storing the data that is presented: instead of pulling it in on the fly from Google spreadsheets, it’s all the page in a hidden table (generated by their content management system, I guess) and shown as needed.

They should be making the data available in a reusable format, but they still aren’t. So we need to scrape it, but that’s easy, so I updated the script and regenerated appearances.csv, a nice reusable CSV file suitable for importing into your favourite data analysis tool. The last appearance listed was on 29 November 2014; I assume the December ones will show up soon in January.

The data shows 218 people have made 716 appearances since 24 April 2014. A quick histogram of appearances per person shows that most made only 1 or 2 appearances, and then it quickly tails off. Here how I did things in R:

> library(dplyr) > library(ggplot2) > cbc <- read.csv("appearances.csv", header = TRUE, stringsAsFactors = TRUE) > cbc$date <- as.Date(cbc$date) > totals <- cbc %>% group_by(name) %>% summarise(count = n()) %>% select(count) > qplot(totals$count, binwidth = 1) Histogram of appearance counts. Very skewed.

The median number of appearances is 2, the mean is about 3.3, and third quartile is 4 and above. Let’s label anyone in the third quartile as “busy,” and pick out everyone who is busy, then make a data frame of just the appearance information about busy people.

> summary(totals$count) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 3.284 4.000 33.000 > quantile(totals$count) 0% 25% 50% 75% 100% 1 1 2 4 33 > busy.number <- quantile(totals$count)[[4]] > busy.number [1] 4 > busy_people <- cbc %>% group_by(name) %>% summarise(count = n()) %>% filter(count >= busy.number) %>% select(name) > head(busy_people) Source: local data frame [6 x 1] name 1 Adrian Harewood 2 Alan Neal 3 Amanda Lang 4 Andrew Chang 5 Anne-Marie Mediwake 6 Brian Goldman > busy <- a %>% filter(name %in% busy_people$name) > head(busy) name date event role fee 1 Nora Young 2014-11-20 University of New Brunswick: Andrews initiative Lecture Paid 2 Carol Off 2014-11-14 War Museum Interview Expenses 3 Rex Murphy 2014-11-13 The Salvation Army Speech Paid 4 Carol Off 2014-11-03 Giller Prize Interview Paid 5 Carol Off 2014-11-01 International Federation of Authors Interview Unpaid 6 Carol Off 2014-10-27 International Federation of Authors Interview Unpaid

Now busy is a data frame of information about who did what where, but only for people with more than 4 appearances. It’s easy to do a stacked bar chart that shows how many of each type of fee (Paid, Unpaid, Expenses) each person received. There aren’t many situations where someone did a gig for expenses (red). Most are unpaid (blue) and some are paid (green).

> ggplot(busy, aes(name, fill = fee)) + geom_bar() + coord_flip() Number of appearances by remuneration types

Lawrence Wall is doing a lot of unpaid appearances, and has never done any for pay. Good for him. Rex Murphy is the only busy person who only does paid appearances. Tells you something, that.

Let’s pick out just the paid appearances of the busy people. No need to colour anything this time.

> ggplot(busy %>% filter(fee == "Paid"), aes(name)) + geom_bar() + coord_flip() Number of paid appearances by busy people

Amanda Lang is way out in the lead, with Peter Mansbridge second and Heather Hiscox and Dianne Buckner tied to show. In R, with dplyr, it’s easy to poke around in the data and see what’s going on, for example looking at the paid appearances of Amanda Lang and—as someone I’d expect/hope to be a lot different—Nora Young:

> busy %>% filter(name == "Amanda Lang", fee == "Paid") %>% select(date, event) date event 1 2014-11-27 Productivity Alberta Summit 2 2014-11-26 Association of Manitoba Municipalities 16th Annual Convention 3 2014-11-24 Portfolio Managers Association of Canada 4 2014-11-24 Sun Life Client Appreciation 5 2014-11-18 Vaughan Chamber of Commerce 6 2014-11-04 "PwC’s Western Canada Conference 7 2014-10-30 Chartered Institute of Management Accountants Conference on Innovation 8 2014-10-27 2014 ASA - CICBV Business Valuation Conference 9 2014-10-22 Simon Fraser University Public Square 10 2014-10-07 Colliers International Market Outlook Breakfast 11 2014-09-22 National Insurance Conference 12 2014-09-15 RIMS Canada Conference 13 2014-08-19 Association of Municipalities of Ontario Annual Conference 14 2014-08-07 Manulife Asset Management Seminar 15 2014-07-10 Manulife Asset Management Seminar 16 2014-06-26 Manulife Asset Management Seminar 17 2014-05-29 Manulife Asset Management Seminar 18 2014-05-13 GeoConvention Show Calgary 19 2014-05-09 Alberta Urban Development Institute 20 2014-05-08 Young Presidents Organization 21 2014-05-07 Canadian Restaurant Investment Summit 22 2014-05-06 Canadian Hotel Investment Conference > busy %>% filter(name == "Nora Young", fee == "Paid") %>% select(date, event) date event 1 2014-11-20 University of New Brunswick: Andrews initiative 2 2014-10-04 EdTech Team Ottawa: Bilingual Ottawa Summit feat. Google 3 2014-10-02 Humber College: President's Lecture Series 4 2014-10-01 Speech Ontario Professional Planners Institute: Healthy Communities and Planning in the Digital Age

Nora Young spoke about healthy communities and education to planners and colleges and universities … Amanda Lang spoke to developers and business groups and insurance companies. They are a lot different.

At this point, following up on any relation between Amanda Lang (or another host) and paid corporate gigs requires examination by hand. If the transcripts of The Exchange with Amanda Lang were available then it would be possible to write a script to look through them for mentions of these corporate hosts, which would provide clues for further examination. If the interviews were catalogued by a librarian with a controlled vocabulary then it would be even easier: you’d just do a query to find all occasions where (“Amanda Lang” wasPaidBy ?company) AND (“Amanda Lang” interviewed ?person) AND (?person isEmployeeOf ?company) and there you go, a list of interviews that require further investigation.

But it’s not all catalogued neatly, so journalists need to dig. This kind of initial data munging and visualization may, however, be helpful in pointing out who should be looked at first. Lang, Mansbridge and Murphy are the first three that Canadaland looked at, which does make me wonder what checking Hiscox and Buckner would show … are they different, and if so, how and why, and what does that say? I don’t know. This is as far as I’ll go with this cursory analysis.

In any case: hurrah to the CBC for making the data available, but boo for not making the raw data easy to use. Hurrah to Canadaland for investigating all this and forcing the issue.

Mark E. Phillips: A measure of metadata improvement in the UNT Libraries Digital Collections

Mon, 2014-12-29 23:46

The Portal to Texas History is working with the Digital Public Library of America as a Service Hub for the state of Texas.  As part of this work we are spending time working on a number of projects to improve the number and quality of metadata records in the Portal.

We have a number of student assistants within the UNT Libraries who are working to create metadata for items in our system that do not have complete metadata records.  In doing so we are able to make these items available to the general public.  I thought it might be interesting to write a bit about how we are measuring this work and showing that we are in fact making more content available.

What is a hidden record?

We have to kinds of records that get loaded into our digital library systems.  Records that are completely fleshed out and “live” and records that are minimal in nature and serve as a placeholder until the full record is created.  The minimal records almost always go into the system in a hidden state while the full records are most often loaded unhidden or published. There are situations where we load these full records into the system as hidden records but that is fairly rare.

How many hidden records?

When we started working on the Service Hubs project with DPLA we had 39,851 metadata records in the system that were hidden out of a total of 754,944 total metadata records.  This is about 5% of the records in the system in a hidden state.  

Why so many?

There are a few different categories that we can sort our hidden records into.  We have items that are missing full metadata records.  This accounts for the largest percentage of hidden records.  We also have records that belong to partner institutions around the state which most likely will never be completed because something on the partners end fell through before the metadata records were completed,  we generally call these orphaned metadata records.  We have items that for one reason or another are marked as “duplicate” and are waiting to be purged from the access system.  Finally we have items that are in an embargoed state in the system either because the rights owner for the item has an access embargo on the items, or because we haven’t been able to fully secure rights for the items yet.  Together this makes all of the hidden items in our system.  Unfortunately we currently don’t have a great way of differentiating between these different kinds of hidden record.

How are you measuring progress?

One of the metrics that we are using to establish that we are in fact reducing the number of hidden items in the system over time is to track the percentage of hidden records to total records in the system over time.  This gives us a way to show that we are making progress and continuing to reduce the ratio of hidden to unhidden records in the system.  The following table shows the current data we’ve been collecting for this since August 2014.

Date Total Hidden Percent Hidden 2014-08-04 754,944 39,851 5.278669676 2014-09-02 816,446 43,238 5.295879948 2014-10-14 907,816 38,867 4.281374199 2014-11-05 937,470 44,286 4.723991168 2014-12-14 1,014,890 41,264 4.065859354

You see that even though we’ve had a few rises between the months we’ve been moving overall in a downward trend in the number of records that are hidden versus unhidden.  The dataset that is updated each month is available as a Google Drive Spreadsheet.

There are several projects that we have loaded in a hidden state over the past few months including over 7,000 Texas Patent records, 1,200 Texas State Auditors Reports and 3,000 photographs from a personal photograph collection.  These were all loaded in a hidden state which explains the large jumps in numbers.

Areas to improve.

One of the things that we have started to think about (but don’t have any solid answers) is a way of classifying the different states that a metadata record can have in our system so that we can have a better understanding of why items are hidden vs not hidden.  We recognize our simple hidden or unhidden designation is lacking.  I would be interested in knowing how others are approaching this sort of issue and if there is some sort of existing work to build upon.  If there is something out there do get in touch and let me know.

Library of Congress: The Signal: Managing Research Data at Tufts University: An NDSR Project Update

Mon, 2014-12-29 13:45

The following is a guest post by Samantha DeWitt, National Digital Stewardship Resident at Tufts University.

Hello readers and a happy winter solstice from Medford, Massachusetts. It’s hard to believe I am already in my third month of the National Digital Stewardship Residency. There’s a chill in the air and the vivid fall colors that decorated the Tufts University campus last month have given way to a palette of browns and grays.

Samantha, by Samantha.

During my residency here, I have been exploring ways in which the university can get a better handle on its faculty-produced research data. The project has been illuminating. The first thing I discovered is that Tufts is not alone in their uncertainty concerning the status of institutionally connected research data. Many institutions are taking a hard look at how they approach research data management and some of the results are noteworthy. Harvard, for instance, has developed the Dataverse Network; an “open-source application for sharing, citing, analyzing and preserving research data.” Purdue has recently developed an online research repository (PURR), which provides researchers with a collaborative space during their project and long-term data management assistance. (Published datasets remain online for a minimum of 10 years as a part of the Purdue Libraries’ collection.)

Initial data storage choices

At the beginning of a project, researchers can receive assistance with storage from the Tufts technology services department. Networked (cluster) storage is available for up to several terabytes. One drive is available for smaller amounts of collaborative storage and a second can be used for individual storage space (up to four GB). Lastly, cloud storage is available through Tufts Box. Of course, one can always elect to store data on a personal hard drive or select from an array of portable storage devices.

Unfortunately, hard drives may crash and portable devices may become lost or obsolete… As this is a blog about digital preservation, I realize I don’t need to elaborate on the problems that can befall neglected media. Further, the data remaining in networked storage will be erased when a researcher leaves. Even if this were not the case, attempts to retroactively find or make sense of the data would be prohibitively time-consuming.

Data must be properly managed to be accessible

Tufts is looking at ways to understand its data output with strategies to trace and catalog research data.

Data sharing can be seen as fundamental to the most basic tenets of the scientific method: it permits reproducibility, encourages collaboration among researchers and advocates for the re-use of valuable resources. These principles have been espoused by the National Institutes of Health (NIH) and the National Science Foundation (NSF) and they, along with provisions to increase financial transparency, have resulted in increasingly stringent data management mandates for grant-seekers.

These days, Washington isn’t the only player putting pressure on researchers to tend to their data. In 2011, The Bill and Melinda Gates Foundation began asking researchers to submit a data access plan (PDF) along with their grant proposal, stating that, “a data access plan should at a minimum address the nature and scope of data and information to be disseminated, the timing of disclosure, the manner in which the data and information is stored and disseminated, and who will have access to the data and under what conditions.” The Alfred P. Sloan Foundation, too, asks applicants to describe how their data and code will be “shared, annotated, cited, and archived.”

But just because data has been placed in an appropriate subject-based repository does not ensure that those at Tufts know where it is. (Researchers themselves may not even know or remember.) This creates a unique opportunity for the university to consider ways to catalog this data. By better understanding its research output, Tufts could more easily:

  • Comply with funders’ data access mandates.
  • Visualize institutional research output.
  • Encourage inter-departmental collaboration.
  • Avoid research duplication.
  • Increase institutional visibility by data sharing.

The journals “Science” and “Nature” both require authors to submit data relevant to their publication. Furthermore, in May of this year, the Nature Publishing Group launched an open-access, online-only journal called “Scientific Data,” where researchers can access descriptions of data outputs, file formats, sample identifiers and replication structure. What is worth noting is that the site does not store data but rather acts as a finding aid for data housed in other repositories. The idea of keeping records of data while depositing them elsewhere, is intriguing. In fact, it might be possible to devise a similar sort of system here. Tufts already has a Fedora-based digital repository, so the digital object record would merely require the adequate metadata and a URL to direct the user to the right repository. This type of system could allow the university a better grasp on its research data output.

Tufts has made definite progress in advocating for best practices in data management for its researchers; the library holds frequent workshops and offers assistance in drafting data management plans. It is likely, however, that both government and non-government funders – as well as scholarly journals – will continue to focus on the effective management of research data. Moreover, because universities such as Tufts should be able to appraise one of its most fundamental assets, research data access continues to require our attention.

John Miedema: “This ‘slip-world’ was quite a world and he’d almost lost it once because he hadn’t written any of it down.” Pirsig, Lila

Mon, 2014-12-29 02:37

He saw that her suitcase had shoved all his trays of slips over to one side of the pilot berth. They were for a book he was working on and one of the four long card-catalog-type trays was by an edge where it could fall off. That’s all he needed, he thought, about three thousand four-by-six slips of note pad paper all over the floor.

He got up and adjusted the sliding rest inside each tray so that it was tight against the slips and they couldn’t fall out. Then he carefully pushed the trays back into a safer place in the rear of the berth. Then he went back and sat down again.

It would actually be easier to lose the boat than it would be to lose those slips. There were about eleven thousand of them. They’d grown out of almost four years of organizing and reorganizing and reorganizing so many times he’d become dizzy trying to fit them all together. He’d just about given up.

Their overall subject he called a ‘Metaphysics of Quality,’ or sometimes a ‘Metaphysics of Value,’ or sometimes just ‘MOQ’ to save time.

The buildings out there on shore were in one world and these slips were in another. This ‘slip-world’ was quite a world and he’d almost lost it once because he hadn’t written any of it down and incidents came along that had destroyed his memory of it. Now he had reconstructed what seemed like most of it on these slips and he didn’t want to lose it again.

But maybe it was a good thing that he had lost it because now, in the reconstruction of it, all sorts of new material was flooding in – so much that his main task was to get it processed before it log-jammed his head into some kind of a block that he couldn’t get out of. Now the main purpose of the slips was not to help him remember anything. It was to help him to forget it. That sounded contradictory but the purpose was to keep his head empty, to put all his ideas of the past four years on that pilot berth where he didn’t have to think of them. That was what he wanted.

There’s an old analogy to a cup of tea. If you want to drink new tea you have to get rid of the old tea that’s in your cup, otherwise your cup just overflows and you get a wet mess. Your head is like that cup. It has a limited capacity and if you want to learn something about the world you should keep your head empty in order to learn it. It’s very easy to spend your whole life swishing old tea around in your cup thinking it’s great stuff because you’ve never really tried anything new, because you could never get it in, because the old stuff prevented its entry because you were so sure the old stuff was so good, because you never really tried anything new … on and on in an endless circular pattern.

Pirsig, Robert M. (1991). Lila: An Inquiry into Morals. Pg. 22-23.

Patrick Hochstenbach: Post-Christmas Bloat

Sun, 2014-12-28 16:55
Filed under: Comics, Doodles Tagged: birds, cartoon, cat, comic, fudenosuke

State Library of Denmark: Samsung 840 EVO degradation

Sun, 2014-12-28 02:53

Rumour has it that our 25 lovely 1TB Samsung 840 EVO drives in our Net Archive search machine does not perform well, when data are left untouched for months. Rumour in this case being solid confirmation with a firmware fix from Samsung. In our setup, index shards are written once, then left untouched for months or even years. Exactly the circumstances that trigger the performance degradation.

Measurements, please!

Our 25 shards are build over half a year, giving us an unique opportunity to measure drives in different states of decay. First experiment was very simple: Just read all the data from the drive sequentially by issuing cat index/* > /dev/null and plot the measured time spend with the age of the files on the x axis. That shows the impact on bulk read speed. Second experiment was to issue Solr searches to each shard in isolation, testing search speed one drive at a time. That shows the impact on small random reads.

For search as well as bulk read, lower numbers are better. The raw numbers for 7 drives follows:

Months 25% search median search 75% search 95% search mean search Bulk read hours 0.9 36 50 69 269 112 1.11 2.0 99 144 196 486 203 3.51 3.0 133 198 269 590 281 6.06 4.0 141 234 324 670 330 8.70 5.1 133 183 244 520 295 5.85 6.0 106 158 211 433 204 5.23 7.0 105 227 333 703 338 10.49

Inspecting the graph, it seems that search performance quickly gets worse until the data are about 4 months old. After that it stabilizes. Bulk reading on the other hand continue to worsen during all 7 months, but that has little relevance for search.

The Net Archive search uses SolrCloud for querying the 25 shards simultaneously and merging the result. We only had 24 shards at the previous measurement 6 weeks ago, but the results should still be comparable. Keep in mind that our goal is to have median response times below 2 seconds for all standard searches; searches matching the full corpus and similar are allowed to take longer.

The distinct hill is present both for the old and the new measurements: See Even sparse faceting is limited for details. But the hill has grown for the latest measurements; response times has nearly doubled for the slowest searches. How come it got that much worse during just 6 weeks?

Theory: In a distributed setup, the speed is dictated by the slowest shard. As the data gets older on the un-patched Samsung drives, the chances of having slow reads rises. Although the median response time for search on a shard with 3 month old data is about the same as one with 7 month old data, the chances of very poor performance searches rises. As the whole collection of drives got 6 weeks older, the chances of not having poor performance from at least one of the drives during a SolrCloud search fell.

Note how our overall median response time actually got better with the latest measurement, although the mean (average) got markedly worse. This is due to the random distribution of result set sizes. The chart paint a much clearer picture.

Well, fix it then!

The good news is that there is a fix from Samsung. The bad news is that we cannot upgrade the drives using the controller on the server. Someone has to go through the process of removing them from the machine and perform the necessary steps on a workstation. We plan on doing this in January and besides the hassle and the downtime, we foresee no problems with it.

However, as the drive bug is for old data, a rewrite of all the 25*900GB files should freshen the charges and temporarily bring them back to speed. Mads Villadsen suggested using dd if=somefile of=somefile conv=notrunc, so let’s try that. For science!


It took nearly 11 hours to process drive 1, which had the oldest data. That fits well with the old measurement of bulk speed for that drive, which was 10½ hour for 900GB. After that, bulk speed increased to 1 hour for 900GB. Reviving the 24 other drives was done in parallel with a mean speed of 17MB/s, presumably limited by the controller. Bulk read speeds for the reviewed drives was 1 hour for 900GB, except for drive 3 which took 1 hour and 17 minutes. Let’s file that under pixies.

Repeating the individual shard search performance test from before, we get the following results:
Note that the x-axis is now drive number instead of data age. As can be seen, the drives are remarkably similar in performance. Comparing to the old test, they are at the same speed as the drive with 1 month old data, indicating that the degradation sets in after more than 1 month and not immediately. The raw numbers for the same 7 drives as listed in the first table are:

Months 25% search median search 75% search 95% search mean search Bulk read hours 0.3 34 52 69 329 106 1.00 0.3 41 55 69 256 104 1.00 0.3 50 63 85 330 131 1.00 0.3 39 58 77 301 108 1.00 0.3 40 57 74 314 106 1.01 0.3 37 50 66 254 96 1.01 0.3 24 33 51 344 98 1.00

Running the full distributed search test and plotting the results together with the 1½ month old measurements as well as the latest measurements with the degraded drives gives us the following.

Performance is back to the same level as 1½ month ago, but how come it is not better than that? A quick inspection of the machine revealed that 2 backup jobs had started and were running during the last test; it is unknown how heavy that impact is on the drives, so the test will be re-run when the backups has finished.


The performance degradation of non-upgraded Samsung 840 EVO drives is very real and the degradation is serious after a couple of months. Should you own such drives, it is highly advisable to apply the fixes from Samsung.

Patrick Hochstenbach: Arthur joins Sketchbook Skool

Sat, 2014-12-27 09:45 Filed under: Doodles Tagged: art, arthur, cartoon, comic, fudenosuke, school

District Dispatch: Heritage Health Information survey extended

Fri, 2014-12-26 23:43

The deadline for the Heritage Health Information (HHI) 2014: A National Collections Care Survey has been extended to February 13, 2015! The HHI 2014 is a national survey on the condition of collections held by archives, libraries, historical societies, museums, scientific research collections, and archaeological repositories. It is the only comprehensive survey to collect data on the condition and preservation needs of our nation’s collections.

Invitations to participate were sent to institution directors in October. These invitations included personalized login information, which may be entered at Questions about the survey may be directed to hhi2014survey [at] heritagepreservation [dot] org or 202-233-0824.

Heritage Health Information 2014 is sponsored by the Institute of Museum and Libraries Services and the National Endowments for the Humanities & Arts, and is conducted by Heritage Preservation. Please do all you can to ensure that your institution is represented in this important survey. Your responses are critical in garnering future support for collections care.

The post Heritage Health Information survey extended appeared first on District Dispatch.

District Dispatch: Apply for 2015 National Digital Stewardship Residency Program

Fri, 2014-12-26 23:33

The Library of Congress and the Institute of Museum and Library Services (IMLS) recently announced the official open call for applications for the 2015 National Digital Stewardship Residency, to be held in the Washington, D.C. area. Applications will close on January 30, 2015. To apply, go to the official USAJobs application website.

For the 2015–16 class, five residents will be chosen for a year-long residency at a prominent institution in the Washington, D.C. area. The residency will begin in June, 2015, with an intensive week-long digital stewardship workshop at the Library of Congress. Thereafter, each resident will move to his or her designated host institution to work on a significant digital stewardship project. These projects will allow them to acquire hands-on knowledge and skills involving the collection, selection, management, long-term preservation, and accessibility of digital assets.

The five institutions, and the projects they will offer to NDSR residents, are:

  • American Institute of Architects: Building Curation into Records Creation: Developing a Digital Repository Program at the American Institute of Architects
  • U.S. Senate Historical Office: Improving Digital Stewardship in the U.S. Senate
  • National Library of Medicine: NLM-Developed Software as Cultural Heritage
  • District of Columbia Public Library: Personal Digital Preservation Access and Education through the Public Library
  • Government Publishing Office: Preparation for Audit and Certification of GPO’s FDsys as a Trustworthy Digital Repository

The inaugural class of the NDSR was held in Washington, D.C. in 2013-14. Host institutions for that class included Association of Research Libraries, the Dumbarton Oaks Research Library, the Folger Shakespeare Library, the Library of Congress, the University of Maryland, the National Library of Medicine, the National Security Archive, the Public Broadcasting Service, the Smithsonian Institution Archives and the World Bank.

“We are excited to be collaborating with such dynamic host institutions for the second NDSR residency class in Washington, D.C.,” said Library of Congress Supervisory Program Specialist George Coulborne. “In collaboration with the hosts, we look forward to developing the most engaging experience possible for our residents. Last year’s residents all found employment in fields related to digital stewardship or went on to pursue higher degrees. We hope to replicate that outcome with this class of residents, as well as build bridges between the host institutions and the Library of Congress to advance digital stewardship.”

“At IMLS, we are delighted to continue our work on and funding support for the second round of the NDSR,” said Maura Marx, IMLS Deputy Director for Library Services. “We welcome the new hosts and look forward to welcoming the new residents to all the opportunities this program presents.”

To qualify, applicants must have a master’s degree or higher academic credential, graduating between spring 2013 and spring 2015, with a strong interest in digital stewardship. Currently enrolled doctoral students also are encouraged to apply. Applicants must submit a detailed resume and cover letter, their undergraduate and graduate transcripts, three letters of recommendation, and a creative video that explains an applicant’s interest in the program. Visit the NDSR application website.

The residents chosen for NDSR 2015 will be announced by early April 2015. For additional information and updates regarding the National Digital Stewardship Residency, please see the program website.

The Office of Strategic Initiatives, part of the Library of Congress, oversees the NDSR for the Library and directs the overall digital strategic planning for the Library and the national program for long-term preservation of digital cultural assets, leading a collaborative institution-wide effort to develop consolidated digital future plans, and integrating the delivery of information technology services.

The post Apply for 2015 National Digital Stewardship Residency Program appeared first on District Dispatch.