You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 10 hours 56 min ago

John Miedema: I adopted Pirsig’s term “slip” as the unit of text for my cognitive system

Fri, 2015-01-02 03:23

I adopted Pirsig’s term “slip” as the unit of text for my cognitive system. Pirsig was referring to literal slips of paper, the size of index cards. I am working in a digital context but I share Pirsig preference for the slip concept. Its content length is optimal for random access, better than a page, easily re-sorted and re-categorized until the correct view on the content is decided.

I envision a slip to look like an index card, such as in Figure 1:

The slip has a subject line. It has a paragraph or two of content, just enough words to state an idea with context. The asterisk is used to suggest a category, “quality”. The hashtag is used to suggest a tag, “staticVsDynamic”. The processing of these features in a cognitive system will be detailed later.

A typical non-fiction work has about 100,000 words. Estimating 100 words per slip, a work would have 1000 slips. The count seems manageable for digital processing. Pirsig’s work had about 3000 slips, but then he was writing a metaphysics.

William Denton: Reading diary in Org

Fri, 2015-01-02 03:13

Last year I started using Org to keep track of my reading, instead of on paper, and it worked very well. I was able to see how much I was reading all through the year, which helped me read more.

I have two tables. The first is where I enter what I read. For each book I track the start date, title, author, number of pages, and type (F for fiction, N for nonfiction, A for article). If I don’t finish a book or just skim it I leave the pages field blank and put a - in T. Today I started a new book so there’s just one title, but to add a new one I would go to the dashed line at the bottom and hit S-M-<down> (that’s Alt-Shift-<down> in Emacs-talk) and it creates a new formatted line. I’m down at the bottom of the table so I tab through a few times to get to the # line, which forces a recalculation of the totals.

#+CAPTION: 2015 reading diary #+ATTR_LATEX: :environment longtable :align l|p{8cm}|p{ccm}|l|l #+NAME: reading_2015 | | Date | Title | Author | Pages | T | | | | <65> | <40> | | | |---+-------------+------------------------------------+-------------------+-------+---| | | 01 Jan 2015 | Stoicism and the Art of Happiness | Donald Robertson | 232 | N | |---+-------------+------------------------------------+-------------------+-------+---| | # | | 1 | | 232 | | #+TBLFM: $3=@#-3::$5=vsum(@3..@-1)

(The #+CAPTION and #+ATTR_LATEX lines are for the LaTeX export I use if I want to print it.)

The second table is all generated from the first one. I tab through it to refresh all the fields. All of the formulas in the #+TBLFM field should be on one line, but I broke it out here to make it easier to read.

Books per week and predicted books per year are especially helpful in keeping me on track.

#+NAME: reading-analysis-2015 #+CAPTION: 2015 reading statistics |---+----------------------+---------------| | | Statistic | | |---+----------------------+---------------| | # | Fiction books | 0 | | # | Nonfiction books | 1 | | # | Articles | 0 | | # | DNF or skimmed | 0 | | # | Total books read | 1 | | # | Total pages read | 232 | | # | Days in | 001 | | # | Weeks in | 00 | | # | Books per week | 1 | | # | Pages per day | 232 | | # | Predicted books/year | 365 | |---+----------------------+---------------| #+TBLFM: @2$3='(length(org-lookup-all "F" '(remote(reading_2015,@2$6..@>$6)) nil)):: @3$3='(length(org-lookup-all "N" '(remote(reading_2015,@2$6..@>$6)) nil)):: @4$3='(length(org-lookup-all "A" '(remote(reading_2015,@2$6..@>$6)) nil)):: @5$3='(length(org-lookup-all "-" '(remote(reading_2015,@2$6..@>>$6)) nil));E:: @6$3=@2$3+@3$3::@7$3=remote(reading_2015, @>$5):: @8$3='(format-time-string "%j"):: @9$3='(format-time-string "%U"):: @10$3=round(@6$3/@9$3, 1):: @11$3=round(@7$3/@8$3, 0):: @12$3=round(@6$3*365/@8$3, 0)

(Tables and the spreadsheet in Org are very, very useful. I use them every day for a variety of things.)

With all that information in tables it’s easy to embed code to pull out other statistics and make charts. I’ll cover that when I tweak something to handle co-written books, but today, after getting some of that working for the first time, I was able to see my most read authors over the last four years are Anthony Trollope, Terry Pratchett and Anthony Powell. There are 39 authors who I’ve read at least twice. Some of them I’ll never read again, some I’ll read everything new they come out with (like Sarah Waters, who I only started reading this year) and some are overdue for rereading (like Georgette Heyer).

Patrick Hochstenbach: 2015 – Day One

Thu, 2015-01-01 19:41
Filed under: Doodles Tagged: cat, doodle, newyear

Eric Lease Morgan: Great Books Survey

Thu, 2015-01-01 15:55

I am happy to say that the Great Books Survey is still going strong. Since October of 2010 it has been answered 24,749 times by 2,108 people from people all over the globe. To date, the top five “greatest” books are Athenian Constitution by Aristotle, Hamlet by Shakespeare, Don Quixote by Cervantes, Odyssey by Homer, and the Divine Comedy by Dante. The least “greatest” books are Rhesus by Euripides, On Fistulae by Hippocrates, On Fractures by Hippocrates, On Ulcers by Hippocrates, On Hemorrhoids by Hippocrates. “Too bad Hippocrates”.

For more information about this Great Books of the Western World investigation, see the various blog postings.

Eric Hellman: The Year Amazon Failed Calculus

Wed, 2014-12-31 22:40
In August, Amazon sent me a remarkable email containing a treatise on ebook pricing. I quote from it:
... e-books are highly price elastic. This means that when the price goes down, customers buy much more. We've quantified the price elasticity of e-books from repeated measurements across many titles. For every copy an e-book would sell at $14.99, it would sell 1.74 copies if priced at $9.99. So, for example, if customers would buy 100,000 copies of a particular e-book at $14.99, then customers would buy 174,000 copies of that same e-book at $9.99. Total revenue at $14.99 would be $1,499,000. Total revenue at $9.99 is $1,738,000. The important thing to note here is that the lower price is good for all parties involved: the customer is paying 33% less and the author is getting a royalty check 16% larger and being read by an audience that’s 74% larger. The pie is simply bigger.As you probably know, I'm an engineer, so when I read that paragraph, my reaction was not to write an angry letter to Hachette or to Amazon - my reaction was to start a graph. And I have a third data point to add to the graph. At, I've been working on a rather different price point, $0.  Our "sales" rate is currently about 100,000 copies per year. Our total "sales" revenue for all these books adds up to zero dollars and zero cents. It's even less if you convert to bitcoin.

($0 is terrible for sales revenue, but it's a great price for ebooks that want to accomplish something other than generate sales revenue. Some books want more than anything to make the world a better place, and $0 can help them do that, which is why is trying so hard to support free ebooks.)

So here's my graph of the revenue curve combining "repeated and careful measurements" from Amazon and

I've added a fit to the simplest sensible algebraic equation possible that fits the data, Ax/(1+Bx2), which suggests that the optimum price point is $8.25. Below $8.25,  the increase in unit sales won't make up for the drop in price, and even if the price drops to zero, only twice as many books are sold as at $8.25 - the market for the book saturates.

But Amazon seems to have quit calculus after the first semester, because the real problem has a lot more variables that the one Amazon has solved for. This is because they've ignored the effect of changing a book's price on sales of ALL THE OTHER BOOKS. For you math nerds out there, Amazon has measured a partial derivative when the quantity of interest is the total derivative of revenue. Sales are higher at $10 than at $15 mostly because consumers perceive $15 as expensive for an ebook when most other ebooks are $10. So maybe your pie is bigger, but everyone else is stuck with pop-tarts.

While any individual publisher will find it advantageous to price their books slightly below the prevailing price, the advantage will go away when every publisher lowers its price.

Some price-sensitive readers will read more ebooks if the price is lowered. These are the readers who spend the most on ebooks and are the same readers who patronize libraries. Amazon wants the loyalty of customers so much that they introduced the Kindle Unlimited service. Yes, Amazon is trying to help its customers spend less on their reading obsessions. And yes, Amazon is doing their best to win these customers away from those awful libraries.

But I'm pretty sure that Jeff Bezos passed calculus. He was an EECS major at Princeton (I was, too). So maybe the calculation he's doing is a different one. Maybe his price optimization for ebooks is not maximizing publisher revenue, but rather Amazon's TOTAL revenue. Supposing someone spends less to feed their book habit, doesn't that mean they'll just spend it on something else? And where are they going to spend it? Maybe the best price for Amazon is the price that keeps the customer glued to their Kindle tablet, away from the library and away from the bookstore. The best price for Amazon might be a terrible price for a publisher that wants to sell books.

Read Shatzkin on Publishers vs. Amazon. Then read Hoffelder on the impact of Kindle Unlimited. The last Amazon article you should read this year is Benedict Evans on profits.

It's too late to buy Champagne on Amazon - this New Year's at least.

Meredith Farkas: Peer learning in library instruction

Wed, 2014-12-31 18:30

Teaching is such a solitary thing. Sure, you’re up in front of a bunch of students, and maybe an instructor if you’re doing course-integrated instruction, but the act still feels solitary. We try to make it less so by seeking feedback from instructors and doing assessment, but we rarely get feedback from people who really understand what we do: our colleagues in the library.

But doing that can be terrifying for some. The idea of showing off your approach to teaching can be intimidating. Many of us assume that whatever our colleagues are doing in the classroom, it’s probably ten times more brilliant than what we are doing. I can guarantee that it’s probably different from what you do, but the fear of finding out they’re so much better at it than you is likely unfounded. You probably do some things they wish they did, and they probably do some things you wish you did. All that will come from discovering this is that you will learn more and become better, which seems worth a bit of anxiety.

As a former head of instruction at two institutions, I’ve guided colleagues through peer learning exercises around their instructional practice. What I’ve learned from doing this at two very different institutions is that there is no one-size-fits-all approach; you have to tailor the approach to the needs, anxieties, and culture of the group. But the value and importance of being able to talk about the good, the bad, and the ugly of teaching with your colleagues cannot be overstated. Not only does it improve your own reflective practice, but it creates a true community of practice, which will elevate everyone’s teaching. A rising tide raises all ships, right?

At one institution, we did peer observation of instruction followed by one-on-one meetings. Before the start of the term, each librarian chose two colleagues whose instruction they would observe  (we made sure no one had to work with someone they reported to) and those people, in turn, also observed them. So everyone observed and was observed by two people. We let the observed librarian pick the class they felt most comfortable having observed, which I think is a good way to decrease anxiety. If I could go back, I’d have each librarian first choose two sessions they are comfortable having observed first and then have people choose their observation partners based on scheduling availability, because I know at least one person couldn’t make it to the “ideal” session.

As each pair watched the instruction session, they took notes and wrote down questions they had about the approach the librarian was taking. This was for the one-on-one meeting each pair would have after the session. The idea was not to look for flaws, but to better understand their approach and brainstorm together better ways of meeting their instructional goals. But I think we all found that the act of watching two other people teach was actually far more enlightening than the conversations we had about our own teaching (though they were valuable too). We were able to lift the veil and see other approaches, and the ideas we got from this were incorporated into our future teaching. It worked out really well for all of us.

I will say that the group I did this with was tremendously comfortable with one another. We trusted and relied on each other, and I think that was what made it possible to do this successfully. At an institution where librarians are more anxious about their teaching or simply don’t trust each other enough, the approach might require tweaks. Maybe there is no meeting after the instruction sessions to discuss and reflect on them. Maybe instead, every librarian just observes two other librarians teaching. That, in and of itself, is so valuable. Or maybe your colleagues are just not comfortable letting other librarians watch them teach. These are not concerns to just brush off and ignore. It takes time to build a culture of trust, so if it’s not there yet, you need to find other ways to build a community of practice and ethic of peer learning that get people relying on each other for their learning and instructional improvement. It’s well worth the work.

One way to do this without peer observation is through reflective peer coaching, “a formative model that examines intentions prior to teaching and reflections afterwards” (Vidmar, 2006). We did this several times at one of my places of work and everyone who participated found it really helpful. We adopted the model promoted by our wonderful colleague at Southern Oregon University, Dale Vidmar. If you’re interested in improving teaching and reflective practice, he is a guy to know (or at least read his work!).

So, with reflective peer coaching, librarians pair up and meet once before the individual instruction sessions they want to reflect on are going to be taught. In that first meeting, each librarian talks about the session and their goals for it. They may also discuss concerns or fears that they have about the session, though not everyone will be comfortable with that. Their partner may ask questions to elicit further reflection on their goals and approach, but they are not there to make suggestions.

The pairs do not actually observe the instruction sessions they’ve heard about. Instead, they meet afterwards to discuss how it went. The act of doing this is what really creates a culture of reflective practice. Taking the time to really think about what went well, what didn’t, and how you might improve next time is so valuable. Having to articulate that to someone else, who may be asking probing questions that get you thinking more deeply about the session, leads to even greater learning. I provided each pair with suggested questions (most of which were borrowed from Dale’s work) that they could ask to elicit responses from their partner, but they were free to conduct these conversations however they chose so long as it wasn’t focused on making suggestions to their partner (which is harder to avoid than you might think! We naturally want to offer our help!). In Dale’s model, there is a third person involved, an observer, who makes sure that the pair is focused on reflection and questioning, not suggesting or advising and makes note of any really interesting comments from the person reflecting. Given how busy my colleagues were, we didn’t have observers.

Another way to build a culture of peer learning is through workshopping instruction sessions. This is where a single librarian talks about a session they’ve taught before or are teaching soon with the rest of the instruction librarians or community of practice. Maybe it’s one that is problematic for one reason or another — no computers for the students, big lecture class, instructor asking them to teach ALL THE THINGS, short time-frame, etc. — or it may just be one the librarian wasn’t satisfied with or is anxious about. So they come to their community of practice seeking feedback. How this plays out depends on the time constraints. It can range from simply offering suggestions to collaboratively redesigning the entire session in sub-groups to give the librarian seeking feedback a variety of different approaches to consider. Either way, the focus is on improving the teaching of a single session. While even this can be intimidating, it doesn’t really require laying yourself bare in the same way you would if your colleagues were actually watching you teach. We tried this a few times at our monthly instruction meetings at Portland State and it went pretty well.

Dale Vidmar (2006) writes that “two essential elements to meaningful collaboration and reflection are to create a trusting relationship and to promote thought and inquiry.” But what if you don’t have a community of practice at your place of work? What if there isn’t a culture of trust and the group dynamics are such that trying to create it would be fraught with peril? Well, you can create your own informal community of practice with even just one other colleague. In that case, it’s mainly about having a buddy you trust that you can bounce ideas off of. The value of this cannot be overstated. Even if you are part of a community of practice, I think having a buddy or two (or more) with whom you feel comfortable enough to share your fears and seek help from on a more frequent basis is critical in the workplace. At Portland State, I had my “pocket of wonderful”, a group with whom I was constantly talking about instruction sessions, and who did the same with me. When I created a new tutorial, they always got the first look before I sent things to the larger instruction list. I learned so much from them and feel like I’m a better instructor thanks to the informal conversations we had. At PCC, I was lucky in my first term to have a wonderful colleague who showed me his approaches to teaching certain classes (that I’d be teaching too), warned me about problematic instructors, and gave me valuable feedback. All of my colleagues are completely lovely and helpful, but his support of my instructional practice was invaluable. I hope in the future, I can be of help to him.

We don’t have to struggle alone. Whether you have a single trusted colleague or a large group that meets regularly, you can find ways to build a practice of reflection and peer learning around instruction.


Work Cited

Vidmar, Dale J. 2006. Reflective peer coaching: Crafting collaborative self-assessment in teaching.
Research Strategies 20: 135-148. (you can find a .doc file of this article on Dale’s website)

 Photo credit: Reflecting Critically to Improve Action, A Guide for Project M & E

John Miedema: “Instead of asking ‘Where does this metaphysics of the universe begin?’ – which was a virtually impossible question – all he had to do was just hold up two slips and ask, ‘Which comes first?'”

Wed, 2014-12-31 14:47

Because he didn’t pre-judge the fittingness of new ideas or try to put them in order but just let them flow in, these ideas sometimes came in so fast he couldn’t write them down quickly enough. The subject matter, a whole metaphysics, was so enormous the flow had turned into an avalanche. The slips kept expanding in every direction so that the more he saw the more he saw there was to see. It was like a Venturi effect which pulled ideas into it endlessly, on and on. He saw there were a million things to read, a million leads to follow … too much … too much … and not enough time in one life to get it all together. Snowed under.

There’d been times when an urge surfaced to take the slips, pile by pile, and file them into the door of the coal stove on top of the glowing charcoal briquets and then close the door and listen to the cricking of the metal as they turned into smoke. Then it would all be gone and he would be really free again.

Except that he wouldn’t be free. It would still be there in his mind to do.

So he spent most of his time submerged in chaos, knowing that the longer he put off setting into a fixed organization the more difficult it would become. But he felt sure that sooner or later some sort of a format would have to emerge and it would be a better one for his having waited.

Eventually this belief was justified. Periods started to appear when he just sat there for hours and no slips came in – and this, he saw, was at last the time for organizing. He was pleased to discover that the slips themselves made this organizing much easier. Instead of asking ‘Where does this metaphysics of the universe begin?’ – which was a virtually impossible question – all he had to do was just hold up two slips and ask, ‘Which comes first?’ This was easy and he always seemed to get an answer. Then he would take a third slip, compare it with the first one, and ask again, ‘Which comes first?’ If the new slip came after the first one he compared it with the second. Then he had a three-slip organization. He kept repeating the process with slip after slip.

Pirsig, Robert M. (1991). Lila: An Inquiry into Morals. Pg. 24.

Alf Eaton, Alf: The trouble with scientific software

Wed, 2014-12-31 12:44

Via Nautilus’ excellent Three Sentence Science, I was interested to read Nature’s list of “10 scientists who mattered this year”.

One of them, Sjors Scheres, has written software - RELION - that creates three-dimensional images of protein structures from cryo-electron microscopy images.

I was interested in finding out more about this software: how it had been created, and how the developer(s) had been able to make such a significant improvement in protein imaging.

The Scheres lab has a website. There’s no software section, but in the “Impact” tab is a section about RELION:

“The empirical Bayesian approach to single-particle analysis has been implemented in RELION (for REgularised LIkelihood OptimisatioN). RELION may be downloaded for free from the RELION Wiki). The Wiki also contains its documentation and a comprehensive tutorial.”

I was hoping for a link to GitHub, but at least the source code is available (though the “for free” is worrying, signifying that the default is “not for free”).

On the RELION Wiki, the introduction states that RELION “is developed in the group of Sjors Scheres” (slightly problematic, as this implies that outsiders are excluded, and that development of the software is not an open process).

There’s a link to “Download & install the 1.3 release”. On that page is a link to “Download RELION for free from here”, which leads to a form, asking for name, organisation and email address (which aren’t validated, so can be anything - the aim is to allow the owners to email users if a critical bug is found, but this shouldn’t really be a requirement before being allowed to download the software).

Finally, you get the software: relion–1.3.tar.bz2, containing files that were last modified in February and June this year.

The file is downloaded over HTTP, with no hash provided that would allow verification of the authenticity or correctness of the downloaded file.

The COPYING file contains the GPLv2 license - good!

There’s an AUTHORS file, but it doesn’t really list the contributors in a way that would be useful for citation. Instead, it’s mostly about third-party code:

This program is developed in the group of Sjors H.W. Scheres at the MRC Laboratory of Molecular Biology. However, it does also contain pieces of code from the following packages: XMIPP: http:/ BSOFT: HEALPIX: Original disclaimers in the code of these external packages have been maintained as much as possible. Please contact Sjors Scheres ( if you feel this has not been done correctly.

This is problematic, because the licenses of these pieces of software aren’t known. They are difficult to find: trying to download XMIPP hits another registration form, and BSOFT has no visible license. At least HEALPIX is hosted on SourceForge and has a clear GPLv2 license.

The CHANGELOG and NEWS files are empty. Apart from the folder name, the only way to find out which version of the code is present is to look in the configure script, which contains PACKAGE_VERSION=‘1.3’. There’s no way to know what has changed from the previous version, as the previous versions are not available anywhere (this also means that it’s impossible to reproduce results generated using older versions of the software).

The README file contains information about how to credit the authors of RELION if it is used: by citing the article Scheres, JMB (2011) (DOI: 10.1016/j.jmb.2011.11.010) which describes how the software works (the version of the software that was available in 2011, at least). This article is Open Access and published under CC-BY v3 (thanks MRC!).

Suggested Improvements

The source code for RELION should be in a public version control system such as GitHub, with tagged releases.

The CHANGELOG should be maintained, so that users can see what has changed between releases.

There should be a CITATION file that includes full details of the authors who contributed to (and should be credited for) development of the software, the name and current version of the software, and any other appropriate citation details.

Each public release of the software should be archived in a repository such as figshare, and assigned a DOI.

There should be a way for users to submit visible reports of any issues that are found with the software.

The parts of the software derived from third-party code should be clearly identified, and removed if their license is not compatible with the GPL.

For more discussion of what is needed to publish citable, re-usable scientific software, see the issues list of Mozilla Science Lab's "Code as a Research Object" project.

Patrick Hochstenbach: Happy New Year!

Wed, 2014-12-31 08:42
Filed under: Doodles Tagged: cartoon, cat, comic, doodle, newyear

Jenny Rose Halperin: Bulbes: a soup zine. Call for Submissions!

Tue, 2014-12-30 21:04

Please forward widely!

It’s that time of year, when hat hair is a reality and wet boots have to be left at the door. Frozen fingers and toes are warmed with lots of tea and hot cocoa, and you have heard so many Christmas songs that music is temporarily ruined.

I came to the conclusion a few years ago that soup is magic (influenced heavily by a friend, a soup evangelist) and decided to start a zine about soup, called


It is currently filled mostly with recipes, but also some poems (written by myself and others) and essays and reflections and jokes about soup. Some of you have already submitted to the zine, which is why all this may sound familiar.

Unfortunately, I hit a wall at some point and never finished it, but this year is the year! I finally have both the funds and feelings to finish this project and I encourage all of you to send me

* Recipes (hot and cold soups are welcome)
* Artwork about soup (particularly cover artwork!)
* Soup poems
* Soup essays
* Soup songs
* Soup jokes
* Anything else that may be worth including in a zine about soup

Submissions can be original or found, new or old.

Submission deadline is January 20 (after all the craziness of this time of year and early enough so that I can finish it and send it out before the end of winter!) If you need more time, please tell me and I will plan accordingly.
If you want to snail mail me your submission, get in touch for my address.

Otherwise email is fine!

Happy holidaze to all of you.



PS I got a big kick in the tuchus to actually finish this when I met Rachel Fershleiser, who kindly mailed me a copy of her much more punnily named “Stock Tips” last week.  It was pretty surreal to meet someone else who made a zine about soup!

check it out!

District Dispatch: CopyTalk webinar on Georgia State e-reserves case

Tue, 2014-12-30 15:15

Join us for our next installment of CopyTalk, January 8th at 2pm Eastern Time.

Laura Quilter will be our guest speaker for CopyTalk, a bimonthly webinar brought to you by the Office for Information Technology Policy’s subcommittee on Copyright Education. Our topic: an update on the lawsuit brought by three academic publishers against Georgia State University regarding fair use and e-reserves. If you have been keeping track, the GSU case is now in its fourth year of litigation. Most recently in October 2014, the Eleventh Circuit Appeals court overturned the 2012 decision in favor of GBS, a decision questioned by both rights holders and supporters of Georgia State due to the court’s formulaic application of fair use. Is this a serious setback or possibly good news? If litigation continues, the GSU case is destined to be a major ruling on fair use of digital copies for educational purposes. Not to be missed!!

Laura Quilter is the copyright and information policy attorney/librarian at the University of Massachusetts Amherst.  She works with the UMass Amherst community on copyright and related matters, equipping faculty, students, and staff with the understanding they need to navigate copyright, fair use, open access, publishing, and related issues.   Laura maintains a teaching appointment at Simmons College School of Library & Information Science, and has previously taught at UC Berkeley School of Law with the Samuelson Law, Technology & Public Policy Clinic. She holds an MSLIS degree (1993, U. of Kentucky) and a JD (2003, UC Berkeley School of Law). She is a frequent speaker, who has taught and lectured to a wide variety of audiences. She previously maintained a consulting practice on intellectual property and technology law matters. Laura’s research interests are the intersection of copyright with intellectual freedom and access to knowledge, and more generally human rights concerns within information law and policy.

There is no need to pre-register for this free webinar! Just show up on January 8, 2015 at 2pm Eastern

Note that the webinar is limited to 100 seats so watch with colleagues if possible. An archived copy will be available after the webinar.




The post CopyTalk webinar on Georgia State e-reserves case appeared first on District Dispatch.

John Miedema: “The reason Phaedrus used slips rather than full-sized sheets of paper is that a card-catalog tray full of slips provides a more random access.”

Tue, 2014-12-30 13:47

The reason Phaedrus used slips rather than full-sized sheets of paper is that a card-catalog tray full of slips provides a more random access. When information is organized in small chunks that can be accessed and sequenced at random it becomes much more valuable than when you have to take it in serial form. It’s better, for example, to run a post office where the patrons have numbered boxes and can come in to access these boxes any time they please. It’s worse to have them all come in at a certain time, stand in a queue and get their mail from Joe, who has to sort through everything alphabetically each time and who has rheumatism, is going to retire in a few years, and who doesn’t care whether they like waiting or not. When any distribution is locked into a rigid sequential format it develops Joes that dictate what new changes will be allowed and what will not, and that rigidity is deadly.

Some of the slips were actually about this topic: random access and Quality. The two are closely related. Random access is at the essence of organic growth, in which cells, like post-office boxes, are relatively independent. Cities are based on random access. Democracies are founded on it. The free market system, free speech, and the growth of science are all based on it. A library is one of civilization’s most powerful tools precisely because of its card-catalog trays. Without the Dewey Decimal System allowing the number of cards in the main catalog to grow or shrink at any point the whole library would soon grow stale and useless and die.

And so while those trays certainly didn’t have much glamour they nevertheless had the hidden strength of a card catalog. They ensured that by keeping his head empty and keeping sequential formatting to a minimum, no fresh new unexplored idea would be forgotten or shut out. There were no ideological Joes to kill an idea because it didn’t fit into what he was already thinking.

Pirsig, Robert M. (1991). Lila: An Inquiry into Morals. Pg. 23-24.

William Denton: CBC appearances (updated)

Tue, 2014-12-30 00:41

Sean Craig’s Amanda Lang took money from Manulife & Sun Life, gave them favourable CBC coverage piece on Canadaland got me looking at the cbcappearances script I wrote earlier this year.

It wasn’t getting any of the recent appearances—looks like the CBC changed how they are storing the data that is presented: instead of pulling it in on the fly from Google spreadsheets, it’s all the page in a hidden table (generated by their content management system, I guess) and shown as needed.

They should be making the data available in a reusable format, but they still aren’t. So we need to scrape it, but that’s easy, so I updated the script and regenerated appearances.csv, a nice reusable CSV file suitable for importing into your favourite data analysis tool. The last appearance listed was on 29 November 2014; I assume the December ones will show up soon in January.

The data shows 218 people have made 716 appearances since 24 April 2014. A quick histogram of appearances per person shows that most made only 1 or 2 appearances, and then it quickly tails off. Here how I did things in R:

> library(dplyr) > library(ggplot2) > cbc <- read.csv("appearances.csv", header = TRUE, stringsAsFactors = TRUE) > cbc$date <- as.Date(cbc$date) > totals <- cbc %>% group_by(name) %>% summarise(count = n()) %>% select(count) > qplot(totals$count, binwidth = 1) Histogram of appearance counts. Very skewed.

The median number of appearances is 2, the mean is about 3.3, and third quartile is 4 and above. Let’s label anyone in the third quartile as “busy,” and pick out everyone who is busy, then make a data frame of just the appearance information about busy people.

> summary(totals$count) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 3.284 4.000 33.000 > quantile(totals$count) 0% 25% 50% 75% 100% 1 1 2 4 33 > busy.number <- quantile(totals$count)[[4]] > busy.number [1] 4 > busy_people <- cbc %>% group_by(name) %>% summarise(count = n()) %>% filter(count >= busy.number) %>% select(name) > head(busy_people) Source: local data frame [6 x 1] name 1 Adrian Harewood 2 Alan Neal 3 Amanda Lang 4 Andrew Chang 5 Anne-Marie Mediwake 6 Brian Goldman > busy <- a %>% filter(name %in% busy_people$name) > head(busy) name date event role fee 1 Nora Young 2014-11-20 University of New Brunswick: Andrews initiative Lecture Paid 2 Carol Off 2014-11-14 War Museum Interview Expenses 3 Rex Murphy 2014-11-13 The Salvation Army Speech Paid 4 Carol Off 2014-11-03 Giller Prize Interview Paid 5 Carol Off 2014-11-01 International Federation of Authors Interview Unpaid 6 Carol Off 2014-10-27 International Federation of Authors Interview Unpaid

Now busy is a data frame of information about who did what where, but only for people with more than 4 appearances. It’s easy to do a stacked bar chart that shows how many of each type of fee (Paid, Unpaid, Expenses) each person received. There aren’t many situations where someone did a gig for expenses (red). Most are unpaid (blue) and some are paid (green).

> ggplot(busy, aes(name, fill = fee)) + geom_bar() + coord_flip() Number of appearances by remuneration types

Lawrence Wall is doing a lot of unpaid appearances, and has never done any for pay. Good for him. Rex Murphy is the only busy person who only does paid appearances. Tells you something, that.

Let’s pick out just the paid appearances of the busy people. No need to colour anything this time.

> ggplot(busy %>% filter(fee == "Paid"), aes(name)) + geom_bar() + coord_flip() Number of paid appearances by busy people

Amanda Lang is way out in the lead, with Peter Mansbridge second and Heather Hiscox and Dianne Buckner tied to show. In R, with dplyr, it’s easy to poke around in the data and see what’s going on, for example looking at the paid appearances of Amanda Lang and—as someone I’d expect/hope to be a lot different—Nora Young:

> busy %>% filter(name == "Amanda Lang", fee == "Paid") %>% select(date, event) date event 1 2014-11-27 Productivity Alberta Summit 2 2014-11-26 Association of Manitoba Municipalities 16th Annual Convention 3 2014-11-24 Portfolio Managers Association of Canada 4 2014-11-24 Sun Life Client Appreciation 5 2014-11-18 Vaughan Chamber of Commerce 6 2014-11-04 "PwC’s Western Canada Conference 7 2014-10-30 Chartered Institute of Management Accountants Conference on Innovation 8 2014-10-27 2014 ASA - CICBV Business Valuation Conference 9 2014-10-22 Simon Fraser University Public Square 10 2014-10-07 Colliers International Market Outlook Breakfast 11 2014-09-22 National Insurance Conference 12 2014-09-15 RIMS Canada Conference 13 2014-08-19 Association of Municipalities of Ontario Annual Conference 14 2014-08-07 Manulife Asset Management Seminar 15 2014-07-10 Manulife Asset Management Seminar 16 2014-06-26 Manulife Asset Management Seminar 17 2014-05-29 Manulife Asset Management Seminar 18 2014-05-13 GeoConvention Show Calgary 19 2014-05-09 Alberta Urban Development Institute 20 2014-05-08 Young Presidents Organization 21 2014-05-07 Canadian Restaurant Investment Summit 22 2014-05-06 Canadian Hotel Investment Conference > busy %>% filter(name == "Nora Young", fee == "Paid") %>% select(date, event) date event 1 2014-11-20 University of New Brunswick: Andrews initiative 2 2014-10-04 EdTech Team Ottawa: Bilingual Ottawa Summit feat. Google 3 2014-10-02 Humber College: President's Lecture Series 4 2014-10-01 Speech Ontario Professional Planners Institute: Healthy Communities and Planning in the Digital Age

Nora Young spoke about healthy communities and education to planners and colleges and universities … Amanda Lang spoke to developers and business groups and insurance companies. They are a lot different.

At this point, following up on any relation between Amanda Lang (or another host) and paid corporate gigs requires examination by hand. If the transcripts of The Exchange with Amanda Lang were available then it would be possible to write a script to look through them for mentions of these corporate hosts, which would provide clues for further examination. If the interviews were catalogued by a librarian with a controlled vocabulary then it would be even easier: you’d just do a query to find all occasions where (“Amanda Lang” wasPaidBy ?company) AND (“Amanda Lang” interviewed ?person) AND (?person isEmployeeOf ?company) and there you go, a list of interviews that require further investigation.

But it’s not all catalogued neatly, so journalists need to dig. This kind of initial data munging and visualization may, however, be helpful in pointing out who should be looked at first. Lang, Mansbridge and Murphy are the first three that Canadaland looked at, which does make me wonder what checking Hiscox and Buckner would show … are they different, and if so, how and why, and what does that say? I don’t know. This is as far as I’ll go with this cursory analysis.

In any case: hurrah to the CBC for making the data available, but boo for not making the raw data easy to use. Hurrah to Canadaland for investigating all this and forcing the issue.

Mark E. Phillips: A measure of metadata improvement in the UNT Libraries Digital Collections

Mon, 2014-12-29 23:46

The Portal to Texas History is working with the Digital Public Library of America as a Service Hub for the state of Texas.  As part of this work we are spending time working on a number of projects to improve the number and quality of metadata records in the Portal.

We have a number of student assistants within the UNT Libraries who are working to create metadata for items in our system that do not have complete metadata records.  In doing so we are able to make these items available to the general public.  I thought it might be interesting to write a bit about how we are measuring this work and showing that we are in fact making more content available.

What is a hidden record?

We have to kinds of records that get loaded into our digital library systems.  Records that are completely fleshed out and “live” and records that are minimal in nature and serve as a placeholder until the full record is created.  The minimal records almost always go into the system in a hidden state while the full records are most often loaded unhidden or published. There are situations where we load these full records into the system as hidden records but that is fairly rare.

How many hidden records?

When we started working on the Service Hubs project with DPLA we had 39,851 metadata records in the system that were hidden out of a total of 754,944 total metadata records.  This is about 5% of the records in the system in a hidden state.  

Why so many?

There are a few different categories that we can sort our hidden records into.  We have items that are missing full metadata records.  This accounts for the largest percentage of hidden records.  We also have records that belong to partner institutions around the state which most likely will never be completed because something on the partners end fell through before the metadata records were completed,  we generally call these orphaned metadata records.  We have items that for one reason or another are marked as “duplicate” and are waiting to be purged from the access system.  Finally we have items that are in an embargoed state in the system either because the rights owner for the item has an access embargo on the items, or because we haven’t been able to fully secure rights for the items yet.  Together this makes all of the hidden items in our system.  Unfortunately we currently don’t have a great way of differentiating between these different kinds of hidden record.

How are you measuring progress?

One of the metrics that we are using to establish that we are in fact reducing the number of hidden items in the system over time is to track the percentage of hidden records to total records in the system over time.  This gives us a way to show that we are making progress and continuing to reduce the ratio of hidden to unhidden records in the system.  The following table shows the current data we’ve been collecting for this since August 2014.

Date Total Hidden Percent Hidden 2014-08-04 754,944 39,851 5.278669676 2014-09-02 816,446 43,238 5.295879948 2014-10-14 907,816 38,867 4.281374199 2014-11-05 937,470 44,286 4.723991168 2014-12-14 1,014,890 41,264 4.065859354

You see that even though we’ve had a few rises between the months we’ve been moving overall in a downward trend in the number of records that are hidden versus unhidden.  The dataset that is updated each month is available as a Google Drive Spreadsheet.

There are several projects that we have loaded in a hidden state over the past few months including over 7,000 Texas Patent records, 1,200 Texas State Auditors Reports and 3,000 photographs from a personal photograph collection.  These were all loaded in a hidden state which explains the large jumps in numbers.

Areas to improve.

One of the things that we have started to think about (but don’t have any solid answers) is a way of classifying the different states that a metadata record can have in our system so that we can have a better understanding of why items are hidden vs not hidden.  We recognize our simple hidden or unhidden designation is lacking.  I would be interested in knowing how others are approaching this sort of issue and if there is some sort of existing work to build upon.  If there is something out there do get in touch and let me know.

Library of Congress: The Signal: Managing Research Data at Tufts University: An NDSR Project Update

Mon, 2014-12-29 13:45

The following is a guest post by Samantha DeWitt, National Digital Stewardship Resident at Tufts University.

Hello readers and a happy winter solstice from Medford, Massachusetts. It’s hard to believe I am already in my third month of the National Digital Stewardship Residency. There’s a chill in the air and the vivid fall colors that decorated the Tufts University campus last month have given way to a palette of browns and grays.

Samantha, by Samantha.

During my residency here, I have been exploring ways in which the university can get a better handle on its faculty-produced research data. The project has been illuminating. The first thing I discovered is that Tufts is not alone in their uncertainty concerning the status of institutionally connected research data. Many institutions are taking a hard look at how they approach research data management and some of the results are noteworthy. Harvard, for instance, has developed the Dataverse Network; an “open-source application for sharing, citing, analyzing and preserving research data.” Purdue has recently developed an online research repository (PURR), which provides researchers with a collaborative space during their project and long-term data management assistance. (Published datasets remain online for a minimum of 10 years as a part of the Purdue Libraries’ collection.)

Initial data storage choices

At the beginning of a project, researchers can receive assistance with storage from the Tufts technology services department. Networked (cluster) storage is available for up to several terabytes. One drive is available for smaller amounts of collaborative storage and a second can be used for individual storage space (up to four GB). Lastly, cloud storage is available through Tufts Box. Of course, one can always elect to store data on a personal hard drive or select from an array of portable storage devices.

Unfortunately, hard drives may crash and portable devices may become lost or obsolete… As this is a blog about digital preservation, I realize I don’t need to elaborate on the problems that can befall neglected media. Further, the data remaining in networked storage will be erased when a researcher leaves. Even if this were not the case, attempts to retroactively find or make sense of the data would be prohibitively time-consuming.

Data must be properly managed to be accessible

Tufts is looking at ways to understand its data output with strategies to trace and catalog research data.

Data sharing can be seen as fundamental to the most basic tenets of the scientific method: it permits reproducibility, encourages collaboration among researchers and advocates for the re-use of valuable resources. These principles have been espoused by the National Institutes of Health (NIH) and the National Science Foundation (NSF) and they, along with provisions to increase financial transparency, have resulted in increasingly stringent data management mandates for grant-seekers.

These days, Washington isn’t the only player putting pressure on researchers to tend to their data. In 2011, The Bill and Melinda Gates Foundation began asking researchers to submit a data access plan (PDF) along with their grant proposal, stating that, “a data access plan should at a minimum address the nature and scope of data and information to be disseminated, the timing of disclosure, the manner in which the data and information is stored and disseminated, and who will have access to the data and under what conditions.” The Alfred P. Sloan Foundation, too, asks applicants to describe how their data and code will be “shared, annotated, cited, and archived.”

But just because data has been placed in an appropriate subject-based repository does not ensure that those at Tufts know where it is. (Researchers themselves may not even know or remember.) This creates a unique opportunity for the university to consider ways to catalog this data. By better understanding its research output, Tufts could more easily:

  • Comply with funders’ data access mandates.
  • Visualize institutional research output.
  • Encourage inter-departmental collaboration.
  • Avoid research duplication.
  • Increase institutional visibility by data sharing.

The journals “Science” and “Nature” both require authors to submit data relevant to their publication. Furthermore, in May of this year, the Nature Publishing Group launched an open-access, online-only journal called “Scientific Data,” where researchers can access descriptions of data outputs, file formats, sample identifiers and replication structure. What is worth noting is that the site does not store data but rather acts as a finding aid for data housed in other repositories. The idea of keeping records of data while depositing them elsewhere, is intriguing. In fact, it might be possible to devise a similar sort of system here. Tufts already has a Fedora-based digital repository, so the digital object record would merely require the adequate metadata and a URL to direct the user to the right repository. This type of system could allow the university a better grasp on its research data output.

Tufts has made definite progress in advocating for best practices in data management for its researchers; the library holds frequent workshops and offers assistance in drafting data management plans. It is likely, however, that both government and non-government funders – as well as scholarly journals – will continue to focus on the effective management of research data. Moreover, because universities such as Tufts should be able to appraise one of its most fundamental assets, research data access continues to require our attention.

John Miedema: “This ‘slip-world’ was quite a world and he’d almost lost it once because he hadn’t written any of it down.” Pirsig, Lila

Mon, 2014-12-29 02:37

He saw that her suitcase had shoved all his trays of slips over to one side of the pilot berth. They were for a book he was working on and one of the four long card-catalog-type trays was by an edge where it could fall off. That’s all he needed, he thought, about three thousand four-by-six slips of note pad paper all over the floor.

He got up and adjusted the sliding rest inside each tray so that it was tight against the slips and they couldn’t fall out. Then he carefully pushed the trays back into a safer place in the rear of the berth. Then he went back and sat down again.

It would actually be easier to lose the boat than it would be to lose those slips. There were about eleven thousand of them. They’d grown out of almost four years of organizing and reorganizing and reorganizing so many times he’d become dizzy trying to fit them all together. He’d just about given up.

Their overall subject he called a ‘Metaphysics of Quality,’ or sometimes a ‘Metaphysics of Value,’ or sometimes just ‘MOQ’ to save time.

The buildings out there on shore were in one world and these slips were in another. This ‘slip-world’ was quite a world and he’d almost lost it once because he hadn’t written any of it down and incidents came along that had destroyed his memory of it. Now he had reconstructed what seemed like most of it on these slips and he didn’t want to lose it again.

But maybe it was a good thing that he had lost it because now, in the reconstruction of it, all sorts of new material was flooding in – so much that his main task was to get it processed before it log-jammed his head into some kind of a block that he couldn’t get out of. Now the main purpose of the slips was not to help him remember anything. It was to help him to forget it. That sounded contradictory but the purpose was to keep his head empty, to put all his ideas of the past four years on that pilot berth where he didn’t have to think of them. That was what he wanted.

There’s an old analogy to a cup of tea. If you want to drink new tea you have to get rid of the old tea that’s in your cup, otherwise your cup just overflows and you get a wet mess. Your head is like that cup. It has a limited capacity and if you want to learn something about the world you should keep your head empty in order to learn it. It’s very easy to spend your whole life swishing old tea around in your cup thinking it’s great stuff because you’ve never really tried anything new, because you could never get it in, because the old stuff prevented its entry because you were so sure the old stuff was so good, because you never really tried anything new … on and on in an endless circular pattern.

Pirsig, Robert M. (1991). Lila: An Inquiry into Morals. Pg. 22-23.