You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 19 hours 52 min ago

Hydra Project: Sufia 6.0.0 released

Tue, 2015-03-31 13:15

We are delighted to announce that Sufia 6.0.0 has been released. It’s been quite an undertaking:

* over 451 commits
* 402 files changed
* 17 contributors from 6 institutions, across the U.S.A and Canada

Some of the new features include:

* Fedora 4 support
* Some optimizations for large files including using ActiveFedora’s streaming API
* New metadata editing forms (inspired by Worthwhile and Curate)
* Store old featured researchers in database
* Support for the latest Hydra components, Blacklight, Rails, and Ruby versions
* Easier overriding of Sufia’s controllers
* Dozens and dozens of bugs squashed, and lots of UI/UX tweaks
* Lots of work on the README

You can read all about it in the release notes, including instructions for upgrading:

We’ve also created a sample application that outlines the steps of upgrading to version 6 and migrating the data from Fedora 3 to Fedora 4:

Penn State has tested this code against their Scholarsphere application where they will be using it to launch their new Fedora 4-based version and migrate its data in April. You will notice that Sufia 5.0, the final Fedora 3-based Sufia version, is still in release candidate status. We hope to have this rectified soon.

Many thanks to all the developers who have made this release possible, and to the leadership at each of their institutions for recognizing the value of this project.

Mark E. Phillips: Metadata Edit Events: Part 4 – Duration, buckets

Tue, 2015-03-31 12:29

This is the fourth in a series of posts related to metadata edit events collected by the UNT Libraries from its digital library system from January 1, 2014 until December 31, 2014.  The previous posts covered when, who, and what.

This post will start the discussion on the “how long” or duration of the dataset.

Libraries, archives, and museums have long discussed the cost of metadata creation and improvement projects, depending on the size, complexity and experience of the metadata creators,  the costs associated with metadata generation, manipulation and improvement can vary drastically.

The amount of time that a person takes to create or edit a specific metadata record is often used in the calculations of what projects will cost to complete.  At the UNT Libraries we have used $3.00 per descriptive record as our metadata costs for projects, and based on the level of metadata created, workflows use, and the system we’ve developed for metadata creation, this number seems to do a good job of covering our metadata creation costs. It will be interesting to get a sense of how much time was spent editing metadata records over the past year and also plotting that to collections, type, formats and partners.  This will involve a bit of investigation of the dataset before we get to those numbers though.

Here is a quick warning about the rest of the post,  for me I’m stepping out into deeper water for me with the analysis I’m going to be doing with our 94,222 edit events. From what I can tell from my research is that there are many ways to go about some of this and I’m not at all claiming that I have the best or even a good approach.  But it has been fun so far.


The reason we wanted to capture event data when we created our Metadata Edit Event Service was to get a better idea of how much time our users were spending on the task of editing metadata records.

This is accomplished by adding a log value into the system with a timestamp, identifier, and username when a record is opened,  and when the record is published back into the system the original log time is subtracted from the published time which results in the number of seconds that were taken for the metadata event. (a side note,  this is also the basis for our record locking mechanism so that two users don’t try and edit the same record at the same time)

There are of course a number of issues with this model that we noticed, first what if the users opens a record and forgets about it and goes to lunch then comes back and publishes the record.  What happens if they open a record and then close it, what happens to that previous log event, is it used the next time?  What happens if a user opens multiple records at once in different tabs,  if they aren’t using the other tabs immediately they are adding time without really “editing” the records.  What if a user makes use of a browser automation tool like Selenium,  won’t that skew your data?

The answer to many of these questions is “yep that happens” and how we deal with them in the data is something that I’m trying to figure out,  I’ll walk you through what I’m doing so far to see if it makes sense.

Looking at the Data Hours

As a reminder,  there are 94,222 edit events in the dataset.  The first thing I wanted to take a look at is how they group into buckets based on hours.  I took the durations and divided them by 3600 with floor division so i should get buckets of 0,1,2,3,4,….and so on.

Below is a table of these values.

Hours Event Count 0 93,378 1 592 2 124 3 41 4 20 5 5 6 8 7 7 8 1 9 4 10 6 11 2 12 1 14 3 16 5 17 3 18 2 19 1 20 1 21 2 22 2 23 2 24 3 25 1 26 1 29 1 32 2 37 1 40 2 119 1

And then a pretty graph of that same data.

Edit Event durations grouped by hour

What is very obvious about this table and graph is that the vast majority with 93,378 (99%) of the edit events taking under one hour to finish.  We already see some outliers with 119 hours (almost an entire work week.. that’s one tough record) on the top end of event duration list.

While I’m not going to get into it with this post,  it would be interesting to see if there are any patterns to find in the 844 records that took longer than an hour to create.  What percentage of that users records took over an hour,  do they come from similar collections, types, formats, or partners?  Something for later I guess.


Next I wanted to look at the edit events that took less than an hour to complete,  where do they sit if I put them in buckets based on 60 seconds.  Filtering out the events that took more than an hour to complete leaves me 93,378 events.  Below is the graph of these edit events.

Edit Event durations grouped by minute for events taking under one hour to complete.

You can see a dramatic curve for the edit events as the number of minutes goes up.

I was interested to see where the 80/20 split for this dataset would be and it appears to be right about six minutes.  There are 17,397 (19%) events occurring from 7-60 minutes and 75,981 (81%) events from 0-6 minutes in length.


Diving into the dataset one more time I wanted to look at the 35,935 events that happened in less than a minute.  Editing a record in under a minute for me takes a few different paths.  First you could be editing a simple field like changing a language code or a resource type,  second you could be just “looking” at a record and instead of closing the record you hit “publish” again. You might also be switching a record from the hidden state to the unhidden state (or vice versa), finally you might be using a browser automation tool to automate your edits.   Let’s see if we can spot any of these actions when we look at the data.

Edit Event durations for events taking under one minute to complete.

By just looking at the data above it is hard to say which of the kinds of events mentioned above map to different parts of the curve.  I think when we start to look at individual users and collections some of this information might make a little more sense.

This is going to wrap up this post,  in the next post I’m hoping to define the cutoff that will designate “outliers” from data that we want to use for the calculation of average times for metadata creation and then see how that looks for our various users in the system.

As always feel free to contact me via Twitter if you have questions or comments.

Meredith Farkas: True confessions

Tue, 2015-03-31 04:43

When my brain was completely full on Thursday at the ACRL Conference, Jad Abumrad’s keynote felt like a spa for my brain. For those who don’t know, he is the co-host of Radiolab, a very cool and innovative show on NPR, and the recipient of one of those fancy schmancy MacArthur genius grants. Good call ACRL planning committee! His keynote was brilliant and it was coming at a time when I’ve been reflecting on where I am in my career now that I feel like I’m not in survival mode anymore.

For those who missed Jad’s talk, here’s another one he gave two years ago that covered some similar territory:

Jad Abumrad: Why "Gut Churn" Is an Essential Part of the Creative Process from 99U on Vimeo.

I have such admiration for people who are confident. People who are poised. People who are strong advocates for themselves. People who are quick thinkers. People who are energized, not anxious, when in a crowd of people. People who can be politic. People who are brave. I have a lot of friends I wish I was more like. But I’ve also learned over the years that many of the people I thought were all those things were actually just as big a ball of neuroses as I am. That a lot of people I thought were so confident were actually overcompensating for major insecurities.

People you admire are probably more than meets the eye too.

There are people who say they admire me. I’ve always been uncomfortable with that because I don’t think I deserve it. I’m also uncomfortable because I worry that it creates this false expert vs. novice dichotomy that might make them think they can’t achieve what I have. Anyone can do what I’ve done.

I know a lot of people who are afraid to take risks in their work and/or are in difficult work situations that are killing their passion for their work. In the interest of encouraging other people who are struggling, and inspired by Jad’s talk (though not nearly as eloquent), I’m going to share a bit here.

I am a big ball of self-doubt.

Have you let doubt keep you from trying something or pursuing an idea? Well, screw that! I have never felt certain about anything I’ve done while I was doing it. The entire time we were working on Library DIY, there was a constant voice in the back of my mind telling me “this is crap. There’s a reason no one has done something like this and that’s because it makes no sense.” I’ve cringed when hitting publish on the vast majority of blog posts I’ve written because I think most times that the ideas I have are stupid.

Jad talked about how “gut churn” is an essential part of the creative process. That feeling of anxiety and doubt and panic when you’re trying to do something really creative and different is very normal and very necessary. I’ve always believed that talented, accomplished, and creative people feel really certain about their projects and path (á la Steve Jobs), but it was a relief to know that I’m not alone in feeling the “gut churn.”

So many of us are stopped in our tracks by fears that our ideas are not innovative or even good. Sometimes we’re right and sometimes we’re wrong. I’ve had projects fail and I’ve had projects succeed beyond my wildest dreams, but I’m always glad I went for it because I learned from every one of them.

I’m starting to realize that “gut churn” is better than certainty, because it leaves you more open to making changes and improvements based on what you hear from others (colleagues, patrons, etc.). The more stuck you get on the perfect rightness of your original vision, the less likely you’ll be to accept feedback and make improvements. I’ve learned to develop some amount of detachment from my projects, so that when my work is criticized, it doesn’t feel like a criticism of me. Becoming defensive isn’t productive, and I regret times when I was defensive about stuff in the past.

I’m more of a beginner now than I was before.

One of my favorite former colleagues sent me an article entitled “The importance of stupidity in scientific research.” What I initially thought was a joke actually was a fantastic editorial about seeking out opportunities to “feel stupid;” where you can’t easily find an answer and have to struggle, learn, and make your own discoveries.

Productive stupidity means being ignorant by choice. Focusing on important questions puts us in the awkward position of being ignorant. One of the beautiful things about science is that it allows us to bumble along, getting it wrong time after time, and feel perfectly fine as long as we learn something each time. No doubt, this can be difficult for students who are accustomed to getting the answers right. … The more comfortable we become with being stupid, the deeper we will wade into the unknown and the more likely we are to make big discoveries.

I was a high achiever in high school and, at the elite college I attended where I felt perpetually out of my depth, I avoided taking classes that scared and challenged me. What a waste. I’ve come to love the anxiety of doing something new that I’m not necessarily a natural at. Public speaking was something that used to terrify me, but over time, I became increasingly comfortable and found my voice as a speaker. Moving from a university to a community college put me back into the beginner role, and I’ve grown so much as an instructor over the past few months because of it. Feeling ignorant (as I did in my first term here) is not a comfortable thing, but it makes you struggle more and learn more to get beyond that beginner state.

I don’t consider myself an expert at anything. There are some things I’m better at than others, but in my teaching, my writing, my speaking, and everything else I do professionally, I am a work in progress; a perpetual beginner. Having that attitude leaves us open to learning and growth.

Haters gonna hate, but don’t let them define you.

I’m one of those people who just wants to be liked. I’m a people pleaser. I remember in my sophomore year of college, I lived in a house where most of my housemates were always fighting with each other. My buddy Dan Young and I were like Switzerland where everyone bitched to us about other people and we just tried to stay neutral and sympathetic.

I’ve always gotten along with people in the workplace, so when I had what I can only describe as a “nemesis” in one of my jobs, I had no idea how to handle it. This was someone who had been up for the management job I got. I tried to connect with her and be friendly, but she did everything in her power to undercut me in meetings and make me look bad to our superiors and colleagues. I constantly heard from colleagues about her saying bad things about me behind my back, as if I was some kind of horrible person, which made me wonder if I was. I hate that I let her get to me so much. But when she started alienating other people at work, I realized it wasn’t all about me.

The good thing that came out of this experience is that I’m now more ok with not being liked, especially when I’m pretty sure there was nothing I did to deserve it. Sometimes it’s not really about you, but about a situation or the fragile ego of the other person. Sometimes you’re walking into a context that dooms you from the start. It’s always worth starting from a place where you examine your own behavior to see if you somehow caused the problem, but you shouldn’t hang your whole sense of self-worth on whether or not your colleagues adore you.

Even my painful experiences have led to valuable learning.

I spent a big part of the past four years feeling like a failure. Every time I started to feel good about the work I was doing, something or someone came and smacked me down. Still, I’ve learned so much about myself and how to handle difficult work and political situations because of the experiences I had.

In the talk I shared above, Jad talks about reframing awful things that happen; using them as an arrow to point you toward the solution. That’s what led me to my current job, one that was not at all what I’d envisioned as my future when I was at Norwich five years ago. Yet it fits me like a glove. When I was feeling horrible about work, I thought a lot about what the right job would look like. And it looked quite a bit like what I’m doing now. Pain has a way of sharpening your focus and showing you the right path.

I deserve good things. So do you.

I’m not perfect. I’ve made mistakes and I’m sure I’ll make more in the future. I’m a work in progress, but I’m always striving to be better. I want to be a supportive colleague and be good at my job. I want to be a good wife and mother. I want to feel like I’m contributing to the profession beyond my library in useful ways. I’m working on getting used to the happiness I feel now that I’m in a job I love. I’m trying to be nicer to myself. I’m trying to feel like I deserve these good things that are happening for me.

We all deserve good things. We are all works in progress. Don’t let your own doubts or the stories you’ve got in your head (or that people tell you) about what you can and can’t do prevent you from taking risks and growing. Try. If the worst thing to fear is failure (and recognizing that you will learn from it either way), it doesn’t seem like such a huge risk to take.

Image credit: Gut churn, by Dreadful Daily Doodles

DuraSpace News: Call For Sixth Annual VIVO Conference Papers–Open Through April 24

Tue, 2015-03-31 00:00

Boston, MA  The Sixth Annual VIVO Conference will be held August 12-14, 2015 at the Hyatt Regency Cambridge, overlooking Boston. The VIVO Conference creates a unique opportunity for people from across the country and around the world to come together to explore ways to use semantic technologies and linked open data to promote scholarly collaboration and research discovery.

DPLA: Sharing Data for Better Discovery and Access

Mon, 2015-03-30 21:00

The Internet Archive and the Digital Public Library of America (DPLA) are pleased to announce a joint collaborative program to enhance sharing of collections from the Internet Archive in the Digital Public Library of America (DPLA). The Internet Archive will work with interested Libraries and content providers to help ensure their metadata meets DPLA’s standards and requirements. After their content is digitized, the metadata would then be ready for ingestion into the DPLA if the content provider has a current DPLA provider agreement.

The DPLA is excited to collaborate with the Internet Archive in this effort to improve metadata quality overall, by making it more consistent with DPLA requirements, including consistent rights statements. Better data means better access. In addition to providing DPLA compliant metadata services, the Internet Archive also offers a spectrum of digital collection services, such as digitization, storage and preservation. Libraries, Archives and Museums who chose Internet Archive as their service provider have the added benefit of having their content made globally available through Internet Archive’s award winning portals, and

“We are thrilled to be working with the DPLA”, states Robert Miller, Internet Archive’s  General Manager of Digital Libraries. “With their emphasis on providing not only a portal and a platform, but also their advocacy for public access of content, they are a perfect partner for us”.

Rachel Frick, DPLA’s Business Development Director says, “The Internet Archive’s mission of ‘Universal Access to All Knowledge’, coupled with their end-to-end digital library solutions complements our core values.”

Program details are available upon request.


FOSS4Lib Recent Releases: ArchivesSpace - 1.2.0

Mon, 2015-03-30 20:43

Last updated March 30, 2015. Created by Peter Murray on March 30, 2015.
Log in to edit this page.

Package: ArchivesSpaceRelease Date: Monday, March 30, 2015

Nicole Engard: Bookmarks for March 30, 2015

Mon, 2015-03-30 20:30

Today I found the following resources and bookmarked them on Delicious.

Digest powered by RSS Digest

The post Bookmarks for March 30, 2015 appeared first on What I Learned Today....

Related posts:

  1. Encyclopaedia Britannica Goes — Gasp! — Wiki
  2. Can you say Kebberfegg 3 times fast
  3. Are you backing up?

John Miedema: Tags are the evil sisters of Categories. Surprising views, sour fast. Lila offers a different approach.

Mon, 2015-03-30 20:26

I’m a classification nut, as I told you. In the last post I told you about the way I organize files and emails into folders. Scintillating stuff, I know. But let’s go a level deeper toward Lila by talking about tagging. Tags are the evil sisters of categories. Categories are top-down classification — someone on high has a idealized model of how everything fits into nice neat buckets. Tags are situational and bottom-up. In the heat of the moment, you decide that this file or that email is about some subject. Tags don’t conform to a model, you make them up on the fly. You add many tags, as many as you like. Mayhem! I’ve tried ‘em, I don’t like ‘em.

Tags do one thing very well, they let you create surprising views on your content. Categories suffer from the fact that they only provide one view, a hierarchical structured tree. Tags let you see the same content in many different ways. Oh! Look. There’s that short story I wrote tagged with “epic.” And there’s those awesome vacation pics tagged with the same. Hey, I could put those photos on that story and make it so much better. But the juice you get out of tags sours fast. The fact that they are situational and bottom-up causes their meaning to change. “Bad” and “sick” used to mean negative things. As soon as people get about a hundred tags they start refactoring them, merging and splitting them, using punctuation like underscores to give certain tags special meanings. Pretty soon they dump the whole lot of them and start over. Tags fail. What people really want is, yup, categories.

Lila is a new way to get the juice out of tags without going sour. Lila works collaboratively with the author to organize writing. Lila will let writers assign categories and tags, but treat them as mere suggestions. The human is smart, Lila knows, and needs his or her help, so it will use the author’s suggestions to come up with its own set of categories and tags. Lila’s technique will be based on natural language processing. Best part, the tags can also be regenerated at the click of a button, so that the tags never sour. You get the surprising views and the tags maintain their freshness. Sweet.

I’ve been pretty down on tags in this post, so I will say there is one more thing that tags do quite well. They connect people, like hash tags in twitter. They form lose groupings of content so that disparate folks can find each other. It doesn’t apply so much to a solitary writing process, but it might fit to a social writing process. I will think on that.

FOSS4Lib Recent Releases: Sufia - 6.0.0

Mon, 2015-03-30 20:03

Last updated March 30, 2015. Created by Peter Murray on March 30, 2015.
Log in to edit this page.

Package: SufiaRelease Date: Friday, March 27, 2015

Roy Tennant: Want To See More Women in Tech? Mentor Someone

Mon, 2015-03-30 16:50

I was not much more than a newly-minted librarian when my greatest professional mentor gave me a chance at something that would launch my career beyond the confines of my institution onto an international stage. It was in the early 90s, when the Internet was just beginning to take off at large research libraries around the United States. If you can, and I know it’s difficult, imagine libraries without the Internet. Imagine society without the Internet.

Anne Lipow was entrusted with developing and delivering technology and bibliographic classes to both staff and faculty at UC Berkeley, and she took the responsibility very seriously. She would prowl the halls of Doe Library looking for young turks like myself, to pull us in to developing and delivering courses on how to connect to the newly-online library catalog or how to use this new thing called Gopher. I almost started ducking into doorways when I spotted her coming down the hall. And now I’m really glad I was more stupid than cowardly. Because I could never have predicted what would come next.

Anne retired from Berkeley and started her own consultancy: Library Solutions Institute. She began planning her very first event — an all-day hands-on workshop on how to use the Internet timed to coincide with the ALA Annual Conference to be held in San Francisco in June 1992. She signed me up to help, as well as John Ober. Clifford Lynch agreed to be the ending keynote of one group and the beginning keynote of another, thereby allowing us to sign up two cohorts over two days.

We began work on a set of handouts that soon led to a binder to hold them all and the dawning realization that we had a book on our hands. Anne changed the name of her business to Library Solutions Institution and Press and we were off to the races. Crossing the Internet Threshold: An Instructional Handbook was published later that year and it took off, and my speaking career took off with it. Before long I was traveling to foreign countries such as Romania and Hungary, giving workshops based on that text. Between the royalties and speaking fees, my wife and I were able to financially weather the impact of twins born in February 1993. Without it, I shudder to think.

So you will not find a stronger advocate for mentorship than me. That’s why I have tried to focus on finding young female professionals interested in library technology to mentor, so as a profession we can increase the number of women in tech librarianship. I know that a diversity of perspectives, skills, and abilities is by its very nature a good thing. And the more of us out there increasing diversity of all kinds in library tech librarianship, the better off the entire profession will be.

Anne, I miss you. But your example and inspiration is alive and well.


Islandora: Islandora Foundation: Meet the Partners

Mon, 2015-03-30 14:45

The Islandora Foundation has been very fortunate to welcome six new Partner-level members in the past few months, due in large part to enthusiasm over our ongoing Fedora 4/Islandora 7.x upgration project. I'd like to take some time in this week's blog to highlight those new members, and all of the Foundation supporters who have helped us to get where we are today. Which is a pretty good spot for a non-profit of less than two years: we are on the verge of our third community-led release, upgrading to support the latest in Fedora, holding Camps all over the world, and planning our first conference.

So let's take a look at the Partners who are helping the Islandora Foundation to thrive:

When we launched in July 2013, it was with the backing of two initial partners who have always been a part of Islandora's story: UPEI and discoverygarden, Inc. Islandora was born at UPEI under the guidance of University Librarian Mark Leggott, who continues on at the current Chairman of our Board of Directors.

Discoverygarden, in this context, is sort of like the Foundation's older sibling who went to work in the private sector. By providing services to install and customize an open source software platform and donating many of their developments right back for public use, it developed alongside Islandora while making huge contributions to the codebase, and developers at dgi continue to produce and refine a lot of the core functions that make Islandora work.

The next institution to step up to the plate as a Partner is LYRASIS. One of the Foundation's first Collaborator members, LYRASIS is a non-profit membership organization committed to the success of libraries and cultural heritage organizations. It partners with members create, access, and manage information with an emphasis on digital content. In the Islandora community, LYRASIS has had an active presence on the Roadmap Committee, the Board of Directors, a number of Interest Groups, and most Islandora Camps (to the point where we feel a little bereft when there's no one from LYRASIS in attendance).

When they renewed their membership this year, LYRASIS decided to bump up to Partner to help support the upgration. This is a common theme with our new Partners. Fedora 4 is a great step forward and it is awesome how many in the community have committed to seeing it happen.

Their Assistant Director for Digital Technology Services, Peter Murray, was already a member of the Islandora Foundation Board at our request, and will be continuing in this role.

The University of Manitoba is a public university in Winnipeg, Manitoba. A founding Member in the Islandora Foundation, they also bumped up to Partner to help the Fedora 4 upgration. Their Web Application Developer, Jared Whiklo, has been an active participant on the front lines of the project, working with Nick and Danny to get the prototype off the ground. The library's Head of Discovery & Delivery Services, Lisa O'Hara, is joining the Foundation's Board of Directors.

McMaster University is another long-time Collaborator in the Foundation. Like LYRASIS and the University of Manitoba, their new Partnership helps to support the future of Islandora with Fedora 4. Already a a big help to the community through member's participating in Interest Groups and the Roadmap Committee, we are looking forward to having McMaster represented on our Board of Directors by Dale Askey.

York University makes the move to Partner from Collaborator through their very generous in-kind donation of a resource that has proven absolutely vital to the Fedora 4 upgration: Nick Ruest's time. They are also piloting a migration from Fedora 3 to 4 that will most likely serve as the framework on which the entire community can base such migrations in the future. 

Adam Taves, Acting Associate University Librarian for Collections & Research, will be joining the Board of Directors on York's behalf.


Simon Fraser University is one of two new Partners who are joining the Foundation for the first time this year - although it has long been an active contributor to the Islandora community through the efforts of members like Mark Jordan and Alex Garnett. Indeed, Mark Jordan was already a member of the Islandora Foundation Board of Directors and will be staying on with us.

We were very fortunate to be able to first announce SFU's Partnership right on their downtown campus at Islandora Camp BC last February.


Our very newest Partner is not quite our first European member (that distinction belongs to to digiBESS group in Italy), but it is our first European Partner. The University of Limerick joins us in part to support the Fedora 4 upgration and will be represented on the Board by Caleb Derven. And maybe, if we are very lucky, we'll get to invite you all to an Islandora Camp in their fair city. Because this:

Photo by: William Murphy

ACRL TechConnect: A Video on Browser Extensions

Mon, 2015-03-30 13:00

I thought we’d try something new on ACRL TechConnect, so I recorded a fifteen-minute video discussing general use cases for browser extensions and some specifics of Google Chrome extensions.

The video mentions my WikipeDPLA post on this blog and walks through some slides I presented at a Code4Lib Northern California event.

If you’re looking for another good extension example in libraryland, Stephen Schor of New York Public Library recently wrote extensions for Chrome and Firefox that improve the appearance and utility of the Library of Congress’ EAD documentation. The Chrome extension uses the same content script approach as my silly example in the video. It’s a good demonstration of how you can customize a site you don’t control using the power of browser add-ons.

Have you found a use case for browser add-ons at your library? Let us know in the comments!

Mark E. Phillips: Metadata Edit Events: Part 3 – What

Mon, 2015-03-30 02:16

This is the third post in a series related to metadata event data that we collected from January 1, 2014 to December 31, 2014 for the UNT Libraries Digital Collections.  We collected 94,222 metadata editing events during this time.

The first post was about the when of the events,  when did they occur, what day of the week and what day of the week the occurred.

The second post touched on the who of the events,  who were the main metadata editors, how were edits distributed among the different users, and how the number of years per month, day, hour were distributed.

This post will look at the what of the events data.  What were the records that were touched,  what collections or partners did they belong to and so on.

Of the total 94,222 edit events there were 68,758 unique metadata records edited.

By using the helpful st program we can quickly get the statistics for these 68,758 unique metadata records.  By choosing the “complete” stats we get the following data.

N min q1 median q3 max sum mean stddev stderr 68,758 1 1 1 1 45 94,222 1.37034 0.913541 0.0034839

With this we can see that there is a mean of 1.37 edits per record over the entire dataset with the maximum number of edits for a record being 45.

The total distribution of number of edits-per-record a presented in the table below.

Number of Edits Instances 1 53,213 2 9,937 3 3,519 4 1,089 5 489 6 257 7 111 8 60 9 30 10 13 11 14 12 7 13 5 14 5 15 1 16 2 17 1 19 1 21 1 26 1 30 1 45 1

From the 68,758 records edited,  53,213 (77%) of the records were only edited once, with two and three edits per record edit 9,937 (14%),  and 3,519 (5%) respectively. From there things level out very quickly to under 1% of the records.

When indexing these edit events in Solr I also merged the events with additional metadata from the records.  By doing so we have a few more facets to take a look at, specifically how the edit events are distributed over partner, collection, resource type and format.


There are 167 partner institutions represented in the edit event dataset.

The top ten partners by the number of edit events is presented in the table below.

Partner Code Partner Name Edit Count Unique Records Edited Unique Collections UNTGD UNT Libraries Gov Docs Department 21,932 14,096 27 OKHS Oklahoma Historical Society 10,377 8,801 34 UNTA UNT Libraries Special Collections 9,481 6,027 25 UNT UNT Libraries 7,102 5,274 27 PCJB Private Collection of Jim Bell 5,504 5,322 1 HMRC Houston Metropolitan Research Center at Houston Public Library 5,396 2,125 5 HPUL Howard Payne University Library 4,531 4,518 4 UNTCVA UNT College of Visual Arts and Design 4,296 3,464 5 HSUL Hardin-Simmons University Library 2,765 2,593 6 HIGPL Higgins Public Library 1,935 1,130 3

In addition to the number of edit events,  I have added a column for the number of unique records for each of the institutions.  The same data is presented in the graph below.

Graph showing the edit event count and unique record count for each of the institutions with the most edit events

The larger the difference between the Edit Count and the Unique Records Edited represents more repetitive edits of the same records by that partner.

The final column in the table above shows the number of different collections that were edited that belong to each specific partner.  Taking UNTGD as an example, there are 27 different collection that held records that were edited during the year.

Collection Code Collection Name Edit Events Records Edited TLRA Texas Laws and Resolutions Archive 8,629 5,187 TXPT Texas Patents 7,394 4,636 TXSAOR Texas State Auditor’s Office: Reports 2,724 1,223 USCMC United States Census Map Collection 1,779 1,695 USTOPO USGS Topographic Map Collection 490 458 TRAIL Technical Report Archive and Image Library 287 279 CRSR Congressional Research Service Reports 271 270 FCCRD Federal Communications Commission Record 211 208 NACA National Advisory Committee for Aeronautics Collection 62 62 WWPC World War Poster Collection 49 49 WWI World War One Collection 41 41 USDAFB USDA Farmers’ Bulletins 21 19 ATOZ Government Documents A to Z Digitization Project 19 18 WWII World War Two Collection 19 19 ACIR Advisory Commission on Intergovernmental Relations 14 13 NMAP World War Two Newsmaps 12 12 TR Texas Register 12 8 TXPUB Texas State Publications 12 12 GAORT Government Accountability Office Reports 10 10 BRAC Defense Base Closure and Realignment Commission 4 4 OTA Office of Technology Assessment 4 4 GDCC CyberCemetery 2 2 FEDER Federal Communications Commission Record 1 1 GSLTX General and Special Laws of Texas 1 1 TXHRJ Texas House of Representatives Journals 1 1 TXSS Texas Soil Surveys 1 1 UNTGOV Government Documents General Collection 1 1

This is set of data that is a bit easer to see with a simple graph.  I’ve plotted the ratio of records and the number of edit events to a simple line graph.

UNT Government Documents Edits to Record Ratios for each collection.

You can look at the graph above and quickly see which of the collections have had a higher edit-to-record ratio with the Texas State Auditor’s Office: Reports being the most number of edits per record with a ratio of over 2 edits per record for that collection.  Many of the other collections are much closer to 1 where there would be one edit per record.


The edit events occur in 266 different collections in the UNT Libraries’ Digital Collections.  As with the 167 partners above,  that is too many to stick into a table so I’m going to just list the top ten of them for us in the table below.

Collection Code Collection Name Edit Events Unique Records TLRA Texas Laws and Resolutions Archive 8,629 5,187 ABCM Abilene Library Consortium 8,481 8,060 TDNP Texas Digital Newspaper Program 7,618 6,305 TXPT Texas Patents 7,394 4,636 OKPCP Oklahoma Publishing Company Photography Collection 5,799 4,729 JBPC Jim Bell Texas Architecture Photograph Collection 5,504 5,322 TCO Texas Cultures Online 5,490 2,208 JJHP John J. Herrera Papers 5,194 1,996 UNTETD UNT Theses and Dissertations 4,981 3,704 UNTPC University Photography Collection 4,509 3,232

Again plotting the ratio of edit events to the number of unique records gives us the graph below.

Edit Events to Record Ratio grouped by Collection

You can quickly see the two collections that averaged over two edit events for each of the records that were edited during the last year,  meaning if a record was edited,  most likely it was edited at least two times.  Other collections like the Jim Bell Photography Collection or the Abilene Library Consortium Collection appear to have only been edited one time per record on average,  so when the edit was complete, it wasn’t revisited for additional editing.

Resource Type

The UNT Libraries makes use of a locally controlled vocabulary for its resource types.  You can view all of the available resource types here .

If you group the edit events and associated edit events by the resource type you will get the following table.

Resource Type Edit Events Unique Records image_photo 31,702 24,384 text_newspaper 11,598 10,176 text_leg 8,633 5,191 text_patent 7,480 4,667 physical-object 5,591 4,921 text_etd 4,986 3,709 text 4,311 2,511 text_letter 4,276 2,136 image_map 3,542 3,160 text_report 3,375 1,822 image_artwork 1,217 1,042 text_article 1,060 758 video 931 461 sound 719 694 text_legal 687 341 text_journal 549 288 text_book 476 422 image_presentation 430 313 image_postcard 429 180 image_poster 427 321 text_paper 423 312 text_pamphlet 303 199 text_clipping 275 149 text_yearbook 91 66 dataset 54 19 image_score 49 37 collection 41 34 image 34 20 website 22 20 text_chapter 17 14 text_review 13 11 text_poem 3 1 specimen 1 1

By calculating the edit-event-to-record ratio and plotting that you get the following graph.

Edit Events to Record Ratio grouped by Resource Type.

In the graph above I presented the data in the same order as it appears in the table just above the chart.  You can see that the highest ratio is for our text_poem record that was edited three different times.  Other notably high ratios are for postcards and datasets though there are several others that are at or close to 2 to 1 ratio of edits to records.


The final way we are going to look at the “what” data is by Format.  Again the UNT Libraries uses a controlled vocabulary for the format which you can look at here.  I’ve once again facetted on the format field and presented the total number of edit events and then unique records for each of the five format types that we have in the system.

Format Edit Events Unique Records text 48,580 32,770 image 43,477 34,436 video 931 461 audio 720 695 website 22 20

Converting the ratio of events-to-records into a bar graph results in the graph below.

Edit Events to Record Ratio grouped by Format

It looks like we edit video files more times per record than any of the other types with text and then image coming in behind.


There are almost endless combinations of collections, partners, resource types, and formats that can be put together and it deserves some further analysis to see if there are patters that we should pay attention to present in the data.  But that’s more for another day.

This is the third in a series of posts related to metadata edit events in the UNT Libraries’ Digital Collections.  check back for the next installment.

As always feel free to contact me via Twitter if you have questions or comments.

DuraSpace News: TOMORROW: Washington D.C. Fedora User Group Meeting, March 31 - April 1

Mon, 2015-03-30 00:00

Washington, DC  The Washington D.C. Fedora User Group Meeting will get underway tomorrow, Mar. 31 at the USDA National Agriculture Library. Day one presentations include updates on DuraSpace and Fedora 4, Fedora at the National Agriculture Library, Fedora at the University of Maryland Libraries, an Islandora Update and Specifying the Fedora API, and Short Presentations and a Project Roundtable. View the agenda here.

DuraSpace News: TOMORROW: Washington D.C. Fedora User Group Meeting, March 31 - April 1

Mon, 2015-03-30 00:00

Washington, DC  The Washington D.C. Fedora User Group Meeting will get underway tomorrow, Mar. 31 at the USDA National Agriculture Library. Day one presentations include updates on DuraSpace and Fedora 4, Fedora at the National Agriculture Library, Fedora at the University of Maryland Libraries, an Islandora Update and Specifying the Fedora API, and Short Presentations and a Project Roundtable. View the agenda here.

Mita Williams: The Setup

Sun, 2015-03-29 21:33
For this post, I’m going to pretend that the editors of the blog, The Setup (“a collection of nerdy interviews asking people from all walks of life what they use to get the job done”) asked me for a contribution. But in reality, I’m just following Bill Denton’s lead.

It feels a little self-indulgent to write about one’s technology purchases so before I describe my set up, let me explain why I’m sharing this information.

Some time back, in preparation for a session I was giving on Zotero for my university’s annual  technology conference, I realized that before going into the reasons how to use Zotero, I had to address the reasons why. I recognized that I was asking students and faculty who were likely already time-strapped and overburdened, to abandon long-standing practices that were already successfully working for them if they were going to switch to Zotero for their research work.

Before my presentation, I asked on Twitter when and why faculty would change their research practices.  Most of the answers were on the cynical side but there were some that gave me some room to maneuver, namely this one: “when I start a new project.”  And there’s a certain logic to this approach. If you were starting graduate school and know that you have to prepare for comps and generate a thesis at the end of the process, wouldn’t you want to conscientiously design your workflow at the start to capture what you learn in such a way that it’s searchable and reusable?

My own sabbatical is over and oddly enough, it is now at the end of my sabbatical in which I feel the most like I’m starting all over again in my professional work. So I’m using that New Project feeling to fuel some self-reflection in my own research process, bring some mindfulness to my online habits, and deliberate design into My Setup.

There’s another reason why I’m thinking about the deliberate design of research practice. As libraries start venturing into the space of research service consultation, I believe that librarians need to follow best practices for ourselves if we hope to develop expertise in this area.

As well, I think we need to more conscious of how and when our practices are not in line with our values. It’s simply not possible to live completely without hypocrisy in this complicated world but that doesn’t mean we can’t strive for praxis. It’s difficult for me to take seriously accusations that hackerspaces are neoliberal when it’s being stated by a person cradling a  Macbook or iPhone. That being said, I greatly rely on products from Microsoft, Amazon, and Google so I'm in no position to cast stones.

I just want to care about the infrastructures we’re building….

And with that, here’s my setup!


There are three computers that I spend my time on: the family computer in the kitchen (a Dell desktop running Windows 7), my work computer (another Dell desktop running Windows 7), and my Thinkpad X1 Carbon laptop which I got earlier this year.  Grub turned my laptop into a dual boot machine that I can switch between Ubuntu and Windows 7. I feel I need a Windows environment so I can run any ESRI products and all those other Mac/Windows only products if need be.

I have a Nexus 4 Android phone made by LG and a Kindle DX as my ebook reader. I don’t own a tablet or an mp3 player.

Worldbackup Day is March 31st. I need to get myself an external drive for backups (Todo1).


After getting my laptop, the first thing I did was investigated password managers to find which one would work best for me. I ended up choosing LastPass and I felt the benefits immediately. Using a password manager has saved me so much pain and aggravation and now my passwords are now (almost) all unique. Next, I need to set up two factor authentication for the services that I haven’t gotten around to yet (Todo2).  

With work being done on three computers, it’s not surprising that I have a tendency to work online. My browser of choice is Mozilla but I will flip to Chrome from time to time. I use the sync functionality on both so my bookmarks are the automatically updated and the same across devices. I use SublimeText for my text editor for code, GIMP as my graphics editor, and QGIS for my geospatial needs.

This draft, along with much of my other writing and presentations are on Google Drive. I spend much of my time in Gmail and Google Calendar. While years ago, I downloaded all my email using Mozilla Thunderbird, I have not set up a regular backup strategy for these documents (Todo3). I’ve toyed with using Dropbox to back up Drive but think I’m better with an external drive. I have a Dropbox account because people occasionally share documents with me through it but at the moment, I only use it to backup my kids Minecraft games.

From 2007 to 2013, I used delicious to capture and share the things I read online. Then delicious tried to be the new Pinterest and made itself unusable (although it has since reverted back to close to its original form) and so I switched to Evernote (somewhat reluctantly because I missed the public aspect of sharing bookmarks).   I’ve grown to be quite dependent on Evernote to save my outboard brain. I use IFTTT to post the links from my Twitter faves to delicious which are then imported automatically into Evernote.  I also use IFTTT to automatically backup my Tumblr posts to Evernote, my Foursquare check-ins saved to Evernote (and Google Calendar) and my Feedly saved posts to Evernote. Have I established a system to back up my Evernote notes on a regular basis? No, no I have not (Todo4).

The overarching idea that I have come up with is that the things I write are backed up on my Google Drive account and the library of things that I have read or saved to future reading (ha!) are saved on Evernote.  To this end, I use IFTTT to save my Tweets to a Google Spreadsheet and my Blogger and WordPress posts are automatically saved to Google Drive (still in a work in progress. Todo 5). My ISP is Dreamhost but I am tempted to jump ship to Digital Ocean.

My goal is to have at least one backup for the things I’ve created. So I use IFTTT to save my Instagram posts to Flickr. My Flickr posts are just a small subset of all the photos that are automatically captured and saved on Google Photos.  No, I have not backed up these photos  (Todo 6) but I have, since 2005, printed the best of my photos on an annual basis into beautiful softcover books using QOOP and then later, through Blurb.  My Facebook photos and status updates from 2006 to 2013 have been printed in a lovely hardcover book using MySocialBook.  One day I would like to print a book of the best of my blogged writings using Blurb, if just as a personal artifact.

Speaking of books, because I’m one of the proud and the few to own a KindleDX, I use it to read PDFs and most of my non-fiction reading. When I stumble upon a longread on the web, I use Readability’s Send to Kindle function so I can read it later without eyestrain. I’m inclined to buy the books that I used in my writing and research as Kindle ebooks because I can easily attach highlighted passages from these books to my Zotero account. My ebooks are backed up in my calibre library. I also use Goodreads to keep track of my reading because I love knowing what my friends are into.

I subscribe to Rdio and for those times that I actually spend money on owning music, I try to use Bandcamp. I’m an avid listener of podcasts and for this purpose use BeyondPod. Our Sonos system allows us to play music from all these services, as well as TuneIn, in the living room.  The music that I used to listen to on CD is now sitting on an unused computer running Windows XP and I know if I don’t get my act together and transfer those files to an external drive soon those files will be gone for good.. if they haven’t already become inaccessible (*gulp*) (Todo 8).

For my “Todo list” I use Google Keep, which also captures my stray thoughts when I’m away from paper or my computer. Google Keep has an awesome feature that will trigger reminders based on your location.

So that’s My Setup. Let me know if you have any suggestions or can see some weaknesses in my workflow. Also, I’d love to learn from your Setup.

And please please please call me out if I don’t have a sequel to this post called The Backup by the time of next year's World Backup Day.

Nicole Engard: Bookmarks for March 29, 2015

Sun, 2015-03-29 20:30

Today I found the following resources and bookmarked them on Delicious.

Digest powered by RSS Digest

The post Bookmarks for March 29, 2015 appeared first on What I Learned Today....

Related posts:

  1. No more Delicious?
  2. Can you say Kebberfegg 3 times fast
  3. Are you backing up?

John Miedema: I’m a bit of a classification nut. It comes from my Dutch heritage. How do you organize files and emails into folders?

Sat, 2015-03-28 18:05

I’m a bit of a classification nut. It comes from my Dutch heritage — those Dutchies are always trying to be efficient with their tiny bits of land. It’s why I’m drawn to library science too. I think a lot about the way I organize computer files and emails into folders. It provides insight into the way all classification works, and of course ties into my Lila project. I’d really like to hear about your own practices. Here’s mine:

  1. Start with a root folder. When an activity starts, I put a bunch of files into a root folder (e.g., a Windows directory or a Gmail label).
  2. Sort files by subject or date. As the files start to pile up in a folder, I find stuff by sorting files by subject or date using application sorting functions (e.g., Windows Explorer).
  3. Group files into folders by subject. When there are a lot of files in a folder, I group files into different folders. The subject classification is low level, e.g, Activity 1, Activity 2. Activities that are expire are usually grouped together into an ‘archive’ folder.
  4.  Develop a model. Over time the folder and file structure can get complex, making  it hard to find stuff. I often resort to search tools. What helps is developing a model that reflects my work. E.g., Client 1, Client 2. Different levels correspond to my workflow, E.g., 1. Discovery, 2. Scoping, 3. Estimation, etc. The model is really a taxonomy, an information architecture. I can use the same pattern for each new activity.
  5. Classification always requires tinkering. I’ve been slowly improving the way I organize files into folders for as long as I’ve been working. Some patterns get reused over time, others get improved. Tinkering never ends.

(I will discuss the use of tagging later. Frankly, I find manual tagging hopeless.)

Mark E. Phillips: Metadata Edit Events: Part 2 – Who

Sat, 2015-03-28 15:53

In the previous post I started to explore the metadata edit events dataset generated from 94,222 edit events from 2014 for the UNT Libraries’ Digital Collections.  I focused on some of the information about when these edits were performed.

This post focuses on the “who” of the dataset.

All together we had 193 unique users edit metadata for one of the systems that comprise the UNT Libraries’ Digital Collections.  This includes The Portal to Texas History, UNT Digital Library, and the Gateway to Oklahoma History.

The top ten most frequent editors of metadata in the system are responsible for 57% of the overall edits.

Username Edit Events htarver 15,451 aseitsinger 10,105 twarner 4,655 mjohnston 4,143 atraxinger 3,905 cwilliams 3,490 sfisher 3,466 thuang 3,327 mphillips 2,669 sdillard 2,518

The overall distribution of edits per user looks like this.

Distribution of edits per user for the Edit Event Dataset

As you can see it shows the primary users of the system and then very quickly tapers down to the “long tail” of users who have a lower number of edit events.

A quick look at the total number of users active for given days of the week across the entire dataset.

Sun Mon Tue Wed Thu Fri Sat 40 95 122 122 123 97 39

There is a swell for Tue, Wed, and Thu in the table above.  It seems to be pretty consistent, either you have 39,40 users, 95-97 users, or 122-123 unique users on a given day of the week.

In looking at how unique users were spread across the year, grouped into months,  we got the following table and then graph.

Month Unique Users January 54 February 73 March 64 April 61 May 44 June 40 July 48 August 50 September 50 October 84 November 49 December 36

Unique Editors Per Month

There were some spikes throughout the year,  most likely related to a metadata class in the UNT College of Information that uses the Edit system as part of their teaching.  This is the October and February spikes in number of unique users.  Other than that we are a consistently over 40 unique users per month with a small dip for the December holiday season when school is not is session.

In the previous post we had a heatmap with the number of edit events distributed over the hours of the day and the days of the week.  I’ve included that graph below.

94,222 edit events plotted to the time and day they were performed

I was curious to see how the unique number of editors mapped to this same type of graph,  so that is included below.

Unique editors distribution across day of the week and hour of the day.

User Status

Of the 193 unique metadata editors in the dataset, 135 (70%) of the users were classified as Non-UNT-Employee and  58 (30%) were classified as UNT-Employee. For the edit events themselves, 75,968 (81%) were completed by users classified with a status of UNT-Employee  and 18,254 (19%) by users classified with the status of Non-UNT-Employee.

User Rank Rank Edit Events Percentage of Total Edits (n=94,222) Unique Users Percentage of Total Users (n=193) Librarian 22,466 24% 16 8% Staff 12,837 14% 13 7% Student 41,800 44% 92 48% Unknown 17,119 18% 72 37%

You can see that 44% of all of the edits in the dataset were completed by users who were students. Librarians and Staff members accounted for 38% of the edits.

This is the second in a series of posts related to metadata edit events in the UNT Libraries’ Digital Collections.  check back for the next installment.

As always feel free to contact me via Twitter if you have questions or comments.

Ed Summers: The Adventure of Experiment

Sat, 2015-03-28 11:50

Love of certainty is a demand for guarantees in advance of action. Ignoring the fact that truth can be bought only by the adventure of experiment, dogmatism turns truth into an insurance company. Fixed ends upon one side and fixed “principles” — that is authoritative rules — on the other, are props for a feeling of safety, the refuge of the timid, and the means by which the bold prey upon the timid.

John Dewey in Human Nature and Conduct (p. 237)