You are here

Feed aggregator

FOSS4Lib Upcoming Events: Islandora information session at Carleton University

planet code4lib - Thu, 2016-02-18 15:42
Date: Wednesday, February 24, 2016 - 09:00 to 12:15Supports: Islandora

Last updated February 18, 2016. Created by Peter Murray on February 18, 2016.
Log in to edit this page.

For more details, see the Meet-up page.

Andromeda Yelton: analyzing the Ruby Community Conduct Guideline

planet code4lib - Thu, 2016-02-18 14:26

tl;dr I read the Ruby Community Conduct Guideline. There are some appealing elements, but it is not actually workable as a governance document. I see three key problems: lack of recourse, assumption of symmetry, and non-handling of bad actors.


The Ruby Community Conduct Guideline has an arresting blankness where I expected to see information on procedure. In particular, it doesn’t address any of the following:

  • How, and to whom, can conduct bugs be reported?
  • Who has the authority to mediate, or adjudicate, disputes under this guideline?
  • How are people selected for this role?
  • What sanctions may they impose? (What may they not impose?)
  • What procedures will they follow to:
    • Investigate situations
    • Reach decisions
    • Communicate those decisions to the aggrieved parties and the community at large
  • What enforcement mechanisms are (and are not) available after decisions are reached? Who is invested with the authority to carry out these enforcement mechanisms?

The absence of such procedures is obviously worrisome to people who identify with complainants and see themselves as being at risk of being harassed, because it indicates that there is, in fact, no mechanism for lodging a complaint, and no one responsible for handling it. But it should also be worrisome to people who see themselves as more likely to be (however unfairly) among the accused, because it means that if someone does attempt to lodge a complaint, the procedures for handling it will be invented on the fly, by people under stress, deadline pressure, and heavy criticism.

The history of such situations does not suggest this will go well.


There are, again, some appealing statements of aspirational values in the Guideline. But the values are written as if they apply equally to all parties in all scenarios, and this has serious failure modes.

I expect, for instance, that the first guideline (“Participants will be tolerant of opposing views”) is meant to avoid folding an ideological litmus test into the Guideline. And I actually share the implied concern there; poorly drafted or discussed codes of conduct can indeed shade into this, and that’s not okay in large, international spaces. Insofar as this statement says “if I’m a Republican and you’re a Democrat, or I’m on Team Samoas and you’re on Team Tagalongs, or I’m a vi girl and you’re an emacs guy, we should be able to work together and deal with our disagreement”, I am all for it.

But what if my viewpoint is “someone should be allowed to check your genitals to see if you’re allowed to go to the bathroom“? Or “there aren’t many black software engineers because they’re just not as smart as white people”? (To be clear, not only do I not hold either viewpoint, I find them both loathsome. But you needn’t look far to find either.) Well. If I have any position of power in the community at all, my viewpoint has now become a barrier to your participation, if you are trans or black. You can’t go to a conference if you’re not sure that you’ll be able to pee when you’re there. And you can’t trust that any of your technical contributions will be reviewed fairly if people think your group membership limits your intelligence (unless you hide your race, which means, again, no conference attendance for you, and actually quite a lot of work to separate your workplace and social media identities from your open source contributions). Some people will laugh off that sort of outrageous prejudice and participate anyway; others will participate, but at a significant psychic cost (which is moreover, again, asymmetric — not a cost to which other community members are, or even can be, subject) — and others will go away and use their skills somewhere they don’t have to pay that kind of cost. In 2/3 of these cases, the participant loses; in 1/3, the open source community does as well.

And that brings me to the other asymmetry, which is power. Participants in open source (or, really, any) communities do not have equal power. They bring the inequalities of the larger world, of course, but there are also people with and without commit bits, people recognized on the conference circuit and those with no reputation, established participants and newcomers…

If, say, “Behaviour which can be reasonably considered harassment will not be tolerated.”, and low-status person A is harassing high-status person B, then even without any recourse procedures in the guideline, B has options. B can quietly ensure that A’s patches or talk proposals are rejected, that A isn’t welcome in after-hours bar conversations, that A doesn’t get dinner invitations. Or use blunter options that may even take advantage of official community resources (pipe all their messages to /dev/null before they get posted to the mailing list, say).

But if B is harassing A, A doesn’t have any of these options. A has…well, the procedures in a code of conduct, if there were any. And A has Twitter mobs. And A can leave the community. And that’s about it.

An assumption of symmetry is in fact an assumption that the transgressions of the powerful deserve more forbearance than the transgressions of the weak, and the suffering of the weak is less deserving of care than the suffering of the powerful.

bad actors

QA Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.

— Bill Sempf (@sempf) September 23, 2014

We write code in the hopes it will do the right thing, but we test it with the certainty that something will do wrong. We know that code isn’t good enough if it only handles expected inputs. The world will see your glass and fill it with sfdeljknesv.

When interpreting the words and actions of others, participants should always assume good intentions.

I absolutely love this philosophy right up until I don’t. Lots of people are decent, and the appropriate reaction to people with good intentions who have inadvertently transgressed some boundary isn’t the same as the appropriate reaction to a bad actor, and community policy needs to leave space for the former.

But some actors do not, in fact, have good intentions. The Ruby Guideline offers no next actions to victims, bystanders, or community leaders in the event of a bad actor. And it leaves no room for people to trust their own judgment when they have presumed good intentions at the outset, but later evidence has contradicted that hypothesis. If I have well-supported reasons to believe someone is not acting with good intentions, at what point am I allowed to abandon that assumption? Explicitly, never.

The Ruby Guideline — by addressing aspirations but not failure modes, by assuming symmetry in an asymmetric world, by stating values but not procedures — creates a gaping hole for the social equivalent of an injection attack. Trust all code to self-execute, and something terribly destructive will eventually run. And when it does, you’ll wish you had logging or sandboxes or rollback or instrumentation or at the very minimum a SIGTERM…but instead you’ll have a runaway process doing what it will to your system, or a messy SIGKILL.

The ironic thing is, in everyday life, I more or less try to live by this guideline. I generally do assume good faith until proven otherwise. I can find ideas of value in a wide range of philosophies, and in fact my everyday social media diet includes anarchists, libertarians, mainline Democrats, greens, socialists, fundamentalist Christians, liberal Christians, Muslims, Jews of various movements, people from at least five continents…quite a few people who would genuinely hate each other were they in the same room, and who discuss the same topic from such different points of view it’s almost hard to recognize that it’s the same topic. And that’s great! And I’m richer for it.

And the Guideline is still a bad piece of community governance.

FOSS4Lib Recent Releases: Open Library Environment (OLE) - 2.0

planet code4lib - Thu, 2016-02-18 13:44

Last updated February 18, 2016. Created by Peter Murray on February 18, 2016.
Log in to edit this page.

Package: Open Library Environment (OLE)Release Date: Wednesday, February 17, 2016

Open Knowledge Foundation: New Network Guidelines – Tell Us What You Think!

planet code4lib - Thu, 2016-02-18 10:02

A month ago, we updated you about our plans for the Open Knowledge Network in 2016. One of the firsts steps this year will be to update our network guidelines. For us, guidelines are important because they help us to define our mutual causes and help us strive to achieve our goal. It also help us to set expectations. In Facebook terms,when the guidelines are not clear, the relationship status is “complicated”. When the guidelines are clear, then both sides can decide if they want to agree to the holy grail of statuses: “In a relationship”.

Today, we are happy to bring the Open Knowledge Network Guidelines to its first public consultation. We see the Network Guidelines as a continuous processes of learning and feedback. It is a love document that will be reviewed constantly, and there is no better way to start it then consulting with you! In the past month Open Knowledge International staff and Open Knowledge Chapters have reviewed and commented the guidelines draft. Now it’s your turn!

What will you find in the guidelines?

For starters, we have defined the different entities in the network. Then , we explained what will happen to non active groups and how we they can be revived, we also defined responsibilities and support for each entity of the network. Lastly, we explain how to join the network.

In our point of view, the responsibilities and support are a milestone in our network. They help us to commit to one another. In the new guidelines, most of the responsibilities of each entity are regarded to communication and how each of the entities update the rest of the network about their work.. We hope that with constant communication and updates, we can see more collaboration, build capacity and give more power to open knowledge activities around the world.

What’s next? Please take a look and comment on the guidelines in this Google Doc We will review the guidelines in the next month and then upload them to the Open Knowledge International website. In addition, we will contact current members of the network and will update them of the changes in the guidelines.

Let’s shape our network together.

Eric Hellman: The Impact of Bitcoin on Fried Chicken Recipe Archives

planet code4lib - Thu, 2016-02-18 05:08
Bitcoin is magic. Not the technology, but the hype machine behind it. You've probably heard that Bitcoin technology is going to change everything from banking to fried chicken recipes, from copyright to genome research. Like any good hype machine, Bitcoin's whips amazing facts together with plausible nonsense to make a perfect soufflé.

ChickenCoin (Comoros 25 francs 1982)
CC BY-NC-ND  by edelweisscoinsThe hype cycle is not Bitcoin's fault. Bitcoin is a masterful and probably successful attack on a problem that many thought was impossible. Bitcoin creates a decentralized, open, transparent and secure way to maintain a monetary transaction ledger. The reason this is so hard is because of the money. Money creates strong incentives for cheating, hacking, subverting the ledger, and it's really hard to create an open system that has no centralized authority yet is hard to attack. The genius of bitcoin is to cleverly exploit that incentive to power its distributed ledger engine, often referred to as "the blockchain". Anyone can create bitcoin for themselves and add it to the ledger, but to get others to accept your addition, you have to play by the rules and do work to add to the ledger. This process is called "mining".

If you're building a fried chicken recipe archive, (let's call it FriChiReciChive) there's good news and bad news. The bad news is that fried chicken is a terrible fuel for a blockchain ledger. No one mines for fried chicken. The good news is that very few nation-states care about your fried chicken recipes. Defending your recipe archive against cheating, hacking, attack and subversion will not require heroic bank-vault tactics.

That's not to say you can't learn from Bitcoin and its blockchain. Bitcoin is cleverly assembled from mature technologies that each seemed impossible not long ago. Your legacy recipe system was probably built in the days of archive horses and database buggies; if you're building a new one it probably would be a good idea to have a set of modern tools.

What are these tools? Here are a few of them:
  1. Big storage. It's easy to forget how much storage is available today. The current size of the bitcoin blockchain, the ledger of every bitcoin transaction every made, is only 56 GB. That's about one iPhone of storage. The cheapest macbook Pro comes with 128 GB, which is more than you can imagine. Amazon Web Services offers a 500GB of storage for $15 per month. Your job in making FriChiReciChive a reality is to imagine how make use of all that storage. Suppose the average fried chicken recipe is a thousand words. That's about 10 thousand bytes. With 500GB and a little math, you can store 50 million fried chicken recipes.

    Momofoku Fried Chicken
    CC BY-NC by gandhuHaving trouble imagining 50 million chicken recipes? You could try a recipe a minute and it would take you 95 years to try them all. That would be a poor use of your time on earth, and it would be a poor use of 500 GB. So forget about deleting old recipes and start thinking about the FriChi info you could be keeping. How about recording every portion of fried chicken ever prepared, and who ate it. This is possible today. If you're working on an archive of books, you could record every time someone reads a book. Trust me, Amazon is trying to do that.

    Occasionally, you'll hear that you can store information directly in Bitcoin's blockchain. That's possible, but you probably don't want to do that because of cost. The current cost of adding a MB (about 1 block) to the bitcoin blockchain is 25 BTC. At current exchange rates, that's about $10 per kB. That cost is borne by the global Bitcoin system, and it pays for the power consumed by Bitcoin miners. For comparison, AWS will charge you 0.36 microcents per year to store a kilobyte. The blockchain does more than S3, but not 30 million times more.

  2. Moroccan Chicken Hash
    CC BY-NC-ND by mmm-yosoCryptographic hashes. Bitcoin makes pervasive use of cryptographic hashes to build and access its blockchain. Cryptographic hashes have some amazing properties. For example, you can use a hash to identify a digital document of any kind. Whether it's a password or a video of a cat, you can compute a hash and use the hash to identify the cat video. Flip a single bit in your fried chicken recipe, and you get a completely different hash. That's why bitcoin uses hashes to identify transactions. And you can't make the chicken from the hash. Yes, that's why it called a hash.

    Once you have the hash of a digital object, you've made it tamper-proof. If someone makes a change in your recipe, or your cat video, or your software object, the hash of the thing will be completely different, and you'll be able to tell that it's been tampered with. So you never need to let anyone mess with Granny's fried chicken recipe.

  3. Hash chains. once you've computed hashes for everything in FriChiReciChive, you probably think, "what good is it to hash a recipe? "If someone can change the recipe, someone can change the hash, too." Bitcoin solves this problem by hashing the hashes! each new data block contains the hash of the previous block. Which contains a hash of the block before that! etc. etc. all the way back to Satoshi's first block. Of course, this trick of chaining the block hashes was not invented by Bitcoin. And a chain isn't the only way to play this trick. A structure known a Merkle tree (after its inventor) lays out the hashes chains in a tree topology. So by using Merkle trees of fried chicken recipes, you can make the hash of a new recipe depend on every previous recipe. If someone wanted to mess with Granny, they'd have Nana to mess with too, not to mention the Colonel!

  4. Jingu-Galen Ginkgo Festival Fried Chicken
    CC BY-NC-ND by mawariCryptographic signatures. If you're still paying attention, you might be thinking. "Hey! what's to stop Satoshi from rewriting the whole block chain?" And here's where cryptographic signatures come in. Every blockchain transaction gets signed by someone using public key cryptography. The signature works like a hash, and there are chains of signatures. Each signature can be verified using a public key, but without the owner's private key, any change to the block will cause the verification to fail. The result is that the block chain can't be altered without the cooperation of everyone who has signed blocks in the chain.

    Here's where FriChiReciChive is much easier to secure than Bitcoin. You don't need a lot of people participating to make the recipe ledger secure enough for the largest fried chicken attack you can imagine.

  5. Peer-to-peer. Perhaps the cleverest part of Bitcoin is the way that it arbitrates contention for the privilege of adding blocks. It uses puzzle solving to dole out this privilege to the "miners" (puzzle-solvers) who are best at solving the puzzle. Arbitration is needed because otherwise everyone could add blocks earning them Bitcoin. The puzzle solving turns out to be expensive because of the energy used to power the puzzle-solving computers. Peer-to-peer networks which share databases don't need this type of arbitration. While the contention for blocks in Bitcoin has been constantly rising, the contention for slots in distributed fried chicken data storage should drop into the foreseeable future.

  6. Charleston: Husk - Crispy Southern Fried
    Chicken Skins CC BY-NC-ND by wallygZero-knowledge proofs. Once everyone's Fried Chicken meals are recorded in FriChiReciChive, you might suppose that fried chicken privacy would be a thing of the past. The truth is much more complicated. Bitcoin offers a non-intuitive mix of anonymity and transparency. All of its transactions are public, but identities can be private. This is possible because in today's world, knowledge can e one-directional and partitioned in bizarre ways. For example, you could build ChiFriReciChive in such a way that Bob or Alice could ask what they ate on Christmas but Eve would never know, even though she possessed the entire fried chicken ledger.

  7. One more thing. Bitcoin appears to have solved a really hard problem by deploying mature digital tools in clever ways that give participants incentive to make the system work. When you're figuring out how to build FriChiReciChive, or solving whatever problem you might have, chances are you'll have a different set of really hard problems with a different set of participants and incentives. By adding the set of tools I've discussed here, you may be able to devise a new solution to your hard problem.
Bon appetit. Your soufflé will be délicieux!

Peter Murray: Idea for an NPR Twitter bot — Tweet me about that story I just heard

planet code4lib - Thu, 2016-02-18 02:07

So I had an idea for a Twitter bot I would like to see. Occasionally I’ll be listening to a story on NPR and I’ll want to know more about it. Sometimes the host will say something like: “come to for more information and click on…” Other times it will be because I missed a crucial bit of the story and I’ll want to know more about it. So why not have a Twitter bot that I can call upon to say “Tell me more about that story”:

Hey @NPRnow — tell me more about that story I just heard.

— Peter Murray (@DataG) February 18, 2016

@DataG The story you just heard on NPR station WOSU was "More Died On This WWII Ship Than On The Titanic And …"

— nprbot (@NPRnow) February 18, 2016

The workflow for such a system doesn’t seem that hard. The bot would have to know my current location, and from that guess which NPR station(s) I’m listening to. (If I didn’t have geolocated tweets turn on, the bot could engage in a direct message conversation with me to ask which radio station I was listening to.) It would know the time of my tweet, so it know which segment I was listening to. Sure, there is variation in local program listings, but for the most part it could probably rely on the national program segment lineups.

The technology isn’t all that hard either. Amazon Kinesis could be used for tapping into Twitter’s streaming API. Kinesis would fire off AWS Lambda events in response to tweets to the bot, and the Lambda function could do the work of figuring out how to respond to the user. There is already some nice sample code from AWS for how to put this together.

I’d do this, but there is one piece missing that I can’t find — the NPR segment lineup. A time-stamped listing of when the segments appear in the national audio stream. There is some nice semantic markup in the program listing page (today’s Morning Edition show, for example):

<article class="story clearfix"> <p class="segment-num"><b>1</b></p> <div class="storyinfo noimg"> <h2 class="slug"><a href="" title="Race : NPR">Race</a></h2> <h1><a href="" title="Supreme Court Short List Must Include Diverse Candidates, Author Says : NPR">Supreme Court Short List Must Include Diverse Candidates, Author Says</a></h1> <input type="hidden" id="title467036605" value="Supreme Court Short List Must Include Diverse Candidates, Author Says"/> <input type="hidden" id="modelShortUrl467036605" value=""/> <input type="hidden" id="modelFullUrl467036605" value=""/> </div>   <div class="audio-player "> <a href="#" data-audio="/npr/me/2016/02/20160217_me_qualified_diverse_candidates_must_be_considered_for_supreme_court_vacancy" data-audio-desktop="[467036605, 467036606, null, 1, 1, 1]"> <b class="icn-media"></b> <b class="call-to-action">Listen</b> <b class="loading">Loading…</b> <b class="audio-info"> </b><b class="time-elapsed"></b> <b class="time-total">5:30</b>   <b class="media-time-total"> </b><b class="media-time-current"></b> <b class="scrubber"></b>   </a>   <ul class="aux has-embed"> <li class="playlist"><a href="#" data-audio="/npr/me/2016/02/20160217_me_qualified_diverse_candidates_must_be_considered_for_supreme_court_vacancy" data-audio-desktop="[467036605, 467036606, null, 2, 1, 1]" title="Playlist">Playlist</a></li>   <li class="download"><a href="" title="Download">Download</a></li>   <li class="embed"> <a href="#" title="Embed">Embed</a> <div class="audio-embed-modal"> <label class="embed-label">Embed<input class="embed-url embed-url-no-touch" readonly="" value="<iframe src=&quot;; width=&quot;100%&quot; height=&quot;290&quot; frameborder=&quot;0&quot; scrolling=&quot;no&quot; title=&quot;NPR embedded audio player&quot;/>"></label> <button class="embed-close" aria-label="close embed overlay">Close embed overlay</button> <b class="embed-url embed-url-touch"><code><b class="punctuation">&lt;</b>iframe src="" width="100%" height="290" frameborder="0" scrolling="no" title="NPR embedded audio player"&gt;</code></b> </div> </li> </ul> </div> </article>

But it doesn’t have the time-stamped rundown of when the segments occur in the show. I’ve done some moderately intense Google searches, but I haven’t turned up anything. This kind of thing is probably on the dark web since it is intended just for station managers. Does anyone know how I might get ahold of it?

DuraSpace News: GET "Expert Tips for Setting Up and Managing a DSpace Repository"

planet code4lib - Thu, 2016-02-18 00:00

Austin, TX  The January-February 2016 issue of the EIFL Newsletter features an report on a recent EIFL (Electronic Information for Libraries) and Institute of Development Studies (IDS) webinar entitled, "Expert Tips for Setting Up and Managing a DSpace Repository", attended by 70 people from 19 countries.

LITA: Jobs in Information Technology: February 17, 2016

planet code4lib - Wed, 2016-02-17 20:08

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

Library Systems and Services, Educational Services Librarian, Redding, CA

Harford County Public Library, Vacancy #16-21, Specialist IV – ILS, Belcamp, MD

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

SearchHub: The Data That Lies Beneath: A Dark Data Deep Dive

planet code4lib - Wed, 2016-02-17 18:34

Our latest interactive infographic takes a look at the murky origins – and potential of dark data

The post The Data That Lies Beneath: A Dark Data Deep Dive appeared first on

DuraSpace News: Registration Available: Hot Topics: The DuraSpace Community Webinar Series, “VIVO plus SHARE: Closing the Loop on Scholarly Activity”

planet code4lib - Wed, 2016-02-17 00:00

Austin, TX  We invite you to register for the DuraSpace Community Webinar Series: “VIVO plus SHARE: Closing the Loop on Scholarly Activity.”


Curated by: Rick Johnson, Program Co-Director, Digital Initiatives and Scholarship Head, Data Curation and Digital Library Solutions Hesburgh Libraries, University of Notre Dame; Visiting Program Officer for SHARE at the Association of Research Libraries

Mark Matienzo: My Jekyll todo list

planet code4lib - Tue, 2016-02-16 23:01

A running list of things I want to do or have done. A lot of this relates to adopting the IndieWeb ethos

  • DONE Enable sending and receiving webmentions
  • DONE Minimal h-entry markup
  • New theme!
  • Enable incoming webmention displays from
  • Redo build process, perhaps running on Travis or my own server.
  • Enable automatic POSSE to Twitter, Medium, Slideshare, LinkedIn, and Facebook(?). Consider using Bridgy if this will lower friction.
  • Send automatic webmentions through on build.
  • Adopt Micropub or something comparable to potentially stage posts through pull requests. Longer term goal is to have a nice mobile client.
  • Refactor the publication and resume to be data driven.
  • Reuse and refactor existing codebases, like Aaron Gustafon’s Jekyll plugin for webmentions (Github repo) and Will Norris’ syndication plugin.
  • Implement Jekyll collections as a proxy for managing h-entry post-types.
  • Cache and eventually move commenting away from Disqus.

Suzanne Chapman: Confab presentation – Short, True, Meaningful (pick 2)

planet code4lib - Tue, 2016-02-16 21:08

I gave an Ignite-style presentation at the 2014 Confab Higher Ed conference. Not sure if I’ll ever do a timed, auto advancing style presentation again but it was super fun!

Short, True, Meaningful (pick 2) – Confab Higher Ed Lightening Talk

Description: Librarians are great at being exhaustively thorough – a trait that’s wonderful for all kinds of traditional library activities and services – but not so great for coming up with concise web content or link labels. After many lengthy conversations debating the value of being concise and user-friendly over being exhaustively thorough, I introduced a new strategy. Using the classic “cheap, fast, good” decision triangle, I created my own version using short, true, and meaningful to help visually demonstrate the tradeoffs.

ACRL TechConnect: Low Expectations Distributed: Yet Another Institutional Repository Collection Development Workflow

planet code4lib - Tue, 2016-02-16 20:38

Anyone who has worked on an institutional repository for even a short time knows  that collecting faculty scholarship is not a straightforward process, no matter how nice your workflow looks on paper or how dedicated you are. Keeping expectations for the process manageable (not necessarily low, as in my clickbaity title) and constant simplification and automation can make your process more manageable, however, and therefore work better. I’ve written before about some ways in which I’ve automated my process for faculty collection development, as well as how I’ve used lightweight project management tools to streamline processes. My newest technique for faculty scholarship collection development brings together pieces of all those to greatly improve our productivity.

Allocating Your Human and Machine Resources

First, here is the personnel situation we have for the institutional repository I manage. Your own circumstances will certainly vary, but I think institutions of all sizes will have some version of this distribution. I manage our repository as approximately half my position, and I have one graduate student assistant who works about 10-15 hours a week. From week to week we only average about 30-40 hours total to devote to all aspects of the repository, of which faculty collection development is only a part. We have 12 librarians who are liaisons with departments and do the majority of the outreach to faculty and promotion of the repository, but a limited amount of the collection development except for specific parts of the process. While they are certainly welcome to do more, in reality, they have so much else to do that it doesn’t make sense for them to spend their time on data entry unless they want to (and some of them do). The breakdown of work is roughly that the liaisons promote the repository to the faculty and answer basic questions; I answer more complex questions, develop procedures, train staff, make interpretations of publishing agreements, and verify metadata; and my GA does the simple research and data entry. From time to time we have additional graduate or undergraduate student help in the form of faculty research assistants, and we have a group of students available for digitization if needed.

Those are our human resources. The tools that we use for the day-to-day work include Digital Measures (our faculty activity system), Excel, OpenRefine, Box, and Asana. I’ll say a bit about what each of these are and how we use them below. By far the most important innovation for our faculty collection development workflow has been integration with the Faculty Activity System, which is how we refer to Digital Measures on our campus. Many colleges and universities have some type of faculty activity system or are in the process of implementing one. These generally are adopted for purposes of annual reports, retention, promotion, and tenure reviews. I have been at two different universities working on adopting such systems, and as you might imagine, it’s a slow process with varying levels of participation across departments. Faculty do not always like these systems for a variety of reasons, and so there may be hesitation to complete profiles even when required. Nevertheless, we felt in the library that this was a great source of faculty publication information that we could use for collection development for the repository and the collection in general.

We now have a required question about including the item in the repository on every item the faculty member enters in the Faculty Activity System. If a faculty member is saying they published an article, they also have to say whether it should be included in the repository. We started this in late 2014, and it revolutionized our ability to reach faculty and departments who never had participated in the repository before, as well as simplify the lives of faculty who were eager participants but now only had to enter data in one place. Of course, there are still a number of people whom we are missing, but this is part of keeping your expectation low–if you can’t reach everyone, focus your efforts on the people you can. And anyway, we are now so swamped with submissions that we can’t keep up with them, which is a good if unusual problem to have in this realm. Note that the process I describe below is basically the same as when we analyze a faculty member’s CV (which I described in my OpenRefine post), but we spend relatively little time doing that these days since it’s easier for most people to just enter their material in Digital Measures and select that they want to include it in the repository.

The ease of integration between your own institution’s faculty activity system (assuming it exists) and your repository certainly will vary, but in most cases it should be possible for the library to get access to the data. It’s a great selling point for the faculty to participate in the system for your Office of Institutional Research or similar office who administers it, since it gives faculty a reason to keep it up to date when they may be in between review cycles. If your institution does not yet have such a system, you might still discuss a partnership with that office, since your repository may hold extremely useful information for them about research activity of which they are not aware.

The Workflow

We get reports from the Faculty Activity System on roughly a quarterly basis. Faculty member data entry tends to bunch around certain dates, so we focus on end of semesters as the times to get the reports. The reports come by email as Excel files with information about the person, their department, contact information, and the like, as well as information about each publication. We do some initial processing in Excel to clean them up, remove duplicates from prior reports, and remove irrelevant information.  It is amazing how many people see a field like “Journal Title” as a chance to ask a question rather than provide information. We focus our efforts on items that have actually been published, since the vast majority of people have no interest in posting pre-prints and those that do prefer to post them in arXiv or similar. The few people who do know about pre-prints and don’t have a subject archive generally submit their items directly. This is another way to lower expectations of what can be done through the process. I’ve already described how I use OpenRefine for creating reports from faculty CVs using the SHERPA/RoMEO API, and we follow a similar but much simplified process since we already have the data in the correct columns. Of course, following this process doesn’t tell us what we can do with every item. The journal title may be entered incorrectly so the API call didn’t pick it up, or the journal may not be in SHERPA/RoMEO. My graduate student assistant fills in what he is able to determine, and I work on the complex cases. As we are doing this, the Excel spreadsheet is saved in Box so we have the change history tracked and can easily add collaborators.

A view of how we use Asana for managing faculty collection development workflows.

At this point, we are ready to move to Asana, which is a lightweight project management tool ideal for several people working on a group of related projects. Asana is far more fun and easy to work with than Excel spreadsheets, and this helps us work together better to manage workload and see where we are with all our on-going projects. For each report (or faculty member CV), we create a new project in Asana with several sections. While it doesn’t always happen in practice, in theory each citation is a task that moves between sections as it is completed, and finally checked off when it is either posted or moved off into some other fate not as glamorous as being archived as open access full text. The sections generally cover posting the publisher’s PDF, contacting publishers, reminders for followup, posting author’s manuscripts, or posting to SelectedWorks, which is our faculty profile service that is related to our repository but mainly holds citations rather than full text. Again, as part of the low expectations, we focus on posting final PDFs of articles or book chapters. We add books to a faculty book list, and don’t even attempt to include full text for these unless someone wants to make special arrangements with their publisher–this is rare, but again the people who really care make it happen. If we already know that the author’s manuscript is permitted, we don’t add these to Asana, but keep them in the spreadsheet until we are ready for them.

We contact publishers in batches, trying to group citations by journal and publisher to increase efficiency so we can send one letter to cover many articles or chapters. We note to follow up with a reminder in one month, and then again in a month after that. Usually the second notice is enough to catch the attention of the publisher. As they respond, we move the citation to either posting publisher’s PDF section or to author’s manuscript section, or if it’s not permitted at all to the post to SelectedWorks section. While we’ve tried several different procedures, I’ve determined it’s best for the liaison librarians to ask just for author’s accepted manuscripts for items after we’ve verified that no other version may be posted. And if we don’t ever get them, we don’t worry about it too much.


I hope you’ve gotten some ideas from this post about your own procedures and new tools you might try. Even more, I hope you’ll think about which pieces of your procedures are really working for you, and discard those that aren’t working any more. Your own situation will dictate which those are, but let’s all stop beating ourselves up about not achieving perfection. Make sure to let your repository stakeholders know what works and what doesn’t, and if something that isn’t working is still important, work collaboratively to figure out a way around that obstacle. That type of collaboration is what led to our partnership with the Office of Institutional Research to use the Digital Measures platform for our collection development, and that in turn has  led to other collaborative opportunities.


Suzanne Chapman: 2016 UX+Library Conferences

planet code4lib - Tue, 2016-02-16 17:35

There are so many good conferences for UX+Library folks this year!

Note: this isn’t intended to be an exhaustive list, just conferences I’ve attended and enjoyed or conferences I’ve heard good things about. But if you have suggestions, please comment below. 

(each section in chronological order)

Library conferences focused on web technology, design, user research, or assessment Code4Lib

“An annual gathering of technologists from around the world, who largely work for and with libraries, archives, and museums and have a commitment to open technologies.”

  • Next occurrence: March 7-10, 2016
  • Location: Philadelphia, PA
IOLUG (Indiana Online Users Group)

Theme: “DIY UX: Innovate. Create. Design.”

From the call for proposals “What strategies and/or tools do you use to make library resources, webpages, spaces, marketing materials, etc. more user-friendly? What has proven successful for your organization? What problems surrounding user experience have you encountered, and what solutions have you devised? What best practices or recent research can you share about user experience? We encourage presentations that are practical, hands-on, and include take-awayable tools, techniques, and/or strategies that librarians can implement to improve their resources and services for students, patrons, faculty, etc.”

  • Next occurrence: May 20, 2016
  • Location: Indianapolis, IN
Computers in Libraries

“Libraries are changing,—building creative spaces with a focus on learning and creating; engaging audiences in different ways with community and digital managers; partnering with different community organizations in new and exciting ways. Computers in Libraries has always highlighted and showcased creative and innovative practices in all types of libraries, but this year with our theme, Library Labs: Research, Innovation & Imagination, we plan to feature truly transformative and cutting-edge research, services, and practices along with the strategies and technologies to support them.”

  • Next occurrence: March 8-10, 2016
  • Location: Washington, DC
Design for Digital

“Designing for Digital is a two-day conference packed with intensive, hands-on workshops and informative sessions meant to bring together colleagues working on user experience, discovery, design and usability projects inside and outside of libraries, drawing expertise from the tech and education communities, as well as from peers. This exposure will allow information professionals to bring lessons home to their institutions and to think differently about designing our digital future.”

    • Next occurrence: April 6 – 7, 2016
    • Location: Austin, TX

The one and only library conference focused entirely on UX. Last year offered an interactive format with wonderful keynotes, hands-on ethnographic technique exercises, and a team challenge. This year will feature more individual sessions highlighting projects from around the world with the theme: nailed, failed, and derailed.

  • Next occurrence: June 23-24, 2016
  • Location: Manchester, England

“Access is Canada’s annual library technology conference. It brings librarians, technicians, developers, programmers, and managers together to discuss cutting-edge library technologies. Access is a single stream conference featuring in-depth analyses, panel discussions, poster presentations, lightning talks, hackfest, and plenty of time for networking and social events.”

  • Next occurrence: October 4-7th
  • Location:Fredericton, New Brunswick
LITA Forum

“The LITA Forum is the annual gathering of about 300 technology-minded information professionals. It is the conference where technology meets the practicality of daily information operations in archives, libraries, and other information services. The Forum is an ideal place to interact with fellow library technologists. Attendees are working at the cutting edge of library technology and are interested in making connections with technically-inclined colleagues and learn about new directions and projects in libraries.”

  • Next occurrence: TBD (likely November)
  • Location: TBD
Library Assessment

Theme is “Building Effective, Sustainable, Practical Assessment”.

“The conference goal is to build and further a vibrant library assessment community by bringing together interested practitioners and researchers who have responsibility or interest in the broad field of library assessment. The event provides a mix of invited speakers, contributed papers, short papers, posters, and pre- and post-conference workshops that stimulate discussion and provide workable ideas for effective, sustainable, and practical library assessment.”

    • Next occurrence: October 31–November 2, 2016
    • Location: Arlington, VA
DLF (Digital Library Federation)

“Strategy meets practice at the Digital Library Federation (DLF). Through its programs, working groups, and initiatives, DLF connects the vision and research agenda of its parent organization, the Council on Library and Information Resources (CLIR), to an active and exciting network of practitioners working in digital libraries, archives, labs, and museums. DLF is a place where ideas can be road-tested, and from which new strategic directions can emerge.”

  • Next occurrence: November 7-9, 2016
  • Location: Milwaukee, WI
Higher Ed conferences focused on web technology, design, or user research Web Con

“An affordable two-day conference for web designers, developers, content managers, and other web professionals within higher ed and beyond.”

  • Next occurrence: April 27-28, 2016
  • Location: Urbana-Champaign, IL

“HighEdWeb is the annual conference of the Higher Education Web Professionals Association, created by and for all higher education Web professionals—from programmers to marketers to designers to all team members in-between—who want to explore the unique Web issues facing colleges and universities.”

  • Next occurrence: October 16-19, 2016
  • Location: Memphis, TN

“Focusing on the universal methods and tools of user interface and user experience design, as well as the unique challenges of producing websites and applications for large institutions, edUi is a perfect opportunity for web professionals at institutions of learning—including higher education, K-12 schools, libraries, museums, government, and local and regional businesses—to develop skills and share ideas.”

  • Next occurrence: TBD (likely November 2016)
  • Location: TBD
And a Bunch of Professional Industry Conferences & Events





Library of Congress: The Signal: Blurred Lines, Shapes, and Polygons, Part 2: An Interview with Frank Donnelly, Geospatial Data Librarian

planet code4lib - Tue, 2016-02-16 16:10

The following is a guest post by Genevieve Havemeyer-King, National Digital Stewardship Resident at the Wildlife Conservation Society Library & Archives. She participates in the NDSR-NYC cohort. This post is Part 2 on Genevieve’s exploration of stewardship issues for preserving geospatial data. Part 1 focuses on specific challenges of archiving geodata.

Frank Donnelly, GIS Librarian at Baruch College CUNY, was generous enough to let me pick his brain about some questions that came up while researching the selection and appraisal of geospatial data sets for my National Digital Stewardship Residency.

Baruch College’s NYC Geodatabase

Donnelly maintains the Newman Library’s geospatial data resources and repository, creates research guides for learning and exploring spatial data, and also teaches classes in open-source GIS software. In my meeting with him, we talked about approaches to GIS data curation in a library setting, limitations of traditional archival repositories, and how GIS data may be changing – all topics which have helped me think more flexibly about my work with these collections and my own implementation of standards and best practices for geospatial data stewardship.

Genevieve: How do you approach the selection of GIS materials?

Frank: As a librarian, much of my material selection is driven by the questions I receive from members of my college (students, faculty, and staff). In some cases these are direct questions (i.e. can we purchase or access a particular dataset), and in other cases it’s based on my general sense of what people’s interests are. I get a lot of questions from folks who are interested in local, neighborhood data in NYC for either business, social science, or public policy-based research, so I tend to focus on those areas. I also consider the sources of the questions – the particular departments or centers on campus that are most interested in data services – and try to anticipate what would interest them.

I try to obtain a mix of resources that would appeal to novice users for basic requests (canned products or click-able resources) as well as to advanced users (spatial databases that we construct so researchers using GIS can use it as a foundation for their work). Lastly, I look at what’s publicly accessible and readily usable, and what’s not. For example, it was challenging to find well-documented and public sources for geospatial datasets for NYC transit, so we decided to generate our own out of the raw data that’s provided.

Genevieve: On the limitations of the Shapefile, is the field growing out of this format? And do the limitations affect your ability to provide access?

Frank: People in the geospatial community have been grumbling about shapefiles for quite some time now, and have been predicting or calling for their demise. There are a number of limitations to the format in terms of maximum file size, limits on the number of attribute fields and on syntax used for field headers, lack of Unicode support, etc. It’s a rather clunky format as you have several individual pieces or files that have to travel together in order to function. Despite attempts to move on – ESRI has tried to de-emphasize them by moving towards various geodatabase formats, and various groups have promoted plain text formats like GML, WKT, and GeoJSON – the shapefile is still with us. It’s a long-established open format that can work in many systems, and has essentially become an interchange format that will work everywhere. If you want to download data from a spatial database or off of many web-based systems, those systems can usually transform and output the data to the shapefile format, so there isn’t a limitation in that sense. Compared to other types of digital data (music, spreadsheet files) GIS software seems to be better equipped at reading multiple types of file formats – just think about how many different raster formats there are. As other vector formats start growing in popularity and longevity – like GeoJSON or perhaps Spatialite – the shapefile may be eclipsed in the future, but it’s construction is simple enough that they should continue to be accessible.

Genevieve: Do you think that a digital repository designed for traditional archives can or should incorporate complex data sets like those within GIS collections? Do you have any initial ideas or approaches to this?

Frank: This is something of an age-old debate within libraries; whether the library catalog should contain just books or should it also contain other formats like music, maps, datasets, etc. My own belief is that people who are looking for geospatial datasets are going to want to search through a catalog specifically for datasets; it doesn’t make sense to wade through a hodgepodge of other materials, and the interface and search mechanisms for datasets are fundamentally different than the interface that you would want or need when searching for documents. Typical digital archive systems tend to focus on individual files as objects – a document, a picture, etc. Datasets are more complex as they require extensive metadata (for searchability and usability), help documentation and codebooks, etc. If the data is stored in large relational or object-oriented databases, that data can’t be stored in a typical digital repository unless you export the data tables out into individual delimited text files. That might be fine for small datasets or generally good for insuring preservation, but if you have enormous datasets – imagine if you had every data table from the US Census – it would be completely unfeasible.

For digital repositories I think it’s fine for small individual datasets, particularly if they are being attached to a journal article or research paper where analysis was done. But in most instances I think it’s better to have separate repositories for spatial and large tabular datasets. As a compromise you can always generate metadata records that can link you from the basic repository to the spatial one if you want to increase find-ability. Metadata is key for datasets – unlike documents (articles, reports, web pages) you have no text to search through, so keyword searching goes out the window. In order to find them you need to rely on searching metadata records or any help documents or codebooks associated with them.

Genevieve: How do you see selection and preservation changing in the future, if/when you begin collecting GIS data created at Baruch?

Frank: For us, the big change will occur when we can build a more robust infrastructure for serving data. Right now we have a web server where we can drop files and people can click on links to download layers or tables one by one. But increasingly it’s not enough to just have your own data website floating out there; in order to make sure your data is accessible and findable you want to appear in other large repositories. Ideally we want to get a spatial database up and running (like PostGIS) where we can serve the data out in a number of ways – we can continue to serve it the old fashioned way but would also be able to publish out to larger repositories like the OpenGeoportal. A spatial database would allow us to grant users access to our data directly through a GIS interface, without having to download and unzip files one by one.

DPLA: RootsTech Wrap Up

planet code4lib - Tue, 2016-02-16 16:00

The DPLA booth ready for visitors at RootsTech 2016.

Earlier this month, DPLA’s Director for Content Emily Gore and Manager of Special Projects Kenny Whitebloom headed west to Salt Lake City to represent DPLA at RootsTech, the largest family history event in the world.  DPLA had a booth in the Exhibit Hall and hosted two sessions, through which we were able to introduce our portal, collections, and resources to over a thousand genealogists and family researchers.

We love connecting with new audiences and were thrilled to have the opportunity to touch base with genealogists and family researchers to chat about their needs, interests, and questions about what DPLA has to offer.  DPLA stands out in the field as a free public resource that allows researchers to search the collections of almost 1,800 libraries, archives, and museums around the country all at once.  Nearly everyone who passed through our booth was excited about DPLA as a research resource — there was so much interest we ran out of brochures!

We also found that family researchers had great questions for us: Can you search family names?  Does DPLA have things like newspapers, letters, or yearbooks? What about essential documents like birth and death records?  Answer: Yes! We have collected content  in each of these categories from our network of hubs, but what was perhaps most exciting to family researchers about DPLA was the opportunity to dig deeper and add context to the lives of our ancestors.

Our presentation sessions allowed us to go even further.  We welcomed all levels of researchers, from beginners to pros, and Emily’s slides below demonstrate a few of the ways that DPLA collections hold vast potential for family historians.  For example, family bibles, like that of the Whitehead family, can be an invaluable source of birth and death information, particularly for the years before official state documentation.  Looking for Civil War-era ancestors?  Try searching for regimental records, veterans’ association photos, and scrapbooks:


The App Library also holds some valuable tools for family researchers.  Here, Emily shows how DPLA by County and State might be particularly helpful to zero in a a specific place or region that your family hails from.  Or, try cross-searching DPLA and Europeana to connect to family resources in Europe!


Thanks to everyone that stopped by our sessions and booth at RootsTech and welcome to the DPLA community!

Let’s stay connected about how DPLA can best serve genealogists and family researchers and we’ll hope to see you at RootsTech 2017!


Subscribe to code4lib aggregator