You are here

Feed aggregator

Mark E. Phillips: Writing the UNT Libraries Digital Collections to tape.

planet code4lib - Fri, 2015-10-02 13:30

When we created our digital library infrastructure a few years ago, one of the design goals of the system was that we would create write once digital objects for the Archival Information Packages (AIPs) that we store in our Coda repository.

Currently we store two copies of each of these AIPs, one locally in the Willis Library server room and another copy in the UNT System Data Center at the UNT Discovery Park research campus that is five miles north of the main campus.

Over the past year we have been working on a self-audit using the TRAC Criteria and Checklist as part of our goal in demonstrating that the UNT Libraries Digital Collections is a Trusted Digital Repository.  In addition to this TRAC work we’ve also used the NDSA Levels of Preservation to help frame where we are with digital preservation infrastructure, and were we would like to be in the future.

One of the things that I was thinking about recently is what it would take for us to get to Level 3 of the NDSA Levels of Preservation for “Storage and Geographic Location”

“At least one copy in a geographic location with a different disaster threat”

In thinking about this I was curious what the lowest cost would be for me to get this third copy of my data created, and moved someplace that was outside of our local disaster threat area.

First some metrics

The UNT Libraries’ Digital Collections has grown considerably over the past five years that we’ve had our current infrastructure.

As of this post, we have 1,371,808 bags of data containing 157,952,829 file  in our repository,  taking up 290.4 TB of storage for each copy we keep.

As you can see by the image above, the growth curve has changed a bit starting in 2014 and is a bit steeper than it had been previously.  From what I can tell it is going to continue at this rate for a while.

So I need to figure out what it would cost to store 290TB of data in order to get my third copy.

Some options.

There are several options to choose from for where I could store my third copy of data,  I could store my data with a service like Cronopolis, MetaArchive, DPN, or DuraSpace to name a few.  These all have different cost models and different services, and for what I’m interested in accomplishing with this post and my current musing,  these solutions are overkill for what I want.

I could use either a cloud based service like Amazon Glacier, or even work with one of the large high performance computing facilities like TACC at the University of Texas to store a copy of all of my data.  This is another option but again not something I’m interested in musing about in this post.

So what is left?  Well I could spin up another rack of storage, put our Coda repository software on top of it and start replicating my third copy, but the problem is getting it in a rack that is several hundred miles away,  UNT doesn’t have any facilities in locations outside of the DFW area so that is out of the question.

So finally I’m leaving myself to think about tape infrastructure, and specifically about getting an LTO-6 setup to spool a copy of all of our data to and then send those tapes off to a storage facility,  possibly something like the TSLAC Records Management Services for Government Agencies.

Spooling to Tape

So in this little experiment I was interested in finding out how many LTO-6 tapes it would take to store the UNT Libraries Digital Collections.  I pulled a set of data from Coda that contained the 1,371,808 bags of data and the size of each of those bags in bytes.

The uncompressed capacity of LTO-6 tape is 2.5 TB so some quick math says that it will take 116 tapes to write all of my data.  This is probably low because that would assume that I am able to completely fill each of the tapes with exactly 2.5 TB of data.

I figured that there were going to be at least three ways for me to approach distributing digital objects to disk,  they are the following:

  • Write items in the order that they were accessioned
  • Write items in order from smallest to largest
  • Fill each tape to the highest capacity before moving to the next

I wrote three small python scripts that simulated all three of these options to find the number of tapes needed as well as the overall storage efficiency of that method.  I decided I would only fill a tape with 2.4 TB of data to give myself plenty of wiggle room. Here are the results

Method Number of Tapes Efficiency Smallest to Largest 136 96.91% In order of accession 136 96.91% Fill a tape completely 132 99.85%

In my thinking, the simplest way of writing objects to tape would be to order the objects by their accession date, write files to a tape until it is full, when it is full start writing to another tape.

If we assume that a tape costs $34 dollars,  the overhead of this less efficient but simplest way of writing is only an overhead of $116 dollars which to me is completely worth it.  This way, in the future I could just continue to write tapes as new content gets ingested by just picking up where I left off.

So from what I can figure from my poking around on and various tape retailers,  I’m going to be out roughly $10,000 for my initial tape infrastructure that would include a tape autoloader and a server to stage files to from our Coda repository.  I would have another cost of $4,352 to get my 136 LTO-6 tapes to accommodate my current 290 TB of data in Coda.  If I assume a five year replacement rate for this technology (so that I can spread the initial costs out over five years) that will leave me with a cost of just about $50 per-TB,  if I divide that over the five year lifetime of the technology,  that’s $10 per-TB-per-year.

If you like GB prices better I’m coming up with $.01 cents per-GB or $.002 cents per-GB-per-year cost.

If I was going to use Amazon Glacier (calculations are with an unofficial Amazon Glacier calculated and assume a whole bunch of things that I’ll gloss over related to data transfer) I come up with a cost of $35,283.33 per year instead of my roughly calculated $2870.40 per year. (I realize that these cost comparison aren’t for the same service and Glacier includes extra redundancy, but you get the point I think)

There is going to be another cost associated with this which is the off-site storage of 136 LTO-6 tapes.  As of right now I don’t have any idea of those costs but assume that it could be done anywhere from very cheaply as part of an MOU with another academic library for little or no cost, or something more costly like a contract with a commercial service.  I’m interested to see if UNT would be able to take advantage of the services offered by TSLAC and their Records Management Services.

So what’s next?

I’ve had fun musing about this sort of thing for the past day or so.  I have zero experience with tape infrastructure and from what I can tell it can get as cool and feature rich as you are willing to pay.  I like the idea of keeping it simple so if I can work directly with a tape autoloader with some command line tools like tar and mt,  I think that is what I would prefer.

Hope you enjoyed my musings here, if you have thoughts, suggestions, or if I missed something in my thoughts,  please let me know via Twitter.

District Dispatch: Archived copy of CopyTalk webinar now available

planet code4lib - Thu, 2015-10-01 20:19

By trophygeek

If you missed the October 1st webinar “Trans-Pacific Partnership” with Krista Cox, Director of Public Policy Initiatives at the Association of Research Libraries (ARL) you can now access it in our CopyTalk archive. Krista discussed what we know about this trade agreement gleaned from leaked documents.

Mark your calendars! OITP’s Copyright Education Subcommittee sponsors CopyTalk on the first Thursday of every month at 2:00 pm (Eastern)/ 11 am (Pacific). On November 5th, 2015 Rebecca Tushnet, Professor of Law at Georgetown University will talk about fan fiction. Imagine the copyright issues!

If you want to suggest topics for CopyTalk webinars, let us know via email ( and use the subject heading “CopyTalk.”

Oh yes! The webinars are free, and we want to keep it that way. We have 100 seat limit but any additional seats are outrageously expensive! If possible, consider watching the webinar with colleagues or joining the webinar before start time. And remember, there is an archive.

The post Archived copy of CopyTalk webinar now available appeared first on District Dispatch.

pinboard: @lbjay/c4l16 keynote candidates on Twitter

planet code4lib - Thu, 2015-10-01 14:06
A twitter list of all the #c4l16 keynote cands. cuz we should all be judged on what we express in 140ch. #code4lib

Harvard Library Innovation Lab: Link roundup October 1, 2015

planet code4lib - Thu, 2015-10-01 13:23

Homogeneously contributed

MIT Student Builds Real-Time Transit Map for His Wall | Mental Floss

Real time light up MBTA map

Why Preserving Old Computer Games is Surprisingly Difficult | Mental Floss

Preserving old video games is not so easy

Get Peanutized | Turn Yourself into a Peanuts Character

Need a new GitHub profile pic? Peanutize yourself!

Cheeky Cans Reduce Litter by Asking People to Vote With Their Trash | Mental Floss

Kind of like Awesome Box, but with litter

This Camera Refuses to Take Pictures of Over-Photographed Locations | Mental Floss

A camera that won’t take pictures of popular locations. How about a scanner that won’t check out popular items?

Conal Tuohy: Names in the Museum

planet code4lib - Thu, 2015-10-01 04:56

My last blog post described an experimental Linked Open Data service I created, underpinned by Museum Victoria’s collection API. Mainly, I described the LOD service’s general framework, and explained how it worked in terms of data flow.

To recap briefly, the LOD service receives a request from a browser and in turn translates that request into one or more requests to the Museum Victoria API, interprets the result in terms of the CIDOC CRM, and returns the result to the browser. The LOD service does not have any data storage of its own; it’s purely an intermediary or proxy, like one of those real-time interpreters at the United Nations. I call this technique a “Linked Data proxy”.

I have a couple more blog posts to write about the experience. In this post, I’m going to write about how the Linked Data proxy deals with the issue of naming the various things which the Museum’s database contains.

Using Uniform Resource Identifiers (URIs) as names

Names are a central issue in any Linked Data system; anything of interest must be named with an HTTP URI; every piece of information which is recorded about a thing is attached to this name, and crucially, because these names are HTTP URIs, they can (in fact in a Linked Data system, they must) also serve as a means to obtain information about the thing.

In a nutshell there are three main tasks the Linked Data proxy has to be able to perform:

  1. When it receives an HTTP request, it has to recognise the HTTP URI as an identifier that identifies a particular individual belonging to some general type: an artefact; a species; a manufacturing technique; etc.
  2. Having recognised as some sort of name, it has to be able to look up and retrieve information about the particular individual which it identifies.
  3. Having found some information about the named thing, it has to convert that information into RDF (the language of Linked Data), in the process converting any identifiers it has found into the kind of HTTP URIs it can recognise in future. A Linked Open Data client is going to want to use those identifiers to make further requests, so they have to match the kind of identifiers the LOD service can recognise (in step 1 above).
Recognising various HTTP URIs as identifiers for things in Museum Victoria’s collection

Let’s look at the task of recognising URIs as names first.

The Linked Data Proxy distinguishes between URIs that name different types of things by recognising different prefixes in the URIs. For instance, a URI beginning with will identity a particular item in the collection, whereas a URI beginning with will identify some particular technique used in the manufacture of an item.

The four central entities of Museum Victoria’s API

The Museum Victoria API is organised around four main types of entity:

  • items
  • specimens
  • species
  • articles

The LOD service handles all four very similarly: since the MV API provides an identifier for every item, specimen, species, or article, the LOD service can generate a linked data identifier for each one just by sticking a prefix on the front. For example, the item which Museum Victoria identifies with the identifier items/1221868 can be identified with the Linked Data identifier just by sticking in front of it, and a document about that item can be identified by

Secondary entities

So far so straightforward, but apart from these four main entity types, there are a number of things of interest which the Museum Victoria API deals with in a secondary way.

For example, the MV database includes information on how many of the artefacts in the collection were manufactured, in a field called technique. For instance, many ceramic items (e.g. teacups) in their collection were created from a mould, and have the value moulded in their technique field. The tricky thing here is that the techniques are not “first-class” entities like items. Instead, a technique is just a textual attribute of an item. This is a common enough situation in legacy data systems: the focus of the system is on what it sees as a “core” entity (a museum item in this case), which have their own identifiers and a bunch of properties hanging off them. Those properties are very much second-class citizens in the data model, and are often just textual labels. A number of items might share a common value for their technique field, but that common value is not stored anywhere except in the technique field of those items; it has no existence independent of those items.

In Linked Data systems, by contrast, such common values should be treated as first-class citizens, with their own identifiers, and with links that connect each technique to the items which were manufactured using that technique.

What is the LOD service to do? When expressing a technique as a URI, it can simply use the technique’s name itself (“moulded”) as part of the identifier, like so:

Then when the LOD service is responding to a request for a URI like the above, it can pull that prefix off and have the original world “moulded” back.

At this point the LOD service needs to be able to provide some information about the moulded technique. Because the technique is not a first-class object in the underlying collection database, there’s not much that can be said about it, apart from its name, obviously, which is “moulded”. All that the LOD service really knows about a given technique is that a certain set of items were produced using that technique, and it can retrieve that list using the Museum Victoria search API. The search API allows for searching by a number of different fields, including technique, so the Linked Data service can take the last component of the request URI it has received (“moulded”) and pass that to the search API, like so:

The result of the search is a list of items produced with the given technique, which the LOD service simply reformats into an RDF representation. As part of that conversion, the identifiers of the various moulded items in the results list (e.g. items/1280928) are converted into HTTP URIs simply by sticking the LOD service’s base URI on the front of them, e.g.

External links

Tim Berners-Lee, the inventor of the World Wide Web, in an addendum to his “philosophical” post about Linked Data, suggested a “5-star” rating scheme for Linked Open Data, in which the fifth star requires that a dataset “link … to other people’s data to provide context”. Since the Museum Victoria data doesn’t include external links, it is tricky to earn this final star, but there is a way, based on MV’s use of the standard taxonomic naming system used in biology. Since many of MV’s items are biological specimens, we can use their taxonomic names to establish links to external sources which also use the same taxonomic names. For this example, I chose to link to biological data in Wikipedia, which, unknown to many people, also publishes a large dataset of Linked Open Data derived from the Wikipedia pages, including a lot of biological taxa. To establisha link to DBpedia, the LOD service takes the Museum’s taxonName field and inserts it into a SPARQL query, which it sends to DBpedia, essentially asking “do you have anything on file which has this binomial name?

select distinct ?species where {
?species dbp:binomial "{MV's taxon name goes here}"@en}

The result of the query is either a “no” or it’s a link to the species in Wikipedia’s database, which the LOD service can then republish.

coming up…

My next post in the series will look at some issues of how the Museum’s data relates to the CIDOC CRM model; where it matches neatly, and where it’s either more, or less specific than the CRM.

District Dispatch: Spectrum, Wi-Fi and LTE-U: Where goes the neighborhood?

planet code4lib - Wed, 2015-09-30 21:39

Modern libraries are dynamic community keystones. People of all ages traverse their stacks, computer labs, makerspaces and common areas throughout the course of a day—creating, working and learning. Many elements combine to make them such valuable assets, but one in particular serves as lifeblood in the digital age: broadband internet connectivity. With the E-rate modernization proceeding at the FCC in the rear view mirror, internet policy wonks within the library community have shifted their attention to the bubbling network neutrality debate. But, there’s at least one more front these brave souls – and all library professionals, for that matter – should monitor: radio spectrum allocation.

The radio spectrum is a series of energy currents. In 1985, a regulatory change made it possible for people and businesses to harness the energy in two different bands of these currents (2.4 GHz and 5.8 GHz) without first obtaining a license or completing an application. The “permissionless,” or, more commonly, “unlicensed” spectrum may barely be 30 years old, but it has already given birth to technologies we use in our daily lives. For the library community, this includes RFID and, perhaps even more importantly, the nearly ubiquitous Wi-Fi that increases library public internet access capacity, mobility, and varied user experiences.

Credit: miniyo73, Flickr

Wi-Fi uses unlicensed spectrum to connect devices within a local area to the internet. Why is this important to note at this particular moment? Wi-Fi may soon have a neighbor. Several wireless industry players support running Long-Term Evolution (LTE) in the unlicensed spectrum – a scenario they call LTE-U. LTE is the technology your smart devices use to connect to the internet and transmit data. Proponents of LTE-U claim it will offer cellular subscribers a more seamless user experience, and will spur innovation in mobile internet technology by making spectrum use more efficient. But opponents flag concerns about interference that could degrade Wi-Fi performance, and note LTE-U was developed outside of longstanding Wi-Fi standards bodies. So, where goes the neighborhood? Does LTE-U promise problems or progress in the mobile connectivity arena?

This was a major question for debate at a panel discussion on the past and future of Wi-Fi hosted by the Congressional Internet Caucus last Friday. Participating on the panel was OITP’s own Larra Clark. She was joined by four other telecom policy experts – Paula Boyd of Microsoft; John Hunter of T-Mobile; David Young of Verizon; and Fred Campbell of the Center for Boundless Innovation in Technology. Although the panelists all celebrated unlicensed spectrum as a tool for innovative uses that add substantial economic value, they sounded varying notes on the virtues of LTE-U. Hunter and Young assured the audience that mobile carriers would work to ensure LTE-U and Wi-Fi fit together as part of a balanced approach to expanding mobile internet connectivity. Boyd was skeptical, pointing out that LTE-U has thus far bypassed international standards bodies. She urged greater vetting and scrutiny.

ALA is keenly aware of the need to ensure that LTE-U plays nicely with Wi-Fi. Clark spoke articulately to the importance of Wi-Fi for library services, for wireless subscribers that use WiFi to contain data plan costs, and for those who rely on free Wi-Fi as their primary online connection. However, she also reflected the conciliatory tone of the discussion when she said that the debate over what to allow into the unlicensed spectrum space should not be seen as “…an ‘either/or’ proposition. Rather,” she noted, “we must protect and advance innovation by ensuring adequate safeguards and continuing conversations among policymakers, innovators, public interest organizations, and industry players on how to maximize the benefit of unlicensed spectrum.”

The ALA Washington Office will continue to lend the library perspective in the ongoing debates and discussions over spectrum policy at the federal level.

The post Spectrum, Wi-Fi and LTE-U: Where goes the neighborhood? appeared first on District Dispatch.

Open Knowledge Foundation: 6 lessons from sharing humanitarian data

planet code4lib - Wed, 2015-09-30 18:56

Cross-posted from

This post is a write-up of the talk I gave at Strata London in May 2015 called “Sharing humanitarian data at the United Nations”. You can find the slides on that page.

The Humanitarian Data Exchange (HDX) is an unusual data hub. It’s made by the UN, and is successfully used by agencies, NGOs, companies, Governments and academics to share data.

They’re doing this during crises such as the Ebola epidemic and the Nepal earthquakes, and every day to build up information in between crises.

There are lots of data hubs which are used by one organisation to publish data, far fewer which are used by lots of organisations to share data. The HDX project did a bunch of things right. What were they?

Here are six lessons…

1) Do good design

HDX started with user needs research. This was expensive, and was immediately worth it because it stopped a large part of the project which wasn’t needed.

The user needs led to design work which has made the website seem simple and beautiful – particularly unusual for something from a large bureaucracy like the UN.

[Insert image – Human.Data.Exch.]

2) Build on existing software

When making a hub for sharing data, there’s no need to make something from scratch. Open Knowledge’s CKAN software is open source, this stuff is a commodity. HDX has developers who modify and improve it for the specific needs of humanitarian data.

[insert image – ckan]

3) Use experts

HDX is a great international team – the leader is in New York, most of the developers are in Romania, there’s a data lab in Nairobi. Crucially, they bring in specific outside expertise: frog design do the user research and design work; ScraperWiki, experts in data collaboration, provide operational management.

[insert image – scraperwiki]

4) Measure the right things

HDX’s metrics are about both sides of its two sided network. Are users who visit the site actually finding and downloading data they want? Are new organisations joining to share data? They’re avoiding “vanity metrics”, taking inspiration from tech startup concepts like “pirate metrics“.

[insert image – week 17]

5) Add features specific to your community

There are endless features you can add to data hubs – most add no value, and end up a cost to maintain. HDX add specific things valuable to its community.

For example, much humanitarian data is in “shape files”, a standard for geographical information. HDX automatically renders a beautiful map of these – essential for users who don’t have ArcGIS, and a good check for those that do.

[image – hdx map]

6) Trust in the data

The early user research showed that trust in the data was vital. For this reason, anyone can’t just come along and add data to it. New organisations have to apply – proving either that they’re known in humanitarian circles, or have quality data to share. Applications are checked by hand. It’s important to get this kind of balance right – being too ideologically open or closed doesn’t work.

[image – hdx text]


The detail of how a data sharing project is run really matters. Most data in organisations gets lost, left in spreadsheets on dying file shares. We hope more businesses and Governments will build a good culture of sharing data in their industries, just as HDX is building one for humanitarian data.

SearchHub: Lasso Some Prizes by Stumping The Chump in Austin Texas

planet code4lib - Wed, 2015-09-30 18:53

Professional Rodeo riders typically only have a few seconds to prove themselves and win big prizes. But you’ve still got two whole weeks to prove you can Stump The Chump with your tough Lucene/Solr questions, and earn both bragging rights and one of these prizes…

  • 1st Prize: $100 Amazon gift certificate
  • 2nd Prize: $50 Amazon gift certificate
  • 3rd Prize: $25 Amazon gift certificate

You don’t have to know how to rope a steer to win, just check out the session information page for details on how to submit your questions. Even if you can’t make it to Austin to attend the conference, you can still participate — and do your part to humiliate me — by submitting your questions.

To keep up with all the “Chump” related info, you can subscribe to this blog (or just the “Chump” tag).

The post Lasso Some Prizes by Stumping The Chump in Austin Texas appeared first on Lucidworks.

HangingTogether: Services built on usage metrics

planet code4lib - Wed, 2015-09-30 17:39

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Corey Harper of New York University and Stephen Hearn of University of Minnesota. They had posited that in an environment more oriented toward search than toward browse indexing, new kinds of services will rely on non-bibliographic data, usage metrics and data analysis techniques. Metrics can be about usage data—such as how frequently items have been borrowed, cited, downloaded or requested—or about bibliographic data—such as where, how and how often search terms appear in the bibliographic record. Some kinds of use data are best collected on a larger scale than most catalogs provide.

These usage metrics could be used to build a wide range of library services and activities. Among the possible services noted: collection management, identifying materials for offsite storage, deciding which subscriptions to maintain, comparing citations for researchers’ publications with what the library is not purchasing; improving relevancy ranking, personalizing search results, offering recommendation services, measuring impact of library usage on research or student success. What if libraries emulated Amazon with “People who accessed <this title> also accessed <these titles>” or “People in the same course as you are accessing <these titles>”?

Harvard Innovation Lab’s StackLife aggregates such usage data of library titles as number of check-outs (broken down by faculty, undergraduates and graduate students, with faculty checkouts weighted differently), number of ILL requests, and frequency the title is placed in course reserves, and then assigns a “Stack Score” for each title. A search on a subject then displays a heat map graphic with the higher scores shown in darker hues, as shown in the accompanying graphic, and can serve as a type of recommender service. The StackLife example inspired other suggestions for possible services, such as aggregating the holdings and circulation data across multiple institutions—or even across countries—with Amazon sales data, and weighting scores if the author was affiliated with the institution. A recent Pew study found that personal recommendations dominated book recommendations. Could libraries capture and aggregate faculty and student recommendations mentioned in blogs and tweets?

The University of Minnesota conducted a study[i] to investigate the relationships between first-year undergraduate students’ use of the academic library, academic achievement, and retention. The results suggested a strong correlation between using academic library services and resources—particularly database logins, book loans, electronic journal logins, and library workstation logins— and higher grade point averages.

Some of the challenges raised in the focus group discussions included:

Difficulties in analyzing usage data: The different systems and databases libraries have present challenges in both gathering and analyzing the data. A number of focus group members are interested in visualizing usage data, and at least a couple are using Tableau to do so. Libraries have with difficulty harvested citations and measure which titles are available in their repositories, but it is even more difficult to demonstrate which resources would not have been available without the library.  The variety of resources also mean that the people who analyze the data are scattered across the campus in different functional areas. Interpreting Google analytics to determine patterns of usage over time and the effect of curricula changes is particularly difficult.

Aggregating usage data across campus: Tools that allow selectors to choose titles to send to remote storage by circulation data and classification range (to assess the impact on a particular area of stacks) can be hampered when storage facilities use a different integrated library system.

Anonymizing data to protect privacy: Aggregating metrics across institutions may help anonymize data but hinders analysis of performance at an individual institution. Anonymizing data may also prevent usage metrics by demographics (e.g., professors vs. grad students vs. undergraduates). Even when demographic data is captured as part of campus log-ins, libraries cannot know the demographics of people accessing their resources who are not affiliated with their institution.

Difficulties in correlating library use with academic performance or impact: Some focus group members questioned whether it was even possible to correlate library use with academic performance. (“Are we busting our heads to collect something that doesn’t tell us anything?”) On the other hand, we can at least start making some decisions based on the data we do have, and perhaps libraries’ concern with being “scientific” is not warranted.

Data outside the library control: Much usage data lies outside the library control (for example, Google Scholar and Elsevier). Only vendors have access to electronic database logs. Relevance ranking for electronic resources licensed from vendors is a “black box”.

Inconsistent metadata: Inconsistent metadata can dilute the reliability of usage statistics. Examples cited included: the same author represented in multiple ways; varying metadata due to changes in cataloging rules over time; different romanization schemes used for non-Latin script materials.  The biggest issue is that most libraries’ metadata comes from external sources and thus the library has no control over its quality. The low quality of metadata for e-resources from some vendors remains a common issue; misplaced identifiers for ebooks was cited as a serious problem. Focus group members have pointed vendors to the OCLC cross-industry white paper, Success Strategies for Electronic Content Discovery and Access without much success. Threats to cancel a subscription unless the metadata improves prove empty when their selectors object. Libraries do some bulk editing of the metadata, for example: reconciling name forms with the LC name authority file (or outsource this work); adding language and format codes in the fixed fields. The best sign of a “reliable vendor” is that they get their metadata from OCLC. It’s important for vendors to view metadata as a “community property.”

[i] Krista M. Soria, Jan Fransen, Shane Nackerud.  Stacks, Serials, Search Engines, and Students’ Success: First-Year Undergraduate Students’ Library Use, Academic Achievement, and Retention. Journal of Academic Librarianship, 40 (2014), 84-91. doi:10.1016/j.acalib.2013.12.002


About Karen Smith-Yoshimura

Karen Smith-Yoshimura, program officer, works on topics related to renovating descriptive and organizing practices with a focus on large research libraries and area studies requirements.

Mail | Web | Twitter | More Posts (61)

Jonathan Rochkind: Just curious: Do you think there is a market for additional Rails contractors for libraries?

planet code4lib - Wed, 2015-09-30 14:54

Fellow library tech people and other library people who read this blog, what do you think?

Are there libraries who would be interested in hiring a Rails contractor/consultant to do work for them, of any kind?

I know Data Curation Experts does a great job with what they do — do you think there is work for more than just them, whether on Blacklight/Hydra or other Rails?

Any sense of it, from where you work or what you’ve heard?

I’m just curious, thinking about some things.

Filed under: General

District Dispatch: ALA Urges Passage of Digital Learning Equity Act

planet code4lib - Wed, 2015-09-30 14:02

Digital section of the Martin Luther King, Jr.,
Memorial Library, Washington, D.C.

ALA is urging passage of The Digital Learning Equity Act of 2015, H.R. 3582, which was introduced in the U.S. House of Representatives last week by Congressman Peter Welch (D-VT) and co-sponsored by David McKinley (R-WV) with the support of ALA and the education community.

ALA President Sari Feldman issued a statement applauding Reps. Welch and McKinley for co-sponsoring H.R. 3582 and warning that: “Students in every classroom and every corner of the nation need Congress to close the homework gap. ALA urges Congress to quickly pass and send the Digital Learning Equity Act to the President.”

The legislation addresses the growing digital divide and learning gaps between students with and without access to the Internet at home. Increasingly, students find it necessary to complete homework utilizing the Internet. Many students gather at public libraries after school, gain access before school or at lunch, or simply go without access often resulting in these students falling behind. Allowing students access to laptops only addresses part of the digital divide—if the student cannot access the Internet, they cannot do their homework research.

H.R. 3582 recognizes that at-home access is critical to homework completion, authorizes an innovative grant program for schools to promote for student access, prioritizes rural and high-density, low-income schools, and requires the FCC to study the growing homework gap. ALA signed a joint letter with several education representatives to support H.R. 3582 and urge its quick passage.

The legislation recognizes that libraries can provide access tools, on-line tools, as well as provide research and guidance for students complementing the work of teachers and school librarians. As noted in the September/October issue of American Libraries, libraries are quickly moving to provide tools for greater Internet access. Libraries across the country, including New York, Kansas City, San Mateo County (CA), Chicago and Washington County, Maine, are already allowing patrons to check-out mobile Wi-Fi hotspots.

Additional benefits of increased access include providing parents opportunities to complete their education, obtain certifications, and apply for jobs.

The Senate companion, S. 1606, was introduced in the Senate this past June by Senator Angus King (I-ME) and is being considered by the Senate Health, Education, Labor and Pensions Committee. H.R. 3582 was referred to the House Education and Workforce Committee.

ALA will continue to support efforts to broaden access to the Internet and calls on Congress to quickly pass the Digital Learning Equity Act.

The post ALA Urges Passage of Digital Learning Equity Act appeared first on District Dispatch.

LITA: Creative Commons Crash Course, a LITA webinar

planet code4lib - Wed, 2015-09-30 14:00

Attend this interesting and useful LITA webinar:

Creative Commons Crash Course

Wednesday, October 7, 2015
1:30 pm – 3:00 pm Central Time
Register Online, page arranged by session date (login required)

Since the first versions were released in 2002, Creative Commons licenses have become an important part of the copyright landscape, particularly for organizations that are interested in freely sharing information and materials. Participants in this 90 minute webinar will learn about the current Creative Commons licenses and how they relate to copyright law.

This webinar will follow up on Carli Spina’s highly popular Ignite Session at the 2015 ALA Mid Winter conference. Carli will explain how to find materials that are Creative Commons-licensed, how to appropriately use such items and how to apply Creative Commons licenses to newly created materials. It will also include demonstrations of some important tools that make use of Creative Commons-licensed media. This program will be useful for librarians interested in instituting a Creative Commons licensing policy at their institutions, as well as those who are interested in finding free media for use in library materials.

Carli Spina

Is the Emerging Technologies and Research Librarian at the Harvard Law School Library. There she is responsible for teaching research and technology classes, as well as working on technology projects and creating online learning objects. She has presented both online and in-person on copyright and technology topics. Carli also offers copyright training and assistance to patrons and staff and maintains a guide to finding and understanding Creative Commons and public domain materials. Prior to becoming a librarian, she worked as an attorney at an international law firm. You can find more information about her work, publications, and presentations at

Register for the Webinar

Full details
Can’t make the date but still ant to join in? Registered participants will have access to the recorded webinar.


  • LITA Member: $45
  • Non-Member: $105
  • Group: $196

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

LITA: To tweet or not to tweet: scholarly engagement and Twitter

planet code4lib - Wed, 2015-09-30 14:00
by Colleen Simon, via Flickr

I’ve been thinking a lot about scholarly engagement on Twitter lately, especially after reading Bonnie Stewart‘s latest blog post, “The morning after we all became social media gurus.” Based on her research and writing for her thesis, she weighs exactly what we as academic librarians and LIS professionals are getting out of digital scholarly engagement and how we measure that influence in terms of metrics. I’d like to unpack this topic a bit and open it up to a wider reader discussion in the comments section, after the jump!

Debating the merits of networked scholarship via Twitter is a topic that has been bouncing around in journal articles and blog posts for the past 8 years or so. But as notions of scholarly publication and information dissemination change, it’s worth returning to this topic in order to assess how our presence as academic librarians and LIS professionals is changing as well. Addressing social media training in her blog post, Stewart poses the question, “Are the workshops helping…or just making people feel pressured to Do Another Thing in a profession currently swamped by exhortations to do, show, and justify?” I am both a lizard person Millennial and early-career librarian, so navigating through Twitter is easy for me, but not in the sense of establishing myself professionally. I feel that I’ve only just gotten the hang of professional information dissemination, and am learning more every day about how what we as information specialists tweet out reaches others in our community and what we get back from that.

But how do we understand and frame the practical benefits of digital and networked scholarship through Twitter specifically? The amount of times a single tweet is cited? How many followers, retweets, or favorites a professional has?

The pros of using Twitter as a form of scholarly networking are very clear to me – being able to contribute to the conversation in one’s field, creating new professional connections, and having an open venue in which to speak on scholarly matters – to name a few.

But the more tangential aspects are where it gets a lot grayer for me. How do we view ownership of tweets and replies to tweets? Does the citation of a viral tweet hold as much weight as a citation to an article published in a scholarly journal? How do we weigh the importance of scholarly tweets when we sometimes have to parse them out between the pictures of our pets being adorable? (I mean personally I see them as being equally important.)

This is all to say that if and/or when Twitter and other social media venues become a default environment for digital scholarship, should there be more of an effort to incorporate social media and networked scholarship as the norm that all “successful” LIS professionals should be doing, or is this just another signifier of the DIY-style of skill-building that librarianship is experiencing today as Erin Leach has written in her blog post? Should academic institutions be providing more workshops to train and guide professionals to use Twitter as professional development? What is the mark of a truly connected scholar and information specialist? I have a lot of questions.

I’ll round out my post with a quote from Jesse Stommel from his presentation New-form scholarship and the public digital humanities: “It isn’t that a single tweet constitutes scholarship, although in rare cases one might, but rather that Twitter and participatory media more broadly disperses the locus of scholarship, making the work less about scholarly products and more about community presence and engagement.” Community presence and engagement are such important factors in how I see academic librarians, LIS professionals, and information specialists using Twitter and connecting in the field.

So to open this up to you the readers, how do you measure your digital identity as a scholar or professional? How much weight do you give to digital networked scholarship? 

Ed Summers: Seminar Week 4

planet code4lib - Wed, 2015-09-30 04:00

In this week’s seminar we left the discussion of information and began looking at the theory of design writ large, with a few focused readings, and a lecture from Professor Findlater. One of the key things I took from the lecture was the distinction between User Experience and Usability. The usability of an object speaks to its functionality, and how easy it is for people to use it. User Experience on the other hand is more of an affective measure of how users perceive or think about the device, which may not always be congruent with it’s usability. It’s an interesting distinction, which I wasn’t really conscious of before.

Speaking of distinctions, we spent a fair bit of time talking about the first chapter in Don Norman’s The Design of Everyday Things. Norman is well known for popularizing the term affordance in the HCI community. He borrowed the term from the psychologist James Gibson. We read from Norman’s revised edition from 2013–the original was published in 1988. In chapter 1 it seems that Norman has a bit of an axe to grind because of how affordance had been used in the literature to denote a relation between an object and an organism, as well as a perceived affordance or what he now calls a signifier. This might seem like splitting hairs a bit, and perhaps a bit quixotic after the term affordance has been out in the wild for 25 years–but for me it made some sense. I know everyone didn’t but I appreciated his sometimes acerbic tone, especially when he was berating General Electric for its continuously flawed refrigerator controls.

A good example came up in class of a student who recently moved and needed to buy a couch. She is tall, and often likes sleeping on a couch. So she was looking for a couch that would be easy to sleep on. Specifically she wanted a couch with arm rests that could also serve as pillows, because she is tall. For example compare these two couches:

The second couch affords being used for sleeping by a tall person. The affordance is a specific relation between the tall person and the couch, not a specific feature of the couch. The distinction that Norman is making here is that the affordance is different from the perception of the affordance, or signifier, which in this case is the type of arm rest.

Being able to distinguish between the relation between the object and the person, and the sign of the relationship seems important–particularly in cases where the affordance exists but is not known, or when it appears that the affordance exists, but it actually does not. I’m thinking of controls in a user interface that appear to do one thing, but do something else, or do that expected thing as well as something unexpected. I can see why HCI people would have reacted negatively to Norman telling them they were using the term wrong. But since I’m new to the field I don’t have that baggage I guess :-)

A few other things that came up in discussion that I will note down here to follow up on at some point are:

Writing Workshop

In the second part of the class we had a writing workshop where we discussed writing we liked, writing strategies (since we’ll be doing lots of that in the PhD program), as well as ways to get started writing when you are stuck.

We all brought in examples of writing we liked, and talked briefly about them. I brought in Jackson & Barbrow (2015) as an example of what I thought was a well written paper. I like it because it is strongly grounded in enthographic research (admittedly something I want to learn more about), and discusses a topic area that I am interested in: ecology, measurement and standards. I thought the paper did a good job of studying behavior around standards at different scales: the individual team of researchers out in the field, and the large scale national collection of data. Jackson included numerous quotes from individuals he observed during the study, which added additional authentic voices to the paper. He also quoted Borges in an useful/artful way. I thought it was also very interesting how standards were presented as things that we should consider as a design factor in Human Computer Interaction. The very idea that seemingly invisible and dull things like standards (how they live, and what they omit) could be useful in design is a fascinating idea to me. I liked the illustration that while measuring the environment seems like a precise science, it has a human/social component to it that is extremely important. Finally, I’ll admit it: I brought the paper because it won an honorable mention best paper award at CHI – and I’m a fan of Jackson’s work.

Writing Strategies

After we all talked about our favorite papers we discussed writing strategies or techniques to be aware of:

  • illustrations are important, captions are important
  • formulas as visualization
  • subheadings to make things easier for the reader
  • start a section with the main point, so people can find their way back
  • template, protocol for experimentation: group study
  • goal is sometimes to help people replicate
  • need a clear method: this is a must
  • need to be able to justify the decision
  • supplementary information (datasets, interview transcripts, coding, etc)
  • talk about failure
  • made clear real world implications
  • in first paragraph that the research is important
  • reflection on findings
Ways to get started

We also talked about ways to get started writing when it is difficult to get going:

  • look for similar papers: content & venue
  • keep track of good examples of papers
  • may find some that work as templates
  • talk about the paper, get feedback all along the way ; talk to co-authors
  • talk to the people that you are citing
  • start with what’s easiest: sometimes lit review, or methods, depending
  • what is the story: what do you want people to remember, how do they get highlighted ; the turns
  • where do you like writing?
  • find your voice, style, etc.
  • give yourself time
  • the time of day is important
  • stretches of time are important (take breaks)
  • plan/outline
  • be clear and concise
  • use evidence

Here are my random notes I took while doing the readings for this week. They haven’t been cleaned up much. But I thought I’d keep them here just in case I need to remember something later.

When doing the readings we were asked to pay particular attention to these questions:

  • What problem or research question does it address? Is that an important problem/question?
  • What is the research contribution?
  • Is the overall research approach used (e.g., lit survey, interview study, experiment) appropriate?
  • Are there any threats to the validity of the research? For example, the sampling method was for the researcher to ask all their friends to participate (let’s hope we don’t see this!).
  • Is the research properly grounded in past work?
  • Are there any presentation issues?

Norman (2013), Chapter 1

  • Psychopathology: the scientific study of mental disorders, and their causes (potentially in the environment)
  • Is it possible to learn how a product is supposed to be used?
  • “All artificial things are designed.”
  • Types: Industrial, Interaction, Experience (last two seem to blend a bit)
  • Rules followed by a machine are only know by its designers – and sometimes not even then (Nick’s work on Algorithmic Accountability)
  • The importance of blaming machines for the problems, not the person. Human centered.
  • “If only people would read the instructions everything would be all right.” Haha: RTFM.
  • Wow, he studied the usability of a nuclear reactor and Three Mile Island. No pressure!
  • Need to design for the way people are not the way you (the designer) want them to be.
  • In Human Centered Design (HCD) focus on breakdown, not on when things are going right. Winograd & Flores (Computers and Cognition).
  • HCD cuts across the three types of design.
  • Affordances: relationship between a physical object and a person (or agent). An affordance is not a property of the object, it is a relationship.
    • J J Gibson came up with the word.
    • Some are perceivable, and some are invisible.
  • Norman introduces: affordances, signinfiers, constraints, mappings, feedback, and conceptual model.
  • Signifiers are signals that draw attention to affordances, or make them visible.
    • They can be accidental or unintentional. Markers of use (a path through the woods)
    • Some signifiers are the perceived affordances, useful in cases where the affordance isn’t easily perceived.
    • They must be perceivable, or else they are failing to function.
    • Some perceived affordances (signifiers) can appear real, but are not. (used in games, and other places to illustrate constraints?)
    • If signifiers are signals for affordances, I guess affordances are the signified?
  • Mappings: a relationship between the control and the result.
    • some are natural, some feel natural (but in fact are not)
  • Feedback communicates the results of an action: there can be too little feedback, and too much feedback (backseat driver).
  • Conceptual models are explanations of how something works. Often they are simplified, and there are more than one. They are often inferred from using a device.
  • System Image: the bundle of all these things – akin to an actual brand?
  • Hardest part of design is getting people to abandon their disciplinary viewpoint and to think about the person who is using the device (antidisciplinarity)

Crystal & Ellington (2004)

  • This paper is a review of the literature on the topic of task analysis, with a view to the future, so not so much a formal research study really.
  • It seems like a pretty thorough review, but wondering if it is a regurgitation of Kuutti and Bannon (1991) that is mentioned in the conclusion. Although I guess Crystal reviews content past 1991, so maybe it’s an update of sorts?
  • It’s kind of amusing how they use their own task analysis breakdown to present the illustration of the field. I guess this would be a conceptual model (CTA)?
  • My impression was that more could’ve been done to define notion of usability which is thrown in at the end.
  • I liked how they wanted to compose and integrate the research on task-analysis, and not validate a particular viewpoint, but show the options that are available, and suggest cross-pollination between the models. Not sure if this is inter-disciplinary or not.

Oulasvirta, Kurvinen, & Kankainen (2003)

  • conclusions are inconclusive
  • hypotheses are introduced too late
  • good grounding in the literature
  • they seemed to have tow different variables at play: documentation provided and environment
  • admittedly limited analysis of the design documents

This study is built upon a useful foundation of existing research on bodystorming, and seems to provide a useful introduction to the concept. It also usefully highlights through concrete examples how bodystorming is different from brainstorming. The goal of the study seems to be to be explore two hypotheses, that are introduced at the end of the paper (I think they should’ve been included earlier):

  1. to speed up the design process
  2. to increase awareness of contextual factors in design

The authors mentioned a third hypothesis, which was to evaluate whether bodystorming on site provided immediate feedback about design ideas. But to me this seemed like a very minor variation on the speed of the design process.

They attempted to study these questions by analyzing the designs generated by 4 different case studies where bodystorming was used and a more traditional brainstorming case study. The setting of the bodystorming was varied: on site, similar site, office space, office space with acting. It also seemed like different types of documentation (user stories and design questions) were used in each scenario. This seemed to be changing more than one variable, and complicating the ability to draw conclusions. The authors mention that the results were somewhat complicated because designs from one of the bodystorming sessions seemed to inform other sessions. This was strange because they mention elsewhere that the case studies included different participants; but they couldn’t be all different if this sort of learning took place?

The findings themselves were inconclusive, and admittedly somewhat shallow. Although some of the anecdotes regarding site accessibility, level of exhaustion, inspiration and memorability seem like they would be fruitful to explore in a more controlled manner. It felt like this study was trying to do an experiment, but really did a much better job of presenting the ideas of bodystorming in the context of the literature, and providing a useful set of case studies to delineate how it could be used.


Crystal, A., & Ellington, B. (2004). Task analysis and human-computer interaction: Approaches, techniques, and levels of analysis. AMCIS 2004 Proceedings, 391.

Jackson, S. J., & Barbrow, S. (2015). Standards and/as innovation: Protocols, creativity, and interactive systems development in ecology. In. CHI; Association of Computing Machinery. Retrieved from

Norman, D. A. (2013). The design of everyday things: Revised and expanded edition. Basic books.

Oulasvirta, A., Kurvinen, E., & Kankainen, T. (2003). Understanding contexts by being there: Case studies in bodystorming. Personal and Ubiquitous Computing, 7(2), 125–134.

DuraSpace News: Tell a DSpace Story to Introduce People, Ideas and Innovation

planet code4lib - Wed, 2015-09-30 00:00

The Telling DSpace Stories work group got underway this fall. The goal is to introduce DSpace community members and the work they are doing by sharing each others stories. The first five stories are now available to answer questions about how others have implemented DSpace at several types of institutions in different parts of the world:

LITA: It’s a Brave New Workplace

planet code4lib - Tue, 2015-09-29 20:08

LITA Blog Readers, I’ve got a new job. For the past month I’ve been getting my sea legs at the University of Houston’s M.D. Anderson Library. As CORC (Coordinator of Online Resources and Collections), my job is supporting data-driven collection decisions and processes. I know, it’s way cool.

M.D. Anderson – Ain’t she a beaut?

I have come to realize that the most challenging aspect of adapting to a new workplace may well be learning new technologies and  adjusting to familiar technologies used in slightly different ways. I’m text mining my own notes for clues and asking a ton of questions, but switching from Trello to Basecamp has been rough.

No, let’s be honest, the most challenging thing has been  navigating the throngs of undergrads on a crowded campus. Before working remotely for years, I worked at small nonprofits, graduated from a teeny, tiny liberal arts college, and grew up in a not-big Midwestern town. You may notice a theme.

No worries, I’m doing fine. The tech is with me.

In upcoming installments of Brave New Workplace I’ll share methods for organization, prioritization, acculturation, and technology adaptation in a new workplace. While I’ll focus on library technologies and applications, I’ll also be turning a tech-focused approach to workplace culture questions. Spoiler alert: I’m going to encourage you to build your own CRM for your coworkers and their technology habits. Be prepared.

And stay tuned! Brave New Workplace will return on October 16th.




SearchHub: How StubHub De-Dupes with Apache Solr

planet code4lib - Tue, 2015-09-29 19:07
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting StubHub engineer Neeraj Jain’s session on de-duping in Solr. Stubhub handles large number of events and related documents. Use of Solr within Stubhub has grown from search for events/tickets to content ingestion. One of the major challenges that are faced in content ingestion systems is to detect and remove duplicates without compromising on quality and performance. We present a solution that involves spatial searching, custom update handler, custom geodist function etc, to solve the de-duplication problem. In this talk, we’ll present design and implementation details of the custom modules and APIs and discuss some of the challenges that we faced and how we overcame them. We’ll also present the comparison analysis between old and the new system used for de-duplication. Neeraj Jain is an engineer working with Stubhub Inc in San Francisco. He has a special interest in search domain and has been working with SOLR for over 4 years. He also has interest in mobile app development; he works as a freelancer and has applications on Google play store and iTunes store that are built using SOLR. Neeraj has a Masters in Technology degree from the Indian Institute of Technology, Kharagpur. Deduplication Using Solr: Presented by Neeraj Jain, Stubhub from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How StubHub De-Dupes with Apache Solr appeared first on Lucidworks.

FOSS4Lib Recent Releases: ArchivesSpace - 1.4.0

planet code4lib - Tue, 2015-09-29 18:07

Last updated September 29, 2015. Created by Peter Murray on September 29, 2015.
Log in to edit this page.

Package: ArchivesSpaceRelease Date: Monday, September 28, 2015


Subscribe to code4lib aggregator