You are here

Feed aggregator

Open Library Data Additions: Amazon Crawl: part dn

planet code4lib - Mon, 2016-03-28 20:54

Part dn of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

LITA: Universal Design for Libraries and Librarians, an important LITA web course

planet code4lib - Mon, 2016-03-28 19:01

Consider this important new LITA web course:
Universal Design for Libraries and Librarians

Instructors: Jessica Olin, Director of the Library, Robert H. Parker Library, Wesley College; and Holly Mabry, Digital Services Librarian, Gardner-Webb University

Offered: April 11 – May 27, 2016
A Moodle based web course with asynchronous weekly content lessons, tutorials, assignments, and groups discussion.

Register Online, page arranged by session date (login required)

Universal Design is the idea of designing products, places, and experiences to make them accessible to as broad a spectrum of people as possible, without requiring special modifications or adaptations. This course will present an overview of universal design as a historical movement, as a philosophy, and as an applicable set of tools. Students will learn about the diversity of experiences and capabilities that people have, including disabilities (e.g. physical, learning, cognitive, resulting from age and/or accident), cultural backgrounds, and other abilities. The class will also give students the opportunity to redesign specific products or environments to make them more universally accessible and usable.


By the end of this class, students will be able to…

  • Articulate the ethical, philosophical, and practical aspects of Universal Design as a method and movement – both in general and as it relates to their specific work and life circumstances
  • Demonstrate the specific pedagogical, ethical, and customer service benefits of using Universal Design principles to develop and recreate library spaces and services in order to make them more broadly accessible
  • Integrate the ideals and practicalities of Universal Design into library spaces and services via a continuous critique and evaluation cycle

Here’s the Course Page

Jessica Olin

Jessica Olin

Is the Director of the Library, Robert H. Parker Library, Wesley College. Ms. Olin received her MLIS from Simmons College in 2003 and an MAEd, with a concentration in Adult Education, from Touro University International. Her first position in higher education was at Landmark College, a college that is specifically geared to meeting the unique needs of people with learning differences. While at Landmark, Ms. Olin learned about the ethical, theoretical, and practical aspects of universal design. She has since taught an undergraduate course for both the education and the entrepreneurship departments at Hiram College on the subject.

Holly Mabry

Holly Mabry

Holly Mabry received her MLIS from UNC-Greensboro in 2009. She is currently the Digital Services Librarian at Gardner-Webb University where she manages the university’s institutional repository, and teaches the library’s for-credit online research skills course. She also works for an international virtual reference service called Chatstaff. Since finishing her MLIS, she has done several presentations at local and national library conferences on implementing universal design in libraries with a focus on accessibility for patrons with disabilities.


February 29 – March 31, 2016


  • LITA Member: $135
  • ALA Member: $195
  • Non-member: $260

Technical Requirements:

Moodle login info will be sent to registrants the week prior to the start date. The Moodle-developed course site will include weekly new content lessons and is composed of self-paced modules with facilitated interaction led by the instructor. Students regularly use the forum and chat room functions to facilitate their class participation. The course web site will be open for 1 week prior to the start date for students to have access to Moodle instructions and set their browser correctly. The course site will remain open for 90 days after the end date for students to refer back to course material.

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

Mark E. Phillips: Beginning to look at the description field in the DPLA

planet code4lib - Mon, 2016-03-28 15:00

Last year I took a look at the subject field and the date fields in the Digital Public Library of America (DPLA).  This time around I wanted to begin looking at the description field and see what I could see.

Before diving into the analysis,  I think it is important to take a look at a few things.  First off, when you reference the DPLA Metadata Application Profile v4,  you may notice that the description field is not a required field,  in fact the field doesn’t show up in APPENDIX B: REQUIRED, REQUIRED IF AVAILABLE, AND RECOMMENDED PROPERTIES.  From that you can assume that this field is very optional.  Also, the description field when present is often used to communicate a variety of information to the user.  The DPLA data has examples that are clearly rights statements, notes, physical descriptions of the item, content descriptions of the item, and in some instances a place to store identifiers or names. Of all of the fields that one will come into contact in the DPLA dataset,  I would image that the description field is probably one of the ones with the highest variability of content.  So with that giant caveat,  let’s get started.

So on to the data.

The DPLA makes available a data dump of the metadata in their system.  Last year I was analyzing just over 8 million records, this year the collection has grown to more than 11 million records ( 11,654,800 in the dataset I’m using).

The first thing that I had to accomplish was to pull out just the descriptions from the full json dataset that I downloaded.  I was interested in three values for each record, specifically the Provider or “Hub”, the DPLA identifier for the item and finally the description fields.  I finally took the time to look at jq, which made this pretty easy.

For those that are interested here is what I came up with to extract the data I wanted.

zcat all.json.gz | jq -nc --stream --compact-output '. | fromstream(1|truncate_stream(inputs)) | {'provider': (._source.provider["@id"]), 'id': (, 'descriptions': ._source.sourceResource.description?}'

This results in an output that look like this.

{"provider":"","id":"4fce5c56d60170c685f1dc4ae8fb04bf","descriptions":["Lang: Charles Aikin Collection"]} {"provider":"","id":"bca3f20535ed74edb20df6c738184a84","descriptions":["Lang: Maire, graveur."]} {"provider":"","id":"76ceb3f9105098f69809b47aacd4e4e0","descriptions":null} {"provider":"","id":"88c69f6d29b5dd37e912f7f0660c67c6","descriptions":null}

From there my plan was to write some short python scripts that can read a line, convert it from json into a python object and then do programmy stuff with it.

Who has what?

After parsing the data a bit I wanted to remind myself of the spread of the data in the DPLA collection.  There is a page on the DPLA’s site that shows you how many records have been contributed by which Hub in the network.  This is helpful but I wanted to draw a bar graph to give a visual representation of this data.

DPLA Partner Records

As has been the case since it was added, Hathitrust is the biggest provider of records to the DPLA with other 2.4 million records.  Pretty amazing!

There are three other Hubs/Providers that contribute over 1 million records each,  The Smithsonian, New York Public Library, and the University of Southern California Libraries. Down from there there are three more that contribute over half a million records,  Mountain West Digital Library, National Archives and Records Administration (NARA) and The Portal to Texas History.

There were 11,410 records (coded as undefined_provider) that are not currently associated with a Hub/Provider,  probably a data conversion error somewhere during the record ingest pipeline.

 Which have descriptions

After the reminder about the size and shape of the Hubs/Providers in the DPLA dataset, we can dive right into the data and see quickly how well represented in the data the description field is.

We can start off with another graph.

Percent of Hubs/Providers with and without descriptions

You can see that some of the Hubs/Providers have very few records (< 2%) with descriptions (Kentucky Digital Library, NARA) while others had a very high percentage (> 95%) of records with description fields present (David Rumsey, Digital Commonwealth, Digital Library of Georgia, J. Paul Getty Trust, Government Publishing Office, The Portal to Texas History, Tennessee Digital Library, and the University of Illinois at Urbana-Champaign).

Below is a full breakdown for each Hub/Provider showing how many and what percentage of the records have zero descriptions, or one or more descriptions.

Provider Records 0 Descriptions 1+ Descriptions 0 Descriptions % 1+ Descriptions % artstor 107,665 40,851 66,814 37.94% 62.06% bhl 123,472 64,928 58,544 52.59% 47.41% cdl 312,573 80,450 232,123 25.74% 74.26% david_rumsey 65,244 168 65,076 0.26% 99.74% digital-commonwealth 222,102 8,932 213,170 4.02% 95.98% digitalnc 281,087 70,583 210,504 25.11% 74.89% esdn 197,396 48,660 148,736 24.65% 75.35% georgia 373,083 9,344 363,739 2.50% 97.50% getty 95,908 229 95,679 0.24% 99.76% gpo 158,228 207 158,021 0.13% 99.87% harvard 14,112 3,106 11,006 22.01% 77.99% hathitrust 2,474,530 1,068,159 1,406,371 43.17% 56.83% indiana 62,695 18,819 43,876 30.02% 69.98% internet_archive 212,902 40,877 172,025 19.20% 80.80% kdl 144,202 142,268 1,934 98.66% 1.34% mdl 483,086 44,989 438,097 9.31% 90.69% missouri-hub 144,424 17,808 126,616 12.33% 87.67% mwdl 932,808 57,899 874,909 6.21% 93.79% nara 700,948 692,759 8,189 98.83% 1.17% nypl 1,170,436 775,361 395,075 66.25% 33.75% scdl 159,092 33,036 126,056 20.77% 79.23% smithsonian 1,250,705 68,871 1,181,834 5.51% 94.49% the_portal_to_texas_history 649,276 125 649,151 0.02% 99.98% tn 151,334 2,463 148,871 1.63% 98.37% uiuc 18,231 127 18,104 0.70% 99.30% undefined_provider 11,422 11,410 12 99.89% 0.11% usc 1,065,641 852,076 213,565 79.96% 20.04% virginia 30,174 21,081 9,093 69.86% 30.14% washington 42,024 8,838 33,186 21.03% 78.97%

With so many of the Hub/Providers having a high percentage of records with descriptions, I was curious about the overall records in the DPLA.  Below is a pie chart that shows you what I found.

DPLA records with and without descriptions

Almost 2/3 of the records in the DPLA have at least one description field, this is more than I would have expected for an un-required, un-recommended field, but I think this is probably a good thing.

Descriptions per record

The final thing I wanted to look at in this post was the average number of description fields for each of the Hubs/Providers.  This time we will start off with the data table below.

Provider Providers min median max mean stddev artstor 107,665 0 1 5 0.82 0.84 bhl 123,472 0 0 1 0.47 0.50 cdl 312,573 0 1 10 1.55 1.46 david_rumsey 65,244 0 3 4 2.55 0.80 digital-commonwealth 222,102 0 2 17 2.01 1.15 digitalnc 281,087 0 1 19 0.86 0.67 esdn 197,396 0 1 1 0.75 0.43 georgia 373,083 0 2 98 2.32 1.56 getty 95,908 0 2 25 2.75 2.59 gpo 158,228 0 4 65 4.37 2.53 harvard 14,112 0 1 11 1.46 1.24 hathitrust 2,474,530 0 1 77 1.22 1.57 indiana 62,695 0 1 98 0.91 1.21 internet_archive 212,902 0 2 35 2.27 2.29 kdl 144,202 0 0 1 0.01 0.12 mdl 483,086 0 1 1 0.91 0.29 missouri-hub 144,424 0 1 16 1.05 0.70 mwdl 932,808 0 1 15 1.22 0.86 nara 700,948 0 0 1 0.01 0.11 nypl 1,170,436 0 0 2 0.34 0.47 scdl 159,092 0 1 16 0.80 0.41 smithsonian 1,250,705 0 2 179 2.19 1.94 the_portal_to_texas_history 649,276 0 2 3 1.96 0.20 tn 151,334 0 1 1 0.98 0.13 uiuc 18,231 0 3 25 3.47 2.13 undefined_provider 11,422 0 0 4 0.00 0.08 usc 1,065,641 0 0 6 0.21 0.43 virginia 30,174 0 0 1 0.30 0.46 washington 42,024 0 1 1 0.79 0.41

This time with an image

Average number of descriptions per record

You can see that there are several Hubs/Providers a have multiple descriptions per record,  with the Government Publishing Office coming in at 4.37 descriptions per record.

I found it interesting that when you exclude the two Hubs/Providers that don’t really do descriptions (KDL and NARA) you see two that have a very low standard deviation from their mean (average) Tennessee Digital Library at 0.13 and The Portal to Texas History at 0.20 don’t drift much from their almost one description-per-record for Tennessee and almost two descriptions-per-record for Texas. It makes me think that this is probably a set of records that each of those Hubs/Providers would like to have identified so they could go in and add a few descriptions.


Well that wraps up this post that I hope is the first in a series of posts about the description field in the DPLA dataset.  In subsequent posts we will move away from record level analysis of description fields and get down to the field level to do some analysis of the descriptions themselves.  I have a number of predictions but I will hold onto those for now.

If you have questions or comments about this post,  please let me know via Twitter.

Open Knowledge Foundation: Open Data Day 2016 Malaysia Data Expedition – Measuring Provision of Public Services for Education

planet code4lib - Mon, 2016-03-28 10:06

This blog post was written by the members of the Sinar project in Malaysia 

In Malaysia, Sinar Project with the support of Open Knowledge International organised a one-day data expedition based on the guide from School of Data to search for data related to government provision of health and education services. This brought together a group of people with diverse skills to formulate questions of public interest. The data sourced would be used for analysis and visualisation in order to provide answers.

Data Expedition

A data expedition is a quest to explore uncharted areas of data and report on those findings. The participants with different skillsets gathered throughout the day at the Sinar Project office. Together they explored data relating to schools and clinics to see what data and analysis methods are available to gain insights on the public service provision for education and health.

We used the guides and outlines for the data expedition from School of Data website. The role playing guides worked as a great ice breaker. There was healthy competition on who could draw the best giraffes for those wanting to prove their mettle as a designer for the team.



Deciding what to explore, education or health?

The storyteller in the team, who was a professional journalist started out with a few questions to explore.

  • Are there villages or towns which are far away from schools?
  • Are there villages or towns which are far away from clinics and hospitals?
  • What is the population density and provision of clinics and schools?

The scouts then went on a preliminary exploration for whether this data exists.

Looking for the Lost City of Open Data

The Scouts, with the aid of the rest of the team, looked for data that could answer the questions. They found a lot of usable data from the Malaysian government open data portal This data included lists of all public schools and clinics with addresses, as well as numbers of teachers for each district.

It was decided by the team that given the time limitation, the focus would be to answer the questions on education data. Another priority was to find data relating to class sizes to see if schools are overcrowded or not. Below you can see the data that the team found. 

Education Open Data Data in Reports



Not all schools are created equal, there are different types, some are considered as high achieving schools or Sekolah Berprestasi Tinggi

Health Open Data GIS


Other Data

CIDB Construction Projects contains relevant information such as construction of schools and clinics Script to import into Elastic Search


Sinar Project had some budgets as open data, at state and federal levels that could be used as additional reference point. These were created as part of the Open Spending project.

Selangor State Government

Federal Government Higher education Education Methodology

The team opted to focus on the available datasets to answer questions about education provision, by first converting all school addresses into geocoding, and then looking at joining up data to find out the relationship between enrollments, school and teacher ratios.

Joining up data

To join up data; the different data sets such as teacher numbers and schools, VLOOKUP function in Excel was used to join by School code.

Converting Address to geolocation (latlong)

To convert street addresses to latitude, longitude coordinates we used the dataset with the cleansed address’ along with a geocoding tool csvgeocode

./node_modules/.bin/csvgeocode ./input.csv ./output.csv --url "{{Alamat}}&amp;key=" --verbose

Convert the completed CSV to GeoJSON points

Use the  csv2geojson

<span style="font-weight: 400;">csv2geojson --lat "Lat" --lon "Lng" Selangor_Joined_Up_Moe.csv</span>

To get population by PBT

Use the data from state economic planning unit agency site for socio-economic data specifically section Jadual 8

To get all the schools separated by individual PBT (District)

UseGeoJSON of Schools data and PBT Boundary loaded into QGIS; and use the Vector > Geo-processing > Intersect.  

A post from Stack Exchange suggests  it might be better to use Vector > Spatial Query > Spatial Query option.

Open Datasets Generated

The cleansed and joined up datasets created during this expedition are made available on GitHub. While the focus was on education, due to the similarity in available data, the methods were also applied to clinics also. See it on our repository –

Visualizations All Primary and Secondary Schools on a Map with Google Fusion Tables Teacher to Students per school ratios


  • Teachers vs enrollment did not provide data relating to class size or overcrowding
  • Demographic datasets to measure schools to eligible population
  • More school datasets required for teachers, specifically by subject and class ratios
  • Methods used for location of schools can also be applied to clinics & hospital data

It was discovered that additional data was needed to provide useful information on the quality of education. There was not enough demographic data found to check against the number of schools in a particular district. Teacher to student ratio was also not a good indicator of problems reported in the news. The teacher to enrollment ratios was generally very low with a mean of 13 and median of 14. What was needed, was ratio by subject teachers, class size or against the population of eligible children of each area, to provide better insights.

Automatically calculating the distance from points was also considered and matched up with whether there are school bus operators in the area. This was discussed because the distance from schools may not be relevant for rural areas, where there were not enough children to warrant a school within the distance policy. A tool to check distance from a point to the nearest school could be built with the data made available. This could be useful for civil society to use data as evidence to prove that distance was too far or transport not provided for some communities.

Demographic data was found for local councils; this could be used by researchers using local council boundary data on whether there were enough schools against the population of local councils. Interestingly in Malaysia, education is under Federal government and despite having state and local education departments, the administrative boundaries do not match up with local council boundaries or electoral boundaries. This is a planning coordination challenge for policy makers. Administrative local council boundary data was made available as open data thanks to the efforts of another civil society group Tindak Malaysia, which scanned and digitized the electoral and administrative boundaries manually.

Running future expeditions

This was a one day expedition so it was time limited. For running these brief expeditions we learned the following:

  • Focus and narrow down expedition to specific issue
  • Be better prepared, scout for available datasets beforehand and determine topic
  • Focus on central repository or wiki of available data

Thank you to all of the wonderful contributors to the data expedition:

  • Lim Hui Ying (Storyteller)
  • Haris Subandie (Engineer)
  • Jack Khor (Designer)
  • Chow Chee Leong (Analyst)
  • Donaldson Tan (Engineer)
  • Michael Leow (Engineer)
  • Sze Ming (Designer)
  • Swee Meng (Engineer)
  • Hazwany (Nany) Jamaluddin (Analyst)
  • Loo (Scout)

Terry Reese: MarcEdit Updates

planet code4lib - Mon, 2016-03-28 05:24

I spent some time this week working through a few updates based on some feedback I’ve gotten over the past couple of weeks.  Most of the updates at this point are focused on the Windows/Linux builds, but the Mac build has been updated as well as all new functionality found in the linking libraries and RDA changes apply there as well.  I’ll be spending this week focusing on making Mac MarcEdit UI to continue to work towards functional parity with the Windows version.

Windows/Linux Updates:

* 6.2.100
** Bug Fix: Build Links Tool — when processing a FAST heading without a control number, the search would fail.  This has been corrected.
** Bug Fix: MarcEditor — when using the convenience function that allows you to open mrc files directly into the MarcEditor and saving directly back to the mrc file — when using a task, this function would be disconnected.  This has been corrected.
** Enhancement: ILS Integration — added code to enable the use of profiles.
** Enhancement: ILS Integration — added a new select option so users can select from existing Z39.50 servers.
** Enhancement: OAI Harvesting — Added a debug URL string so that users can see the URL MarcEdit will be using to query the users server.
** UI Change: OAI Harvesting — UI has been changed to have the data always expanded.
** Enhancement: MarcValidator — Rules file has been updated to include some missing fields.
** Enhancement; MarcValidator — Rules file includes a new parameter: subfield, which defines the valid subfields within a field.  If a subfield appears not in this list, it will mark the record as an error.
** Enhancement: Task Menu — Task menu items have been truncated according to Windows convention.  I’ve expanded those values so users can see approximately 45 characters of a task name.
** Cleanup: Validate Headings — did a little work on the validate headings to clean up some old code.  Finishing prep to start allowing indexes beyond LCSH based on the rules file developed for the build links tool.


Mac Updates:

** 1.4.43 ChangeLog
* Bug Fix: Build Link Tool: Generating FAST Headings would work when an identifier was in the record, but wasn’t correctly finding the data when looking.
* Enhancement: RDA Helper:  Rules file has been updated and code now exists to allow users to define subfields that are valid.
* Bug Fix: RDA Helper: Updated library to correct a processing error when handling unicode replacement of characters in the 264.
* Enhancement: RDA Helper: Users can now define fields by subfield.  I.E. =245$c and abbreviation expansion will only occur over the defined subfields.

MarcValidator Changes:

One of the significant changes in the program this time around has been a change in how the Validator works.  The Validator currently looks at data present, and determines if that data has been used correctly.  I’ve added a new field in the validator rules file called subfield (Example block):

# Uncomment these lines and add validation routines like:
#valida    [^0-9x]    Valid Characters
#validz    [^0-9x]    Valid Characters
ind1    blank    Undefined
ind2    blank    Undefined
subfield    acqz68    Valid Subfields
a    NR    International Standard Book Number
c    NR    Terms of availability
q    R    Qualifier
z    R    Canceled/invalid ISBN
6    NR    Linkage
8    R    Field link and sequence number

The new block is the subfield item – here the tool defines all the subfields that are valid for this field.  If this element is defined and a subfield shows up that isn’t defined, you will receive an error message letting you know that the record has a field with an improper subfield in it.

RDA Helper

The other big change came in the RDA Helper.  Here I added the ability for the abbreviation field to be defined at a finer granularity.  Up to this point, abbreviation definitions happened at the field or field group level.  Users can now define down to the subfield level.  For example, if the user wanted to just target the 245$c, for abbreviations but leave all other 245 subfields alone, one would just define =245$c in the abbreviation field definition file.  If you want to define multiple subfields for processing, define each as its own unit…i.e:

You can get the download from the MarcEdit website ( or via the MarcEdit automatic download functionality.

Questions, let me know.


DuraSpace News: VIVO Updates for March 27–User Group Meeting Details, Membership Focus

planet code4lib - Mon, 2016-03-28 00:00

From Mike Conlon, VIVO Project Director

VIVO User Group Meeting.  Registration is open.  VIVO User Group Meeting #1 will be held May 5-6 at the Galter Health Science Library at Northwestern in Chicago!

Open Library Data Additions: Western Washington University MARC

planet code4lib - Sun, 2016-03-27 05:32

MARC records from Western Washington University. A late addition to marc_oregon_summit_records..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata

Ed Summers: Revisiting Archive Collections

planet code4lib - Sun, 2016-03-27 04:00

For my Qualitative Research Methods class this week I was asked to find a research article in my field of interest that uses focus groups, and to write up a short summary of the article and critique their use of focus groups as a research method.

After what seemed like a bit too much searching around in Google Scholar I eventually ran across an article written by Jon Newman of Lambeth Archive in 2012 about the experiences that a group of archives in the South East of England had with participatory cataloging of their collections (Newman, 2012). The archives Newman discusses were all participants in the Mandeville Legacy Project who were attempting to provide better access to their archival collections related to the topic of disabilities and rehabilitation. They chose to use a technique pioneered in the museum community called Revisiting Collections which was adapted specifically for the archives as Revisiting Archive Collections or RAC.

RAC is a technique that was designed by the Collections Trust in the UK to try to make archival descriptions more inclusive, accurate and complete by including the contributions of individuals outside of the archival profession. In the words of the RAC Toolkit:

A key strength of Revisiting Collections is that it provides a framework for embedding new understanding and perspectives on objects and records directly within the museum or archive’s collection knowledge management system, ensuring that it forms part of the story about the collections that is recorded and made accessible to all.

RAC’s framework includes community based focus groups which bring individuals into contact with archival materials and elicit knowledge sharing as well as new narrative and documentation. RAC is similar in spirit to other methods for achieving [participatory archives], but is different because it uses actual focus groups rather than a Web based crowd sourcing approach. The framework includes detailed instructions for running a RAC focus group including:

  • how to select participants
  • how to select materials
  • consent
  • prompt questions
  • data collection
  • attribution
  • room setup
  • starting/ending the session
  • follow up after the session

The essential idea is that people who have direct experience of the subject material have much to offer in the description of the records. Newman connects RAC’s theoretical stance of involving more voices in the production of archives with the work of Terry Cook, Tom Nesmith, Verne Harris, Wendy Duff and Eric Ketelaar. This constellation of archival theory has been actively dismantling the Jenkinsonian notion of the archivist as a neutral, informed, anonymous and monolithic voice. It is not simply a stylistic choice, but a foundational point about recognizing the archivist’s and archive’s role in shaping the historical record. RAC is an example of connecting this theory to actual practice.

So Newman isn’t using focus groups as a research method in this study, but is instead reflecting on the use of focus groups as a technique for generating more complete and useful archival descriptions. To do this he provides case studies that reflect the implementation of RAC in 5 county records offices.

He found that in all these cases work still needed to be done to integrate the results of the focus group sessions into the archival descriptions themselves. Part of the problem lay in how well the archival standards and systems accommodate this new type of community or user centered information. Museums in the UK (at least in 2012) have a SPECTRUM which is a standard that includes guidance for adding user generated information, and the standard is implemented in museum collection management systems. Newman found that guidance on how RAC fit with archival description ISAD(G) systems was not enough to get the newly acquired information into archival systems.

However Newman also found that the focus group sessions generated powerful, revealing and creative descriptions of the records which were highly valuable. The interactions between the archive and the external partners led to increased levels of engagement and trust that was deemed extremely useful by both parties. Using visual material from the archives was as an effective way to generate discussion in the focus groups.

Newman noted that some archivists had uncertainty about how to add the emotive content of these contributions to the archival description. To my eye this seemed like perhaps some were still clinging to the notion that archival descriptions were unbiased and neutral. Indeed, I noticed that the RAC guidelines themselves recommended only adding acquired content if it was deemed neutral:

Information that is destined for the ISAD(G) catalogue may be used verbatim if you consider it to be neutral, factual and verifiable. It is more likely, however, to be a trigger for the archivist to revisit the catalogue, investigate or authenticate the new information that has been has been offered and rework the existing description. (p. 24)

Of course revisiting the catalog to revise is a bit of a luxury, especially when many archives have large backlogs of records that lack any description at all.

The RAC guidelines also require attribution when adding to the official archival description. This in turn requires obtaining consent from the focus group participants. But Newman observed that there was occasionally some uncertainty about how this consent and attribution worked in situations like students names where privacy came into play.

A big part of the work of conducting the focus groups is in the data analysis afterwards. RAC provides guidance on how to mark up the focus group transcripts using 5 categories:

  • ISAD(G) catalog
  • keywords
  • subject guide
  • free text

As any researcher will tell you this markup process in itself can be highly time consuming. I think it would’ve benefited Newman’s article to examine how participating archives were able to perform this step: how much they did it, and what categories of information were most acquired. Some basic statistics such as the number of focus groups conducted by institution, the number, ages, backgrounds of participants, and time spent may have been difficult to acquire but would’ve helped get more of a sense of the scope of the work. In addition it would’ve been interesting to learn more about how focus group participants were selected.

Despite these shortcomings I enjoyed Newman’s analysis, and am sympathetic to the theoretical goals of the RAC project. It is a useful example of putting post-Foucauldian critiques of the archive into practice, without waving the crowdsourcing magic wand. I think a useful extension of this work would be to dive a bit deeper into how participating archives routed around their archival systems by adding content to websites and/or subject guides, and to contemplate how archival description could be linked to that larger body of documentation.


Newman, J. (2012). Revisiting archive collections: Developing models for participatory cataloguing. Journal of the Society of Archivists, 33(1), 57–73.

Open Library Data Additions: Miami University of Ohio MARC

planet code4lib - Sat, 2016-03-26 08:08

MARC records from the Miami University of Ohio..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata

Mita Williams: Knight News Challenge: Library Starter Deck: a 21st-century game engine and design studio for libraries

planet code4lib - Sat, 2016-03-26 02:38
The Library Starter Deck from FutureCoast on Vimeo.

Last week, Ken Eklund and myself submitted our proposal for the 2016 Knight News Challenge which asks,  How might libraries serve 21st century information needs?

Our answer is this: The Library Starter Deck: a 21st-century game engine and design studio for libraries. We also have shared a brief on some of the inspirations behind our proposal (pdf).

Two years ago I reviewed the 680+ applications to the 2014 Knight News Challenge for Libraries entries and shared some of my favourites. It was, and it is still a very useful exercise because there are not many opportunities to read grant applications (if you are not the one handing out the grant) and this particular set offer applications from both professionals and those from the public.

You can also review the entries as an act of finding signals of the future, as the IFTF might put it. That's what I've chosen to do for this year's review. What this means is that I've chosen not to highlight here what I think are the best or most deserving to win applications (that's up to these good people) but instead, I made note of the applications that, for lack of a better word, surprised me:

I'd like to add there are many other deserving submissions that I have given a 'heart' to on the Knight News Challenge website and if you are able to, I'd encourage you to do the same.

FOSS4Lib Recent Releases: Koha - 3.22.5

planet code4lib - Fri, 2016-03-25 20:59
Package: KohaRelease Date: Wednesday, March 23, 2016

Last updated March 25, 2016. Created by David Nind on March 25, 2016.
Log in to edit this page.

Koha 3.22.5 is a security and maintenance release. It includes one security fix and 63 bug fixes (this includes enhancements, as well as fixes for problems).

As this is a security releases, we strongly recommend anyone running Koha 3.22.* upgrade as soon as possible.

See the release announcement for the details:

Open Library Data Additions: MIT Barton Catalog MODS

planet code4lib - Fri, 2016-03-25 20:53

Catalog records from MIT's Barton Catalog in MODS format. Downloaded from

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata

NYPL Labs: Introducing the Photographers’ Identities Catalog

planet code4lib - Fri, 2016-03-25 18:27

Today the New York Public Library is pleased to announce the launch of Photographers’ Identities Catalog (PIC), a collection of biographical data for over 115,000 photographers, studios, manufacturers, dealers, and others involved in the production of photographs. PIC is world-wide in scope and spans the the entire history of photography. So if you’re a historian, student, archivist, cataloger or genealogist, we hope you’ll make it a first stop for your research. And if you’re into data and maps, you’re in luck, too: all of the data and code are free to take and use as you wish.

Each entry has a name, nationality, dates, relevant locations and the sources from which we’ve gotten the information—so you can double check our work, or perhaps find more information that we don’t include. Also, you might find genders, photo processes and formats they used, even collections known to have their work. It’s a lot of information for you to query or filter, delimit by dates, or zoom in and explore on the map. And you can share or export your results.

Blanche Bates. Image ID: 78659

How might PIC be useful for you? Well, here’s one simple way we make use of it in the Photography Collection: dating photographs. NYPL has a handful of cabinet card portraits of the actress Blanche Bates, but they are either undated or have a very wide range of dates given.

The photographer’s name and address are given: the Klein & Guttenstein studio at 164 Wisconsin Street, Milwaukee. Search by the studio name, and select them from the list. In the locations tab you’ll find them at that address for only one year before they moved down the street; so, our photos were taken in 1899. You could even get clever and see if you can find out the identities of the two partners in the studio (hint: try using the In Map Area option).

But there’s much more to explore with PIC: you can find female photographers with studios in particular countries, learn about the world’s earliest photographers, and find photographers in the most unlikely places…

Often PIC has a lot of information or can point you to sources that do, but there may be errors or missing information. If you have suggestions or corrections, let us know through the Feedback form. If you’re a museum, library, historical society or other public collection and would like to let us know what photographers you’ve got, talk to us. If you’re a scholar or historian with names and locations of photographers and studios—particularly in under-represented areas—we’d love to hear from you, too!


Open Library Data Additions: Amazon Crawl: part hf

planet code4lib - Fri, 2016-03-25 08:33

Part hf of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

FOSS4Lib Upcoming Events: DC Fedora Users Group

planet code4lib - Thu, 2016-03-24 19:07
Date: Wednesday, April 27, 2016 - 08:00 to Thursday, April 28, 2016 - 17:00Supports: Fedora RepositoryHydra

Last updated March 24, 2016. Created by Peter Murray on March 24, 2016.
Log in to edit this page.

Our next DC Fedora Users Group meeting will be held on April 27 + 28 at the National Library of Medicine.


Please register in advance (registration is free) by completing this brief form:

As indicated on the form,we are also looking for sponsors for snacks - this could be for one or both days.


DPLA: Announcing the 2016 DPLA+DLF “Cross-Pollinator” grant awardees

planet code4lib - Thu, 2016-03-24 17:05

We are pleased to announce the recipients of the 2016 DPLA + DLF Cross-Pollinator Travel Grants, three individuals from DLF member organizations who will be attending DPLAfest 2016 April 14-15 in Washington, D.C.

The DPLA + DLF Cross-Pollinator Travel Grants are part of a broader vision for partnership between the Digital Library Federation (DLF) and the Digital Public Library of America. It is our belief that robust community support is key to the sustainability of large-scale national efforts. Connecting the energetic and talented DLF community with the work of the DPLA is a positive way to increase serendipitous collaboration around this shared digital platform. The goal of this program is to bring “cross-pollinators” to DPLAfest— DLF community contributors who can provide unique personal perspectives, help to deepen connections between our organizations, and bring DLF community insight to exciting areas of growth and opportunity at DPLA.

Meet the 2016 DPLA + DLF Cross-Pollinators

Jasmine Burns
Image Technologies and Visual Literacy Librarian and Interim Head, Fine Arts Library
Indiana University Bloomington

Twitter: @jazz_with_jazz

Jasmine Burns’ primary duties are to manage and curate the libraries’ multimedia image collections for teaching and research in the fine arts, including studio, art history, apparel merchandising, and fashion design. She holds an MLIS from the University of Wisconsin-Milwaukee with a concentration in Archive Studies, and an MA in Art History from Binghamton University. She has worked previously as an assistant curator of a slide library, a museum educator, a junior fellow at the Library of Congress, and as a digitization assistant for a university archives.

Burns writes:

As a new emerging professional, one major limitation that I face is that I have yet to build a strong foundation in organizations outside of those few that guide my daily work… Attending DPLAfest would offer an alternate conference experience that would enhance my understanding of the field of digital cultural heritage, and introduce me to how individuals within this and other allied fields are approaching similar issues in collections building and support, teaching with a variety of visual materials, and the presentation and preservation of digital images. My participation in DPLAfest would give me broad ideas on how to expand the scope of my projects in a way that addresses a larger community, instead of limiting my sphere to art and art history. My ultimate professional goal is to explore, create, and enhance open access image collections through digital platforms. My DLF colleagues provide me with guidance for the technical and data management aspects of managing digital image collections, while DPLAfest would expose me to the nuances of managing and curating the content within such collections.

Nancy Moussa
Web Developer, University of Michigan Library
DPLA Community Rep

In her role at University of Michigan, Nancy Moussa has worked on various projects including Islamic Manuscripts, Omeka online exhibits, and with other open source platforms such as Drupal and WordPress. Her background is in information science, with a B.S. in Computer Science from American University in Cairo, an MMath in Computer Science from University of Waterloo, Canada, and an MSI in Human Computer Interaction from School of Information at University of Michigan, Ann Arbor. She is also a member of DPLA’s newest class of community reps.

Moussa writes:

In the past three years my focus has been on customizing and building plugins for Omeka… I would like to research and investigate the DPLA API to understand how to integrate open source platforms with DLPA resources and digital objects. My second interest is to understand how DPLA’s growing contents can benefit teachers in schools, librarians, researchers and students. I hope there is more collaboration between DPLA and DLF. It is a very important step. The collaboration will reveal more incredible digital works that are contributed by DLF members. I am envisioning that DLF members (institutions) will have more opportunities to access digital works provided by other members through the DPLA portal & DPLA API /Apps. Therefore, I am looking forward to attending DPLAfest to increase my understanding and to network with other DPLA representatives and DPLA community in general.


T-Kay Sangwand
Librarian for Digital Collection Development
Digital Library Program, UCLA

Twitter: @tttkay

Prior to her current position at UCLA, T-Kay Sangwand served as the Human Rights Archivist and Librarian for Brazilian Studies at University of Texas at Austin. In 2015, she was named one of Library Journal’s “Movers and Shakers” in the Advocate category for her collaborative work with human rights organizations through the UT Libraries Human Rights Documentation Initiative. She is currently a Certified Archivist and completed the Archives Leadership Institute in 2013. Sangwand holds an MLIS and MA degree in Latin American Studies from UCLA with specializations in Archives, Spanish and Portuguese and a BA in Gender Studies and Latin American Studies from Scripps College.

Sangwand writes:

As an information professional that is committed to building a representative historical record that celebrates the existence and contributions of marginalized groups (i.e. people of color, women, queer folks), I am particularly excited about the possibility of attending DPLAfest and learning about how the DPLA platform can be leveraged in pursuit of this more representative historical record…. While UCLA is not yet a contributor to DPLA, this is something we are working towards and a process I am looking forward to being a part of as my current position focuses on digital access for a wide cross-section of materials from Chicano Studies, Gender Studies, UCLA Oral History Center and more. Since DLF explicitly describes itself as a “robust community of practice advancing research, learning, social justice & the public good [my emphasis],” I am hopeful that DLF community members, including UCLA, can form a critical mass around building out a representative and diverse historical record in support of the values espoused by DLF.


Congratulations to all — we look forward to meeting you at DPLAfest!

David Rosenthal: Long Tien Nguyen &amp; Alan Kay's "Cuneiform" System

planet code4lib - Thu, 2016-03-24 15:00
Jason Scott points me to Long Tien Nguyen and Alan Kay's paper from last October entitled The Cuneiform Tablets of 2015. It describes what is in effect a better implementation of Raymond Lorie's Universal Virtual Computer. They attribute the failure of the UVC to its complexity:
They tried to make the most general virtual machine they could think of, one that could easily emulate all known real computer architectures easily. The resulting design has a segmented memory model, bit-addressable memory, and an unlimited number of registers of unlimited bit length. This Universal Virtual Computer requires several dozen pages to be completely specified and explained, and requires far more than an afternoon (probably several weeks) to be completely implemented. They are correct that the UVC was too complicated, but the reasons why it was a failure are far more fundamental and, alas, apply equally to Chifir, the much simpler virtual machine they describe. Below the fold, I set out these reasons.

The reasons are strongly related to the reason why the regular announcements of new quasi-immortal media have had almost no effect on practical digital preservation. And, in fact, the paper starts by assuming the availability of a quasi-immortal medium in the form of a Rosetta Disk. So we already know that each preserved artefact they create will be extremely expensive.

Investing money and effort now in things that only pay back in the far distant future is simply not going to happen on any scale because the economics don't work. So at best you can send an insignificant amount of stuff on its journey to the future. By far the most important reason digital artefacts, including software, fail to reach future scholars is that no-one could afford to preserve them. Suggesting an approach whose costs are large and totally front-loaded implicitly condemns a vastly larger amount of content of all forms to oblivion because the assumption of unlimited funds is untenable.

Its optimistic to say the least to think you can solve all the problems that will happen to stuff in the next say 1000 years in one fell swoop - you have no idea what the important problems are. The Cuneiform approach assumes that the problems are (a) long-lived media and (b) the ability to recreate an emulator from scratch. These are problems, but there are many other problems we can already see facing the survival of software over the next millennium. And its very unlikely that we know all of them, or that our assessment of their relative importance is correct.

Preservation is a continuous process, not a one-off thing. Getting as much as you can to survive the next few decades is do-able - we have a pretty good idea what the problems are and how to solve them. At the end of that time, technology will (we expect) be better and cheaper, and we will understand the problems of the next few decades. The search for a one-time solution is a distraction from the real, continuing task of preserving our digital heritage.

And I have to say that analogizing a system designed for careful preservation of limited amounts of information for the very long term with cuneiform tablets is misguided. The tablets were not designed or used to send information to the far future. They were the equivalent of paper, a medium for recording information that was useful in the near future, such as for accounting and recounting stories. Although:
Between half a million and two million cuneiform tablets are estimated to have been excavated in modern times, of which only approximately 30,000 – 100,000 have been read or published.The probability that an individual tablet would have survived to be excavated in the modern era is extremely low. A million or so survived, many millions more didn't. The authors surely didn't intend to propose a technique for getting information to the far future with such a low probability of success.

Open Knowledge Foundation: Open Data Day 2016 Birmingham, UK

planet code4lib - Thu, 2016-03-24 11:34

This blogpost was written by Pauline Roche, MD of voluntary sector infrastructure support agency, RnR Organisation, co-organiser Open Mercia, co-Chair West Midlands Open Data Forum, steering group member Open Data Institute (ODI) Birmingham node, founder Data in Brum

20 open data aficionados from across sectors as diverse as big business, small and medium enterprises, and higher education, including volunteers and freelancers gathered in Birmingham, UK on Friday, March 4th to share our enthusiasm for and knowledge of open data in our particular fields, to meet and network with each other and to plan for future activities around open data in the West Midlands. We met on the day before Open Data Day 2016 to accommodate most people’s schedules.

Organised by Open Mercia colleagues, Pauline Roche and Andrew Mackenzie, and hosted at ODI Birmingham by Hugo Russell, Project Manager, Innovation Birmingham. The half day event formally started with introductions, a brief introduction to the new ODI Birmingham node, and watching a livestream of the weekly ODI Friday lecture: ‘Being a Data Magpie’. In the lecture, ODI Senior Consultant Leigh Dodds explained how to find small pieces of data that are shared – whether deliberately or accidentally – in our cities. Delegates were enthralled with Leigh’s stories about data on bins, bridges, lamp posts and trains.

We then moved on to lightning talks about open data with reference to various subjects: highways (Teresa Jolley), transport (Stuart Harrison), small charities (Pauline Roche), mapping (Tom Forth), CartoDB (Stuart Lester), SharpCloud (Hugo Russell) and air quality (Andrew Mackenzie). These talks were interspersed with food and comfort breaks to encourage the informality which tends to generate the sharing and collaboration which we were aiming to achieve.

During the talks, more formal discussion focused on Birmingham’s planned Big Data Corridor, incorporating real-time bus information from the regional transport authority Centro, including community engagement through the East of Birmingham to validate pre/post contract completion, for example, in: road works and traffic management changes. Other discussion focussed on asset condition data, Open Contracting, and visualisation for better decisions

Teresa Jolley’s talk (delivered via Skype from London), showed that 120 local authorities (LA) in England alone are responsible for 98% of the road network but have only 20% of the budget; also each LA gets £30m but actually needs £93m to bring the network back to full maintenance.The talk highlighted that there is a need for more collaboration, improved procurement, new sources of income and data on asset condition which is available in a variety of places, including in people’s heads! The available research data is not open, which is a barrier to collaboration. Delegates concluded from Teresa’s talk that opening the contracts between private and public companies is the main challenge.

Stuart Harrison, ODI Software Superhero, talked about integrated data visualisation and decision making, showing us the London Underground: Train Data Demonstrator. He talked about visualisation for better decisions on train capacity and using station heat maps to identify density of use.

Pauline Roche, MD of the voluntary sector infrastructure support agency, RnR Organisation, shared the Small Charities Coalition definition of their unique feature (annual income less than £1m) and explained that under this definition, 97% of the UK’s 164,000 charities are small. In the West Midlands region alone, the latest figures evidence 20,000 local groups (not all are charities), 34,000 FTE paid staff, 480,000 volunteers and an annual £1.4bn turnover.

Small charities could leverage their impact through the use of Open Data to demonstrate transparency, better target their resources, carry out gap analysis (for example, Nepal NGOs found that opening and sharing their data reduced duplication amongst other NGOs in the country) and measure impact. One small charity which Pauline worked with on a project to open housing data produced a comprehensive Open Data “Wishlist” including data on health, crime and education. Small charities need active support from the council and other data holders to get the data out.   Tom Forth from the ODI Leeds node, showed delegates how he uses open data for mapping with lots of fun demonstrations. Pauline shared some of Tom’s specific mapped data on ethnicity with 2 relevant charities and we look forward to examining that data more closely in the future. It was great to have a lighter, though no less important, view of what can often be seen as a very serious subject. Tom invited delegates to the upcoming Floodhack at ODI Leeds on the following weekend. He also offered to run another mapping event the following week for some students present, with more assistance being proffered by another delegate, Mike Cummins.

Stuart Lester of Digital Birmingham, gave an introduction to CartoDB and reminded delegates of the Birmingham Data Factory where various datasets were available under an open license.

The second last talk of the day was a demonstration of SharpCloud from Hugo Russell, who described using this and other visualisation tools such as Kumu to tell a story and spot issues / relationships

Finally, Andrew Mackenzie presented on air quality and gave some pollution headlines, relating his presentation topically to the LEP, Centro and HS2. He said that some information, while public, is not yet published as data yet, but it can be converted. There were some questions about the position of the monitoring stations and a possible project “What is the quality of the air near me/a location?”. Andrew says it’s currently £72,000 to build an air quality monitoring station and gave some examples of work in the field e.g. , and Smart Citizen . He also mentioned the local organisation Birmingham Friends of the Earth and a friendly data scientist Dr Andy Pryke. One of the delegates tweeted a fascinating visualisation of air pollution data


Our diverse audience represented many networks and organisations: two of the Open Data Institute nodes, Birmingham  and Leeds , West Midlands Open Data Forum , Open Mercia , Open Data Camp, Birmingham Insight, Hacks and Hackers (Birmingham) , Brum by Numbers and Data in Brum. Our primary themes were transport and social benefit, and we learned about useful visualisation tools like CartoDB, SharpCloud and Kumu. The potential markets we explored included: an Open commercialisation model linked to the Job Centre, collaboration where a business could work with a transport authority and an ODI Node to access Job Centres of applicable government departments on a Revenue Share and an Air Quality Project.

Future Events information shared included the Unconference Open Data Camp 3 in Bristol, 14-15 May (Next ticket release 19 March), an Open Government Partnership meeting on 7 April at Impact Hub Birmingham, a Mapping workshop with Tom Forth (date TBC), and offers of future events: CartoDB with Stuart Lester (½ day), OpenStreetMap with Andy Mabbett (½ day) and WikiData with Andy Mabbett (½ day) Pauline also compiled a Post-event Storify:

LITA: Jobs in Information Technology: March 23, 2016

planet code4lib - Thu, 2016-03-24 00:57

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

Yale University, Senior Systems Librarian / Technical Lead, ID 36160BR, New Haven, CT

Misericordia University, University Archivist and Special Collections Librarian, Dallas, PA

University of Arkansas, Accessioning and Processing Archivist, Fayetteville, AR

University of the Pacific, Information and Educational Technology Services (IETS) Director, Stockton, CA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Open Library Data Additions: University of Michigan PD Scan Records

planet code4lib - Wed, 2016-03-23 22:09

Records retrieved from the OAI interface to the University of Michigan's collection of scanned public domain books. Crawl done on 2007-01-11 at

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata


Subscribe to code4lib aggregator