You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 5 hours 45 min ago

DuraSpace News: Universidad de la Sabana: Colombia Evolves Its Institutional Repository

Wed, 2015-02-25 00:00

Chía, Colombia  Universidad de la Sabana´s institutional repository provides services to the university’s researchers, students and the academic community of this prestigious Colombian University. Arvo Consultores has helped the institution upgrade their repository to Dspace 4.2.  Improvements include an enhanced interface, with a new adaptive and responsive interface –the first Mirage2 interface in a Colombian institution.

HangingTogether: New MARC Usage Data Available

Tue, 2015-02-24 21:40

I just finished updating my “MARC Usage in WorldCat” web site that summarizes and reports on how MARC elements have been used in the 333,518,928 MARC records in WorldCat as of 1 Jan 2015.

Not surprisingly, the totals for new fields such as 336, 337, and 338 have shot up, from some 9-10 million occurrences in January 2014 to 40-50 million occurrences in January 2015.

Also, it appears that well over 33 million records have come in through the Digital Collection Gateway.

As always, if you wish to see the summarized contents of any subfield just let me know. And don’t forget about the visualizations (pictured).

About Roy Tennant

Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.

Mail | Web | Twitter | Facebook | LinkedIn | Flickr | YouTube | More Posts (86)

FOSS4Lib Upcoming Events: CollectionSpace Walkthrough March 2015

Tue, 2015-02-24 20:53
Date: Friday, March 27, 2015 - 12:00 to 13:00Supports: CollectionSpace

Last updated February 24, 2015. Created by Peter Murray on February 24, 2015.
Log in to edit this page.

From the announcement:

Curious about CollectionSpace?

Join Megan Forbes, Community Outreach Manager, for the first in a series of bi-monthly walkthroughs. More than just a demo of features and functionality, the walkthrough will:

FOSS4Lib Upcoming Events: CollectionSpace Open House Feb 2015

Tue, 2015-02-24 20:49
Date: Friday, February 27, 2015 - 12:00 to 13:00Supports: CollectionSpace

Last updated February 24, 2015. Created by Peter Murray on February 24, 2015.
Log in to edit this page.

From the announcement:

Curious about CollectionSpace?
Join the program staff, leadership, functionality, and technical working group members, implementers, and special guests for a bi-monthly open house. Bring your questions (functional, technical, operational), your ideas, and your projects to this free-ranging conversation.

Mark E. Phillips: DPLA Metadata Analysis: Part 1 – Basic stats on subjects

Tue, 2015-02-24 18:01

One a recent long flight (from Dubai back to Dallas) I spent some time working with the metadata dataset that the Digital Public Library of American’s (DPLA) provides on its site.

I was interested in finding out the following pieces of information.

  1. What is the average number and standard deviation of subjects-per-record in the DPLA
  2. How does this number compare across the partners?
  3. Is there any different that we can notice between Service-Hubs and Content-Hubs in the DPLA in relation to the subject field usage.
Building the Dataset

The DPLA makes the full dataset of their metadata available for download as single file, and I grabbed a copy before I left the US because I knew it was going to be a long flight.

With a little work I was able to parse all of the metadata records and extract some information I was interested in working with, specifically the subjects for records.

So after parsing through the records to get a list of subjects per record and the Service-Hub or Content-Hub that the record belongs to I loaded this information into Solr to use for analysis. We are using Solr for another research project related to metadata analysis at the UNT Libraries (in addition to our normal use of Solr for a variety of search tasks) so I wanted to work on some code that I could use for a few different projects.

Loading the records into the Solr index took quite a while (loading ~1,000 documents per second into Solr).

So after a few hours of processing I had my dataset and I was able to answer my first question pretty easily using Solr’s built-in statsComponent functionality.  For a description of this view the documentation on Solr’s documentation site.

Answering the questions

The average number of subjects per record in the DPLA = 2.99 with a standard deviation of 3.90. There are records with 0 subjects (1,827,276) and records with as many as 1,476 subjects (this record btw).

Answering question number two involved a small script to create a table for us, you will find that table below.

Hub Name min max count sum sumOfSquares mean stddev ARTstor 0 71 56,342 194,948 1351826 3.460083064 3.467168662 Biodiversity Heritage Library 0 118 138,288 454,624 3100134 3.287515909 3.407385646 David Rumsey 0 4 48,132 22,976 33822 0.477353943 0.689083212 Digital Commonwealth 0 199 124,804 295,778 1767426 2.369940066 2.923194479 Digital Library of Georgia 0 161 259,640 1,151,369 8621935 4.43448236 3.680038874 Harvard Library 0 17 10,568 26,641 88155 2.520912188 1.409567895 HathiTrust 0 92 1,915,159 2,614,199 6951217 1.365003637 1.329038361 Internet Archive 0 68 208,953 385,732 1520200 1.84602279 1.966605872 J. Paul Getty Trust 0 36 92,681 32,999 146491 0.356049244 1.20575216 Kentucky Digital Library 0 13 127,755 26,009 82269 0.203584987 0.776219692 Minnesota Digital Library 1 78 40,533 202,484 1298712 4.995534503 2.661891328 Missouri Hub 0 139 41,557 97,115 606761 2.336910749 3.023203782 Mountain West Digital Library 0 129 867,538 2,641,065 17734515 3.044321978 3.34282307 National Archives and Records Administration 0 103 700,952 231,513 1143343 0.330283671 1.233711342 North Carolina Digital Heritage Center 0 1,476 260,709 869,203 8394791 3.333996908 4.591774892 Smithsonian Institution 0 548 897,196 5,763,459 56446687 6.423857217 4.652809633 South Carolina Digital Library 0 40 76,001 231,270 1125030 3.042986277 2.354387181 The New York Public Library 0 31 1,169,576 1,996,483 6585169 1.707014337 1.648179106 The Portal to Texas History 0 1,035 477,639 5,257,702 69662410 11.00768991 4.96771802 United States Government Printing Office (GPO) 0 30 148,715 457,097 1860297 3.073644219 1.749820977 University of Illinois at Urbana-Champaign 0 22 18,103 67,955 404383 3.753797713 2.871821391 University of Southern California. Libraries 0 119 301,325 863,535 4626989 2.865792749 2.672589058 University of Virginia Library 0 15 30,188 95,328 465286 3.157811051 2.332671249

The columns are min which is the minimum number of subjects per record for a given Hub,  Minnesota Digital Library stands out here as the only Hub that has at least one subject for each of their 40,533 items.  The column max shows the highest number of subjects per record.  Two groups, The Portal to Texas History and North Carolina Digital Heritage Center have at least one record with over 1,000 subject headings. The column count is the number of records that each Hub had when the analysis was performed. The column sum is the total number of subject values for a given Hub,  note this is not the number of unique subject, that information is not present in this dataset. The column mean shows the average number of subjects per Hub and stddev is the standard deviation from this number.  The Portal to Texas History is at the top end of the average with 11.01 subjects per record and the Kentucky Digital Library is on the low end with 0.20 subjects per record.

The final question was if there were differences between the Service-Hubs and the Content-Hubs, that breakdown is in the table below.

Hub Type                                  min max count sum sumOfSquares mean stddev Content-Hub 0 548 5,736,178 13,207,489 84723999 2.302489393 3.077118385 Service-Hub 0 1,476 2,276,176 10,771,995 109293849 4.73249652 5.061612337

It appears that there is a higher number of subjects per record for the Service-Hubs over the Content-Hubs,  over 2x with 4.73 for Service-Hubs and 2.30 for Content-Hubs.

Another interesting number is that there are 1,590,456 records contributed by Content-Hubs, 28% of that collection that do not have subjects compared to 236,811 records contributed by Service-Hubs or 10% that do not have subjects.

I think individually we can come up with reasons that these numbers differ the ways they do. There are reasons to all of this, where did the records come from, were they generated as digital resource metadata records initially or using an existing set of practices such as AACR2 in the MARC format? How does that change the numbers? Are there things that the DPLA is doing to the subjects when they normalize them that change the way they are represented and calculated? I know that for The Portal to Texas History some of our subject strings are being split into multiple headings in order to improve retrieval within the DPLA and are thus inflating our numbers a bit in the tables above. I’d be interested to chat with anyone interested in this topic who has some “here’s why” explanations to the numbers above.

Hit me up on Twitter if you want to chat about this.

DPLA: Announcing our third class of DPLA Community Reps

Tue, 2015-02-24 17:45

We’re extremely excited to introduce our third class of DPLA Community Reps–-volunteers who engage their local communities by leading DPLA outreach activities. We were thrilled with the response to our third call for applicants, and we’re pleased to now add another roster of nearly 100 new Community Reps to our outstanding first and second classes, bringing the total number of Reps to just over 200.

Our third class continues our success at completing the U.S. map, bringing local DPLA advocacy to all 50 states, DC, and a handful of international locations. Our Reps work in K-12 education, public libraries, state libraries, municipal archives, public history and museums, technology, publishing, media, genealogy, and many areas of higher education. This third class in particular solidifies our presence in states where previously we had only one rep, including Alabama, Arkansas, Iowa, Montana, Oklahoma. We also received a number of excellent applications from people working in museums, technology, genealogy, and K-12, so we’re excited to further these avenues for DPLA involvement and outreach. We’re eager to support this new classes’ creative outreach and engagement work, and we thank them for helping us grow the DPLA community!

For more detailed information about our Reps and their plans, including the members of the third class, please visit our Meet the Reps page.

The next call for our fourth class of Reps will take place in one year (January 2016).  To learn more about this program and follow our future calls for applicants, check out our Community Reps page.

LITA: 2015 LITA Forum – Call for Proposals, Deadline Extended

Tue, 2015-02-24 17:18

The LITA Forum is a highly regarded annual event for those involved in new and leading edge technologies in the library and information technology field. Please send your proposal submissions here by March 13, 2015, and join your colleagues in Minneapolis .

The 2015 LITA Forum Committee seeks proposals for excellent pre-conferences, concurrent sessions, and poster sessions for the 18th annual Forum of the Library Information and Technology Association, to be held in Minneapolis Minnesota, November 12-15, 2015 at the Hyatt Regency Minneapolis. This year will feature additional programming in collaboration with LLAMA, the Library Leadership & Management Association.

The Forum Committee welcomes creative program proposals related to all types of libraries: public, school, academic, government, special, and corporate.

Proposals could relate to any of the following topics:

• Cooperation & collaboration
• Scalability and sustainability of library services and tools
• Researcher information networks
• Practical applications of linked data
• Large- and small-scale resource sharing
• User experience & users
• Library spaces (virtual or physical)
• “Big Data” — work in discovery, preservation, or documentation
• Data driven libraries or related assessment projects
• Management of technology in libraries
• Anything else that relates to library information technology

Proposals may cover projects, plans, ideas, or recent discoveries. We accept proposals on any aspect of library and information technology, even if not covered by the above list. The committee particularly invites submissions from first time presenters, library school students, and individuals from diverse backgrounds. Submit your proposal through http://bit.ly/lita-2015-proposal by March 13, 2015.

Presentations must have a technological focus and pertain to libraries. Presentations that incorporate audience participation are encouraged. The format of the presentations may include single- or multi-speaker formats, panel discussions, moderated discussions, case studies and/or demonstrations of projects.

Vendors wishing to submit a proposal should partner with a library representative who is testing/using the product.

Presenters will submit draft presentation slides and/or handouts on ALA Connect in advance of the Forum and will submit final presentation slides or electronic content (video, audio, etc.) to be made available on the web site following the event. Presenters are expected to register and participate in the Forum as attendees; discounted registration will be offered.

Please submit your proposal through http://bit.ly/lita-2015-proposal, by the deadline of March 13, 2015

More information about LITA is available from the LITA website, Facebook and Twitter. Or contact Mark Beatty, LITA Programs and Marketing Specialist at mbeatty@ala.org

David Rosenthal: Using the official Linux overlayfs

Tue, 2015-02-24 16:00
I realize that it may not be obvious exactly how to use the newly-official Linux overlayfs implementation. Below the fold, some example shell scripts that may help clarify things.

An overlayfs mount involves four directories:
  • below, a directory that, accessed via the mount, will be read-only.
  • above, a directory that, accessed via the mount, will be read-write.
  • work, a read-write directory in the same file system as the above directory.
  • mount, the directory at which the union of the entries in below and above will appear.
Lets create an example of each of these directories:
% mkdir below above work mount
%
Now we populate below and above with some files:
% for A in one two three
> do
> echo "Content of below/${A}" >below/${A}
> done
% chmod a-w below/* below
% for A in four five six
> do
> echo "Content of above/${A}" >above/${A}
> done
% ls -la below above work mount
above:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 8
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..

work:
total 8
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
%
Now we create the overlay mount and see what happens:
% OPTS="-o lowerdir=below,upperdir=above,workdir=work"
% sudo mount -t overlay ${OPTS} overlay mount
% ls -la below above work mount
above:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 32
drwxr-xr-x 1 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

work:
total 12
drwxr-xr-x 3 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
d--------- 2 root root 4096 Feb 18 10:42 work
%
The union of the files in the below and above directories has appeared in the mount directory. Now we add a file to the mount directory:
% echo "Content of mount/seven" >mount/seven
% ls -la below above work mount
above:
total 24
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 36
drwxr-xr-x 1 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

work:
total 12
drwxr-xr-x 3 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
d--------- 2 root root 4096 Feb 18 10:42 work
%
The file seven has appeared in the mount directory, and also in the above directory. They have the same content:
% cat mount/seven
Content of mount/seven
% cat above/seven
Content of mount/seven
%
Now we write to a file that is in the below directory:
% cat mount/two
Content of below/two
% echo "New content of mount/two" >mount/two
% cat mount/two
New content of mount/two
% ls -la below above work mount
above:
total 28
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
-rw-r--r-- 1 pi pi 25 Feb 18 10:42 two

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 36
drwxr-xr-x 1 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 25 Feb 18 10:42 two

work:
total 12
drwxr-xr-x 3 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
d--------- 2 root root 4096 Feb 18 10:42 work
%
A file two has appeared in the above directory which, when viewed through the overlay mount, obscures the file two in the below directory, which is still there with its original content:
% cat above/two
New content of mount/two
% cat below/two
Content of below/two
%
Now we remove a file from overlay mount directory that is in the below directory:
% rm mount/three
% ls -la below above work mount
above:
total 28
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
c--------- 1 pi pi 0, 0 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 25 Feb 18 10:42 two

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 32
drwxr-xr-x 1 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
-rw-r--r-- 1 pi pi 25 Feb 18 10:42 two

work:
total 12
drwxr-xr-x 3 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
d--------- 2 root root 4096 Feb 18 10:42 work
%
The file three is still in the below directory, but a character special file three has appeared in the above directory that makes the file in the below directory inaccessible, a "whiteout". Now we remove the file in the below directory that we wrote to earlier:
% rm mount/two
% ls -la below above work mount
above:
total 24
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
c--------- 1 pi pi 0, 0 Feb 18 10:42 three
c--------- 1 pi pi 0, 0 Feb 18 10:42 two

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 28
drwxr-xr-x 1 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six

work:
total 12
drwxr-xr-x 3 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
d--------- 2 root root 4096 Feb 18 10:42 work
%

The file two in the above directory has been replaced with a whiteout. Now we undo the overlayfs mount:
% sudo umount mount
% ls -la below above work mount
above:
total 24
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
c--------- 1 pi pi 0, 0 Feb 18 10:42 three
c--------- 1 pi pi 0, 0 Feb 18 10:42 two

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 8
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..

work:
total 12
drwxr-xr-x 3 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
d--------- 2 root root 4096 Feb 18 10:42 work
% cat below/two
Content of below/two
% cat above/seven
Content of mount/seven
%
The content of the read-only below directory is, as expected, unchanged. The data that was written to the overlay mount and not removed remains in the above directory. The data that was written to the overlay mount and subsequently removed is gone. The whiteouts remain.

Now we re-assemble the mount:
% sudo mount -t overlay ${OPTS} overlay mount
% ls -la below above work mount
above:
total 24
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six
c--------- 1 pi pi 0, 0 Feb 18 10:42 three
c--------- 1 pi pi 0, 0 Feb 18 10:42 two

below:
total 20
drwxr-xr-x 2 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 three
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 two

mount:
total 28
drwxr-xr-x 1 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 five
-rw-r--r-- 1 pi pi 22 Feb 18 10:42 four
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 one
-rw-r--r-- 1 pi pi 23 Feb 18 10:42 seven
-rw-r--r-- 1 pi pi 21 Feb 18 10:42 six

work:
total 12
drwxr-xr-x 3 pi pi 4096 Feb 18 10:42 .
drwxr-xr-x 6 pi pi 4096 Feb 18 10:42 ..
d--------- 2 root root 4096 Feb 18 10:42 work
%
The state of the mount returns.

Thom Hickey: Testing date parsing by fuzzing

Tue, 2015-02-24 15:45


 Fuzz testing, or fuzzing, is a way of stress testing services by sending them potentially unexpected input data. I remember being very impressed by one of the early descriptions of testing software this way (Miller, Barton P., Louis Fredriksen, and Bryan So. 1990. "An empirical study of the reliability of UNIX utilities". Communications of the ACM. 33 (12): 32-44), but had never tried the technique.

Recently, however, Jenny Toves spent some time extending VIAF's date parsing software to handle dates associated with people in WorldCat.  As you might imagine, passing through a hundred million new date strings found some holes in the software.  While we can't guarantee that the parsing always gives the right answer, we would like to be as sure as we can that it won't blow up and cause an exception.

So, I looked into fuzzing.  Rather than sending random strings to the software, the normal techniques now used tend to generate them based on a specification or by fuzzing existing test cases.  Although we do have something close to a specification based on the regular expressions the code uses, I decided to try making changes to the date strings we have that are derived from VIAF dates.

Most frameworks for fuzzing are quite loosely coupled, typically they pass the fuzzed strings to a separate process that is being tested.  Rather than do that, I read in each of the strings, did some simple transformations on it and called the date parsing routine to see if it would cause an exception. Here's what I did for each test string, typically for as many times as the string was long.  At each step the parsing is called

  • Shuffle the string ('abc' might get replaced by 'acb')
  • Change the integer value of character up or down (e.g. 'b' would get replaced by 'a' and then by 'c')
  • Change each character to a random Unicode character

For our 384K test strings this resulted in 1.9M fuzzed strings. This took about an hour to run on my desktop machine.

While the testing didn't find all the bugs we knew about in the code, it did manage to tickle a couple of holes in it, so I think the rather minimal time taken (less than a day) was worth it, given the confidence it gives us that the code won't blow up on strange input.

The date parsing code in GitHub will be updated soon.  Jenny is adding support for Thai dates (different calendar) and generally improving things.

Possibly the reason I thought of trying fuzzing was an amazing post on lcamtuf's blog Pulling JPEGs out of thin air.  That post is really amazing.  By instrumenting some JPEG software so that his fuzzing software could follow code paths at the assembly level, he was able to create byte strings representing valid JPEG images by sending in fuzzed strings, a truly remarkable achievement. My feeling on reading it was very similar to my reaction reading the original UNIX testing article cited earlier.

--Th

 

Harvard Library Innovation Lab: Link roundup February 24, 2015

Tue, 2015-02-24 14:21

This is the good stuff.

Sit Down. Shut Up. Write. Don’t Stop.

Hard work and working hard consistently. That’s the thing. Not romantic sparks of inspiration.

What makes us human? Videos from the BBC.

Fun, beautifully produced, short videos on what makes us human.

The Future of the Web Is 100 Years Old

Our current version of the Web (HTTP/HTML) is just one (far and away the most successful one) in a series of webs.

“Sea Rambler” Customized Bike by Geoff McFetridge

“I can learn small things that get me to points”

Boston Button Factory – Making Buttons Since 1872

17 pound, beautiful buttons. Want.

Open Knowledge Foundation: Open Data Camp UK: Bursting out of the Open Data Bubble

Tue, 2015-02-24 13:23

“But nobody cares about Open Data”

This thought was voiced in many guises during last weekend’s Open Data Camp. Obviously not entirely true, as demonstrated by the 100+ people who had travelled to deepest Hampshire for the first UK camp of its kind, or the many more people involving themselves in Open Data Day activities around the world. However the sentiment that, while many of us are getting extremely excited about the potential of Open Data in areas including government, crime and health, the rest of the planet are ‘just not interested’ was very clear.

As a non-technical person I’m keen to see ways that this gap can be bridged.

Open Data Camp was a 2-day unconference that aimed to let the technical and making sit alongside the story-telling and networking. There was also lots of cake!

Open Data Camp t-shirts

Open Data Board Game

After a pitch from session leaders we were left with that tricky choice about what to go for. I attended a great session led by Ellen Broad from the Open Data Institute on creating an Open Data board game. Creating a board game is no easy task but has huge potential as a way to reach out to people. Those behind the Open Data Board Game Project are keen to create something informative and collaborative which still retains elements of individual competition.

In the session we spent some time thinking about what data could underpin the game: Should it use data sets that affect most members of the general public (transport, health, crime, education – almost a replication of the national information infrastructure)? Or could there be data set bundles (think environmental related datasets that help you create your own climate co-op or food app)? Or what about sets for different levels of the game (a newbie version, a government data version)?

What became clear quite early on was there was two ways to go with the board game idea: one was creating something that could share the merits of Open Data with new communities, the other was something (complex) that those already interested in Open Data could play. Setting out to create a game that is ‘all things to all people’ is unfortunately likely to fail.

Discussion moved away from the practicalities of board game design to engaging with ‘other people’. The observation was made that while the general public don’t care about Open Data per se they do care about the result it brings. One concrete example given was Uber which connects riders to drivers through apps, now with mainstream use.

One project taking an innovative approach is Numbers that Matter. They are looking to bypass the dominant demographic (white, male, middle class, young) of technology users and focus on communities and explore with them how Open Data will affect their well-being. They’ve set out to make Open Data personal and relevant (serving the individual rather than civic-level participant). Researchers in the project began by visiting members of the general public in their own environment (so taxi drivers, hairdressers,…) and spoke to them about what problems or issues they were facing and what solutions could be delivered. The team also spent time working with neighbourhood watch schemes – these are not only organised but have a ‘way in’ with the community. Another project highlighted that is looking at making Open Data and apps meaningful for people is Citadel on the Move which aims to make it easier for citizens and application developers from across Europe to use Open Data to create the type of innovative mobile applications they want and need.

The discussion about engagement exposed some issues around trust and exploitation; ultimately people want to know where the benefits are for them. These benefits needs to be much clearer and articulated better. Tools like Open Food Facts, a database of food products from the entire world, do this well: “we can help you identify products that contain the ingredient you are allergic to“.

Saturday’s unconference board

“Data is interesting in opposition to power”

Keeping with the theme of community engagement I attended a session led by RnR Organisation who support grassroots and minority cultural groups to change, develop and enhance their skills in governance, strategic development, operational and project management, and funding. They used the recent Release of Data fund, which targets the release of specific datasets prioritised by the Open Data User Group, to support the development of a Birmingham Data and Skills Hub. However their training sessions (on areas including data visualization, use of Tablau and Google Fusion tables) have not instilled much interest and on reflection they now realise that they have pitched too high.

Open Data understanding and recognition is clearly part of a broader portfolio of data literacy needs that begins with tools like Excel and Wikipedia. RnR work has identified 3 key needs of 3rd sector orgs: data and analysis skills; data to learn and improve activities; and measurement of impacts.

Among the group some observations were made on the use of data by community groups including the need for timely data (“you need to show people today“) and relevant information driven by community needs (“nobody cares about Open Data but they do care about stopping bad things from happening in their area“). An example cited was of a project to stop the go ahead of a bypass in Hereford, they specifically needed GIS data. One person remarked that “data is interesting in opposition to power“, and we have a role to support here. Other questions raised related to the different needs of communities of geography and communities of interest. Issues like the longevity of data also come in to play: Armchair Auditor is a way to quickly find out where the Isle of Wight council has been spending money, unfortunately a change in formats by the council has resulted in the site being comprimised.

Sessions were illustrated by Drawnalism

What is data literacy?

Nicely following on from these discussions a session later in the day looked at data literacy. The idea was inspired by an Open Data 4 Development research project led by Mark Frank and Johanna Walker (University of Southampton) in which they discovered that even technically literate individuals still found Open Data challenging to understand. The session ended up resulting a series of questions: So ‘what exactly is data literacy’? Is it a homogeneous set of expertise (e.g. finding data), or is the context everything? Are there many approaches (such as suggested in the Open Data cook book or is there a definitive guide such as the Open Data Handbook or a step by step way to learn such as through School of Data. Is the main issue asking the right questions? Is there a difference between data literacy and data fluency? Are there two types of specialism: domain specialism and computer expertise? And can you offset a lack of data expertise with better designed data?

The few answers seemed to emerge through analogies. Maybe data literacy is like traditional literacy – it is essential to all, it is everyone’s job to make it happen (a collaboration between parents and teachers). Or maybe it is more like plumbing – having some understanding can help you understand situations but then you often end up bringing in an expert. Then again it could be more like politics or PHSE – it enables you to interact with the world and understand the bigger picture. The main conclusion from the session was that it is the responsibility of everyone in the room to be an advocate and explainer of Open Data!

“Backbone of information for the UK”

The final session I attended was an informative introduction to the National Information infrastructure an iterative framework that lists strategically important data and documents the services that provide access to the data and connect it to other data. It intended as the “backbone of information” for the UK, rather like the rail and road networks cater for transport. The NII team began work by carrying out a data inventory followed by analysis of the quality of the data available. Much decision making has used the concept of “data that is of strategic value to country” – a type of ‘core reference data’. Future work will involve thinking around what plan the country needs to put into play to support this core data. Does being part of the NII protect data? Does the requirement for a particular data set compel release? More recently there has been engagement with the Open Data user group / transparency board / ODI / Open Knowledge and beyond to understand what people are using and why, this may prioritise release.

It seems that at this moment the NII is too insular, it may need to break free from consideration of just publicly owned data and begin to consider privately owned data not owned by the government (e.g. Ordnance Survey data). Also how can best practices be shared? The Local Government Association are creating some templates for use here but there is scope for more activity.

With event organiser Mark Braggins

Unfortunately I could only attend one day of Open Data Camp and there was way too much for one person to take in anyway! For more highlights read the Open Data Camp blog posts or see summaries of the event on Conferieze and Eventifier. The good news is that with the right funding and good will the Open Data Camp will become an annual roving event.

Where did people come from?

Terry Reese: MarcEdit 6 Update

Tue, 2015-02-24 05:38

A new version of MarcEdit has been made available.  The update includes the following changes:

  • Bug Fix: Export Tab Delimited Records: When working with control data, if a position is requested that doesn’t exist, the process crashes.  This behavior has been changed so that a missing position results in a blank delimited field (as is the case if a field or field/subfield isn’t present.
  • Bug Fix: Task List — Corrected a couple reported issues related to display and editing of tasks.
  • Enhancement: RDA Helper — Abbreviations have been updated so that users can select the fields that abbreviation expansion occurs.
  • Enhancement: Linked Data Tool — I’ve vastly improved the process by which items are linked. 
  • Enhancement: Improved VIAF Linking — thanks to Ralp LeVan for pointing me in the right direction to get more precise matching.
  • Enhancement: Linked Data Tool — I’ve added the ability to select the index from VIAF to link to.  By default, LC (NACO) is selected.
  • Enhancement: Task Lists — Added the Linked Data Tool to the Task Lists
  • Enhancement: MarcEditor — Added the Linked Data Tool as a new function.
  • mprovements: Validate ISBNs — Added some performance enhancements and finished working on some code that should make it easier to begin checking remote services to see if an ISBN is not just valid (structurally) but actually assigned.
  • Enhancement: Linked Data Component — I’ve separated out the linked data logic into a new MarcEdit component.  This is being done so that I can work on exposing the API for anyone interested in using it.
  • Informational: Current version of MarcEdit has been tested against MONO 3.12.0 for Linux and Mac.

Linked Data Tool Improvements:

A couple specific notes of interest around the linked data tool.  First, over the past few weeks, I’ve been collecting instances where id.loc.gov and viaf have been providing back results that were not optimal.  On the VIAF side, some of that was related to the indexes being queried, some of it relates to how queries are made and executed.  I’ve done a fair bit of work added some additional data checks to ensure that links occur correctly.  At the same time, there is one known issue that I wasn’t able to correct while working with id.loc.gov, and that is around deprecated headings.  id.loc.gov currently provides no information within any metadata provided through the service that relates a deprecated item to the current preferred heading.  This is something I’m waiting for LC to correct.

To improve the Linked Data Tool, I’ve added the ability to query by specific index.  By default, the tool will default to LC (NACO), but users can select from a wide range of vocabularies (including, querying all the vocabularies at once).  The new screen for the Linked Data tool looks like the following:

In addition to the changes to the Linked Data Tool – I’ve also integrated the Linked Data Tool with the MarcEditor:

And within the Task Manager:

The idea behind these improvements is to allow users the ability to integrate data linking into normal cataloging workflows – or at least start testing how these changes might impact local workflows.

Downloads:

You can download the current version buy utilizing MarcEdit’s automatic update within the Help menu, or by going to: http://marcedit.reeset.net/downloads.html and downloading the current version.

–tr

SearchHub: Parsing and Indexing Multi-Value Fields in Lucidworks Fusion

Mon, 2015-02-23 21:14
Recently, I was helping a client move from a pure Apache Solr implementation to Lucidworks Fusion.  Part of this effort entailed the recreation of indexing processes (implemented using Solr’s REST APIs) in the Fusion environment, taking advantage of Indexing Pipelines to decouple the required ETL from Solr and provide reusable components for future processing. One particular feature that was heavily used in the previous environment was the definition of “field separator” in the REST API calls to the Solr UpdateCSV request handler. For example: curl "http://localhost:8888/solr/collection1/update/csv/?commit=true&f.street_names.split=true&f.street_names.separator=%0D" --data-binary @input.csv -H 'Content-type:text/plain; charset=utf-8' The curl command above posts a CSV file to the /update/csv request handler, with the request parameters "f.aliases.split=true" and "f.aliases.separator=%0D" identifying the field in column “aliases” as a multi-value field, with the character “\r” separating the values (%0D is a mechanism for escaping the carriage return by providing the hexadecimal ASCII code to represent “\r”.)  This provided a convenient way to parse and index multi-value fields that had been stored as delimited strings.  Further information about this parameter can be found here. After investigating possible approaches to this in Fusion, it was determined that the most straightforward way to accomplish this (and provide a measure of flexibility and reusability) was to create an index pipeline with a JavaScript stage. The Index Pipeline Index pipelines are a framework for plugging together a series of atomic steps, called “stages,” that can dynamically manipulate documents flowing in during indexing.  Pipelines can be created through the admin console by clicking “Pipelines” in the left menu, then entering a unique and arbitrary ID and clicking “Add Pipeline” – see below. After creating your pipeline, you’ll need to add stages.  In our case, we have a fairly simple pipeline with only two stages: a JavaScript stage and a Solr Indexer stage.  Each stage has its own context and properties and is executed in the configured order by Fusion.  Since this post is about document manipulation, I won’t go into detail regarding the Solr Indexer stage; you can find more information about it here.  Below is our pipeline with its two new stages, configured so that the JavaScript stage executes before the Solr Indexer stage. The JavaScript Index Stage We chose a JavaScript stage for our approach, which gave us the ability to directly manipulate every document indexed via standard JavaScript – an extremely powerful and convenient approach.  The JavaScript stage has four properties:
  • “Skip This Stage” – a flag indicating whether this stage should be executed
  • “Label” – an optional property that allows you to assign a friendly name to the stage
  • “Conditional Script” – JavaScript that executes before any other code and must return true or false.  Provides a mechanism for filtering documents processed by this stage; if false is returned, the stage is skipped
  • “Script Body” – required; the JavaScript that executes for each document indexed (where the script in “Conditional Script,” if present, returned true)
Below, our JavaScript stage with “Skip This Stage” set to false and labeled “JavaScript_Split_Fields.” Our goal is to split a field (called “aliases”) in each document on a carriage return (CTRL-M).  To do this, we’ll define a “Script Body” containing JavaScript that checks each document for the presence of a particular field; if present, splits it and assigns the resulting values to that field. The function defined in “Script Body” can take one (doc, the pipeline document) or two (doc and _context, the pipeline context maintained by Fusion) arguments.  Since this stage is the first to be executed in the pipeline, and there are no custom variables to be passed to the next stage, the function only requires the doc argument.  The function will then be obligated to return doc (if the document is to be indexed) or null (if it should not be indexed); in actuality we’ll never return null as the purpose of this stage is to manipulate documents, not determine if they should be indexed. function (doc) { return doc; } Now that we have a reference to the document, we can check for the field “aliases” and split it accordingly.  Once its split, we need to remove the previous value and add the new values to “aliases” (which is defined as a multi-value field in our Solr schema.)  Here’s the final code: function (doc) { var f_aliases = doc.getFirstField("aliases"); if (f_aliases != null) { var v_aliases = f_aliases.value; } else { var v_aliases = null; } if (v_aliases != null) { doc.removeFields("aliases"); aliases = v_aliases.split("\r"); for (var i = 0; i < aliases.length; i++) { doc.addField('aliases',aliases[i]); } } return doc; } Click “Save Changes” at the bottom of this stage’s properties. Bringing It All Together Now we have an index pipeline – let’s see how it works!  Fusion does not tie an index pipeline to a specific collection or datasource, allowing for easy reusability.  We’ll need to associate this pipeline to the datasource from which we’re retrieving “aliases,” which is accomplished by setting the pipeline ID in that datasource to point to our newly-created pipeline. Save your changes and the next time you start indexing that datasource, your index pipeline will be executed. You can debug your JavaScript stage by taking advantage of the “Pipeline Result Preview” pane, which allows you to test your code against static data right in the browser. Additionally, you can add log statements to the JavaScript by calling a method on the logger object, to which your stage already has a handle.  For example: logger.debug("This is a debug message"); will write a debug-level log message to <fusion home>/logs/connector/connector.log.  By combining these two approaches, you should be able to quickly determine the root cause of any issues encountered. A Final Note You have probably already recognized the potential afforded by the JavaScript stage; Lucidworks calls it a “Swiss Army knife.”  In addition to allowing you to execute any ECMA-compliant JavaScript, you can import Java libraries – allowing you to utilize custom Java code within the stage and opening up a myriad of possible solutions.  The JavaScript stage is a powerful tool for any pipeline! About the Author Sean Mare is a technologist with over 18 years of experience in enterprise application design and development.  As Solution Architect with Knowledgent Group Inc., a leading Big Data and Analytics consulting organization and partner with Lucidworks, he leverages the power of enterprise search to enable people and organizations to explore their data in exciting and interesting ways.  He resides in the greater New York City area.

The post Parsing and Indexing Multi-Value Fields in Lucidworks Fusion appeared first on Lucidworks.

District Dispatch: Registration opens for the 16th annual Natl. Freedom of Information Day Conference

Mon, 2015-02-23 20:38

Registration is now open for the 16th annual National Freedom of Information (FOI) Day Conference, which will be held on Friday, March 13, 2015, at the Newseum in Washington, D.C. The annual FOI Day Conference is hosted by the Newseum Institute’s First Amendment Center in cooperation with OpenTheGovernment.org and the American Library Association (ALA). The event brings together access advocates, government officials, judges, lawyers, librarians, journalists, educators and others to discuss timely issues related to transparency in government and freedom of information laws and practices.

Madison Award Awardee Patrice McDermott

This year’s program will feature a discussion of the first ten years of the “Sunshine Week” national open records initiative, presented by the Reporters Committee for Freedom of the Press and the American Society of News Editors. Additionally, the event will include a preview of a major reporting package from The Associated Press, McClatchy and Gannett/USA Today on a decade of open government activity. Miriam Nisbet, former director of the National Archives’ Office of Government Information Services, will address attendees at the FOI Day Conference.

During the event, ALA will announce this year’s recipient of the James Madison Award, which is presented annually to individuals or groups that have championed, protected and promoted public access to government information and the public’s right to know. Previous Madison Award recipients include Internet activist Aaron Swartz, Representative Zoe Lofgren (D-CA) and the Government Printing Office. ALA Incoming President Sari Feldman will present Madison Award this year.

The program is open to the public, but seating is limited. To reserve a seat, please contact Ashlie Hampton, at 202-292-6288, or ahampton[at]newseum[dot]org. The program will be streamed “live” at www.newseum.org.

The post Registration opens for the 16th annual Natl. Freedom of Information Day Conference appeared first on District Dispatch.

Nicole Engard: Bookmarks for February 23, 2015

Mon, 2015-02-23 20:30

Today I found the following resources and bookmarked them on <a href=

  • codebender Online development & collaboration platform for Arduino users, makers and engineers

Digest powered by RSS Digest

The post Bookmarks for February 23, 2015 appeared first on What I Learned Today....

Related posts:

  1. Home Automation with Arduino/RaspberryPi
  2. NFAIS: Embracing New Measures of Value
  3. Another way to use Zoho

Islandora: Looking Back at iCampBC

Mon, 2015-02-23 19:48

Last week we took Islandora Camp to Vancouver, BC for the first time, and it was pretty awesome. 40 attendees and instructors came together for three days to talk about, use, and generally show off what's new with Islandora. We were joined mostly by folks from around BC itself, with attendees from Simon Fraser University, the University of Northern British Columbia, Vancouver Public Library, Prince Rupert Library (one of the very first Islandora sites in the world!), Emily Carr University of Art and Design, and the University of British Colombia.

iCampBC featured the largest-ever Admin Track, with 29 of us coming together to build our own demo islandora sites on a camp Virtual Machine. With the help of my co-instructor, Erin Tripp from discoverygarden, we made collections, played with forms, and built some very nice Solr displays for cat pictures (and the occasional dog). While we were sorting out the front-end of Islandora, the Developer Track and instructors Mark Jordan and Mitch MacKenzie, were digging into the code and ended up developing a new demo module, Islandora Weather, which you can use to display the current weather for locations described in the metadata of an Islandora object (should you ever need to do that...)

For sessions, we had a great panel on Digital Humanties, within and outside of Islandora, featuring Mark Leggott (UPEI), Karyn Huenemann (SFU), Mimi Lam (UBC), and Rebecca Dowson (SFU). SFU's Alex Garnett and Carla Graebner took us on a tour of the tools that SFU has built to manage research data in Islandora. Justin Simpson from Artefactual showed us how Islandora and Archivemitca can play together in Archidora. Slides for these and most other presentations are available via the conference schedule.

Camp Awards were handed out on the last day (and tuques were earned via trivia. Can you name three Interest Groups without looking it up?). A few highlights:

  • So Many Lizards Award: Ashok Modi for all of his rescues in the Dev track
  • Camp Kaleidoscope Award: Kay Cahill, for rocking at least three laptops at one point
  • Camo Mojo Award: Karyn Huenemann, for infectious enthusiasm 

Thank you to all of our attendees for making this camp such a success. We hope some of you will join us for our next big event, the first Islandora Conference, this summer in PEI.

We had a feeling it was going to be a good week when us east-coast camp instructors got to leave behind this:

And were greeted by this:

Seriously, Vancouver. You know you're supposed to be in Canada too, right?

Journal of Web Librarianship: Use and Usability of a Discovery Tool in an Academic Library

Mon, 2015-02-23 17:18
10.1080/19322909.2014.983259
Scott Hanrath

Library of Congress: The Signal: Introducing the Federal Web Archiving Working Group

Mon, 2015-02-23 16:07

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

“Publishing of federal information on government web sites is orders of magnitude more than was previously published in print.  Having GPO, NARA and the Library, and eventually other agencies, working collaboratively to acquire and provide access to these materials will collectively result in more information being available for users and will accomplish this more efficiently.” – Mark Sweeney, Associate Librarian for Library Services, Library of Congress.

“Harvesting born-digital agency web content, making it discoverable, building digital collections, and preserving them for future generations all fulfill the Government Publishing Office’s mission, Keeping America Informed. We are pleased to be partnering with the Library and NARA to get this important project off the ground. Networking and collaboration will be key to our success government-wide.” – Mary Alice Baish, Superintendent of Documents, Government Publishing Office.

“The Congressional Web Harvest is an invaluable tool for preserving Congress’ web presence. The National Archives first captured Congressional web content for the 109th Congress in 2006, and has covered every Congress since, making more than 25 TB of content publicly available at webharvest.gov. This important resource chronicles Congress’ increased use of the web to communicate with constituents and the wider public. We value this collaboration with our partners at the Government Publishing Office and the Library of Congress, and look forward to sharing our results with the greater web archiving community.” – Richard Hunt, Director of the Center for Legislative Archives, National Archives and Records Administration.

Today most information that federal government agencies produce is created in electronic format and disseminated over the World Wide Web. Few federal agencies have any legal obligation to preserve web content that they produce long-term and few deposit such content with the Government Publishing Office or the National Archives and Records Administration – such materials are vulnerable to being lost.

Exterior of Government Printing Office I [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

How much information are we talking about? Just quantifying an answer to that question turns out to be a daunting task. James Jacobs, Data Services Librarian Emeritus, University of California, San Diego, prepared a background paper (PDF) looking at the problem of digital government information for the Center for Research Libraries for the “Leviathan: Libraries and Government in the Age of Big Data” conference organized in April 2014:

The most accurate count we currently have is probably from the 2008 “end of term crawl.” It attempted to capture a snapshot “archive” of “the U.S. federal government Web presence” and, in doing so, revealed the broader scope of the location of government information on the web. It found 14,338 .gov websites and 1,677 .mil websites. These numbers are certainly a more comprehensive count than the official GSA list and more accurate as a count of websites than the ISC count of domains. The crawl also included government information on sites that are not .gov or .mil. It found 29,798 .org, 13,856 .edu, and 57,873 .com websites that it classified as part of the federal web presence. Using these crawl figures, the federal government published information on 135,215 websites in 2008.

In other words, a sea of information in 2008 and now, in 2015, still more.

A function of government is to provide information to its citizens through publishing and to preserve some selected portion of these publications. Clearly some (if not most) .gov web sites are “government publications” and the U.S. federal government puts out information on .mil, .com, and other domains as well. What government agencies are archiving federal government sites for future research on a regular basis? And why? To what extent?

In part inspired by discussions at last year’s Leviathan conference, and in part fulfilling earlier conversations, managers and staff of three federal agencies that each do selective harvesting of federal web sites decided to start meeting and talking on a regular basis – the Government Publishing Office, the National Archives and Records Administration and the Library of Congress.

Managers and staff involved in web archiving from these three agencies have now met five times and have plans to continue meeting on a monthly basis during the remainder of 2015. At the most recent meeting we added a representative from the National Library of Medicine. So far we have been learning about what each of the agencies is doing with harvesting and providing access to federal web sites and why – whether it is the result of a legal mandate or because of other collection development policies. We expect to involve representatives of other federal agencies as seems appropriate over time.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

So far one thing we have agreed on is that we have enjoyed our meetings – the world of web archiving is a small one, and sharing our experiences with each other turns out to be both productive and pleasant. Now that we better understand what we are all doing individually and collectively, we are able to discuss how we can be more efficient and effective in the aggregated results of what we are doing going forward, for example by reducing duplication of effort.

And that’s the kind of thing we hope comes out of this – a shared collective development strategy, if only informally developed. The following are some specific activities we are looking at:

  • Developing and describing web archiving best practices for federal agencies, a web archiving “FADGI” (Federal Agencies Digitization Guidelines Initiative), that could also be of interest to others outside of the federal agency community.
  • Investigate common metrics for use of our web archives of federal sites.
  • Establishing outreach to federal agency staff members who create the sites in order to improve our ability to harvest them.
  • Understand what federal agencies are doing (those that do something) to archive their sites themselves and how that work can be integrated with our efforts.
  • Maintain a seed list of federal URLs and who is archiving what (and which sites are not being harvested).

As the work progresses we look forward to communicating via blog posts and other means about what we accomplish. We hope to hear from you, via the comments on blog posts like this one, with your questions or ideas.

District Dispatch: Last call: Comment on draft national policy agenda for libraries by 2-27!

Mon, 2015-02-23 16:07

Among the hundreds of powerful connections and conversations that took place at the 2015 American Library Association (ALA) Midwinter Meeting, librarians of all backgrounds began commenting on a draft national policy agenda for libraries. They asked how libraries can secure additional funding at a time of government budget cuts. Several noted and appreciated the inclusion of federal libraries, and most people specifically welcomed the premise of all libraries as linked together into a national infrastructure. And many saw potential for the national agenda to serve as a template for state- and local-level policy advocacy.

The draft agenda is the first step towards answering the questions “What are the U.S. library interests and priorities for the next five years that should be emphasized to national decision makers?” and “Where might there be windows of opportunity to advance a particular priority at this particular time?”

Outlining key issues and proposals is being pursued through the Policy Revolution! Initiative, led by the ALA Office for Information Technology Policy (OITP) and the Chief Officers of State Library Agencies (COSLA). A Library Advisory Committee—which includes broad representation from across the library community—provides overall guidance to the national effort. The three-year initiative, funded by the Bill & Melinda Gates Foundation, has three major elements: to develop a national public policy agenda, to initiate and deepen national stakeholder interactions based on policy priorities, and build library advocacy capacity for the long-term.

“We are asking big questions, and I’m really encouraged by the insightful feedback we’ve received in face-to-face meetings, emails and letters,” said OITP Director Alan Inouye. “I hope more people will share their perspectives and aspirations for building the capacity libraries of all kinds need to achieve shared national policy goals.”

The current round of public input closes Friday, February 27, 2015. Send your comments, questions and recommendations now to the project team at oitp[at]alawash[dot]org.

The draft agenda provides an umbrella of timely policy priorities and is understood to be too extensive to serve as the single policy agenda for any given entity in the community. Rather, the goal is that various library entities and their members can fashion their national policy priorities under the rubric of this national public policy agenda.

From this foundation, the ALA Washington Office will match priorities to windows of opportunity and confluence to begin advancing policy priorities—in partnership with other library organizations and allies with whom there is alignment—in mid-2015.

“In a time of increasing competition for resources and challenges to fulfilling our core missions, libraries and library organizations must come together to advocate proactively and strategically,” said COSLA President Kendall Wiggin. “Sustainable libraries are essential to sustainable communities.”

The post Last call: Comment on draft national policy agenda for libraries by 2-27! appeared first on District Dispatch.

DPLA: Let’s Talk about Ebooks

Mon, 2015-02-23 15:45

Books are among the richest artifacts of human culture. In the last half-millenium, we have written over a hundred million of them globally, and within their pages lie incredibly diverse forms of literature, history, and science, poetry and prose, the sacred and the profane. Thanks to our many partners, the Digital Public Library of America already contains over two million ebooks, fully open and free to read.

But we have felt since DPLA’s inception that even with the extent of our ebook collection, we could be doing much more to connect the public, in more frictionless ways, with the books they wish to read. It is no secret that the current landscape for ebooks is rocky and in many ways inhospitable to libraries and readers. Ebook apps are often complicated for new users, and the selection of ebooks a mere fraction of what is on the physical shelves. To their credit, publishers have become more open recently to sharing books through library apps and other digital platforms, but pricing, restrictions, and the availability of titles still vary widely.

At the same time, new models for provisioning ebooks are arising from within the library community. In Colorado, Arizona, Massachusetts, and Connecticut, among other places, libraries and library consortia are exploring ways to expand their e-collections. Some are focusing on books of great local interest, such as genre writers within their areas or biographies of important state figures; others are working with small and independent publishers to provide a wider market for their works; and still others are attempting to recast the economics of ebook purchasing to the benefit of readers and libraries as well as publishers and authors through bulk purchases. Moreover, new initiatives such as the recent push from the National Endowment for the Humanities and the Andrew W. Mellon Foundation to open access to existing works, and the Authors Alliance, which is helping authors to regain their book rights, offer new avenues for books to be made freely available.

At the DPLA, we are particularly enthusiastic about the role that our large and expanding national network of hubs can play. Many of our service hubs have already scanned books from their regions, and are generously sharing them through DPLA. Public domain works are being aggregated by content hubs such as HathiTrust, with more coming online every month. It is clear that we can bring these threads together to create a richer, broader tapestry of ebooks for readers of all ages and interests.

That’s why we’re delighted to announce today that we have received generous funding from the Alfred P. Sloan Foundation to start an in-depth discussion about ebooks and their future, and what DPLA and our partners can do to help push things forward. Along with the New York Public Library, a leader in library technology and services, we plan to intensify the discussions we have already been having with publishers, authors, libraries, and the public about how to connect the maximal number of ebooks with the maximal number of readers.

This conversation will be one of the central events at DPLAfest. If you haven’t registered for the fest yet, this is your call to join us in Indianapolis on April 17-18 to kick off this conversation about the future of ebooks. It is a critical discussion, and we welcome all ideas and viewpoints. We look forward to hearing your thoughts about ebooks in Indy, and in other discussions throughout 2015.

Pages