You are here

Feed aggregator

Library of Congress: The Signal: Unlocking the Imagery of 500 Years of Books

planet code4lib - Mon, 2014-12-22 20:03

The following is a guest post by Kalev H. Leetaru of Georgetown University (Former), Robert Miller of Internet Archive and David A. Shamma from Yahoo Labs/Flickr.

In 1994, linguist Geoff Nunberg stated, in an article in the journal “Representations,” “reading what people have had to say about the future of knowledge in an electronic world, you sometimes have the picture of somebody holding all the books in the library by their spines and shaking them until the sentences fall out loose in space…” What would these fragments look like if you took every page of every book from 2.5 million volumes dating back over 500 years? Could every illustration, drawing, chart, map, photograph and image be extracted, indexed and displayed? That was the question that launched the Internet Archives Book Images Project to catalog the imagery from half a millennium of books.

Over 14.7 million images were extracted from over 600 million pages covering an enormous variety of topics and stretching back to the year 1500. Yet, perhaps what is most remarkable about this montage is that these images come not from some newly-unearthed archive being seen for the first time, but rather from the books we have been digitizing for the past decade that have been resting in our digital libraries.

The history of book digitization has focused on creating text-based searchable collections–the identification of tens of millions of images on the pages of those books has historically been regarded as merely a byproduct of the digitization process. We inverted that model, reimagining books as containers of images rather than text. We then explored how digital libraries can yield rich visual collections by using modern image recognition technology coupled with Flickr to ultimately create one of the largest visual book collections in history. This involved automatically identifying visual content, cropping it out, extracting surrounding metadata via optical character recognition and uploading and indexing the structured data to Flickr. In effect, we are repurposing the vast archives of digital content created for text search and transforming it into a visual gallery of imagery. In doing so, we are creating a new way of “seeing” our cultural history.


Albert Bierstadt, “Among the Sierra Nevada Mountains” courtesy of Wikimedia.

How does one go about creating an archive of images spanning over 500 years? After seeing Albert Bierstadt’s “Among the Sierra Nevada Mountains, California” Kalev Leetaru (a co-author of this post) became curious as to how the American West of the nineteenth century was portrayed in literature–in particular how it was portrayed in imagery rather than words. Yet, searches in various digital libraries for Albert Bierstadt’s name yielded almost exclusively textual descriptions of his artwork, not photographs or illustrations – the same was true for searches of other nineteenth century technologies like the telegraph, the telephone, and the railroad.

Current digital libraries are designed to search for a word or phrase and return a list of pages mentioning it. The notion of searching for images appearing beside a word or phrase is simply not supported. While the concept of “image search” has become ingrained in our daily lives, it has largely remained a tool exclusively for searching the web, rather than other modalities like the printed word. As we’ve sought to find structure online we have seemingly overlooked the inherent knowledge within printed books. The call was clear: expand beyond a simple search box query and unlock the imagery of the world’s books. With this call in hand, Leetaru approached the Internet Archive, as one of the world’s largest open collections of digitized books, for permission to use their vast collection of over 600 million pages of digitized books stretching back half a millennia, and Flickr, as one of the largest online image services, to host the final collection of images in an interactive and searchable form.

A System Prototype

Historically, PDF versions of scanned books were created as what is called “image over text” files: the scanned image of each page is displayed with the OCR text hidden underneath. This works well for desktops and laptops with their unlimited storage and bandwidth, but can easily generate files in the tens or hundreds of megabytes. The rise of eReader devices, with their limited storage and processing capacities, necessitated more optimized file formats that save books as ASCII text and extract each visual element as a separate image file. Many digital libraries, including the Internet Archive, make their books available this way in the open EPUB file format, which is essentially a compressed ZIP file containing a set of HTML pages and the book’s images as JPG, PNG or GIF files. Extracting the images from each book is therefore as simple as unzipping its EPUB file, saving the images to disk, and searching the HTML pages to locate where each image appears in the book to extract the text surrounding it.

The simplicity here is what makes it so powerful. The hard part of creating an image gallery from books lies in the image recognition process needed to identify and extract each image from the page, yet this task is already performed in the creation of the EPUB files. This also means that this process can be easily repeated for any digital library that offers EPUB versions of their works, making it possible to one day create a single master repository of every image published in every book ever digitized. The entire processing pipeline was performed on a single four-processor virtual machine in the Internet Archive’s Virtual Reading Room. While all of the books used are available for public download on the Archive’s website, using the Virtual Reading Room made it possible to work much more easily with the Archive’s collections, dramatically reducing the time it took to complete the project.

Creating the Final Archive

The chief limitation of this prototype solution was that the images in EPUB files are downsampled to minimize storage space and processing time on the small portable devices they are designed for. Creating a gallery of rich high-resolution imagery from the world’s books requires returning to the original raw page scan imagery and using the OCR results to identify and crop each image from those scans.

From the beginning, it was decided to leverage the existing OCR results for each book instead of trying to develop new algorithms for identifying images from scanned book pages. It is hard to improve upon the accuracy of OCR software designed solely for the purpose of identifying text and images from scanned pages and developed by companies with large research and development staffs focusing exclusively on this very topic. In addition, few libraries have large computational clusters capable of running sophisticated image processing software, with OCR either outsourced to a vendor or run in scheduled fashion on dedicated OCR servers. By using the existing OCR results, the most computationally-intensive portion of the process is simply reused and all that remains is to crop the images from the page scans using the OCR results.

The final system uses the raw output of the Abbyy OCR software that the Internet Archive runs over each book, coupled with the scanning data produced when the book was originally digitized, to identify and extract the images at the original digitization resolution. As part of the OCR process to extract searchable text from each page, the OCR software identifies images on each page and sets them aside. Similar to the process used to create the EPUB files, this image information was used to extract each of these images to create the final gallery, along with the text surrounding each image.

The process begins by downloading from the Internet Archive the master list of all files available for a book, including its original page scan images and Abbyy OCR file. The original page scan imagery (JPEG2000, JPEG or TIFF format) and the OCR’s XML output must be present or the book is skipped. Next, the “scandata.xml” file (which contains a list of all pages in the book and their status) is examined to locate page scans that were flagged by the human operator of the book scanning system for exclusion. An example might be a bad page scan where the page was not in proper position for scanning and thus was rescanned. In this case instead of deleting the bad scan images, they are simply flagged in the scandata.xml file. Similarly, to ensure proper color calibration, a “color card” and ruler is photographed before and after each book is scanned (and sometimes periodically at random intervals throughout). These frames are also flagged in the XML file for exclusion. These page scans were dropped before the OCR software was run on the book by the Archive and thus must be identified to align the page numbering.

Next, the Abbyy OCR XML file is examined to locate each image in the book and its surrounding text. This file has an enormous wealth of information calculated by the OCR software. It breaks each page into a series of paragraphs, each paragraph into lines, and each line into individual characters. It also provides a confidence measure on how “sure” it is of each individual letter. Each region, line, character and image includes “l,t,r,b” parameters yielding its “left,” “top,” “right” and “bottom” coordinates in pixels within the original page scan image. Thus, to extract each image from a book, one simply scans the XML file for the “image” regions, reads its coordinate and crops this region from the original page scan imagery – no further analysis is necessary. The text surrounding each image is also extracted from the OCR file to enable keyword searching of the images by their context.

Now that the list of images and their surrounding text has been compiled, the system must download the full-resolution page scans to extract the images from. The problem is that the page scans for each book are delivered as ZIP files that are usually several hundred megabytes, occasionally exceeding one gigabyte. While at first this may not seem like much, when multiplied by a large number of books, the network bandwidth and the speed of computer hard-drives become critical limiting factors. A set of 22-core virtual machines were used to handle much of the computing needs of this project.

Typically one might attempt to process one book per core, meaning 22 books would be processed in parallel at any given time. If all 22 books had 500MB ZIP files containing their full resolution page scan imagery, this would require downloading 11GB of data over the network per second. Assuming that each book takes less than a second of CPU time on average to process, this would require a 100Gbps network link working at 100% capacity to sustain these processing needs. Even if this was broken into 22 separate virtual machines, each with a single core, all 22 machines would still each require their own four Gbps network link working at 100% capacity and delivering 500MB/s bandwidth.

Even if the network bandwidth limitations are overcome, the greatest challenge is actually the disk bandwidth of writing a 500MB ZIP file to disk from the network and then unpacking it (reading the 500MB) and updating the file system metadata to handle several hundred new files being written to disk (writing its 500MB of contents to disk). Thus, all said and done, a single 500MB ZIP file requires 500MB to be read twice and written twice, totaling two GB of total IO. Processing 22 files per second would require 44GB/s of disk bandwidth.

Many cloud computing vendors limit virtual machines to around 100MB/s sustained writes and 200MB/s sustained read performance, while even a dedicated physical three Gbps SATA hard drive operating at peak capability would still require two seconds just to write the ZIP file to disk and another two seconds to unpack it and write its contents to disk as individual files (assuming the intermediate reads are kernel buffered). Since writes are linear, SSD disks do not provide a speed advantage over traditional mechanical disks. While more exotic storage infrastructures exist that can support these needs, they are rarely found in library environments.

Our processing pipeline was therefore designed to minimize network and disk IO even at the cost of slowing down the processing of a single book. In the final system, if a book had less than 50 pages containing images to be extracted the page images were downloaded individually via the Archive’s ZIP API. Instead of downloading the entire ZIP file to local disk, the ZIP API allows a caller to request a single file from a ZIP archive and the Archive will extract, uncompress and return that single file from the ZIP archive. For optimization, contents of the ZIP were extracted individually via calls to the Archive’s web service at an average delay of around two seconds per image.

A book with 50 pages containing images would therefore require around 1.6 minutes to download all of the images, whereas on the virtual machines used for this project the raw page scan ZIP could be downloaded in under 30 seconds. However, downloading multiple full ZIP files would exceed the network and disk bandwidth on the virtual machines, raising the total time required to download the full ZIP file to about 10 minutes. For books with more than 50 pages of images, the full ZIP was downloaded, but only the needed pages were unpacked from the ZIP file instead of all pages, again reducing the read/write bandwidth. Combined, these two techniques reduced the IO load on each virtual machine to the point that the CPUs were able to be kept largely occupied when running 44 books in parallel.

Finally, the book images needed to be cropped from the full resolution page scans. The de facto tool for such image operations is ImageMagick, however its performance is extremely slow on large JPEG2000 images. Instead, we used the specialized “kdu_expand” tool from Kakadu Software to extract images and CJPEG to write the resulting image to disk. This resulted in a speedup of eight to 30 times by comparison to ImageMagick. To minimize memory requirements, the final system actually generates a shell script and then exits and invokes the shell script to perform all of the actual image processing components of the pipeline, freeing up the memory originally used to process the XML files since per-core memory was highly limited on the systems available for this project. Finally, the list of extracted JPEG images and a tab-delimited inventory text file listing the attributes of each image and the text surrounding them is compressed into one ZIP file per book, ready for use. Thus, while conceptually quite simple, the final system required considerable iterative development and enormous effort to minimize IO at all costs.

The Archive, the Flickr Commons, and the Future

With the mechanics of data management and image extraction completed, the first 2.6 million images from the collection were uploaded to the Internet Archive Book Images collection in the Flickr Commons in July/August 2014, making it searchable and browseable to the entire world. Each image is uploaded with a set of indexed tags such as book year (like 1879) or book subject (like sailing) and the book’s id (like bookidkittenscatsbooko00grov). Each image’s description has detailed information about the image caption, text before and after the image, book’s title, and the page number. In the Flickr Commons, the images themselves carry no known copyright restrictions; this itself carries new implications for libraries and archives online. In a recent conversation about this collection, Cathy Marshall stated:

“…here are over 2.5 million images, ripe for reuse and reorganized into different collections and subcollections. Do we necessarily want to read the whole of a monograph about polycystins from 1869? Probably not. But might we use a stippled protozoan illustration as a homescreen background (in this case, with little concern for copyright restrictions). And we would be in good company: in a recent study, Frank Shipman and I found that over 80% of our participants (202 out of 242) downloaded photos they found on the Internet with little concern for copyright restrictions. Almost 3/4 reused these photos in new contexts. It’s easy to see how book images, extracted and offered without copyright concerns, would be an attractive online resource for purposes many of us haven’t even foreseen yet.”

We also look towards annotations. From people in the communities and in the libraries, human annotation can help classify – formally and informally – the content to be discovered in the books. Even signals that one might take for granted, such as marking an image as a favorite or leaving a comment, can be quite valuable in social computing to further understand a corpus and help tell the stories contained across all the books. The structured data and human annotations here are the first steps; computer vision systems can further index concepts.

We are just now at the beginning of what is possible with this collection. From here, we hope to begin a transformation of the Internet Archive’s collections into a dynamic, growing collection where concepts, objects, locations and even music can be discovered, not as just an index of text around images, but rather a deeper knowledge model that utilizes the structure of books, publishing and libraries to understand the world and allow scholars to begin asking an entirely new generation of research questions.

LITA: Long Live Firefox!

planet code4lib - Mon, 2014-12-22 11:00

Until I became a librarian, I never gave much thought to web browsers. In the past I used Safari when working on a Mac, Chrome on my Android tablet, and showed the typical disdain for Internet Explorer. If I ever used Firefox it was purely coincidental, but now it’s my first choice and here’s why.

This month Mozilla launched Firefox 34 and announced a deal to make Yahoo their default search engine. I wasn’t alone in wondering if that move would be bad for business (if you’re like me, you avoid Yahoo like the plague). Mozilla also raised some eyebrows by asking for donations on their home page this year.

I switched to Firefox a few months ago, prior to all the commotion, when I came across Mozilla’s X-Ray Goggles, an add-on that allows you to view how a webpage is constructed (the Denver Public Library has a great project tutorial using X-Ray Goggles that I highly recommend). I was pleasantly surprised to find a slew of other resources for teaching the web and after doing a little more digging, I was taken by Mozilla’s support of an open web and intrigued by their non-profit status.

At the library I frequently encounter patrons who have pledged their allegiance to Google or Apple or Microsoft and I’m the same way. I was excited to update to Lollipop on my tablet and I’m saving up for an iMac, but I cringe when I think about Google’s privacy policies or Apple’s sweatshops. Are these companies that I really want to support?

I was teaching an Android class the other day and a patron asked me which browser is the best. I told her that I use Firefox because I support Mozilla and what they stand for. She chuckled at my response. Maybe it’s silly to stand up for any corporation, but given the choice I want to support the one that does the most good (or the least evil).

Mozilla’s values and goals are very much in line with the modern library. If you’re on the fence about Firefox, take a look at their Privacy Policy, Add-Ons, and see how easy it is to switch your default search engine back to Google. You just might change your mind.

Mark E. Phillips: Calculating a Use.

planet code4lib - Mon, 2014-12-22 01:42

In the previous post I talked a little about what a “use” is for our digital library and why we use it as a unit of measurement.  We’ve found that it is a very good indicator of engagement with our digital items.  It allows us to understand how our digital items are being used in ways that go beyond Google Analytics reports.

In this post I wanted to talk a little bit about how we calculate these metrics.

First of all a few things about the setup that we have for delivering content.

We have a set of application servers that are running an Apache/Python/Django stack for receiving and delivering requests for digital items.  A few years ago we decided that it would be useful to proxy all of our digital object content through theses content servers for delivery so we would have access to adjust and possibly restrict content in some situations.  This means two things,  one that all traffic goes in and out of these application servers for requests and delivery of content, and two that we are able to rely on the log files that Apache produces to get a good understanding of what is going on in our system.

We decided to base our metrics on best practices of the Counter initiative whenever possible as to try and align our numbers with that community.

Each night at about 1:00 AM CT we start a process that aggregates all of the log files for the previous day from the different application servers.  These are collocated on a server that is responsible for calculating the daily uses.

We are using the NCSA extended/combined log format for the log file format coming from our servers.  We also append the domain name for the request because we operate multiple interfaces and domains from the same servers.  A typical set of logs look something like this. - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74680/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74678/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" - - [21/Dec/2014:03:49:28 -0600] "GET /search/?q=%22Bowles%2C+Flora+Gatlin%2C+1881-%22&t=dc_creator&display=grid HTTP/1.0" 200 17900 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;" - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74662/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" - - [21/Dec/2014:03:49:28 -0600] "GET /ark:/67531/metapth74679/m1/1/high_res/?width=930 HTTP/1.0" 200 59858 "" "Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0"


Here are the steps that we go through for calculating uses.

  • Remove requests that are not for an ARK identifier,  the request portion must start with /ark:/
  • Remove requests from known robots (this list, plus additional robots)
  • Remove requests with no user agent
  • Remove all requests that are not 200 or 304
  • Remove requests for thumbnails, raw metadata, feedback, urls from the object as they are generally noise

From the lines in the first example there would be only one line left after processing,  the last line.

The lines that remain are sorted by date and then grouped byIP Address.  A time window of 30 minutes is run on the requests.  The IP address and ARK identifier are used as the key to the time window which allows us to group request by a single IP Address for a single item into a single use.  These uses are then fed into a Django application where we store our stats data.

There are of course many other specifics about the process of getting these stats processed and moved into our system.  I hope this post was helpful for explaining the kinds of things that we do and doing count when calculating uses.


Patrick Hochstenbach: It Is Raining Cat Doodles

planet code4lib - Sat, 2014-12-20 15:48
Filed under: Doodles Tagged: cat, Cats, doodle, elag

District Dispatch: Sony leak reveals efforts to revive SOPA

planet code4lib - Sat, 2014-12-20 15:17

Working in Washington, D.C. tends to make one a bit jaded: the revolving door, the bipartisan attacks, not enough funding for libraries — the list goes on. So, yes, I am D.C.-weary and growing more cynical. Now I have another reason to be fed up.

The Sony Pictures Entertainment data breach has uncovered documents that show that the Motion Picture Association of America (MPAA) has been trying to pull a fast one —reviving the ill-conceived Stop Online Piracy Act (SOPA) legislation (that failed spectacularly in 2010) by apparently working in tandem with state Attorneys General. Documents show that MPAA has been devising a scheme to get the result they could not get with SOPA—shutting down web sites and along with them freedom of expression and access to information.

Sony Pictures studio.

The details have been covered by a number of media outlets, including The New York Times. The MPAA seems to think that the best solution to shutting down piracy is “make invisible” the web sites of suspected culprits. You may think that libraries have little to worry about; after all we aren’t pirates. But the good guys will be yanked offline, as well as the alleged bad guys. Our provision of public access to the internet would then be in jeopardy because a few library users that allegedly posted protected content on, for example, Pinterest or YouTube. Our protection from liability for the activities of library patrons using public computers could be thrown out the window along with internet access. This makes no sense.

SOPA, touted initially by Congress as a solution to online piracy, also made no sense from the start because it was too broad. If passed, it would have required that libraries police the internet and block web sites whenever asked by law enforcement officials. Technical experts confirmed that the implementation of SOPA could threaten cybersecurity and undermine the Domain Name System (DNS), also known as the very “backbone of the Internet.”

After historically overwhelming public outcry the content community and internet supporters were encouraged to work together on a compromise, parties promised to collaborate, and some work was actually accomplished. Now it seems that, as far as MPAA was concerned, collaboration was just hype. They were, the leaked documents show, planning all along to get SOPA one way or another.

The library community opposes piracy. But we also oppose throwing the baby out with the bath water.

Update: The Verve has reported that Mississippi Attorney General Hood did indeed launch his promised attack on behalf of the MPAA by serving Google with a 79-page subpoena charging that Google engaged in “deceptive” or “unfair” trade practice under the Mississippi Consumer Protection Act. Google has filed a brief asking the federal court to set aside the subpoena and noting that Mississippi (or any state for that matter) has no jurisdiction over these matters.

For more on efforts to revive SOPA, see this post as well.

The post Sony leak reveals efforts to revive SOPA appeared first on District Dispatch.

Terry Reese: Working with SPARQL in MarcEdit

planet code4lib - Sat, 2014-12-20 06:06

Over the past couple of weeks, I’ve been working on expanding the linking services that MarcEdit can work with in order to create identifiers for controlled terms and headings.  One of the services that I’ve been experimenting with is NLM’s beta SPARQL endpoint for MESH headings.  MESH has always been something that is a bit foreign to me.  While I had been a cataloger in my past, my primary area of expertise was with geographic materials (analog and digital), as well as traditional monographic data.  While MESH looks like LCSH, it’s quite different as well.  So, I’ve been spending some time trying to learn a little more about it, while working on a process to consistently query the endpoint to retrieve the identifier for a preferred Term. Its been a process that’s been enlightening, but also one that has led me to think about how I might create a process that could be used beyond this simple use-case, and potentially provide MarcEdit with an RDF engine that could be utilized down the road to make it easier to query, create, and update graphs.

Since MarcEdit is written in .NET, this meant looking to see what components currently exist that provide the type of RDF functionality that I may be needing down the road.  Fortunately, a number of components exist, the one I’m utilizing in MarcEdit is dotnetrdf (  The component provides a robust set of functionality that supports everything I want to do now, and should want to do later.

With a tool kit found, I spent some time integrating it into MarcEdit, which is never a small task.  However, the outcome will be a couple of new features to start testing out the toolkit and start providing users with the ability to become more familiar with a key semantic web technology,  SPARQL.  The first new feature will be the integration of MESH as a known vocabulary that will now be queried and controlled when run through the linked data tool.  The second new feature is a SPARQL Browser.  The idea here is to give folks a tool to explore SPARQL endpoints and retrieve the data in different formats.  The proof of concept supports XML, RDFXML, HTML. CSV, Turtle, NTriple, and JSON as output formats.  This means that users can query any SPARQL endpoint and retrieve data back.  In the current proof of concept, I haven’t added the ability to save the output – but I likely will prior to releasing the Christmas MarcEdit update.

Proof of Concept

While this is still somewhat conceptual, the current SPARQL Browser looks like the following:

At present, the Browser assumes that data resides at a remote endpoint, but I’ll likely include the ability to load local RDF, JSON, or Turtle data and provide the ability to query that data as a local endpoint.  Anyway, right now, the Browser takes a URL to the SPARQL Endpoint, and then the query.  The user can then select the format that the result set should be outputted.

Using NLM as an example, say a user wanted to query for the specific term: Congenital Abnormalities – utilizing the current proof of concept, the user would enter the following data:

SPARQL Endpoint:


PREFIX rdf: <> PREFIX rdfs: <> PREFIX xsd: <> PREFIX owl: <> PREFIX meshv: <> PREFIX mesh: <> SELECT distinct ?d ?dLabel FROM <> WHERE { ?d meshv:preferredConcept ?q . ?q rdfs:label 'Congenital Abnormalities' . ?d rdfs:label ?dLabel . } ORDER BY ?dLabel

Running this query within the SPARQL Browser produces a resultset that is formatted internally into a Graph for output purposes.

The images snapshot a couple of the different output formats.  For example, the full JSON output is the following:

{ "head": { "vars": [ "d", "dLabel" ] }, "results": { "bindings": [ { "d": { "type": "uri", "value": "" }, "dLabel": { "type": "literal", "value": "Congenital Abnormalities" } } ] } }

The idea behind creating this as a general purpose tool, is that in theory, this should work for any SPARQL endpoint.   For example, the Project Gutenberg Metadata endpoint.  The same type of exploration can be done, utilizing the Browser.

Future Work

At this point, the SPARQL Browser represents a proof of concept tool, but one that I will make available as part of the MARCNext research toolset:

As part of the next update.  Going forward, I will likely refine the Browser based on feedback, but more importantly, start looking at how the new RDF toolkit might allow for the development of dynamic form generation for editing RDF/BibFrame data…at least somewhere down the road.


[1] SPARQL (W3C):
[2] SPARQL (Wikipedia):
[3] SPARQL Endpoints:
[4] MarcEdit:
[5] MARCNext:

William Denton: Intersecting circles

planet code4lib - Sat, 2014-12-20 03:21

A couple of months ago I was chatting about Venn diagrams with a nine-year-old (almost ten-year-old) friend named N. We learned something interesting about intersecting circles, and along the way I made some drawings and wrote a little code.

We started with two sets, but here let’s start with one. We’ll represent it as a circle on the plane. Call this circle c1.

Everything is either in the circle or outside it. It divides the plane into two regions. We’ll label the region inside the circle 1 and the region outside (the rest of the plane) x.

Now let’s look at two sets, which is probably the default Venn diagram everyone thinks of. Here we have two intersecting circles, c1 and c2.

We need to consider both circles when labelling the regions now. For everything inside c1 but not inside c2, use 1x; for the intersection use 12; for what’s in c2 but not c1 use x2; and for what’s outside both circles use xx.

We can put this in a table:

1 2 1 x 1 2 x 2 x x

This looks like part of a truth table, which of course is what it is. We can use true and false instead of the numbers:

1 2 T F T T F T F F

It takes less space to just list it like this, though: 1x, 12, x2, xx.

It’s redundant to use the numbers, but it’s clearer, and in the elementary school math class they were using them, so I’ll keep with that.

Three circles gives eight regions: 1xx, 12x, 1x3, 123, x2x, xx3, x23, xxx.

Four intersecting circles gets busier and gives 14 regions: 1xxx, 12xx, 123x, 12x4, 1234, 1xx4, 1x34, x2xx, x23x, x324, xx3x, xx34, xxx4, xxxx.

Here N and I stopped and made a list of circles and regions:

Circles Regions 1 2 2 4 3 8 4 14

When N saw this he wondered how much it was growing by each time, because he wanted to know the pattern. He does a lot of that in school. We subtracted each row from the previous to find how much it grew:

Circles Regions Difference 1 2 2 4 2 3 8 4 4 14 6

Aha, that’s looking interesting. What’s the difference of the differences?

Circles Regions Difference DiffDiff 1 2 2 4 2 3 8 4 2 4 14 6 2

Nine-year-old (almost ten-year-old) N saw this was important. I forget how he put it, but he knew that if the second-level difference is constant then that’s the key to the pattern.

I don’t know what triggered the memory, but I was pretty sure it had something to do with squares. There must be a proper way to deduce the formula from the numbers above, but all I could do was fool around a little bit. We’re adding a new 2 each time, so what if we take it away and see what that gives us? Let’s take the number of circles as n and the result as ?(n) for some unknown function ?.

n ?(n) 1 0 2 2 3 3 4 12

I think I saw that 3 x 2 = 6 and 4 x 3 = 12, so n x (n-1) seems to be the pattern, and indeed 2 x 1 = 2 and 1 * 0 = 0, so there we have it.

Adding the 2 back we have:

Given n intersecting circles, the number of regions formed = n x (n - 1) + 2

Therefore we can predict that for 5 circles there will be 5 x 4 + 2 = 22 regions.

I think that here I drew five intersecting circles and counted up the regions and almost got 22, but there were some squidgy bits where the lines were too close together so we couldn’t quite see them all, but it seemed like we’d solved the problem for now. We were pretty chuffed.

When I got home I got to wondering about it more and wrote a bit of R.

I made three functions; the third uses the first two:

  • circle(x,y): draw a circle at (x,y), default radius 1.1
  • roots(n): return the n nth roots of unity (when using complex numbers, x^n = 1 has n solutions)
  • drawcircles(n): draw circles of radius 1.1 around each of those n roots
circle <- function(x, y, rad = 1.1, vertices = 500, ...) { rads <- seq(0, 2*pi, length.out = vertices) xcoords <- cos(rads) * rad + x ycoords <- sin(rads) * rad + y polygon(xcoords, ycoords, ...) } roots <- function(n) { lapply( seq(0, n - 1, 1), function(x) c(round(cos(2*x*pi/n), 4), round(sin(2*x*pi/n), 4)) ) } drawcircles <- function(n) { centres <- roots(n) plot(-2:2, type="n", xlim = c(-2,2), ylim = c(-2,2), asp = 1, xlab = "", ylab = "", axes = FALSE) lapply(centres, function (c) circle(c[[1]], c[[2]])) }

drawcircles(2) does what I did by hand above (without the annotations):

drawcircles(5) shows clearly what I drew badly by hand:

Pushing on, 12 intersecting circles:

There are 12 x 11 + 2 = 123 regions there.

And 60! This has 60 x 59 + 2 = 3598 regions, though at this resolution most can’t be seen. Now we’re getting a bit op art.

This is covered in Wolfram MathWorld as Plane Division by Circles, and (2, 4, 8, 14, 24, …) is A014206 in the On-Line Encyclopedia of Integer Sequences: “Draw n+1 circles in the plane; sequence gives maximal number of regions into which the plane is divided.”

Somewhere along the way while looking into all this I realized I’d missed something right in front of my eyes: the intersecting circles stopped being Venn diagrams after 3!

A Venn diagram represents “all possible logical relations between a finite collection of different sets” (says Venn diagram on Wikipedia today). With n sets there are 2^n possible relations. Three intersecting circles divide the plane into 3 x (3 - 1) + 2 = 8 = 2^3 regions, but with four circles we have 14 regions, not 16! 1x3x and x2x4 are missing: there is nowhere where only c1 and c3 or c2 and c4 intersect without the other two. With five intersecting circles we have 22 regions, but logically there are 2^5 = 32 possible combinations. (What’s an easy way to calculate which are missing?)

It turns out there are various ways to draw four- (or more) set Venn diagrams on Wikipedia, like this two-dimensional oddity (which I can’t imagine librarians ever using when teaching search strategies):

You never know where a bit of conversation about Venn diagrams is going to lead!

District Dispatch: Sony leak reveals efforts to revive SOPA

planet code4lib - Fri, 2014-12-19 22:35

Working in Washington, D.C., tends to make one a bit jaded: the revolving door, the bipartisan attacks, not enough funding for libraries — the list goes on. So, yes, I am D.C.-weary and growing more cynical. Now I have another reason to be fed up.

Sony Pictures studio.

The Sony Pictures Entertainment data breach has uncovered documents that show that the Motion Picture Association of America (MPAA) has been trying to pull a fast one —reviving the ill-conceived Stop Online Privacy Act (SOPA) legislation (that failed spectacularly in 2010) by apparently buying off state Attorneys General.  Documents show that MPAA has been devising a scheme to get the result they could not get with SOPA —shutting down web sites and along with them freedom of expression and access to information.

The details have been covered by a number of media outlets, including The New York Times. The MPAA seems to think that the best solution to shutting down piracy is “make invisible” the web sites of suspected culprits. You may think that libraries have little to worry about; after all we aren’t pirates. But the good guys will be yanked offline, as well as the alleged bad guys. Our provision of public access to the Internet would then be in jeopardy because a few library users allegedly posted protected content on, for example, Pinterest or YouTube. Our protection from liability for the activities of library patrons using public computers could be thrown out the window along with internet access. This makes no sense.

SOPA, touted initially by Congress as a solution to online piracy, also made no sense from the start because it was too broad. If passed, it would have required that libraries police the internet and block web sites whenever asked by law enforcement officials. Technical experts confirmed that the implementation of SOPA could threaten cybersecurity and undermine the Domain Name System (DNS), also known as the very “backbone of the Internet.”

After historically overwhelming public outcry the content community and internet supporters were encouraged to work together on a compromise, parties promised to collaborate, and some work was actually accomplished. Now it seems that, as far as MPAA was concerned, collaboration was just hype.   They were, the leaked documents show, planning all along to get SOPA one way or another.

The library community opposes piracy. But we also oppose throwing the baby out with the bath water.

Update: The Verve has reported that Mississippi Attorney General Hood did indeed launch his promised attack on behalf of the MPAA by serving Google with a 79-page subpoena charging that Google engaged in “deceptive” or “unfair” trade practice under the Mississippi Consumer Protection Act. Google has filed a brief asking the federal court to set aside the subpoena and noting that Mississippi (or any state for that matter) has no jurisdiction over these matters.


The post Sony leak reveals efforts to revive SOPA appeared first on District Dispatch.

Library Hackers Unite blog: NTP and git client vulnerabilities

planet code4lib - Fri, 2014-12-19 21:00

Git client vulnerabilities on case-insensitive filesystems:

NTPd vulnerabilities announced:

OSX and MS Windows users, start by updating your github apps and plugins and then your regular command-line git client. NTP fixes still pending for most platforms.

Library of Congress: The Signal: Dodging the Memory Hole: Collaborations to Save the News

planet code4lib - Fri, 2014-12-19 17:55

The news is often called the “first draft of history” and preserved newspapers are some of the most used collections in libraries. The Internet and other digital technologies have altered the news landscape. There have been numerous stories about the demise of the newspaper and disruption at traditional media outlets. We’ve seen more than a few newspapers shutter their operations or move to strictly digital publishing. At the same time, niche news blogs, citizen-captured video, hyper-local new sites, news aggregators and social media have all emerged to provide a dynamic and constantly changing news environment that is sometimes confusing to consume and definitely complex to encapsulate.

With these issues in mind and with the goal to create a network to preserve born-digital journalism, the Reynolds Journalism Institute at the University of Missouri sponsored part one of the meeting Dodging the Memory Hole  as part of the Journalism Digital New Archive 2014 forum, an initiative at the Reynolds Institute. Edward McCain (the focus of a recent Content Matters interview on The Signal) has a unique joint appointment at the Institute and the University of Missouri Library as the Digital Curator of Journalism. He and Katherine Skinner, Executive Director of the Educopia Institute (which will host part two of the  meeting in May 2015 in Charlotte, N.C.) developed the two-day program which attracted journalists, news librarians, technologists, academics and administrators.

Cliff Lynch, Director of the Coalition of Networked Information, opened the meeting with a thoughtful assessment of the state of digital news production and preservation. An in-depth case study followed recounting the history of the Rocky Mountain News, its connection to the Denver, CO community, its eventual demise as an actively published newspaper and, ultimately, the transfer of its assets to the Denver Public Library where the content and archives of the Rocky Mountain News remain accessible.

This is the first known arrangement of its kind, and DPL has made its donation agreement with E.W. Scripps Company openly accessible so it can serve as a model for other newspapers and libraries or archives. A roundtable discussion of news executives also revealed opportunities to engage in new types of relationships with the creators of news. Particularly, opening a dialog with the maintainers of content management systems that are used in newsrooms could make the transfer of content out of those systems more predictable and archivable.

Ben Welsh, a database producer at the Los Angeles Times, next debuted his tool Storytracker, which is based on PastPages, a tool he developed to capture screenshots of newspaper websites.  Storytracker allows for the capture of screenshots and the extraction of URLs and their associated text so links and particular stories or other content elements from a news webpage can be tracked over time and analyzed. Storytracker is free and available for download and Welsh is looking for feedback on how the tool could be more useful to the web archiving community. Tools like these have the potential to aid in the selection, capture and analysis of web based content and further the goal of preserving born-digital news.

Katherine Skinner closed the meeting with an assessment of the challenges ahead for the community, including: unclear definitions and language around preservation; the copyright status of contemporary news content; the technical complexity of capturing and preserving born-digital news; ignorance of emerging types of content; and the lack of relationships between new content creators and stewardship organizations.

In an attempt to meet some of these challenges, three action areas were defined: awareness, standards and practices and legal framework. Participants volunteered to work toward progress in advocacy messaging, exploring public-private partnerships, preserving pre-print newspaper PDFs, preserving web-based news content and exploring metadata and news content management systems. Groups will attempt to demonstrate some progress in these areas over the next six months and share results at the next Dodging the Memory Hole meeting in Charlotte. If you have ideas or want to participate in any of the action areas let us know in the comments below and we will be in touch.

Casey Bisson: Parable of the Polygons is the future of journalism

planet code4lib - Fri, 2014-12-19 17:35

Okay, so I’m probably both taking that too far and ignoring the fact that interactive media have been a reality for a long time. So let me say what I really mean: media organizations that aren’t planning out how to tell stories with games and simulators will miss out.

Here’s my example: Vi Hart and Nicky Case’s Parable of the Polygons shows us how bias, even small bias, can affect diversity. It shows us this problem using interactive simulators, rather than tells us in text or video. We participate by moving shapes around and pulling the levers of change on bias.

This nuclear power plant simulator offers some insight into the complexity that contributed to Fukushima, and I can’t help thinking the whole net neutrality argument would be better explained with a simulator.

Manage Metadata (Diane Hillmann and Jon Phipps): The Jane-athon Prototype in Hawaii

planet code4lib - Fri, 2014-12-19 15:22

The planning for the Midwinter Jane-athon pre-conference has been taking up a lot of my attention lately. It’s a really cool idea (credit to Deborah Fritz) to address the desire we’ve been hearing for some time for a participatory, hands on, session on RDA. And lets be clear, we’re not talking about the RDA instructions–this is about the RDA data model, vocabularies, and RDA’s availability for linked data. We’ll be using RIMMF (RDA in Many Metadata Formats) as our visualization and data creation tool, setting up small teams with leaders who’ve been prepared to support the teams and a wandering phalanx of coaches to give help on the fly.

Part of the planning has to do with building a set of RIMMF ‘records’ to start with, for participants to add on their own resources and explore the rich relationships in RDA. We’re calling these ‘r-balls’ (a cross between RIMMF and tarballs). These zipped-up r-balls will be available for others to use for their own homegrown sessions, along with instructions for using RIMMF and setting up a Jane-athon (or other themed -athon), and also how to contribute their own r-balls for the use of others. In case you’ve not picked it up, this is a radically different training model, and we’d like to make it possible for others to play, too.

That’s the plan for the morning. After lunch we’ll take a look at what we’ve done, and prise out the issues we’ve encountered, and others we know about. The hope is that the participants will walk out the door with both an understanding of what RDA is (more than the instructions) and how it fits into the emerging linked data world.

I recently returned from a trip to Honolulu, where I did a prototype Jane-athon workshop for the Hawaii Library Association. I have to admit that I didn’t give much thought to how difficult it would be to do solo, but I did have the presence of mind to give the organizer of the workshop some preliminary setup instructions (based on what we’ll be doing in Chicago) to ensure that there would be access to laptops with software and records pre-loaded, and a small cadre of folks who had been working with RIMMF to help out with data creation on the day.

The original plan included a day before the workshop with a general presentation on linked data and some smaller meetings with administrators and others in specialized areas. It’s a format I’ve used before and the smaller meetings after the presentation generally bring out questions that are unlikely to be asked in a larger group.

What I didn’t plan for was that I wouldn’t be able to get out of Ithaca on the appointed day (the day before the presentation) thanks not to bad weather, but instead to a non-functioning plane which couldn’t be repaired. So after a phone discussion with Hawaii, I tried again the next day, and everything went smoothly. On the receiving end there was lots of effort expended to make it all work in the time available, with some meetings dribbling into the next day. But we did it, thanks to organizer Nancy Sack’s prodigious skills and the flexibility of all concerned.

Nancy asked the Jane-athon participants to fill out an evaluation, and sent me the anonymized results. I really appreciated that the respondents added many useful (and frank) comments to the usual range of questions. Those comments in particular were very helpful to me, and were passed on to the other MW Jane-athon organizers. One of the goals of the workshop was to help participants visualize, using RIMMF, how familiar MARC records could be automatically mapped into the FRBR structure of RDA, and how that process might begin to address concerns about future workflow and reuse of MARC records. Another goal was to illustrate how RDA’s relationships enhanced the value of the data, particularly for users. For the most part, it looked as if most of the participants understood the goals of the workshop and felt they had gotten value from it.

But there were those who provided frank criticism of the workshop goals and organization (as well as the presenter, of course!). Part of these criticisms involved the limitations of the workshop, wanting more information on how they could put their new knowledge to work, right now. The clearest expression of this desire came in as follows:

“I sort of expected to be given the whole road map for how to take a set of data and use LOD to make it available to users via the web. In rereading the flyer I see that this was not something the presenter wanted to cover. But I think it was apparent in the afternoon discussion that we wanted more information in the big picture … I feel like I have an understanding of what LOD is, but I have no idea how to use it in a meaningful way.”

Aside from the time constraints–which everyone understood–there’s a problem inherent in the fact that very few active LOD projects have moved beyond publishing their data (a good thing, no doubt about it) to using the data published by others. So it wasn’t so much that I didn’t ‘want’ to present more about the ‘bigger picture’, there wasn’t really anything to say aside from the fact that the answer to that question is still unclear (and I probably wasn’t all that clear about it either). If I had a ‘road map’ to talk about and point them to, I certainly would have shared it, but sadly I have nothing to share at this stage.

But I continue to believe that just as progress in this realm is iterative, it is hugely important that we not wait for the final answers before we talk about the issues. Our learning needs to be iterative too, to move along the path from the abstract to the concrete along with the technical developments. So for MidWinter, we’ll need to be crystal clear about what we’re doing (and why), as well as why there are blank areas in the road-map.

Thanks again to the Hawaii participants, and especially Nancy Sack, for their efforts to make the workshop happen, and the questions and comments that will improve the Jane-athon in Chicago!

For additional information, including a link to register, look here. Although I haven’t seen the latest registration figures, we’re expecting to fill up, so don’t delay!

[these are the workshop slides]

[these are the general presentation slides]

Chris Prom: Configuring WordPress Multisite as a Content Management System

planet code4lib - Fri, 2014-12-19 14:56

In summer/fall 2012, I posted a series regarding the implementation of WordPress as an content management system.  Time prevented me from describing how we decided to configure WordPress for use in the University of Illinois Archives.  In my next two posts, I’d like to rectify that, first by describing our basic implementation, then by noting (in the second post) some WordPress configuration steps that proved particularly handy.It’s an opportune time to do this because our Library is engaged in a project to examine options for a new CMS, and WordPress is one option.

When we went live with the main University Archives site in August 2012, one goal was to manage  related sites (the American Library Association Archives, the Sousa Archives and Center for American Music, and the Student Life and Culture Archives) in one technology, but to allow a certain amount of local flexiblity in the implemenation.  Doing this, I felt, would minimize development and maintenance costs while making it easer for staff to add and edit content.  We had a strong desire to avoid staff training sessions and sought to help our many web writers and editors become self sufficient, without letting them wander too far afield from an overall design aesthetic (even if my own design sense was horrible, managing everything in one system would make it easier to apply a better design at a later date).

I began by setting up a WordPress multisite installation and by selecting the thematic theme framework.  In retrospect, these decisions have proven to be good ones, allowing us to achieve the goals described above

Child Theme Development

Thematic is  theme framework, and is not suitable for those who don’t like editing CSS or delving into code (i.e. for people who want to set colors and do extensive styling in the admin interface.   That said, its layout and div organization are easy to understand, and it is well documented. It includes a particularly strong set of widget areas, so that is a huge plus.  It is developer friendly since it is easy to do site customizations in the child theme, without affecting the parent Thematic style or the WordPress core.

Its best feature: You can spin off child themes, while reusing the same content blocks and staying in sync with WordPress best practices.  Even those with limited CSS and/or php skills can quickly develop attractive designs simply by editing the styles and including a few hooks to load images (in the functions file).  In addition to me, two staff members (Denise Rayman and Angela Jordan) have done this for the ALA Archives and SLC Archives.

Another plus: The Automattic “Theme division” developed and supports Thematic, which means that it benefits from close alignment with WP’s core developer group. Our site has never broken on upgrade when using my thematic child themes; at most we have done a few minutes of work to correct minor problems.

In the end, The decision to use Thematic required more upfront work, but it forced me to  about theme development and to begin grappling with the WordPress API (e.g. hooks and filters), while setting in place a method for other staff to develop spin off sites.  More on that in my next post.

Plugin Selection

Once WordPress multisite was running, we spent time selecting and installing plug-ins that could be used on the main site and that would help us achieve desired effects.  The following proved to be particularly valuable and have proven to have good forward compatibility (i.e. not breaking the site when we upgraded WordPress):

  • WPTouch Mobile
  • WP Table Reloaded (adds table editor)
  • wp-jquery Lightbox (image modal windows)
  • WordPress SEO
  • Simple Section Navigation Widget (builds local navigation menus from page order)
  • Search and Replace (admin tool for bulk updating paths, etc.)
  • List Pages Shortcode
  • Jetpack by
  • Metaslider (image carousel)
  • Ensemble Video  Shortcodes (allows embedding AV access copies in campus streaming service)
  • Google Analytics by Yoast
  • Formidible (form builder)
  • CMS Page Order (drag and drop menu for arranging overall site structure)
  • Disqus Comment System

Again, I’ll write more about how we are using these, in my next post.


William Denton: The best paper I read this year: Polster, Reconfiguring the Academic Dance

planet code4lib - Fri, 2014-12-19 01:58

The best paper I read this year is Reconfiguring the Academic Dance: A Critique of Faculty’s Responses to Administrative Practices in Canadian Universities by Claire Polster, a sociologist at the University of Regina, in Topia 28 (Fall 2012). It’s aimed at professors but public and academic librarians should read it.

Unfortunately, it’s not gold open access. There’s a two year rolling wall and it’s not out of it yet (but I will ask—it should have expired by now). If you don’t have access to it, try asking a friend or following the usual channels. Or wait. Or pay six bucks. (Six bucks? What good does that do, I wonder.)

Here’s the abstract:

This article explores and critiques Canadian academics’ responses to new administrative practices in a variety of areas, including resource allocation, performance assessment and the regulation of academic work. The main argument is that, for the most part, faculty are responding to what administrative practices appear to be, rather than to what they do or accomplish institutionally. That is, academics are seeing and responding to these practices as isolated developments that interfere with or add to their work, rather than as reorganizers of social relations that fundamentally transform what academics do and are. As a result, their responses often serve to entrench and advance these practices’ harmful effects. This problem can be remedied by attending to how new administrative practices reconfigure institutional relations in ways that erode the academic mission, and by establishing new relations that better serve academics’—and the public’s—interests and needs. Drawing on the work of various academic and other activists, this article offers a broad range of possible strategies to achieve the latter goal. These include creating faculty-run “banks” to transform the allocation of institutional resources, producing new means and processes to assess—and support—academic performance, and establishing alternative policy-making bodies that operate outside of, and variously interrupt, traditional policy-making channels.

This is the dance metaphor:

To offer a simplified analogy, if we imagine the university as a dance floor, academics tend to view new administrative practices as burdensome weights or shackles that are placed upon them, impeding their ability to perform. In contrast, I propose we see these practices as obstacles that are placed on the dance floor and reconfigure the dance itself by reorganizing the patterns of activity in and through which it is constituted. I further argue that because most academics do not see how administrative practices reorganize the social relations within which they themselves are implicated, their reactions to these practices help to perpetuate and intensify these transformations and the difficulties they produce. Put differently, most faculty do not realize that they can and should resist how the academic dance is changing, but instead concentrate on ways and means to keep on dancing as best they can.

A Dance to the Music of Time, by Nicolas Poussin (from Wikipedia)

About the constant struggle for resources:

Instead of asking administrators for the resources they need and explaining why they need them, faculty are acting more as entrepreneurs, trying to convince administrators to invest resources in them and not others. One means to this end is by publicizing and promoting ways they comply with administrators’ desires in an ever growing number of newsletters, blogs, magazines and the like. Academics are also developing and trying to “sell” to administrators new ideas that meet their needs (or make them aware of needs they didn’t realize they had), often with the assistance of expensive external consultants. Ironically, these efforts to protect or acquire resources often consume substantial resources, intensifying the very shortages they are designed to alleviate. More importantly, these responses further transform institutional relations, fundamentally altering, not merely adding to, what academics do and what they are.

About performance assessment:

Another academic strategy is to respect one’s public-serving priorities but to translate accomplishments into terms that satisfy administrators. Accordingly, one might reframe work for a local organization as “research” rather than community service, or submit a private note of appreciation from a student as evidence of high-quality teaching. This approach extends and normalizes the adoption of a performative calculus. It also feeds the compulsion to prove one’s value to superiors, rather than to engage freely in activities one values.

Later, when she covers the many ways people try to deal with or work around the problems on their own:

There are few institutional inducements for faculty to think and act as compliant workers rather than autonomous professionals. However, the greater ease that comes from not struggling against a growing number of rules, and perhaps the additional time and resources that are freed up, may indirectly encourage compliance.

Back to the dance metaphor:

If we return to the analogy provided earlier, we may envision academics as dancers who are continually confronted with new obstacles on the floor where they move. As they come up to each obstacle, they react—dodging around it, leaping over it, moving under it—all the while trying to keep pace, appear graceful and avoid bumping into others doing the same. It would be more effective for them to collectively pause, step off the floor, observe the new terrain and decide how to resist changes in the dance, but their furtive engagement with each obstacle keeps them too distracted to contemplate this option. And so they keep on moving, employing their energies and creativity in ways that further entangle them in an increasingly difficult and frustrating dance, rather than trying to move in ways that better serve their own—and others’ —needs.

Dance II, by Henri Matisse (from Wikipedia)

She with a number of useful suggestions about how to change things, and introduces this by saying:

Because so many academic articles are long on critique but short on solutions, I present a wide range of options, based on the reflections and actions of many academic activists both in the past and in the present, which can challenge and transform university relations in positive ways.

Every paragraph hit home. At York University, where I work, we’re going through a prioritization process using the method set out by Robert Dickeson. It’s being used at many universities, and everything about it is covered by Polster’s article. Every reaction she lists, we’ve had. Also, the university is moving to activity-based costing, a sort of internal market system, where some units (faculties) bring in money (from tuition) and all the other units (including the libraries) don’t, and so are cost centres. Cost centres! This has got people in the libraries thinking about how we can generate revenue. Becoming a profit centre! A university library! If thinking like that gets set in us deep the effects will be very damaging.

Library of Congress: The Signal: NDSR Applications Open, Projects Announced!

planet code4lib - Thu, 2014-12-18 18:32

The Library of Congress, Office of Strategic Initiatives and the Institute of Museum and Library Services are pleased to announce the official open call for applications for the 2015 National Digital Stewardship Residency, to be held in the Washington, DC area.  The application period is from December 17, 2014 through January 30, 2015. To apply, go to the official USAJobs page link.

Looking down Pennsylvania Avenue. Photo by Susan Manus

To qualify, applicants must have a master’s degree or higher, graduating between spring 2013 and spring 2015, with a strong interest in digital stewardship. Currently enrolled doctoral students are also encouraged to apply. Application requirements include a detailed resume and cover letter, undergraduate and graduate transcripts, two letters of recommendation and a creative video that defines an applicant’s interest in the program.  (Visit the NDSR application webpage for more application information.)

For the 2015-16 class, five residents will be chosen for a 12-month residency at a prominent institution in the Washington, D.C. area.  The residency will begin in June, 2015, with an intensive week-long digital stewardship workshop at the Library of Congress. Thereafter, each resident will move to their designated host institution to work on a significant digital stewardship project. These projects will allow them to acquire hands-on knowledge and skills involving the collection, selection, management, long-term preservation and accessibility of digital assets.

We are also pleased to announce the five institutions, along with their projects, that have been chosen as residency hosts for this class of the NDSR. Listed below are the hosts and projects, chosen after a very competitive round of applications:

  • District of Columbia Public Library: Personal Digital Preservation Access and Education through the Public Library.
  • Government Publishing Office: Preparation for Audit and Certification of GPO’s FDsys as a Trustworthy Digital Repository.
  • American Institute of Architects: Building Curation into Records Creation: Developing a Digital Repository Program at the American Institute of Architects.
  • U.S. Senate, Historical Office: Improving Digital Stewardship in the U.S. Senate.
  • National Library of Medicine: NLM-Developed Software as Cultural Heritage.

The inaugural class of the NDSR was also held in Washington, DC in 2013-14. Host institutions for that class included the Association of Research Libraries, Dumbarton Oaks Research Library, Folger Shakespeare Library, Library of Congress, University of Maryland, National Library of Medicine, National Security Archive, Public Broadcasting Service, Smithsonian Institution Archives and the World Bank.

George Coulbourne, Supervisory Program Specialist at the Library of Congress, explains the benefits of the program: “We are excited to be collaborating with such dynamic host institutions for the second NDSR residency class in Washington, DC. In collaboration with the hosts, we look forward to developing the most engaging experience possible for our residents.  Last year’s residents all found employment in fields related to digital stewardship or went on to pursue higher degrees.  We hope to replicate that outcome with this class of residents as well as build bridges between the host institutions and the Library of Congress to advance digital stewardship.”

The residents chosen for NDSR 2015 will be announced by early April 2015. Keep an eye on The Signal for that announcement. For additional information and updates regarding the National Digital Stewardship Residency, please see our website.

See the Library’s official press release here.


Subscribe to code4lib aggregator