You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 1 day 22 hours ago

District Dispatch: National Impact of Library Public Programs

Tue, 2014-12-23 21:59

Within the library community, we understand the value of public programming—at least from an experiential perspective, seeing how our users benefit. But how can we understand the benefits and challenges of public programming systematically across libraries, and ultimately at a national level?

The National Impact of Library Public Programs Assessment (NILPPA), a project of the American Library Association’s (ALA) Public Programs Office, is addressing these questions. Research work during the past year has yielded initial findings. You may find these findings of interest, and your comments will help to move this work forward.

The ALA Office for Information Technology Policy (OITP) thinks about the public policy implications of public programming. For many in the library community, the focus is on the substantive programming itself and the direct benefits to communities. For our orientation, public programming provides libraries with visibility (think marketing and advertising) in communities as important cultural and educational institutions. Public programming may also advance specific policy objectives such as improving literacy (including digital literacy), understanding challenges of privacy and surveillance in society, or the importance of widespread access to advanced technology (e.g., high-speed broadband).

The post National Impact of Library Public Programs appeared first on District Dispatch.

District Dispatch: In time for the holidays

Tue, 2014-12-23 21:47

Forget the New York Times best seller list when deciding what to read on any days off you might have in front of you. The Federal Communications Commission (FCC) released the second E-rate Modernization Order in plenty of time for you to print it out and stuff it in your carryon (if you’re lucky enough to be traveling somewhere sunny) or keep it on your bedside table if you’re like me and not so lucky.

Among the major changes adopted in this order are those geared to close the broadband capacity gap for libraries and schools, particularly for those in rural areas. These include:

  • Suspending the amortization rules for special construction;
  • Allowing applicants to pay their non-discounted portion for construction costs over multiple years;
  • Equalizing the treatment of dark and lit fiber;
  • Permitting self-construction of high-speed broadband networks; and
  • Adding discounts when states match funds for broadband construction projects.

These are the changes directly related to addressing the lack of affordable access to high-capacity broadband. The Commission also increased the funding available by $1.5 billion, bringing the program up to $3.9 billion. And as always, there are a number of other important program changes that provide new opportunities for libraries. We are preparing a summary of the order, but in the meantime, the FCC has one of their own which explains the major changes, some of which take effect in the 2015 funding year.

As we alluded to earlier, there is a lot of work ahead to make sure that libraries have the supports that they need to take advantage of the new funding and program changes. To that end Susan Hildreth, director of the Institute of Museum and Library Services (IMLS), and FCC Chairman Tom Wheeler held a conference call which was both a recognition of the hard work of the American Library Association (ALA) and our library partners and a call to action. We are heeding the call to action and planning ongoing outreach and education to provide as much information to applicants and library leaders as we can. As a first step we are working with the Public Library Association to hold a webinar, January 8, to go into detail on the second E-rate order. And there will be more to come in the weeks ahead.

Exparte drinks

Read the FCC’s summary and you can get back to the book you put aside, but if you take on the full 106 pages of the E-rate order, try our official E-rate cocktail:

“The Exparte”
  • 2 ounces Campari
  • 1 ounce Gin (your choice)
  • 3 drops Bitters (try Angustura or orange)
  • Topped with club soda and garnished with an orange twist

For those of you who have been closely following the E-rate proceeding for the last 18 months or for those intimately aware of the intricacies of the E-rate application cycle here’s an ode to help you ring in the new E-rate year.

An E-rate Holiday Ode The end of 2014 is now very near
and we have E-rate reform, so we hear.
Santa Wheeler has a very full sack
and the FCC/USAC elves have much on their backs.
All the new and confusing program regs
for some needed clarity we humbly beg.
With so many questions now still pending
we fear that 2015 may be never ending.
So much is new, so much has changed
even E-rate veterans’ minds go insane!
Like C2 reforms – yes we’ve waited so long
that useless 2-in-5 is finally gone!
C2 budgets, a bit hard to understand
but the FCC says your C2 funding’s in hand.
New rules for fiber, which is good news
changes like this we can certainly use.
A large increase in funding with money that’s new
to be spread among many and not just a few.
No need to amortize those big requests
get all funds in one year, likely the best.
The state match will stretch our limited funds
may not be too much but it’s more than just some.
How to do CEP, when will we hear?
the time to do this is very near.
The bad urban/rural change last summer
the new order fixes, it’s no longer a bummer.
To complete the new 471, I can hardly wait
though when it’s done I’ll be in a catatonic state.
But we’ve finally reached the end of program reform
so in the New Year let’s celebrate – the E-rate’s reborn. –Poem by Bob Bocher, OITP Fellow

Happy New Year

The post In time for the holidays appeared first on District Dispatch.

Nicole Engard: Bookmarks for December 23, 2014

Tue, 2014-12-23 20:30

Today I found the following resources and bookmarked them on <a href=

  • Contiki Contiki is an open source operating system for the Internet of Things. Contiki connects tiny low-cost, low-power microcontrollers to the Internet.

Digest powered by RSS Digest

The post Bookmarks for December 23, 2014 appeared first on What I Learned Today....

Related posts:

  1. What’s new in Ubuntu?
  2. December is Here
  3. Who's afraid of Google?

District Dispatch: Cromnibus Christmas / Chromibus Chanukkah

Tue, 2014-12-23 20:10

The 113th Congress concluded its work in time to leave town for the holidays. While not the most productive Congress in terms of bills passed, the 113th was able to finish one of the mandatory “must do” items of funding the Federal government for Fiscal Year 2015.

One might notice that while the Fiscal Year actually began October 1, for Congress, a three month delay is not uncommon in the highly partisan and dysfunctional climate. The Federal government has been operating under a Continuing Resolution, a Congressionally-enacted measure to provide short term funding to keep the doors of government open while Appropriators hammer out details of longer term funding levels.

What exactly is a Cromnibus? It’s not a Nightmare Before Christmas, but rather a massive funding bill that provides funding to keep the Federal government open for a short period of time (a Continuing Resolution) and also provides long term funding for eleven of Federal agencies in one bill (an Omnibus)…thus the marvelously named CR-Omnibus!

How did libraries fare in the Cromnibus funding package? Mostly, programs supported by the libraries received level funding, which is good news in the austere atmosphere on Capitol Hill. For example, the Library Services and Technology Act, Head Start, Innovative Approaches to Literacy, and Career and Technical Education State Grants all received the same level of funding as FY 2014.

A few programs received slight increases or decreases. Small increases were granted to the Institute of Museum and Library Services, Striving Readers, Library of Congress, and the Government Publishing Office (formally known as the Government Printing Office). Slight decreases were dealt to Assessment programs, National Archives, and Electronic Government initiatives.

You can view an expanded chart displaying the funding levels of top ALA priority programs by clicking here.

Now that the FY 15 budget is done and the 113th Congress has concluded, the 114th Congress will arrive in a few weeks and work on the FY 16 budget will begin.

The post Cromnibus Christmas / Chromibus Chanukkah appeared first on District Dispatch.

LITA: Jobs in Information Technology: December 23

Tue, 2014-12-23 19:33

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

User Experience Librarian, University of Arkansas, Fayetteville, AR

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

FOSS4Lib Recent Releases: Koha - Security and maintenance releases v 3.14.12, 3.16.5.1 and 3.18.2

Tue, 2014-12-23 16:27
Package: KohaRelease Date: Monday, December 22, 2014

Last updated December 23, 2014. Created by David Nind on December 23, 2014.
Log in to edit this page.

Bug fix, security and maintenance releases for Koha. See the release announcements for the details:

William Denton: Bakewell West Scales

Tue, 2014-12-23 02:58

Joan Bakewell interviews Prunella Scales and Timothy West is a fifteen-minute BBC radio interview, with journalist Joan Bakewell interviewing actors Prunella Scales and her husband Timothy West about Scales’s Alzheimer’s. You can hear it in evidence in what she says, but as sad as that is, the good humour of the two of them dealing with it—and all three of them dealing with old age—is remarkable. This is wonderful listening.

Bakewell: Do you remember Fawlty Towers?

Scales: Yes, what about—what do you mean, the lines?

District Dispatch: “Son of SOPA,” Internet-killing MPAA horror sequel, garners serious screams

Mon, 2014-12-22 20:46

Last Friday, District Dispatch readers were no doubt given the heebie-jeebies by Carrie Russell who told an eerily familiar and terrifying tale. For the past week or so in the wake of the Sony email hack, mainstream and online media have exposed a veritable cauldron of connivance by the Motion Picture Association of America (MPAA) with several of the nation’s state Attorneys General.

Photo still from “Son of Frankenstein,” ©Universal Pictures Company, Inc. (1944)

Apparently, MPAA wasn’t satisfied with rousing more than 14 million Americans to grab their keyboards, pitchforks and torches in 2012 to storm Congress’ castle and kill the monster called SOPA (the “Stop Online Piracy Act”). Instead, they immediately embarked upon a secret, potentially million dollar campaign to convince state AGs to reanimate SOPA’s corpse by (ab?)using their investigative and litigation powers. Specifically, MPAA has been trying to force Google and potentially other internet search companies to prevent the public from being able to find, and thus access, websites that the MPAA and friends unilaterally find (no judge involved) to infringe federal copyright law. (Ars Technica laid out the whole sordid campaign brilliantly this past week.)

Faced with that legitimately horrifying prospect, Google fought back late Friday by counter-suing Attorney General Jim Hood of Mississippi, MPAA’s lead laboratory assistant, who announced just hours later that he was calling a “time out” on (but not permanently abandoning) further suspect experiments in breaking the Internet.

Today, ALA’s name leads the list of signers of a letter to Attorney General Hood (pdf) (copied to all attorney generals) by a veritable Who’s Who of the nation’s other leading technology policy and public interest organizations reminding him of SOPA’s fate “and of the principled opposition to curtailing free speech that it first provoked.”

Time will tell if the nation’s Attorneys General will leave federal copyright law enforcement to Congress and reserve state taxpayers’ money for public safety issues and other matters closer to home. Meantime, do you know where your pitchfork and torch are?

Between the holidays just might be a good time to sharpen and prime them just in case we have to help make “Son of SOPA” the least profitable, shortest running movie in policy horror film history.

The post “Son of SOPA,” Internet-killing MPAA horror sequel, garners serious screams appeared first on District Dispatch.

Library of Congress: The Signal: Unlocking the Imagery of 500 Years of Books

Mon, 2014-12-22 20:03

The following is a guest post by Kalev H. Leetaru of Georgetown University (Former), Robert Miller of Internet Archive and David A. Shamma from Yahoo Labs/Flickr.

In 1994, linguist Geoff Nunberg stated, in an article in the journal “Representations,” “reading what people have had to say about the future of knowledge in an electronic world, you sometimes have the picture of somebody holding all the books in the library by their spines and shaking them until the sentences fall out loose in space…” What would these fragments look like if you took every page of every book from 2.5 million volumes dating back over 500 years? Could every illustration, drawing, chart, map, photograph and image be extracted, indexed and displayed? That was the question that launched the Internet Archives Book Images Project to catalog the imagery from half a millennium of books.

Over 14.7 million images were extracted from over 600 million pages covering an enormous variety of topics and stretching back to the year 1500. Yet, perhaps what is most remarkable about this montage is that these images come not from some newly-unearthed archive being seen for the first time, but rather from the books we have been digitizing for the past decade that have been resting in our digital libraries.

The history of book digitization has focused on creating text-based searchable collections–the identification of tens of millions of images on the pages of those books has historically been regarded as merely a byproduct of the digitization process. We inverted that model, reimagining books as containers of images rather than text. We then explored how digital libraries can yield rich visual collections by using modern image recognition technology coupled with Flickr to ultimately create one of the largest visual book collections in history. This involved automatically identifying visual content, cropping it out, extracting surrounding metadata via optical character recognition and uploading and indexing the structured data to Flickr. In effect, we are repurposing the vast archives of digital content created for text search and transforming it into a visual gallery of imagery. In doing so, we are creating a new way of “seeing” our cultural history.

Motivations

Albert Bierstadt, “Among the Sierra Nevada Mountains” courtesy of Wikimedia.

How does one go about creating an archive of images spanning over 500 years? After seeing Albert Bierstadt’s “Among the Sierra Nevada Mountains, California” Kalev Leetaru (a co-author of this post) became curious as to how the American West of the nineteenth century was portrayed in literature–in particular how it was portrayed in imagery rather than words. Yet, searches in various digital libraries for Albert Bierstadt’s name yielded almost exclusively textual descriptions of his artwork, not photographs or illustrations – the same was true for searches of other nineteenth century technologies like the telegraph, the telephone, and the railroad.

Current digital libraries are designed to search for a word or phrase and return a list of pages mentioning it. The notion of searching for images appearing beside a word or phrase is simply not supported. While the concept of “image search” has become ingrained in our daily lives, it has largely remained a tool exclusively for searching the web, rather than other modalities like the printed word. As we’ve sought to find structure online we have seemingly overlooked the inherent knowledge within printed books. The call was clear: expand beyond a simple search box query and unlock the imagery of the world’s books. With this call in hand, Leetaru approached the Internet Archive, as one of the world’s largest open collections of digitized books, for permission to use their vast collection of over 600 million pages of digitized books stretching back half a millennia, and Flickr, as one of the largest online image services, to host the final collection of images in an interactive and searchable form.

A System Prototype

Historically, PDF versions of scanned books were created as what is called “image over text” files: the scanned image of each page is displayed with the OCR text hidden underneath. This works well for desktops and laptops with their unlimited storage and bandwidth, but can easily generate files in the tens or hundreds of megabytes. The rise of eReader devices, with their limited storage and processing capacities, necessitated more optimized file formats that save books as ASCII text and extract each visual element as a separate image file. Many digital libraries, including the Internet Archive, make their books available this way in the open EPUB file format, which is essentially a compressed ZIP file containing a set of HTML pages and the book’s images as JPG, PNG or GIF files. Extracting the images from each book is therefore as simple as unzipping its EPUB file, saving the images to disk, and searching the HTML pages to locate where each image appears in the book to extract the text surrounding it.

The simplicity here is what makes it so powerful. The hard part of creating an image gallery from books lies in the image recognition process needed to identify and extract each image from the page, yet this task is already performed in the creation of the EPUB files. This also means that this process can be easily repeated for any digital library that offers EPUB versions of their works, making it possible to one day create a single master repository of every image published in every book ever digitized. The entire processing pipeline was performed on a single four-processor virtual machine in the Internet Archive’s Virtual Reading Room. While all of the books used are available for public download on the Archive’s website, using the Virtual Reading Room made it possible to work much more easily with the Archive’s collections, dramatically reducing the time it took to complete the project.

Creating the Final Archive

The chief limitation of this prototype solution was that the images in EPUB files are downsampled to minimize storage space and processing time on the small portable devices they are designed for. Creating a gallery of rich high-resolution imagery from the world’s books requires returning to the original raw page scan imagery and using the OCR results to identify and crop each image from those scans.

From the beginning, it was decided to leverage the existing OCR results for each book instead of trying to develop new algorithms for identifying images from scanned book pages. It is hard to improve upon the accuracy of OCR software designed solely for the purpose of identifying text and images from scanned pages and developed by companies with large research and development staffs focusing exclusively on this very topic. In addition, few libraries have large computational clusters capable of running sophisticated image processing software, with OCR either outsourced to a vendor or run in scheduled fashion on dedicated OCR servers. By using the existing OCR results, the most computationally-intensive portion of the process is simply reused and all that remains is to crop the images from the page scans using the OCR results.

The final system uses the raw output of the Abbyy OCR software that the Internet Archive runs over each book, coupled with the scanning data produced when the book was originally digitized, to identify and extract the images at the original digitization resolution. As part of the OCR process to extract searchable text from each page, the OCR software identifies images on each page and sets them aside. Similar to the process used to create the EPUB files, this image information was used to extract each of these images to create the final gallery, along with the text surrounding each image.

The process begins by downloading from the Internet Archive the master list of all files available for a book, including its original page scan images and Abbyy OCR file. The original page scan imagery (JPEG2000, JPEG or TIFF format) and the OCR’s XML output must be present or the book is skipped. Next, the “scandata.xml” file (which contains a list of all pages in the book and their status) is examined to locate page scans that were flagged by the human operator of the book scanning system for exclusion. An example might be a bad page scan where the page was not in proper position for scanning and thus was rescanned. In this case instead of deleting the bad scan images, they are simply flagged in the scandata.xml file. Similarly, to ensure proper color calibration, a “color card” and ruler is photographed before and after each book is scanned (and sometimes periodically at random intervals throughout). These frames are also flagged in the XML file for exclusion. These page scans were dropped before the OCR software was run on the book by the Archive and thus must be identified to align the page numbering.

Next, the Abbyy OCR XML file is examined to locate each image in the book and its surrounding text. This file has an enormous wealth of information calculated by the OCR software. It breaks each page into a series of paragraphs, each paragraph into lines, and each line into individual characters. It also provides a confidence measure on how “sure” it is of each individual letter. Each region, line, character and image includes “l,t,r,b” parameters yielding its “left,” “top,” “right” and “bottom” coordinates in pixels within the original page scan image. Thus, to extract each image from a book, one simply scans the XML file for the “image” regions, reads its coordinate and crops this region from the original page scan imagery – no further analysis is necessary. The text surrounding each image is also extracted from the OCR file to enable keyword searching of the images by their context.

Now that the list of images and their surrounding text has been compiled, the system must download the full-resolution page scans to extract the images from. The problem is that the page scans for each book are delivered as ZIP files that are usually several hundred megabytes, occasionally exceeding one gigabyte. While at first this may not seem like much, when multiplied by a large number of books, the network bandwidth and the speed of computer hard-drives become critical limiting factors. A set of 22-core virtual machines were used to handle much of the computing needs of this project.

Typically one might attempt to process one book per core, meaning 22 books would be processed in parallel at any given time. If all 22 books had 500MB ZIP files containing their full resolution page scan imagery, this would require downloading 11GB of data over the network per second. Assuming that each book takes less than a second of CPU time on average to process, this would require a 100Gbps network link working at 100% capacity to sustain these processing needs. Even if this was broken into 22 separate virtual machines, each with a single core, all 22 machines would still each require their own four Gbps network link working at 100% capacity and delivering 500MB/s bandwidth.

Even if the network bandwidth limitations are overcome, the greatest challenge is actually the disk bandwidth of writing a 500MB ZIP file to disk from the network and then unpacking it (reading the 500MB) and updating the file system metadata to handle several hundred new files being written to disk (writing its 500MB of contents to disk). Thus, all said and done, a single 500MB ZIP file requires 500MB to be read twice and written twice, totaling two GB of total IO. Processing 22 files per second would require 44GB/s of disk bandwidth.

Many cloud computing vendors limit virtual machines to around 100MB/s sustained writes and 200MB/s sustained read performance, while even a dedicated physical three Gbps SATA hard drive operating at peak capability would still require two seconds just to write the ZIP file to disk and another two seconds to unpack it and write its contents to disk as individual files (assuming the intermediate reads are kernel buffered). Since writes are linear, SSD disks do not provide a speed advantage over traditional mechanical disks. While more exotic storage infrastructures exist that can support these needs, they are rarely found in library environments.

Our processing pipeline was therefore designed to minimize network and disk IO even at the cost of slowing down the processing of a single book. In the final system, if a book had less than 50 pages containing images to be extracted the page images were downloaded individually via the Archive’s ZIP API. Instead of downloading the entire ZIP file to local disk, the ZIP API allows a caller to request a single file from a ZIP archive and the Archive will extract, uncompress and return that single file from the ZIP archive. For optimization, contents of the ZIP were extracted individually via calls to the Archive’s web service at an average delay of around two seconds per image.

A book with 50 pages containing images would therefore require around 1.6 minutes to download all of the images, whereas on the virtual machines used for this project the raw page scan ZIP could be downloaded in under 30 seconds. However, downloading multiple full ZIP files would exceed the network and disk bandwidth on the virtual machines, raising the total time required to download the full ZIP file to about 10 minutes. For books with more than 50 pages of images, the full ZIP was downloaded, but only the needed pages were unpacked from the ZIP file instead of all pages, again reducing the read/write bandwidth. Combined, these two techniques reduced the IO load on each virtual machine to the point that the CPUs were able to be kept largely occupied when running 44 books in parallel.

Finally, the book images needed to be cropped from the full resolution page scans. The de facto tool for such image operations is ImageMagick, however its performance is extremely slow on large JPEG2000 images. Instead, we used the specialized “kdu_expand” tool from Kakadu Software to extract images and CJPEG to write the resulting image to disk. This resulted in a speedup of eight to 30 times by comparison to ImageMagick. To minimize memory requirements, the final system actually generates a shell script and then exits and invokes the shell script to perform all of the actual image processing components of the pipeline, freeing up the memory originally used to process the XML files since per-core memory was highly limited on the systems available for this project. Finally, the list of extracted JPEG images and a tab-delimited inventory text file listing the attributes of each image and the text surrounding them is compressed into one ZIP file per book, ready for use. Thus, while conceptually quite simple, the final system required considerable iterative development and enormous effort to minimize IO at all costs.

The Archive, the Flickr Commons, and the Future

With the mechanics of data management and image extraction completed, the first 2.6 million images from the collection were uploaded to the Internet Archive Book Images collection in the Flickr Commons in July/August 2014, making it searchable and browseable to the entire world. Each image is uploaded with a set of indexed tags such as book year (like 1879) or book subject (like sailing) and the book’s id (like bookidkittenscatsbooko00grov). Each image’s description has detailed information about the image caption, text before and after the image, book’s title, and the page number. In the Flickr Commons, the images themselves carry no known copyright restrictions; this itself carries new implications for libraries and archives online. In a recent conversation about this collection, Cathy Marshall stated:

“…here are over 2.5 million images, ripe for reuse and reorganized into different collections and subcollections. Do we necessarily want to read the whole of a monograph about polycystins from 1869? Probably not. But might we use a stippled protozoan illustration as a homescreen background (in this case, with little concern for copyright restrictions). And we would be in good company: in a recent study, Frank Shipman and I found that over 80% of our participants (202 out of 242) downloaded photos they found on the Internet with little concern for copyright restrictions. Almost 3/4 reused these photos in new contexts. It’s easy to see how book images, extracted and offered without copyright concerns, would be an attractive online resource for purposes many of us haven’t even foreseen yet.”

We also look towards annotations. From people in the communities and in the libraries, human annotation can help classify – formally and informally – the content to be discovered in the books. Even signals that one might take for granted, such as marking an image as a favorite or leaving a comment, can be quite valuable in social computing to further understand a corpus and help tell the stories contained across all the books. The structured data and human annotations here are the first steps; computer vision systems can further index concepts.

We are just now at the beginning of what is possible with this collection. From here, we hope to begin a transformation of the Internet Archive’s collections into a dynamic, growing collection where concepts, objects, locations and even music can be discovered, not as just an index of text around images, but rather a deeper knowledge model that utilizes the structure of books, publishing and libraries to understand the world and allow scholars to begin asking an entirely new generation of research questions.

LITA: Long Live Firefox!

Mon, 2014-12-22 11:00


Until I became a librarian, I never gave much thought to web browsers. In the past I used Safari when working on a Mac, Chrome on my Android tablet, and showed the typical disdain for Internet Explorer. If I ever used Firefox it was purely coincidental, but now it’s my first choice and here’s why.

This month Mozilla launched Firefox 34 and announced a deal to make Yahoo their default search engine. I wasn’t alone in wondering if that move would be bad for business (if you’re like me, you avoid Yahoo like the plague). Mozilla also raised some eyebrows by asking for donations on their home page this year.

I switched to Firefox a few months ago, prior to all the commotion, when I came across Mozilla’s X-Ray Goggles, an add-on that allows you to view how a webpage is constructed (the Denver Public Library has a great project tutorial using X-Ray Goggles that I highly recommend). I was pleasantly surprised to find a slew of other resources for teaching the web and after doing a little more digging, I was taken by Mozilla’s support of an open web and intrigued by their non-profit status.

At the library I frequently encounter patrons who have pledged their allegiance to Google or Apple or Microsoft and I’m the same way. I was excited to update to Lollipop on my tablet and I’m saving up for an iMac, but I cringe when I think about Google’s privacy policies or Apple’s sweatshops. Are these companies that I really want to support?

I was teaching an Android class the other day and a patron asked me which browser is the best. I told her that I use Firefox because I support Mozilla and what they stand for. She chuckled at my response. Maybe it’s silly to stand up for any corporation, but given the choice I want to support the one that does the most good (or the least evil).

Mozilla’s values and goals are very much in line with the modern library. If you’re on the fence about Firefox, take a look at their Privacy Policy, Add-Ons, and see how easy it is to switch your default search engine back to Google. You just might change your mind.

Mark E. Phillips: Calculating a Use.

Mon, 2014-12-22 01:42

In the previous post I talked a little about what a “use” is for our digital library and why we use it as a unit of measurement.  We’ve found that it is a very good indicator of engagement with our digital items.  It allows us to understand how our digital items are being used in ways that go beyond Google Analytics reports.

In this post I wanted to talk a little bit about how we calculate these metrics.

First of all a few things about the setup that we have for delivering content.

We have a set of application servers that are running an Apache/Python/Django stack for receiving and delivering requests for digital items.  A few years ago we decided that it would be useful to proxy all of our digital object content through theses content servers for delivery so we would have access to adjust and possibly restrict content in some situations.  This means two things,  one that all traffic goes in and out of these application servers for requests and delivery of content, and two that we are able to rely on the log files that Apache produces to get a good understanding of what is going on in our system.

We decided to base our metrics on best practices of the Counter initiative whenever possible as to try and align our numbers with that community.

Each night at about 1:00 AM CT we start a process that aggregates all of the log files for the previous day from the different application servers.  These are collocated on a server that is responsible for calculating the daily uses.

We are using the NCSA extended/combined log format for the log file format coming from our servers.  We also append the domain name for the request because we operate multiple interfaces and domains from the same servers.  A typical set of logs look something like this.

texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74680/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74678/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" texashistory.unt.edu 68.180.228.104 - - [21/Dec/2014:03:49:28 -0600] "GET /search/?q=%22Bowles%2C+Flora+Gatlin%2C+1881-%22&t=dc_creator&display=grid HTTP/1.0" 200 17900 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74662/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" texashistory.unt.edu 68.229.222.120 - - [21/Dec/2014:03:49:28 -0600] "GET /ark:/67531/metapth74679/m1/1/high_res/?width=930 HTTP/1.0" 200 59858 "http://forum.skyscraperpage.com/showthread.php?t=173080&page=15" "Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0"

 

Here are the steps that we go through for calculating uses.

  • Remove requests that are not for an ARK identifier,  the request portion must start with /ark:/
  • Remove requests from known robots (this list, plus additional robots)
  • Remove requests with no user agent
  • Remove all requests that are not 200 or 304
  • Remove requests for thumbnails, raw metadata, feedback, urls from the object as they are generally noise

From the lines in the first example there would be only one line left after processing,  the last line.

The lines that remain are sorted by date and then grouped byIP Address.  A time window of 30 minutes is run on the requests.  The IP address and ARK identifier are used as the key to the time window which allows us to group request by a single IP Address for a single item into a single use.  These uses are then fed into a Django application where we store our stats data.

There are of course many other specifics about the process of getting these stats processed and moved into our system.  I hope this post was helpful for explaining the kinds of things that we do and doing count when calculating uses.

 

Patrick Hochstenbach: It Is Raining Cat Doodles

Sat, 2014-12-20 15:48
Filed under: Doodles Tagged: cat, Cats, doodle, elag

District Dispatch: Sony leak reveals efforts to revive SOPA

Sat, 2014-12-20 15:17

Working in Washington, D.C. tends to make one a bit jaded: the revolving door, the bipartisan attacks, not enough funding for libraries — the list goes on. So, yes, I am D.C.-weary and growing more cynical. Now I have another reason to be fed up.

The Sony Pictures Entertainment data breach has uncovered documents that show that the Motion Picture Association of America (MPAA) has been trying to pull a fast one —reviving the ill-conceived Stop Online Piracy Act (SOPA) legislation (that failed spectacularly in 2010) by apparently working in tandem with state Attorneys General. Documents show that MPAA has been devising a scheme to get the result they could not get with SOPA—shutting down web sites and along with them freedom of expression and access to information.

Sony Pictures studio.

The details have been covered by a number of media outlets, including The New York Times. The MPAA seems to think that the best solution to shutting down piracy is “make invisible” the web sites of suspected culprits. You may think that libraries have little to worry about; after all we aren’t pirates. But the good guys will be yanked offline, as well as the alleged bad guys. Our provision of public access to the internet would then be in jeopardy because a few library users that allegedly posted protected content on, for example, Pinterest or YouTube. Our protection from liability for the activities of library patrons using public computers could be thrown out the window along with internet access. This makes no sense.

SOPA, touted initially by Congress as a solution to online piracy, also made no sense from the start because it was too broad. If passed, it would have required that libraries police the internet and block web sites whenever asked by law enforcement officials. Technical experts confirmed that the implementation of SOPA could threaten cybersecurity and undermine the Domain Name System (DNS), also known as the very “backbone of the Internet.”

After historically overwhelming public outcry the content community and internet supporters were encouraged to work together on a compromise, parties promised to collaborate, and some work was actually accomplished. Now it seems that, as far as MPAA was concerned, collaboration was just hype. They were, the leaked documents show, planning all along to get SOPA one way or another.

The library community opposes piracy. But we also oppose throwing the baby out with the bath water.

Update: The Verve has reported that Mississippi Attorney General Hood did indeed launch his promised attack on behalf of the MPAA by serving Google with a 79-page subpoena charging that Google engaged in “deceptive” or “unfair” trade practice under the Mississippi Consumer Protection Act. Google has filed a brief asking the federal court to set aside the subpoena and noting that Mississippi (or any state for that matter) has no jurisdiction over these matters.

For more on efforts to revive SOPA, see this post as well.

The post Sony leak reveals efforts to revive SOPA appeared first on District Dispatch.

Journal of Web Librarianship: Reimagining the Bibliography: Database of the Smokies

Sat, 2014-12-20 13:31
10.1080/19322909.2014.973986
Ken Wise

Terry Reese: Working with SPARQL in MarcEdit

Sat, 2014-12-20 06:06

Over the past couple of weeks, I’ve been working on expanding the linking services that MarcEdit can work with in order to create identifiers for controlled terms and headings.  One of the services that I’ve been experimenting with is NLM’s beta SPARQL endpoint for MESH headings.  MESH has always been something that is a bit foreign to me.  While I had been a cataloger in my past, my primary area of expertise was with geographic materials (analog and digital), as well as traditional monographic data.  While MESH looks like LCSH, it’s quite different as well.  So, I’ve been spending some time trying to learn a little more about it, while working on a process to consistently query the endpoint to retrieve the identifier for a preferred Term. Its been a process that’s been enlightening, but also one that has led me to think about how I might create a process that could be used beyond this simple use-case, and potentially provide MarcEdit with an RDF engine that could be utilized down the road to make it easier to query, create, and update graphs.

Since MarcEdit is written in .NET, this meant looking to see what components currently exist that provide the type of RDF functionality that I may be needing down the road.  Fortunately, a number of components exist, the one I’m utilizing in MarcEdit is dotnetrdf (https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/browse/).  The component provides a robust set of functionality that supports everything I want to do now, and should want to do later.

With a tool kit found, I spent some time integrating it into MarcEdit, which is never a small task.  However, the outcome will be a couple of new features to start testing out the toolkit and start providing users with the ability to become more familiar with a key semantic web technology,  SPARQL.  The first new feature will be the integration of MESH as a known vocabulary that will now be queried and controlled when run through the linked data tool.  The second new feature is a SPARQL Browser.  The idea here is to give folks a tool to explore SPARQL endpoints and retrieve the data in different formats.  The proof of concept supports XML, RDFXML, HTML. CSV, Turtle, NTriple, and JSON as output formats.  This means that users can query any SPARQL endpoint and retrieve data back.  In the current proof of concept, I haven’t added the ability to save the output – but I likely will prior to releasing the Christmas MarcEdit update.

Proof of Concept

While this is still somewhat conceptual, the current SPARQL Browser looks like the following:

At present, the Browser assumes that data resides at a remote endpoint, but I’ll likely include the ability to load local RDF, JSON, or Turtle data and provide the ability to query that data as a local endpoint.  Anyway, right now, the Browser takes a URL to the SPARQL Endpoint, and then the query.  The user can then select the format that the result set should be outputted.

Using NLM as an example, say a user wanted to query for the specific term: Congenital Abnormalities – utilizing the current proof of concept, the user would enter the following data:

SPARQL Endpoint: http://id.nlm.nih.gov/mesh/sparql

SPARQL Query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#> PREFIX mesh: <http://id.nlm.nih.gov/mesh/> SELECT distinct ?d ?dLabel FROM <http://id.nlm.nih.gov/mesh2014> WHERE { ?d meshv:preferredConcept ?q . ?q rdfs:label 'Congenital Abnormalities' . ?d rdfs:label ?dLabel . } ORDER BY ?dLabel

Running this query within the SPARQL Browser produces a resultset that is formatted internally into a Graph for output purposes.

The images snapshot a couple of the different output formats.  For example, the full JSON output is the following:

{ "head": { "vars": [ "d", "dLabel" ] }, "results": { "bindings": [ { "d": { "type": "uri", "value": "http://id.nlm.nih.gov/mesh/D000013" }, "dLabel": { "type": "literal", "value": "Congenital Abnormalities" } } ] } }

The idea behind creating this as a general purpose tool, is that in theory, this should work for any SPARQL endpoint.   For example, the Project Gutenberg Metadata endpoint.  The same type of exploration can be done, utilizing the Browser.

Future Work

At this point, the SPARQL Browser represents a proof of concept tool, but one that I will make available as part of the MARCNext research toolset:

As part of the next update.  Going forward, I will likely refine the Browser based on feedback, but more importantly, start looking at how the new RDF toolkit might allow for the development of dynamic form generation for editing RDF/BibFrame data…at least somewhere down the road.

–TR

[1] SPARQL (W3C): http://www.w3.org/TR/rdf-sparql-query/
[2] SPARQL (Wikipedia): http://en.wikipedia.org/wiki/SPARQL
[3] SPARQL Endpoints: http://www.w3.org/wiki/SparqlEndpoints
[4] MarcEdit: http://marcedit.reeset.net
[5] MARCNext: http://blog.reeset.net/archives/1359

William Denton: Intersecting circles

Sat, 2014-12-20 03:21

A couple of months ago I was chatting about Venn diagrams with a nine-year-old (almost ten-year-old) friend named N. We learned something interesting about intersecting circles, and along the way I made some drawings and wrote a little code.

We started with two sets, but here let’s start with one. We’ll represent it as a circle on the plane. Call this circle c1.

Everything is either in the circle or outside it. It divides the plane into two regions. We’ll label the region inside the circle 1 and the region outside (the rest of the plane) x.

Now let’s look at two sets, which is probably the default Venn diagram everyone thinks of. Here we have two intersecting circles, c1 and c2.

We need to consider both circles when labelling the regions now. For everything inside c1 but not inside c2, use 1x; for the intersection use 12; for what’s in c2 but not c1 use x2; and for what’s outside both circles use xx.

We can put this in a table:

1 2 1 x 1 2 x 2 x x

This looks like part of a truth table, which of course is what it is. We can use true and false instead of the numbers:

1 2 T F T T F T F F

It takes less space to just list it like this, though: 1x, 12, x2, xx.

It’s redundant to use the numbers, but it’s clearer, and in the elementary school math class they were using them, so I’ll keep with that.

Three circles gives eight regions: 1xx, 12x, 1x3, 123, x2x, xx3, x23, xxx.

Four intersecting circles gets busier and gives 14 regions: 1xxx, 12xx, 123x, 12x4, 1234, 1xx4, 1x34, x2xx, x23x, x324, xx3x, xx34, xxx4, xxxx.

Here N and I stopped and made a list of circles and regions:

Circles Regions 1 2 2 4 3 8 4 14

When N saw this he wondered how much it was growing by each time, because he wanted to know the pattern. He does a lot of that in school. We subtracted each row from the previous to find how much it grew:

Circles Regions Difference 1 2 2 4 2 3 8 4 4 14 6

Aha, that’s looking interesting. What’s the difference of the differences?

Circles Regions Difference DiffDiff 1 2 2 4 2 3 8 4 2 4 14 6 2

Nine-year-old (almost ten-year-old) N saw this was important. I forget how he put it, but he knew that if the second-level difference is constant then that’s the key to the pattern.

I don’t know what triggered the memory, but I was pretty sure it had something to do with squares. There must be a proper way to deduce the formula from the numbers above, but all I could do was fool around a little bit. We’re adding a new 2 each time, so what if we take it away and see what that gives us? Let’s take the number of circles as n and the result as ?(n) for some unknown function ?.

n ?(n) 1 0 2 2 3 3 4 12

I think I saw that 3 x 2 = 6 and 4 x 3 = 12, so n x (n-1) seems to be the pattern, and indeed 2 x 1 = 2 and 1 * 0 = 0, so there we have it.

Adding the 2 back we have:

Given n intersecting circles, the number of regions formed = n x (n - 1) + 2

Therefore we can predict that for 5 circles there will be 5 x 4 + 2 = 22 regions.

I think that here I drew five intersecting circles and counted up the regions and almost got 22, but there were some squidgy bits where the lines were too close together so we couldn’t quite see them all, but it seemed like we’d solved the problem for now. We were pretty chuffed.

When I got home I got to wondering about it more and wrote a bit of R.

I made three functions; the third uses the first two:

  • circle(x,y): draw a circle at (x,y), default radius 1.1
  • roots(n): return the n nth roots of unity (when using complex numbers, x^n = 1 has n solutions)
  • drawcircles(n): draw circles of radius 1.1 around each of those n roots
circle <- function(x, y, rad = 1.1, vertices = 500, ...) { rads <- seq(0, 2*pi, length.out = vertices) xcoords <- cos(rads) * rad + x ycoords <- sin(rads) * rad + y polygon(xcoords, ycoords, ...) } roots <- function(n) { lapply( seq(0, n - 1, 1), function(x) c(round(cos(2*x*pi/n), 4), round(sin(2*x*pi/n), 4)) ) } drawcircles <- function(n) { centres <- roots(n) plot(-2:2, type="n", xlim = c(-2,2), ylim = c(-2,2), asp = 1, xlab = "", ylab = "", axes = FALSE) lapply(centres, function (c) circle(c[[1]], c[[2]])) }

drawcircles(2) does what I did by hand above (without the annotations):

drawcircles(5) shows clearly what I drew badly by hand:

Pushing on, 12 intersecting circles:

There are 12 x 11 + 2 = 123 regions there.

And 60! This has 60 x 59 + 2 = 3598 regions, though at this resolution most can’t be seen. Now we’re getting a bit op art.

This is covered in Wolfram MathWorld as Plane Division by Circles, and (2, 4, 8, 14, 24, …) is A014206 in the On-Line Encyclopedia of Integer Sequences: “Draw n+1 circles in the plane; sequence gives maximal number of regions into which the plane is divided.”

Somewhere along the way while looking into all this I realized I’d missed something right in front of my eyes: the intersecting circles stopped being Venn diagrams after 3!

A Venn diagram represents “all possible logical relations between a finite collection of different sets” (says Venn diagram on Wikipedia today). With n sets there are 2^n possible relations. Three intersecting circles divide the plane into 3 x (3 - 1) + 2 = 8 = 2^3 regions, but with four circles we have 14 regions, not 16! 1x3x and x2x4 are missing: there is nowhere where only c1 and c3 or c2 and c4 intersect without the other two. With five intersecting circles we have 22 regions, but logically there are 2^5 = 32 possible combinations. (What’s an easy way to calculate which are missing?)

It turns out there are various ways to draw four- (or more) set Venn diagrams on Wikipedia, like this two-dimensional oddity (which I can’t imagine librarians ever using when teaching search strategies):

You never know where a bit of conversation about Venn diagrams is going to lead!

District Dispatch: Sony leak reveals efforts to revive SOPA

Fri, 2014-12-19 22:35

Working in Washington, D.C., tends to make one a bit jaded: the revolving door, the bipartisan attacks, not enough funding for libraries — the list goes on. So, yes, I am D.C.-weary and growing more cynical. Now I have another reason to be fed up.

Sony Pictures studio.

The Sony Pictures Entertainment data breach has uncovered documents that show that the Motion Picture Association of America (MPAA) has been trying to pull a fast one —reviving the ill-conceived Stop Online Privacy Act (SOPA) legislation (that failed spectacularly in 2010) by apparently buying off state Attorneys General.  Documents show that MPAA has been devising a scheme to get the result they could not get with SOPA —shutting down web sites and along with them freedom of expression and access to information.

The details have been covered by a number of media outlets, including The New York Times. The MPAA seems to think that the best solution to shutting down piracy is “make invisible” the web sites of suspected culprits. You may think that libraries have little to worry about; after all we aren’t pirates. But the good guys will be yanked offline, as well as the alleged bad guys. Our provision of public access to the Internet would then be in jeopardy because a few library users allegedly posted protected content on, for example, Pinterest or YouTube. Our protection from liability for the activities of library patrons using public computers could be thrown out the window along with internet access. This makes no sense.

SOPA, touted initially by Congress as a solution to online piracy, also made no sense from the start because it was too broad. If passed, it would have required that libraries police the internet and block web sites whenever asked by law enforcement officials. Technical experts confirmed that the implementation of SOPA could threaten cybersecurity and undermine the Domain Name System (DNS), also known as the very “backbone of the Internet.”

After historically overwhelming public outcry the content community and internet supporters were encouraged to work together on a compromise, parties promised to collaborate, and some work was actually accomplished. Now it seems that, as far as MPAA was concerned, collaboration was just hype.   They were, the leaked documents show, planning all along to get SOPA one way or another.

The library community opposes piracy. But we also oppose throwing the baby out with the bath water.

Update: The Verve has reported that Mississippi Attorney General Hood did indeed launch his promised attack on behalf of the MPAA by serving Google with a 79-page subpoena charging that Google engaged in “deceptive” or “unfair” trade practice under the Mississippi Consumer Protection Act. Google has filed a brief asking the federal court to set aside the subpoena and noting that Mississippi (or any state for that matter) has no jurisdiction over these matters.

 

The post Sony leak reveals efforts to revive SOPA appeared first on District Dispatch.

Pages