You are here

Feed aggregator

Patrick Hochstenbach: Merry Christmas!

planet code4lib - Fri, 2014-12-26 15:14
Filed under: Doodles Tagged: cartoon, cartoons, cat, christmas, comic, doodle, doodles, mouse

District Dispatch: Library experts to talk 3D printing at 2015 ALA Midwinter Meeting

planet code4lib - Fri, 2014-12-26 06:23

Technological developments in 3D printing are empowering people to learn new skills, launch business ventures and solve complex health problems. As this cutting-edge technology becomes more common in libraries, what do librarians need to know? Join a panel of information professionals for the session “Library 3D Printing—Unlocking the Opportunities, Understanding the Challenges” which takes place during the 2015 American Library Association’s (ALA) Midwinter Meeting in Chicago. The session will be held from 10:30–11:30 a.m. on Sunday, February 1, 2015, in the McCormick Convention Center room W470A.

Photo by Creative Tools via Flickr

The panel will tackle the policy implications of 3D printing from all angles, with a view to helping the library community establish smart user policies. Topics of discussion will include intellectual property and intellectual freedom issues, product liability questions, the educational and entrepreneurial applications of library 3D printing and more.

Speakers include Barbara Jones, director of the ALA Office for Intellectual Freedom; Tom Lipinski, dean and professor at the University of Wisconsin-Milwaukee School of Information Studies; and Charlie Wapner, information policy analyst at the ALA Office for Information Technology Policy.

View other ALA Washington Office Midwinter Meeting conference sessions

The post Library experts to talk 3D printing at 2015 ALA Midwinter Meeting appeared first on District Dispatch.

Terry Reese: MarcEdit Update – Christmas Edition

planet code4lib - Wed, 2014-12-24 21:03

Happy Christmas! It is my sincere wish that everyone reading this is/has had a wonderful time with family and good friends over the holiday season. This year marks the second year that I’ve been away from my family – my parents, my brothers, my in-laws – all still in Oregon. It’s the hardest time to be away from family; we have always been close and while my wife and kids have our own Christmas traditions, we’ve always found time around the holidays to be together as an extended family. But this second year in Ohio has been very different than the first. Last year, we were still trying to settle into our new community, new friends and absorb the Midwest culture (which is very different than the west coast). Ohio had become home, and yet, it wasn’t.

This year has been different. We’ve made good friends, our kids have found a place to fit in; we’ve bought a house and are putting down roots. Ohio State University continues to be a place with challenges and opportunities to learn and grow – but more importantly, it has become a place not just with colleagues that I respect and continue to learn from, but a place where friendships have been made. When my older son had a bit of a health scare, it was my community at Ohio State and the friends we’ve made in our neighborhood that helped to provide immediate support, and continue to support us. As I look back on 2014 and all the wonderful friends and adventures that we’ve had in our new adopted state, I realize just how fortunate and blessed my family has been to find a community, job, and friends that have just fit.

This last year has also saw the continued growth of both MarcEdit and its user community. On the application side, this year saw the release of MarcEdit 6, the MARCNext tool kit, integration with OCLC’s WorldCat, new language tools, automation tools, etc.  The user community…well, I’m consistently amazed by the large and diverse user community that has grown up around something that I really made available with the hope that maybe just one other person might find it useful. This is a great community, and I’m always humbled by the kindness and helpfulness displayed. I’m told often how much people appreciate this work. Well, I appreciate you as well. I have always appreciated the opportunity to work with so many interesting people on projects and problems that potentially can have lasting impacts. It has, and always will be one my great pleasures.

On to the update….in what has become a tradition, I’m releasing the MarcEdit Christmas update. I’d already provided a little bit of information related to what was changing in a previous blog post: http://blog.reeset.net/archives/1632 – but I’m including the full list below:

 Changes:
  • Enhancement: MARCCompare: Added options to allow users to define colors for added and deleted content.
  • Enhancement: MARCCompare: Added options to support automatic sorting of data prior to comparison. Users can define the field for sorting (default is the 001)
  • Enhancement: MARCEngine: Improved support for automated conversion of NRC notation in UTF8 data to ensure proper representation of UTF8 characters.
  • Modified Behavior: Automated Update: Previously, MarcEdit would check for an update every time the application was run. If an update had occurred, the program would prompt the user for action. If the user cancelled the action, the program would re-prompt the user each time the program was started. Because many users work in environments where their updates are managed by central IT staff, this constant re-prompting was problematic. Often, it would lead to users simply disabling update notification.   To make this more user friendly, the new behavior works as follows: When the program determines an update has been made, the program will prompt the user. If the user takes no action, the program will no longer prompt for action, but instead will provide an icon denoting the presence of an update in the lower right corner, next to the Preferences shortcut.
  • Enhancement: Link Identifiers Tool: I’ve added support for MESH headings through the use of their beta SPARQL end-point. Records run through the linking tool with identified MESH headings will automatically be resolved against the NLM database.
  • Enhancement: SPARQL Browser: This was described in blog post: http://blog.reeset.net/archives/1632, but this is one a new tool added to the MARCNext toolkit.
  • Enhancement: RDF Toolkit: In building the SPARQL Browser, I integrated a new RDF toolkit into MarcEdit. At this point, the SPARQL Browser is the only resource making like use of its functionality – but I anticipate that this new functionality will be utilized for a variety of other functions as the cataloging community continues to explore new metadata models.
  • Bug Fix: Diacritic insertion via intellisense:  When typing a diacritic, selecting it by double clicking on the value would result in the file scrolling to the top.  The program now resets to the cursor position.  When the user just clicked enter to select the value, a new line was inserted behind the diacritic mnemonic — both of these have been fixed.
  • Bug Fix: Mac Threading issues: One of the things that came to my attention on the last update is that Mono has some issues when generating system dialog boxes within separate threads. It appears that the new garbage collector in Mono may be sweeping object pointers prematurely. The easy solution is to remove the need to generate these system messages or move them when necessary. This has been done. Immediately, this corrects the issue related to MarcEdit crashing when the update message was generated.
  • Bug Fix: Mac Fonts: I was having some trouble with Mac systems not failing graciously when the system requested a font not found on the system. In Windows and Linux implementations of Mono, the default process is to fall through the requested font family until an appropriate font was found. Under OSX, Mono’s behavior is different, with fonts not returning any value, defaulting to undefined blocks. I’ve reworked the font selection class to ensure that a fall back font is always selected on all systems – which has corrected this problem on the Mac.

The MarcEdit update is available for download from: http://marcedit.reeset.net/downloads for all systems. You may also download and update the application via the automatic updating utilizing from within MarcEdit itself.

Again, Happy Christmas!

–tr

Open Library: Open Library Scheduled Hardware Maintenance

planet code4lib - Wed, 2014-12-24 16:46

Open Library will be down for 3 hours starting from Dec 24, 10:00PM SF Time (PST, UTC/GMT -8 hours) due to a scheduled hardware maintenance.

We’ll post updates here and on @openlibrary twitter.

Thank you for your cooperation.

Library of Congress: The Signal: The Top 10 Blog Posts of 2014 on The Signal

planet code4lib - Wed, 2014-12-24 15:53

We’re fans of lists here at the Library of Congress and there is no better way to close out the year on The Signal than taking a look back at our popular blog posts of the year.

Happy new year. Print by Currier & Ives. c1876. http://hdl.loc.gov/loc.pnp/cph.3b50424

Our most viewed post of the year, and our second most viewed post of all time since our blog launched in 2011, was the post about the discovery of unreleased Duke Nukem video game code. It generated quite a lot of buzz and was picked up by the gaming and technical news sites, including: Polygon, Engadget, Eurogamer, The Verge, Gamasutra, and CNET.

Here’s the entire list of top 10 posts of 2014 (out of 189 total posts), ranked by page views based on data as of December 22:

  1. Duke’s Legacy: Video Game Source Disc Preservation at the Library of Congress
  2. Personal Digital Archiving: The Basics of Scanning
  3. What Do you Mean by Archive? Genres of Usage for Digital Preservers
  4. Research is Magic: An Interview with Ethnographers Jason Nguyen & Kurt Baer
  5. Exhibiting .gifs: An Interview with curator Jason Eppink
  6. New NDSA Report: The Benefits and Risks of the PDF/A-3 File Format for Archival Institutions
  7. We’re All Digital Archivists Now: An Interview with Sibyl Schaefer
  8. The PDF’s Place in a History of Paper Knowledge: An Interview with Lisa Gitelman
  9. What Does it Take to Be a Well-rounded Digital Archivist?
  10. Digital Archiving: Making It Personal at the Public Library

And here are the top 10 posts with the most comments, based on data as of December 22:

  1. Personal Digital Archiving: The Basics of Scanning
  2. Duke’s Legacy: Video Game Source Disc Preservation at the Library of Congress
  3. What Do you Mean by Archive? Genres of Usage for Digital Preservers
  4. When it Comes to Keepsakes, What’s the Difference Between Physical and Digital?
  5. Where are the Born-Digital Archives Test Data Sets?
  6. Research and Development for Digital Cultural Heritage Preservation: A (Virtual and In-Person) Open Forum
  7. What Does it Take to Be a Well-rounded Digital Archivist?
  8. Comparing Formats for Still Image Digitizing: Part One
  9. Data: A Love Story in the Making
  10. Tag and Release: Acquiring & Making Available Infinitely Reproducible Digital Objects

It is heartening to see that most of the top posts of the year talk about jobs and skills in our profession, along with posts of interviews with practitioners working on stewarding various types of digital content. Looking specifically at the blog posts that generated the most comments, we were really excited to see excellent engagement and conversation occurring between commenters.

Thank you to all of our readers and commenters for making 2014 a memorable one on The Signal!

District Dispatch: National Impact of Library Public Programs

planet code4lib - Tue, 2014-12-23 21:59

Within the library community, we understand the value of public programming—at least from an experiential perspective, seeing how our users benefit. But how can we understand the benefits and challenges of public programming systematically across libraries, and ultimately at a national level?

The National Impact of Library Public Programs Assessment (NILPPA), a project of the American Library Association’s (ALA) Public Programs Office, is addressing these questions. Research work during the past year has yielded initial findings. You may find these findings of interest, and your comments will help to move this work forward.

The ALA Office for Information Technology Policy (OITP) thinks about the public policy implications of public programming. For many in the library community, the focus is on the substantive programming itself and the direct benefits to communities. For our orientation, public programming provides libraries with visibility (think marketing and advertising) in communities as important cultural and educational institutions. Public programming may also advance specific policy objectives such as improving literacy (including digital literacy), understanding challenges of privacy and surveillance in society, or the importance of widespread access to advanced technology (e.g., high-speed broadband).

The post National Impact of Library Public Programs appeared first on District Dispatch.

District Dispatch: In time for the holidays

planet code4lib - Tue, 2014-12-23 21:47

Forget the New York Times best seller list when deciding what to read on any days off you might have in front of you. The Federal Communications Commission (FCC) released the second E-rate Modernization Order in plenty of time for you to print it out and stuff it in your carryon (if you’re lucky enough to be traveling somewhere sunny) or keep it on your bedside table if you’re like me and not so lucky.

Among the major changes adopted in this order are those geared to close the broadband capacity gap for libraries and schools, particularly for those in rural areas. These include:

  • Suspending the amortization rules for special construction;
  • Allowing applicants to pay their non-discounted portion for construction costs over multiple years;
  • Equalizing the treatment of dark and lit fiber;
  • Permitting self-construction of high-speed broadband networks; and
  • Adding discounts when states match funds for broadband construction projects.

These are the changes directly related to addressing the lack of affordable access to high-capacity broadband. The Commission also increased the funding available by $1.5 billion, bringing the program up to $3.9 billion. And as always, there are a number of other important program changes that provide new opportunities for libraries. We are preparing a summary of the order, but in the meantime, the FCC has one of their own which explains the major changes, some of which take effect in the 2015 funding year.

As we alluded to earlier, there is a lot of work ahead to make sure that libraries have the supports that they need to take advantage of the new funding and program changes. To that end Susan Hildreth, director of the Institute of Museum and Library Services (IMLS), and FCC Chairman Tom Wheeler held a conference call which was both a recognition of the hard work of the American Library Association (ALA) and our library partners and a call to action. We are heeding the call to action and planning ongoing outreach and education to provide as much information to applicants and library leaders as we can. As a first step we are working with the Public Library Association to hold a webinar, January 8, to go into detail on the second E-rate order. And there will be more to come in the weeks ahead.

Exparte drinks

Read the FCC’s summary and you can get back to the book you put aside, but if you take on the full 106 pages of the E-rate order, try our official E-rate cocktail:

“The Exparte”
  • 2 ounces Campari
  • 1 ounce Gin (your choice)
  • 3 drops Bitters (try Angustura or orange)
  • Topped with club soda and garnished with an orange twist

For those of you who have been closely following the E-rate proceeding for the last 18 months or for those intimately aware of the intricacies of the E-rate application cycle here’s an ode to help you ring in the new E-rate year.

An E-rate Holiday Ode The end of 2014 is now very near
and we have E-rate reform, so we hear.
Santa Wheeler has a very full sack
and the FCC/USAC elves have much on their backs.
All the new and confusing program regs
for some needed clarity we humbly beg.
With so many questions now still pending
we fear that 2015 may be never ending.
So much is new, so much has changed
even E-rate veterans’ minds go insane!
Like C2 reforms – yes we’ve waited so long
that useless 2-in-5 is finally gone!
C2 budgets, a bit hard to understand
but the FCC says your C2 funding’s in hand.
New rules for fiber, which is good news
changes like this we can certainly use.
A large increase in funding with money that’s new
to be spread among many and not just a few.
No need to amortize those big requests
get all funds in one year, likely the best.
The state match will stretch our limited funds
may not be too much but it’s more than just some.
How to do CEP, when will we hear?
the time to do this is very near.
The bad urban/rural change last summer
the new order fixes, it’s no longer a bummer.
To complete the new 471, I can hardly wait
though when it’s done I’ll be in a catatonic state.
But we’ve finally reached the end of program reform
so in the New Year let’s celebrate – the E-rate’s reborn. –Poem by Bob Bocher, OITP Fellow

Happy New Year

The post In time for the holidays appeared first on District Dispatch.

Nicole Engard: Bookmarks for December 23, 2014

planet code4lib - Tue, 2014-12-23 20:30

Today I found the following resources and bookmarked them on <a href=

  • Contiki Contiki is an open source operating system for the Internet of Things. Contiki connects tiny low-cost, low-power microcontrollers to the Internet.

Digest powered by RSS Digest

The post Bookmarks for December 23, 2014 appeared first on What I Learned Today....

Related posts:

  1. What’s new in Ubuntu?
  2. December is Here
  3. Who's afraid of Google?

District Dispatch: Cromnibus Christmas / Chromibus Chanukkah

planet code4lib - Tue, 2014-12-23 20:10

The 113th Congress concluded its work in time to leave town for the holidays. While not the most productive Congress in terms of bills passed, the 113th was able to finish one of the mandatory “must do” items of funding the Federal government for Fiscal Year 2015.

One might notice that while the Fiscal Year actually began October 1, for Congress, a three month delay is not uncommon in the highly partisan and dysfunctional climate. The Federal government has been operating under a Continuing Resolution, a Congressionally-enacted measure to provide short term funding to keep the doors of government open while Appropriators hammer out details of longer term funding levels.

What exactly is a Cromnibus? It’s not a Nightmare Before Christmas, but rather a massive funding bill that provides funding to keep the Federal government open for a short period of time (a Continuing Resolution) and also provides long term funding for eleven of Federal agencies in one bill (an Omnibus)…thus the marvelously named CR-Omnibus!

How did libraries fare in the Cromnibus funding package? Mostly, programs supported by the libraries received level funding, which is good news in the austere atmosphere on Capitol Hill. For example, the Library Services and Technology Act, Head Start, Innovative Approaches to Literacy, and Career and Technical Education State Grants all received the same level of funding as FY 2014.

A few programs received slight increases or decreases. Small increases were granted to the Institute of Museum and Library Services, Striving Readers, Library of Congress, and the Government Publishing Office (formally known as the Government Printing Office). Slight decreases were dealt to Assessment programs, National Archives, and Electronic Government initiatives.

You can view an expanded chart displaying the funding levels of top ALA priority programs by clicking here.

Now that the FY 15 budget is done and the 113th Congress has concluded, the 114th Congress will arrive in a few weeks and work on the FY 16 budget will begin.

The post Cromnibus Christmas / Chromibus Chanukkah appeared first on District Dispatch.

LITA: Jobs in Information Technology: December 23

planet code4lib - Tue, 2014-12-23 19:33

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

User Experience Librarian, University of Arkansas, Fayetteville, AR

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

FOSS4Lib Recent Releases: Koha - Security and maintenance releases v 3.14.12, 3.16.5.1 and 3.18.2

planet code4lib - Tue, 2014-12-23 16:27
Package: KohaRelease Date: Monday, December 22, 2014

Last updated December 23, 2014. Created by David Nind on December 23, 2014.
Log in to edit this page.

Bug fix, security and maintenance releases for Koha. See the release announcements for the details:

William Denton: Bakewell West Scales

planet code4lib - Tue, 2014-12-23 02:58

Joan Bakewell interviews Prunella Scales and Timothy West is a fifteen-minute BBC radio interview, with journalist Joan Bakewell interviewing actors Prunella Scales and her husband Timothy West about Scales’s Alzheimer’s. You can hear it in evidence in what she says, but as sad as that is, the good humour of the two of them dealing with it—and all three of them dealing with old age—is remarkable. This is wonderful listening.

Bakewell: Do you remember Fawlty Towers?

Scales: Yes, what about—what do you mean, the lines?

District Dispatch: “Son of SOPA,” Internet-killing MPAA horror sequel, garners serious screams

planet code4lib - Mon, 2014-12-22 20:46

Last Friday, District Dispatch readers were no doubt given the heebie-jeebies by Carrie Russell who told an eerily familiar and terrifying tale. For the past week or so in the wake of the Sony email hack, mainstream and online media have exposed a veritable cauldron of connivance by the Motion Picture Association of America (MPAA) with several of the nation’s state Attorneys General.

Photo still from “Son of Frankenstein,” ©Universal Pictures Company, Inc. (1944)

Apparently, MPAA wasn’t satisfied with rousing more than 14 million Americans to grab their keyboards, pitchforks and torches in 2012 to storm Congress’ castle and kill the monster called SOPA (the “Stop Online Piracy Act”). Instead, they immediately embarked upon a secret, potentially million dollar campaign to convince state AGs to reanimate SOPA’s corpse by (ab?)using their investigative and litigation powers. Specifically, MPAA has been trying to force Google and potentially other internet search companies to prevent the public from being able to find, and thus access, websites that the MPAA and friends unilaterally find (no judge involved) to infringe federal copyright law. (Ars Technica laid out the whole sordid campaign brilliantly this past week.)

Faced with that legitimately horrifying prospect, Google fought back late Friday by counter-suing Attorney General Jim Hood of Mississippi, MPAA’s lead laboratory assistant, who announced just hours later that he was calling a “time out” on (but not permanently abandoning) further suspect experiments in breaking the Internet.

Today, ALA’s name leads the list of signers of a letter to Attorney General Hood (pdf) (copied to all attorney generals) by a veritable Who’s Who of the nation’s other leading technology policy and public interest organizations reminding him of SOPA’s fate “and of the principled opposition to curtailing free speech that it first provoked.”

Time will tell if the nation’s Attorneys General will leave federal copyright law enforcement to Congress and reserve state taxpayers’ money for public safety issues and other matters closer to home. Meantime, do you know where your pitchfork and torch are?

Between the holidays just might be a good time to sharpen and prime them just in case we have to help make “Son of SOPA” the least profitable, shortest running movie in policy horror film history.

The post “Son of SOPA,” Internet-killing MPAA horror sequel, garners serious screams appeared first on District Dispatch.

Library of Congress: The Signal: Unlocking the Imagery of 500 Years of Books

planet code4lib - Mon, 2014-12-22 20:03

The following is a guest post by Kalev H. Leetaru of Georgetown University (Former), Robert Miller of Internet Archive and David A. Shamma from Yahoo Labs/Flickr.

In 1994, linguist Geoff Nunberg stated, in an article in the journal “Representations,” “reading what people have had to say about the future of knowledge in an electronic world, you sometimes have the picture of somebody holding all the books in the library by their spines and shaking them until the sentences fall out loose in space…” What would these fragments look like if you took every page of every book from 2.5 million volumes dating back over 500 years? Could every illustration, drawing, chart, map, photograph and image be extracted, indexed and displayed? That was the question that launched the Internet Archives Book Images Project to catalog the imagery from half a millennium of books.

Over 14.7 million images were extracted from over 600 million pages covering an enormous variety of topics and stretching back to the year 1500. Yet, perhaps what is most remarkable about this montage is that these images come not from some newly-unearthed archive being seen for the first time, but rather from the books we have been digitizing for the past decade that have been resting in our digital libraries.

The history of book digitization has focused on creating text-based searchable collections–the identification of tens of millions of images on the pages of those books has historically been regarded as merely a byproduct of the digitization process. We inverted that model, reimagining books as containers of images rather than text. We then explored how digital libraries can yield rich visual collections by using modern image recognition technology coupled with Flickr to ultimately create one of the largest visual book collections in history. This involved automatically identifying visual content, cropping it out, extracting surrounding metadata via optical character recognition and uploading and indexing the structured data to Flickr. In effect, we are repurposing the vast archives of digital content created for text search and transforming it into a visual gallery of imagery. In doing so, we are creating a new way of “seeing” our cultural history.

Motivations

Albert Bierstadt, “Among the Sierra Nevada Mountains” courtesy of Wikimedia.

How does one go about creating an archive of images spanning over 500 years? After seeing Albert Bierstadt’s “Among the Sierra Nevada Mountains, California” Kalev Leetaru (a co-author of this post) became curious as to how the American West of the nineteenth century was portrayed in literature–in particular how it was portrayed in imagery rather than words. Yet, searches in various digital libraries for Albert Bierstadt’s name yielded almost exclusively textual descriptions of his artwork, not photographs or illustrations – the same was true for searches of other nineteenth century technologies like the telegraph, the telephone, and the railroad.

Current digital libraries are designed to search for a word or phrase and return a list of pages mentioning it. The notion of searching for images appearing beside a word or phrase is simply not supported. While the concept of “image search” has become ingrained in our daily lives, it has largely remained a tool exclusively for searching the web, rather than other modalities like the printed word. As we’ve sought to find structure online we have seemingly overlooked the inherent knowledge within printed books. The call was clear: expand beyond a simple search box query and unlock the imagery of the world’s books. With this call in hand, Leetaru approached the Internet Archive, as one of the world’s largest open collections of digitized books, for permission to use their vast collection of over 600 million pages of digitized books stretching back half a millennia, and Flickr, as one of the largest online image services, to host the final collection of images in an interactive and searchable form.

A System Prototype

Historically, PDF versions of scanned books were created as what is called “image over text” files: the scanned image of each page is displayed with the OCR text hidden underneath. This works well for desktops and laptops with their unlimited storage and bandwidth, but can easily generate files in the tens or hundreds of megabytes. The rise of eReader devices, with their limited storage and processing capacities, necessitated more optimized file formats that save books as ASCII text and extract each visual element as a separate image file. Many digital libraries, including the Internet Archive, make their books available this way in the open EPUB file format, which is essentially a compressed ZIP file containing a set of HTML pages and the book’s images as JPG, PNG or GIF files. Extracting the images from each book is therefore as simple as unzipping its EPUB file, saving the images to disk, and searching the HTML pages to locate where each image appears in the book to extract the text surrounding it.

The simplicity here is what makes it so powerful. The hard part of creating an image gallery from books lies in the image recognition process needed to identify and extract each image from the page, yet this task is already performed in the creation of the EPUB files. This also means that this process can be easily repeated for any digital library that offers EPUB versions of their works, making it possible to one day create a single master repository of every image published in every book ever digitized. The entire processing pipeline was performed on a single four-processor virtual machine in the Internet Archive’s Virtual Reading Room. While all of the books used are available for public download on the Archive’s website, using the Virtual Reading Room made it possible to work much more easily with the Archive’s collections, dramatically reducing the time it took to complete the project.

Creating the Final Archive

The chief limitation of this prototype solution was that the images in EPUB files are downsampled to minimize storage space and processing time on the small portable devices they are designed for. Creating a gallery of rich high-resolution imagery from the world’s books requires returning to the original raw page scan imagery and using the OCR results to identify and crop each image from those scans.

From the beginning, it was decided to leverage the existing OCR results for each book instead of trying to develop new algorithms for identifying images from scanned book pages. It is hard to improve upon the accuracy of OCR software designed solely for the purpose of identifying text and images from scanned pages and developed by companies with large research and development staffs focusing exclusively on this very topic. In addition, few libraries have large computational clusters capable of running sophisticated image processing software, with OCR either outsourced to a vendor or run in scheduled fashion on dedicated OCR servers. By using the existing OCR results, the most computationally-intensive portion of the process is simply reused and all that remains is to crop the images from the page scans using the OCR results.

The final system uses the raw output of the Abbyy OCR software that the Internet Archive runs over each book, coupled with the scanning data produced when the book was originally digitized, to identify and extract the images at the original digitization resolution. As part of the OCR process to extract searchable text from each page, the OCR software identifies images on each page and sets them aside. Similar to the process used to create the EPUB files, this image information was used to extract each of these images to create the final gallery, along with the text surrounding each image.

The process begins by downloading from the Internet Archive the master list of all files available for a book, including its original page scan images and Abbyy OCR file. The original page scan imagery (JPEG2000, JPEG or TIFF format) and the OCR’s XML output must be present or the book is skipped. Next, the “scandata.xml” file (which contains a list of all pages in the book and their status) is examined to locate page scans that were flagged by the human operator of the book scanning system for exclusion. An example might be a bad page scan where the page was not in proper position for scanning and thus was rescanned. In this case instead of deleting the bad scan images, they are simply flagged in the scandata.xml file. Similarly, to ensure proper color calibration, a “color card” and ruler is photographed before and after each book is scanned (and sometimes periodically at random intervals throughout). These frames are also flagged in the XML file for exclusion. These page scans were dropped before the OCR software was run on the book by the Archive and thus must be identified to align the page numbering.

Next, the Abbyy OCR XML file is examined to locate each image in the book and its surrounding text. This file has an enormous wealth of information calculated by the OCR software. It breaks each page into a series of paragraphs, each paragraph into lines, and each line into individual characters. It also provides a confidence measure on how “sure” it is of each individual letter. Each region, line, character and image includes “l,t,r,b” parameters yielding its “left,” “top,” “right” and “bottom” coordinates in pixels within the original page scan image. Thus, to extract each image from a book, one simply scans the XML file for the “image” regions, reads its coordinate and crops this region from the original page scan imagery – no further analysis is necessary. The text surrounding each image is also extracted from the OCR file to enable keyword searching of the images by their context.

Now that the list of images and their surrounding text has been compiled, the system must download the full-resolution page scans to extract the images from. The problem is that the page scans for each book are delivered as ZIP files that are usually several hundred megabytes, occasionally exceeding one gigabyte. While at first this may not seem like much, when multiplied by a large number of books, the network bandwidth and the speed of computer hard-drives become critical limiting factors. A set of 22-core virtual machines were used to handle much of the computing needs of this project.

Typically one might attempt to process one book per core, meaning 22 books would be processed in parallel at any given time. If all 22 books had 500MB ZIP files containing their full resolution page scan imagery, this would require downloading 11GB of data over the network per second. Assuming that each book takes less than a second of CPU time on average to process, this would require a 100Gbps network link working at 100% capacity to sustain these processing needs. Even if this was broken into 22 separate virtual machines, each with a single core, all 22 machines would still each require their own four Gbps network link working at 100% capacity and delivering 500MB/s bandwidth.

Even if the network bandwidth limitations are overcome, the greatest challenge is actually the disk bandwidth of writing a 500MB ZIP file to disk from the network and then unpacking it (reading the 500MB) and updating the file system metadata to handle several hundred new files being written to disk (writing its 500MB of contents to disk). Thus, all said and done, a single 500MB ZIP file requires 500MB to be read twice and written twice, totaling two GB of total IO. Processing 22 files per second would require 44GB/s of disk bandwidth.

Many cloud computing vendors limit virtual machines to around 100MB/s sustained writes and 200MB/s sustained read performance, while even a dedicated physical three Gbps SATA hard drive operating at peak capability would still require two seconds just to write the ZIP file to disk and another two seconds to unpack it and write its contents to disk as individual files (assuming the intermediate reads are kernel buffered). Since writes are linear, SSD disks do not provide a speed advantage over traditional mechanical disks. While more exotic storage infrastructures exist that can support these needs, they are rarely found in library environments.

Our processing pipeline was therefore designed to minimize network and disk IO even at the cost of slowing down the processing of a single book. In the final system, if a book had less than 50 pages containing images to be extracted the page images were downloaded individually via the Archive’s ZIP API. Instead of downloading the entire ZIP file to local disk, the ZIP API allows a caller to request a single file from a ZIP archive and the Archive will extract, uncompress and return that single file from the ZIP archive. For optimization, contents of the ZIP were extracted individually via calls to the Archive’s web service at an average delay of around two seconds per image.

A book with 50 pages containing images would therefore require around 1.6 minutes to download all of the images, whereas on the virtual machines used for this project the raw page scan ZIP could be downloaded in under 30 seconds. However, downloading multiple full ZIP files would exceed the network and disk bandwidth on the virtual machines, raising the total time required to download the full ZIP file to about 10 minutes. For books with more than 50 pages of images, the full ZIP was downloaded, but only the needed pages were unpacked from the ZIP file instead of all pages, again reducing the read/write bandwidth. Combined, these two techniques reduced the IO load on each virtual machine to the point that the CPUs were able to be kept largely occupied when running 44 books in parallel.

Finally, the book images needed to be cropped from the full resolution page scans. The de facto tool for such image operations is ImageMagick, however its performance is extremely slow on large JPEG2000 images. Instead, we used the specialized “kdu_expand” tool from Kakadu Software to extract images and CJPEG to write the resulting image to disk. This resulted in a speedup of eight to 30 times by comparison to ImageMagick. To minimize memory requirements, the final system actually generates a shell script and then exits and invokes the shell script to perform all of the actual image processing components of the pipeline, freeing up the memory originally used to process the XML files since per-core memory was highly limited on the systems available for this project. Finally, the list of extracted JPEG images and a tab-delimited inventory text file listing the attributes of each image and the text surrounding them is compressed into one ZIP file per book, ready for use. Thus, while conceptually quite simple, the final system required considerable iterative development and enormous effort to minimize IO at all costs.

The Archive, the Flickr Commons, and the Future

With the mechanics of data management and image extraction completed, the first 2.6 million images from the collection were uploaded to the Internet Archive Book Images collection in the Flickr Commons in July/August 2014, making it searchable and browseable to the entire world. Each image is uploaded with a set of indexed tags such as book year (like 1879) or book subject (like sailing) and the book’s id (like bookidkittenscatsbooko00grov). Each image’s description has detailed information about the image caption, text before and after the image, book’s title, and the page number. In the Flickr Commons, the images themselves carry no known copyright restrictions; this itself carries new implications for libraries and archives online. In a recent conversation about this collection, Cathy Marshall stated:

“…here are over 2.5 million images, ripe for reuse and reorganized into different collections and subcollections. Do we necessarily want to read the whole of a monograph about polycystins from 1869? Probably not. But might we use a stippled protozoan illustration as a homescreen background (in this case, with little concern for copyright restrictions). And we would be in good company: in a recent study, Frank Shipman and I found that over 80% of our participants (202 out of 242) downloaded photos they found on the Internet with little concern for copyright restrictions. Almost 3/4 reused these photos in new contexts. It’s easy to see how book images, extracted and offered without copyright concerns, would be an attractive online resource for purposes many of us haven’t even foreseen yet.”

We also look towards annotations. From people in the communities and in the libraries, human annotation can help classify – formally and informally – the content to be discovered in the books. Even signals that one might take for granted, such as marking an image as a favorite or leaving a comment, can be quite valuable in social computing to further understand a corpus and help tell the stories contained across all the books. The structured data and human annotations here are the first steps; computer vision systems can further index concepts.

We are just now at the beginning of what is possible with this collection. From here, we hope to begin a transformation of the Internet Archive’s collections into a dynamic, growing collection where concepts, objects, locations and even music can be discovered, not as just an index of text around images, but rather a deeper knowledge model that utilizes the structure of books, publishing and libraries to understand the world and allow scholars to begin asking an entirely new generation of research questions.

LITA: Long Live Firefox!

planet code4lib - Mon, 2014-12-22 11:00


Until I became a librarian, I never gave much thought to web browsers. In the past I used Safari when working on a Mac, Chrome on my Android tablet, and showed the typical disdain for Internet Explorer. If I ever used Firefox it was purely coincidental, but now it’s my first choice and here’s why.

This month Mozilla launched Firefox 34 and announced a deal to make Yahoo their default search engine. I wasn’t alone in wondering if that move would be bad for business (if you’re like me, you avoid Yahoo like the plague). Mozilla also raised some eyebrows by asking for donations on their home page this year.

I switched to Firefox a few months ago, prior to all the commotion, when I came across Mozilla’s X-Ray Goggles, an add-on that allows you to view how a webpage is constructed (the Denver Public Library has a great project tutorial using X-Ray Goggles that I highly recommend). I was pleasantly surprised to find a slew of other resources for teaching the web and after doing a little more digging, I was taken by Mozilla’s support of an open web and intrigued by their non-profit status.

At the library I frequently encounter patrons who have pledged their allegiance to Google or Apple or Microsoft and I’m the same way. I was excited to update to Lollipop on my tablet and I’m saving up for an iMac, but I cringe when I think about Google’s privacy policies or Apple’s sweatshops. Are these companies that I really want to support?

I was teaching an Android class the other day and a patron asked me which browser is the best. I told her that I use Firefox because I support Mozilla and what they stand for. She chuckled at my response. Maybe it’s silly to stand up for any corporation, but given the choice I want to support the one that does the most good (or the least evil).

Mozilla’s values and goals are very much in line with the modern library. If you’re on the fence about Firefox, take a look at their Privacy Policy, Add-Ons, and see how easy it is to switch your default search engine back to Google. You just might change your mind.

Mark E. Phillips: Calculating a Use.

planet code4lib - Mon, 2014-12-22 01:42

In the previous post I talked a little about what a “use” is for our digital library and why we use it as a unit of measurement.  We’ve found that it is a very good indicator of engagement with our digital items.  It allows us to understand how our digital items are being used in ways that go beyond Google Analytics reports.

In this post I wanted to talk a little bit about how we calculate these metrics.

First of all a few things about the setup that we have for delivering content.

We have a set of application servers that are running an Apache/Python/Django stack for receiving and delivering requests for digital items.  A few years ago we decided that it would be useful to proxy all of our digital object content through theses content servers for delivery so we would have access to adjust and possibly restrict content in some situations.  This means two things,  one that all traffic goes in and out of these application servers for requests and delivery of content, and two that we are able to rely on the log files that Apache produces to get a good understanding of what is going on in our system.

We decided to base our metrics on best practices of the Counter initiative whenever possible as to try and align our numbers with that community.

Each night at about 1:00 AM CT we start a process that aggregates all of the log files for the previous day from the different application servers.  These are collocated on a server that is responsible for calculating the daily uses.

We are using the NCSA extended/combined log format for the log file format coming from our servers.  We also append the domain name for the request because we operate multiple interfaces and domains from the same servers.  A typical set of logs look something like this.

texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74680/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74678/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" texashistory.unt.edu 68.180.228.104 - - [21/Dec/2014:03:49:28 -0600] "GET /search/?q=%22Bowles%2C+Flora+Gatlin%2C+1881-%22&t=dc_creator&display=grid HTTP/1.0" 200 17900 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)" texashistory.unt.edu 129.120.90.73 - - [21/Dec/2014:03:49:28 -0600] "GET /acquire/ark:/67531/metapth74662/ HTTP/1.0" 200 95 "-" "Python-urllib/2.7" texashistory.unt.edu 68.229.222.120 - - [21/Dec/2014:03:49:28 -0600] "GET /ark:/67531/metapth74679/m1/1/high_res/?width=930 HTTP/1.0" 200 59858 "http://forum.skyscraperpage.com/showthread.php?t=173080&page=15" "Mozilla/5.0 (Windows NT 5.1; rv:34.0) Gecko/20100101 Firefox/34.0"

 

Here are the steps that we go through for calculating uses.

  • Remove requests that are not for an ARK identifier,  the request portion must start with /ark:/
  • Remove requests from known robots (this list, plus additional robots)
  • Remove requests with no user agent
  • Remove all requests that are not 200 or 304
  • Remove requests for thumbnails, raw metadata, feedback, urls from the object as they are generally noise

From the lines in the first example there would be only one line left after processing,  the last line.

The lines that remain are sorted by date and then grouped byIP Address.  A time window of 30 minutes is run on the requests.  The IP address and ARK identifier are used as the key to the time window which allows us to group request by a single IP Address for a single item into a single use.  These uses are then fed into a Django application where we store our stats data.

There are of course many other specifics about the process of getting these stats processed and moved into our system.  I hope this post was helpful for explaining the kinds of things that we do and doing count when calculating uses.

 

Pages

Subscribe to code4lib aggregator