You are here

Feed aggregator

DuraSpace News: CALL for Proposals for Sixth Annual VIVO Conference Workshops

planet code4lib - Wed, 2015-02-04 00:00

Boston, MA  The Sixth Annual VIVO Conference will be held August 12-14, 2015 at the Hyatt Regency Cambridge, overlooking Boston. The VIVO Conference creates a unique opportunity for people from across the country and around the world to come together to explore ways to use semantic technologies and linked open data to promote scholarly collaboration and research discovery.

OCLC Dev Network: Developer House Project: More of the Same - Faster Results with Related Resources

planet code4lib - Tue, 2015-02-03 20:00

We have started sharing projects created at our Developer House in December, and this week we’re happy to share another. This project and post come to you from Bill Jones and Steelsen Smith.    

OCLC Dev Network: Developer House Project: More of the Same - Faster Results with Related Resources

planet code4lib - Tue, 2015-02-03 20:00

We have started sharing projects created at our Developer House in December, and this week we’re happy to share another. This project and post come to you from Bill Jones and Steelsen Smith.    

Tim Ribaric: OLA SuperConference 2015 Presentation Material and Recap

planet code4lib - Tue, 2015-02-03 17:23


(View from the Podium)

OLA SuperConference 2015 was last week. I had the good opportunity to attend as well as too present.

read more

Mita Williams: Be future compatible

planet code4lib - Tue, 2015-02-03 16:42
Hmmm, I thought kindly published my last post but did not update the RSS feed, so I made this re-post:

On February 1st, I gave a presentation the American Library Association Midwinter Conference in Chicago, Illinois as part of the ALA Masters Series called Mechanic Institutes, Hackerspaces, Makerspaces, TechShops, Incubators, Accelerators, and Centers of Social Enterprise. Where do Libraries fit in? ::
But after inspection, it looks like the RSS feed that is generated by Feedburner has been updated in such a way that I - using feedly - I needed to re-subscribe. Now, I'm not sure who is at fault for this: Feedburner, feedly, or myself for using a third party to distribute a perfectly good rss feed.

I don't follow my reading statistics very closely but I do know that the traffic to this site is largely driven by Twitter and Facebook -- much more than hits from, say, other blogs.  And yet, I'm disturbed that the 118 readers using the now defunct feedly rss feed will not know about what I'm writing now. I'm sad because while I've always had a small audience for this blog - I have always been very proud and humbled that I had this readership because attention is a gift.

LITA: Why Everyone Should be a STEMinist

planet code4lib - Tue, 2015-02-03 16:10

Flickr, ITU/R. Farrell 2013

This blog post is not solely for the attention of women. Everyone can benefit from challenging themselves in the STEM field. STEM stands for Science Technology Engineering and Math. Though there is debate on whether there is a general shortage of Americans in STEM fields, women and minorities represent a large deficit in these areas of study. It goes without saying that all members of the public should invest in educating themselves in at least one of the STEM fields.

STEM is versatile

There is nothing wrong with participating in the humanities and liberal arts. In fact, we need those facets of our cultural identity. The addition of STEM skills can greatly enhance those areas of knowledge and make you a more diverse, dynamic and competitive employee. You can be a teacher, but you can also be an algebra or biology teacher for K-12 students. You can be a librarian, but you can also be a systems librarian, medical reference librarian, web and data librarian etc. Organization, communication and creative skills are not lost in the traditionally analytical STEM fields. Fashion designers and crafters can use computer-aided design (CAD) to create graphic representations of a product. CAD can also be used to create 3D designs. Imagine a 22nd Century museum filled with 3D printed sculptures as a representation of early 21st Century art.

Early and mid-career professionals

You’re not stuck with a job title/description. Most employers would be more than happy if you inquired about being the technology liaison for your department. Having background knowledge in your area, as well as tech knowledge, could place you in a dynamic position to recommend training, updates or software to improve your organizations management. It is also never too late to learn. In fact, having spent decades in an industry, you’re more experienced. Experience brings with it a treasure of potential.

Youth and pre-professionals

According to the National Girls Collaborative Project, at the K-12 level girls are holding their ground with boys in the math and science curriculum. The reality is that, contrary to previously held beliefs, girls and young women are entering the STEM fields. The drop-off occurs at the post-secondary level. Women account for more than half of the population at most American universities. However, women account for the least university degrees in math, engineering and computer science (approx 18%). Hispanics, blacks and Native Americans combined account for 13% of science and engineering degrees earned in 2011. Though there is an abundance of articles, listing the reasons why women are dropping like flies from STEM education, many of them boil down to options. Either there aren’t any or not enough. Young females are discouraged at a young age, so they opt to shift toward other areas of study they believe their skills are better suited. There is also the unfortunate tale of the post-graduate woman who is forced to trade her dream career in order to care for her children (or is marginalized for employment because she has children).

A little encouragement

As a female and a minority, the statistics mirror my personal narrative. I was a late arrival to the information technology field because, when I was younger, I was terrible at math. I unfortunately assumed, like many other girls, that I would never be able to work in the sciences because I wasn’t “naturally” talented. I also didn’t receive my first personal computer until I was in middle school (late 90s to early 2000s). Back then a computer would easily set you back a couple grand. The few computer courses I took championed boys as future programmers and technicians. I thought that boys were naturally great with technology without realizing that it was a symptom of their environment. It would have been great if someone pulled me aside and explained that natural talent is a myth. That if I was willing to work diligently, I could be as good as the boys.

Organizations like exist to remind everyone that women and minorities are capable of holding positions in STEM fields. This is by no means an endorsement for STEMinist, I just thought the addition of STEM as another frontier for feminism should be recognized. If you type “women in STEM” into a search engine, you’ll be inundated with other organizations that are adding to the conversation. No matter where your views fall on the concept of feminism, or females and minorities in the sciences, we all have a role to play in encouraging them to pursue their interests as a career.


Are you or do you know a woman/minority who is contributing to science and technology? No matter how small you believe the contribution to be, leave a comment in hopes of encouraging someone.

DPLA: Getting Involved: Our Expanding Community Reps Program

planet code4lib - Tue, 2015-02-03 15:51

There are so many ways to get involved with the Digital Public Library of America, each of which contributes enormously to our mission of connecting people with our shared cultural heritage. Obviously we have our crucial hubs and institutional partners, who work closely with us to bring their content to the world. If you’re a software developer, you can build upon our code, write your own, and create apps that help to spread that content far and wide. And if you want to provide financial support, that’s easy too.

But I’m often asked about more general ways that one can advance DPLA’s goals, especially in your local community, and perhaps if you’re not necessarily a librarian, archivist, museum professional, or technologist.

Our Community Reps program exists precisely for that reason. DPLA staff and partners can only get the word out so far each year, and we need people from all walks of life who appreciate what we do to step forward and act as our representatives across the county and around the globe. We currently have another round of applications, and I ask you to strongly consider joining the program.

Two hundred people, from all 50 states and from five foreign countries have already done so. They orchestrate public meetings, hold special events, go into classrooms, host hackathons, and use their creativity and voices to let others know about DPLA and what everyone can do with it. I know this first-hand through my interactions with Reps in virtual meetings, on our special Reps email list, and from meeting many in person as I travel across the country. Reps are a critical part of the DPLA family, who often have innovative ideas that make a huge difference to our efforts.

We’re incredibly fortunate to have so many like-minded and energetic people join us and the expanding DPLA community through the Reps program. It’s a great chance to be a part of our mission. Apply today!

Library of Congress: The Signal: Office Opens up with OOXML

planet code4lib - Tue, 2015-02-03 14:40

The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in the Office of Strategic Initiatives.

Before VisiCalc, Lotus 1-2-3, and Microsoft Excel, spreadsheets were manual although their compilers took advantage of adding machines. And there were contests, natch. This 1937 photograph from the Library’s Harris & Ewing collection portrays William A. Offutt of the Washington Loan and Trust Company. It was produced on the occasion of Offutt’s victory over 29 competitors in a speed and accuracy contest for adding machine operators sponsored by the Washington Chapter, American Institute of Banking.

We are pleased to announce the publication of nine new format descriptions on the Library’s Format Sustainability Web site. This is a closely related set, each of which pertains to a member of the Office Open XML (OOXML) family.

Readers should focus on the word Office, because these are the most recent expression of the formats associated with Microsoft’s family of “Office” desktop applications, including Word, PowerPoint and Excel. Formerly, these applications produced files in proprietary, binary formats that carried the filename extensions doc, ppt, and xls. The current versions employ an XML structure for the data and an x has been added to the extensions: docx, pptx, and xlsx.

In addition to giving the formats an XML expression, Microsoft also decided to move the formats out of proprietary status and into a standardized form (now focus on the word Open in the name.) Three international organizations cooperated to standardize OOXML. Ecma International, an international, membership-based organization, published first in 2006. At that time, Caroline Arms (co-compiler of the Library’s Format Sustainability Web site) served on the ECMA work group, which meant that she was ideally situated to draft these descriptions.

In 2008, a modified version was approved as a standard by two bodies who work together on information technology standards through a Joint Technical Committee (JTC 1): International Organization for Standardization and International Electrotechnical Commission. These standards appear in a series with identifiers that lead off with ISO/IEC 29500. Subsequent to the initial publication by ISO/IEC, ECMA produced a second edition with identical text. Clarifications and corrections were incorporated into editions published by this trio in 2011 and 2012.

Here’s a list of the nine:

  • OOXML_Family, OOXML Format Family — ISO/IEC 29500 and ECMA 376
  • OPC/OOXML_2012, Open Packaging Conventions (Office Open XML), ISO 29500-2:2008-2012
  • DOCX/OOXML_2012, DOCX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • DOCX/OOXML_Strict_2012, DOCX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • PPTX/OOXML_2012, PPTX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • PPTX/OOXML_Strict_2012, PPTX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • XLSX/OOXML_2012, XLSX Transitional (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 1-4
  • XLSX/OOXML_Strict_2012, XLSX Strict (Office Open XML), ISO 29500:2008-2012; ECMA-376, Editions 2-4
  • MCE/OOXML_2012, Markup Compatibility and Extensibility (Office Open XML), ISO 29500-3:2008-2012, ECMA-376, Editions 1-4

Microsoft is not the only corporate entity to move formerly proprietary specifications into the realm of public standards. Over the last several years, Adobe has done the same thing with the PDF family. There seems to be a new business model here: Microsoft and Adobe are proud of the capabilities of their application software–that is where they can make money–and they feel that wider implementation of these data formats will help their business rather than hinder it.

Office work in the days before computer support. This photograph of the U.S. Copyright Office (part of the Library of Congress) was made in about 1920 by an unknown photographer. Staff members are using typewriters and a card file to track and manage copyright information. The original photograph is held in the Geographical File in the Library’s Prints and Photographs Division.

Although an aside in this blog, it is worth noting that Microsoft and Adobe also provide open access to format specifications that are, in a strict sense, still proprietary. Microsoft now permits the dissemination of its specifications for binary doc, ppt, and xls, and copies have been posted for download at the Library’s Format Sustainability site. Meanwhile, Adobe makes its DNG photo file format specification freely available, as well as its older TIFF format specification.

Both developments–standardization for Office XML and PDF and open dissemination for Office, DNG and TIFF–are good news for digital-content preservation. Disclosure is one of our sustainability factors and these actions raise the disclosure levels for all of these formats, a good thing.

Meanwhile, readers should remember that the Format Sustainability Web site is not limited to formats that we consider desirable. We list as many formats (and subformats) as we can, as objectively as we can, so that others can choose the ones they prefer for a particular body of content and for particular use cases.

The Library of Congress, for example, has recently posted its preference statements for newly acquired content. The acceptable category for textual content on that list includes the OOXML family as well as OpenDocument (aka Open Document Format or ODF), another XML-formatted office suite. ODF was developed by the Organization for the Advancement of Structured Information Standards, an industry consortium. ODF’s standardization as ISO/IEC 23600 in 2006 predates ISO/IEC’s standardization of OOXML. The Format Sustainability team plans to draft descriptions for ODF very soon.

Nick Ruest: An Exploratory look at 13,968,293 #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets

planet code4lib - Tue, 2015-02-03 14:29
#JeSuisCharlie #JeSuisAhmed #JeSuisJuif #CharlieHebdo

I've spent the better part of a month collecting tweets from the #JeSuisCharlie, #JeSuisAhmed, #JeSuisJuif, and #CharlieHebdo tweets. Last week, I pulled together all of the collection files, did some clean up, and some more analysis on the data set (76G of json!). This time I was able to take advantage of Peter Binkley's twarc-report project. According to the report, the earliest tweet in the data set is from 2015-01-07 11:59:12 UTC, and the last tweet in the data set is from 2015-01-28 18:15:35 UTC. This data set includes 13,968,293 tweets (10,589,910 retweets - 75.81%) from 3,343,319 different users over 21 days. You can check out a word cloud of all the tweets here.

First tweet in data set (numberic sort of tweet ids):


— Thierry Puget (@titi1960) January 7, 2015


If you want to experiment/follow along with what I've done here, you can "rehydrate" the data set with twarc. You can grab the Tweet ids for the data set from here (Data & Analysis tab).

% --hydrate JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweet-ids-20150129.txt > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json

The hydration process will take some time. I'd highly suggest using GNU Screen or tmux, and grabbing approximately 15 pots of coffee.


In this data set, we have 133,970 tweets with geo coordinates availble. This represents about 0.96% of the entire data set.

The map is available here in a separate page since the geojson file is 83M and will potato your browser while everything loads. If anybody knows how to stream that geojson file to Leaflet.js so the browser doesn't potato, please comment! :-)


These are the top 10 users in the data set.

  1. 35,420 tweets Promo_Culturel
  2. 33,075 tweets BotCharlie
  3. 24,251 tweets YaMeCanse21
  4. 23,126 tweets yakacliquer
  5. 17,576 tweets YaMeCanse20
  6. 15,315 tweets iS_Angry_Bird
  7. 9,615 tweets AbraIsacJac
  8. 9,318 tweets AnAnnoyingTweep
  9. 3,967 tweets rightnowio_feed
  10. 3,514 tweets russfeed

This comes from twarc-report's

$ ~/git/twarc-report/ -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json Hashtags

There are teh top 10 hashtags in the data set.

  1. 8,597,175 tweets #charliehebdo
  2. 7,911,343 tweets #jesuischarlie
  3. 377,041 tweets #jesuisahmed
  4. 264,869 tweets #paris
  5. 186,976 tweets #france
  6. 177,448 tweets #parisshooting
  7. 141,993 tweets #jesuisjuif
  8. 140,539 tweets #marcherepublicaine
  9. 129,484 tweets #noussommescharlie
  10. 128,529 tweets #afp

These are the top 10 URLs in the data set. 3,771,042 tweets (27.00%) had an URL associated with them.

These are all shortened urls. I'm working through and issue with

  1. (43,708)
  2. (19,328)
  3. (17,033)
  4. (14,118)
  5. (13,252)
  6. (12,407)
  7. (9,228)
  8. (9,044)
  9. (8,721)
  10. (8,581)

This comes from twarc-report's

$ ~/git/twarc-report/ -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json Media

These are the top 10 media urls in the data set. 8,141,552 tweets (58.29%) had a media URL associated with them.

36,753 Occurrences

35,942 Occurrences

33,501 Occurrences

31,712 Occurrences

29,359 Occurrences

26,334 Occurrences

25,989 Occurrences

23,974 Occurrences

22,659 Occurrences

22,421 Occurrences

This comes from twarc-report's

$ ~/git/twarc-report/ -o text JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-20150129.json tags: #JeSuisCharlie#JeSuisAhmed#JeSuisJuif#CharlieHebdotwarctwarc-report

LITA: To Infinity (Well, LibGuides 2.0) And Beyond

planet code4lib - Tue, 2015-02-03 13:00

Ah, but do I? Credit: Buffy Hamilton


LibGuides is a content management system distributed by Springshare and used by approximately 4800 libraries worldwide to curate and annotate resources online. Generally librarians use it to compile subject guides, but more and more libraries are also using it to build their websites. In 2014, Springshare went public with a new and improved version called LibGuides 2.0.

When my small university library upgraded to LibGuides 2.0, we went the whole hog. After migrating our original LibGuides to version 2, I redid the entire library website using LibGuides, integrating all our content into one unified, flexible content management system (CMS).

Today’s post considers my library’s migration to LibGuides 2.0 as well as assessing the product. My next post will look at how we turned a bunch of subject guides into a high-performing website.

A faculty support page built using LibGuides 2.0 (screenshot credit: Michael Rodriguez)


According to the LibGuides Community pages, 913 libraries worldwide are running LibGuides v1, 439 are running LibGuides v1 CMS, and 1005 are running some version of LibGuides 2.0. This is important because (1) a lot of libraries haven’t upgraded yet, and (2) Springshare has a virtual monopoly on the market for library resource guides. Notwithstanding, Springshare does offer a quality product at a reasonable price. My library pays about $2000 per year for LibGuides CMS, which adds enhanced customization and API features to the regular LibGuides platform.

We did consider dropping LibGuides in favor of WordPress or another open source system, but we concluded that consolidating our web services as much as possible would enhance ease-of-use and ease training and transitions among staff. Our decision was also influenced by the fact that we use LibCal 2.0 for our study room reservation system, while Florida’s statewide Ask a Librarian virtual reference service, in which we participate, is switching to LibAnswers 2.0 by summer 2015. LibGuides, LibCal, and LibAnswers now all integrate seamlessly behind a single “LibApps” login.

LibApps admin interface (screenshot credit: Michael Rodriguez)


Since the upgrade is free, we decided to migrate before classes recommenced in September 2014. We relentlessly weeded redundant, dated, or befuddling content. I deleted or consolidated four or five guides, eliminated the inconsistent tagging system, and rearranged the subject categories. We picked a migration date, and Springshare made it happen within 30 minutes of the hour we chose.

I do suggest carefully screening your asset list prior to migration, because you have the option of populating your A-Z Database List from existing assets simply by checking the box next to each link you want to add to the database list. We overlooked this stage of the process and then had to manually add 140 databases to our A-Z list post-migration. Otherwise, migration was painless. Springshare’s tech support staff were helpful and courteous throughout the process.

Check out Margaret Heller and Will Kent’s ALA TechConnect blog post on Migrating to LibGuides 2.0 or Bill Coombs’ review of LibGuides 2 for other perspectives on the product migration.

Benefits of LibGuides 2.0

Mobile responsive. All the pages automatically realign themselves to match the viewport (tablet, smartphone, laptop, or desktop) through which they are accessed. This is huge.

Modern code. LibGuides 2.0 is built in compliance with HTML5, CSS3, and Bootstrap 3.2, which is a vast improvement given that the previous version’s code seems to date from 1999.

Custom URLs. Did you know that you can work with Springshare and your IT department to customize the URLs for your Guides? Or that you can create a custom homepage with a simple redirect? My library’s website now lives at a delightfully clean URL:

Hosting. Springshare hosts its apps on Amazon servers, so librarians can focus on building content instead of dealing with IT departments, networks, server crashes, domain renewals, or FTP.

A-Z Database List. Pool all your databases into one easily sortable, searchable master list. Sort by subject, database type, and vendor and highlight “best bets” for each category.

Customizations. Customize CSS and Bootstrap for individual guides and use the powerful new API to distribute content outside LibGuides (for LibGuides CMS subscribers only). The old API has been repurposed into a snazzy widget generator to which any LibGuides subscriber has access.

Dynamic design. New features include carousels, image galleries, and tabbed boxes.

Credit: Flickr user Neal Jennings


Hidden code. As far as I can tell, librarians can’t directly edit the CSS or HTML as in WordPress. Instead, you have add custom code to empty boxes in order to override the default code.

Inflexible columns. LibGuides 2.0 lacks the v1 feature wherein librarians could easily adjust the width of guides’ columns. Now we are assigned a preselected range of column widths, which we can only alter by going through the hassle of recoding certain guide templates. Grr.

Slow widgets. Putting multiple widgets on one page can reduce load time, and occasionally a widget won’t load at all in older versions of IE, forcing frustrated users to refresh the page.

Closed documentation. Whereas the older documentation is available on the open web for anyone to see, Springshare has locked most of its v2 documentation behind a LibApps login wall.

No encryption. Alison Macrina’s recent blog post on why we need to encrypt the web resonated because LibGuides’ public side isn’t encrypted. Springshare can provide SSL for domains, but they can’t help with custom domains maintained by library IT on local servers.

Proprietary software. As a huge advocate for open source, I wince at relying on a proprietary CMS rather than on WordPress or Drupal, even though Springshare beats most vendors hollow.

Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.


That said, we are delighted with the new and improved LibGuides. The upgrade has significantly enhanced our website’s user-friendliness, visual appeal, and performance. The next post in this two-part series will look at how we turned a bunch of subject guides into a library website.

Over to you, dear readers! What is your LibGuides experience? Any alternatives to suggest?

Open Knowledge Foundation: India’s Science and Technology Outputs are Now Under Open Access

planet code4lib - Tue, 2015-02-03 11:13

This is a cross-post from the Open Knowledge India blog, see the original here.

As a new year 2015 gift to the scholars of the world, the two departments (Department of Biotechnology [DBT] and Department of Science and Technology [DST]) under the Ministry of Science and Technology, Government of India had unveiled Open Access Policy to all its funded research.

The policy document dated December 12, 2014 states that “Since all funds disbursed by the DBT and DST are public funds, it is important that the information and knowledge generated through the use of these funds are made publicly available as soon as possible, subject to Indian law and IP policies of respective funding agencies and institutions where the research is performed“.

As the Ministry of Science and Technology funds basic, translational and applied scientific research in the country through various initiatives and schemes to individual scientists, scholars, institutes, start-up, etc., this policy assumes very significance and brings almost all the science and technology outputs (here published articles only) generated at various institutes under Open Access.

The policy underscores the fact that by providing free online access to the publications is the most effective way of ensuring the publicly funded research is accessed, read and built upon.

The Ministry under this policy has set up two central repositories of its own ( and and a central harvester ( which will harvest the ful-text and metadata from these repositories and other repositories of various institutes established/funded by DBT and DST in the country.

According to the Open Access policy, “the final accepted manuscript (after refereeing, revision, etc. [post-prints]) resulting from research projects, which are fully or partially funded by DBT or DST, or were performed using infrastructure built with the support of these organizations, should be deposited“.

The policy is not only limited to the accepted manuscripts, but extends to all scholarship and data which received funding from DBT or DST from the fiscal year 2012-13 onwards.

As mentioned above that many of the research projects at various institutes in the country are funded by DBT or DST, this policy definitely, encourage the establishment of Open Access Institutional Repositories by the institutes and opening up of access to all the publicly funded research in the country.

Terry Reese: MarcEdit 6 Update

planet code4lib - Tue, 2015-02-03 06:04

This MarcEdit update includes a couple fixes and an enhancement to one of the new validation components.  Updates include:

** Bug Fix: Task Manager: When selecting the Edit Subfield function, once the delete subfield checkbox is selected and saved, you cannot reopen the task to edit.  This has been corrected.
** Bug Fix: Validate ISBNS: When processing ISBNs, validation appears to be working incorrectly.  This has been corrected.  The ISBN validator now automatically validates $a and $z of any field specified.
** Enhancement: Validate ISBNs: When selecting the field to validate — if just the field is entered, the program automatically examines the $a and $z.  However, you can specify a specific field and subfield for validation. 


Validate ISBNs

This is a new function (as of the last update) that utilizes the mathematical formula to examine the ISBN and determine if the number is mathematically correct.  As I work into the future, I’ll add functionality to enable users to ensure that the ISBN is actually in use and linked to the record referenced in the record.  To use the function, open the MarcEditor, Select the Reports Menu, and then Validate ISBNs. 

Once selected, you will be asked to specify a field or field and subfield to process.  If just the field is selected, the program will automatically evaluate the $a and $z if present.  If the field and subfield is specified, the program will only evaluate the specified subfield.

When run, the program will output any ISBN fields that cannot be mathematically validated.


To get the update, utilize the automated update utility or go to to get the current download.


William Denton: Measure the Library Freedom

planet code4lib - Tue, 2015-02-03 01:41

The winners of the Knight News Challenge: Libraries were announced a few days ago. I didn’t know about the Knight Foundation (those are the same Knights as in Knight Ridder (“Not to be confused with Knight Rider or Night Rider”)) but they’re giving out lots of money to lots of good projects. DocumentCloud got funded a few years ago, and the Internet Archive got $600,000 this round, and well deserved it is. I was struck by how two winners fit together: the Library Freedom Project, which got $244,700, and Measure the Future, which got $130,000.

The Library Freedom Project has this goal:

Providing librarians and their patrons with tools and information to better understand their digital rights by scaling a series of privacy workshops for librarians.

Measure the Future says:

Imagine having a Google-Analytics-style dashboard for your library building: number of visits, what patrons browsed, what parts of the library were busy during which parts of the day, and more. Measure the Future is going to make that happen by using simple and inexpensive sensors that can collect data about building usage that is now invisible. Making these invisible occurrences explicit will allow librarians to make strategic decisions that create more efficient and effective experiences for their patrons.

Our goal is to enable libraries and librarians to make the tools that measure the future of the library as physical space. We are going to build open tools using open hardware and open source software, and then provide open tutorials so that libraries everywhere can build the tools for themselves.

Moss is boss.

I like collecting and analyzing data, I like measuring things, I like small computers and embedded devices, even smart dust—it always comes back to Vernor Vinge, this time A Deepness In the Sky—but I must say I don’t like Google Analytics even though we use it at work. Any road up:

We will be producing open tutorials that outline both the open hardware and the open source software we will be using, so that any library anywhere will be able to purchase inexpensive parts, put them together, and use code that we provide to build their own sensor networks for their own buildings.

The people behind Measure the Future are all top in the field, but, cripes, it looks like they want to combine users, analytics, metrics, sensors, embedded devices, free software, open hardware and “library as place” into a well-intentioned ROI-demonstrating panopticon.

Delicious Mondrian cake. So moist, so geometric.

I’m not going to get all Michel Foucault you, but I recently read The Inspection House: An Impertinent Field Guide to Modern Surveillance by Tim Maly and Emily Horne:

The panopticon is the inflexion point and the culmination point of this new regime. It is the platonic ideal of the control the disciplinary society is trying to achieve. Operation of the panopticon does not require special training or expertise; anyone (including the children or servants of the director, as Bentham suggests) can provide the observation that will produce the necessary effects of anxiety and paranoia in the prisoner. The building itself allows power to be instrumentalized, redirecting it to the accomplishment of specific goals, and the institutional architecture provides the means to achieve that end.

Measure the Future has all the best intentions and will use safe methods, but still, it vibes hinky, this idea of putting sensors all over the library to measure where people walk and talk and, who knows, where body temperature goes up or which study rooms are the loudest … and then that would get correlated with borrowing or card swipes at the gate … and knowing that the spy agencies can hack into anything unless the most extreme security measures are taken and there’s never a moment’s lapse … well, it makes me hope they’ll be in close collaboration with the Library Freedom Project.

And maybe the Library Freedom Project can ask them why, when we’re trying to help users protect themselves as their own governments try to eliminate privacy forever, we’re planting sensors around our buildings because we now think that neverending monitoring of users will help us improve our services and show our worth to our funders.

Mita Williams: Hackerspaces, Makerspaces, Fab Labs, TechShops, Incubators, Accelerators... Where do libraries fit in?

planet code4lib - Mon, 2015-02-02 22:18
[ On February 1st, I gave this presentation the American Library Association Midwinter Conference in Chicago, Illinois as part of the ALA Masters Series. Thank you, good people of ALA.]

Today’s session is going to start out as a field guide but it’s going to end with a history lesson.

We’re going to start here - with a space station called c-base that found/ed in Berlin in 1995.

And then we are going travel through time and space to the present day where business start-up incubator innovation labs are everywhere including CBASE  which is the College of Business and Economics from the University of Guelph.

But before we figure out where libraries makerspaces fit in, we’re going to use the c-base space station to go back in time, just before the very first public libraries were established around the world, so we can figure out how to go back to the future we want. It is 2015, after all.  

But before we can talk about library makerspaces, we need to talk about hackerspaces.

This is the inside of c-base.

c-base is considered one of - or perhaps even - the very first hackerspace. It was established in 1995 by self-proclaimed nerds, sci-fi fans, and digital activists who tell us that c-base was built from a reconstructed space station that fell to earth, then somehow became buried, and when it was uncovered it was found to be borne with the inscription : be future compatible.

The c-base is described as a system of seven concentric rings that can move in relation to each other. These rings are called core, com, culture, creative, cience, carbon and clamp.

Beyond its own many activities, c-base has become the meeting place for German Wikipedians and it’s where the German Pirate Party was first established.

Members of c-base have been known to present at events hosted by the Chaos Computer Club, which is Europe's largest association of hackers that's been around for 30 years now.

So c-base is a hackerspace that is actually inhabited by what we commonly think of as hackers.  

Some of the earliest hackerspaces were directly inspired by c-base. There is story that goes that in August of 2007, a group of North American hackers visited Germany for Chaos Communication Camp and was so impressed that when came back, they formed the first hackerspaces in the United States including NYC Resistor (2007), HacDC (2007), and Noisebridge (San Francisco, 2008).

Since then, many, many more hackerspaces have been developed - there are at least a thousand -  but behind these new spaces are organizations that have are much less counter-culture in their orientation than the mothership of c-base. In fact, at this moment, you could say there isn’t a clear delineation between hackerspaces and makerspaces at all.

But before we can start talking about makerspaces, I think it’s necessary to pay a visit two branches of the hackerspace evolutionary tree: TechShops and Fab Labs.

TechShop is a business that started in 2006 which provides - in return for a monthly membership - access to space that contains over a half a million dollars of equipment, generally including an electronics lab, a machine shop, a wood shop, a metal working shop, etc. There are only 8 of these TechShops across the US despite earlier predictions that would be about 20 of them by now.  They have been slow to open because the owner has stated that the business requires at least 800 people willing to pay over $100 a month in order for a TechShop to be viable.

The motto of TechShop is Build Your Dreams here. But TechShops have been largely understood as places where members dream of prototypes for their future Kickstarter projects. And such dreams have already come true: the prototype of the Square credit card processing reader, for example, was built in a Techshop. I think it's telling that the Detroit Techshop has a bright red phone in the space that connects you directly to the United States Patent and Trademark Office in case of a patent emergency.

Three of out of the 8 TechShops have backing from other organizations. TechShop's Detroit center opened in 2012 in partnership with Ford, which gives its employees free membership for three months. Ford employees can claim patents for themselves or they can give them to Ford in exchange for a share in revenue generated. Ford claims that this partnership with TechShop has led to a 50% rise in the number of patentable ideas put forward by the carmaker's employees,  in one year.

TechShop's offices in Washington DC and Pittsburgh are being sponsored by DARPA, an agency of the Defense Department. DARPA is reported to have invested $3.5 million dollars into TechShop as part of its “broad mission to see if regular citizens can outinvent military contractors on some of its weirder projects.”  But DARPA is not just helping pay for the space, they supposedly use the space themselves. According to the Bloomberg Business Week story I read, DARPA employees arrive at midnight to work when the TechShop is closed to its regular members.

You might be surprised, but we're going to be talking about DARPA again during this talk. But before that, we need to visit another franchise-like type of makerspace called the Fab Lab.

In 1998, Neil Gershenfeld started a class at MIT called "How to make (almost) anything". Gershenfeld wanted to introduce industrial-size machines normally inaccessible to technical students. However, he found his class also attracted a lot of students from various backgrounds including artists, architects, and designers. This led to a larger collaboration which eventually resulted in the Fab Lab Project which began in 2001. Fab Lab began as an educational outreach program from MIT but the idea has since developed into an ambitious network of labs located around the world.

The idea behind Fab Lab is that the space should provide a core set of tools powered by open source software that allow novice makers to make almost anything given a brief introduction to engineering and design education. Anyone can create a recognized Fab Lab as long as it makes a strong effort uphold the criteria of a Fab Lab, with the most important being that Fab Labs are required to be regularly open to the public for little or no cost. While it's not required, a Fab Lab is also strongly encouraged to communicate and collaborate with the other 350 or so other Fab Labs around the world. The idea is that, for example, if you design and make something using Fab Lab equipment in Boston, you could send the files and documents to someone in the Cape Town Fab Lab who could the same using their equipment.

The first library makerspace was a Fab Lab. It was established in 2011 in the Fayetteville Free Library in the state of New York. That's Lauren Britton pictured on screen who was a driving force that helped make that happen.

Now we don't tend to talk about Fab Labs in libraries. We talk about makerspaces. I think this is for several reasons with one of the main ones being - as admirable as I personally find the goals of international collaboration through open source and standardization - the established minimum baseline for such a Fab Lab generally costs between $25,000 and $65,000 in capital costs alone. This  means that a proper Fab Lab is out of reach for many communities and smaller organizations.

I think there's another reason why we think of makerspaces before we think of Fab Labs, TechShops or hackerspaces. And that's because of Make Magazine.

Started in 2005 from the influential source of so many essential computer books, O'Reilly Publishing, Make Magazine was going to be called Hack. But then the daughter of founder Dale Dougherty told him that hacking didn’t sound good, and she didn’t like it. Instead, she suggested he call the magazine MAKE instead, because ‘everyone likes making things’.

And there is something to be said for having a more inclusive name, and something less threatening than hackerspace. But I think there's more to it as well. There is a freedom that comes with the name of makerspace.

One my favourite things about makerspaces is that most of them are open to everyone - artists, scientists, educators, hobbyists, hackers and entrepreneurs and it is possibility for cross-pollination of ideas that is one of the espoused benefits of their spaces for their members. In a world where there's so much specialization, makerspaces are a force that are trying to bring different groups of people together.

Here's such an example. This is i3Detroit which calls itself a DIY co-working space that is a "a collision of art, technology and collaboration".

There are also makerspaces that are more heavily arts-based.  Miss Despoinas is a salon for experimental research and radical aesthetics that hosts workshops using code in contemporary art practice. It is physically located in Hobart, Tasmania.

There are presumably makerspaces that are designed primarily for the launching of new companies, although the only one I could find was Haxlr8r .  Haxkl8r is a hardware business accelerator that combines workshop space with mentorship and venture capital opportunities and official bases in San Francisco and Shenzhen, China.

That being said, I can't help but note that most of these maker spaces that I've found that are designed specifically to support start ups has been in universities.  Pictured here is the "Industrial Courtyard" where students and recent graduates of the university where I work can have access for prototype or product development.

In some ways, this brings up us full circle because it's been said the originators of the first hackerspaces set them up deliberately outside of universities, governments, and businesses because they wanted a form of political independence and even to be a place for resistance to the bad actors of these organizations.

As Willow Brugh describes this transition from the earliest hackerspaces and hacklabs :

The commercialization of the space means more people have access to the ideals of these spaces - but just as when "Open Source" opened up the door to more participants, the blatant political statement of "Free Software" was lost - hacklabs have turned from a political statement on use of space and voice into a place for production and participation in mainstream culture.

For as neutral and benign makerspaces seemingly are ("everyone likes to make things"), there are reasons to be mindful of the organizations behind them. For one, in 2012 Make Magazine received a grant from DARPA to establish makerspaces in 1000 U.S. high schools over the next four years.

Now it's one thing if makerspaces simply exist as a place where friends and hobbyists can meet, work and learn from each other. It's quite another if the makerspace becomes the basis of a model to address STEM anxieties in education.

As much as I appreciate how the Maker Movement is trying to bring a playful approach to learning through building, it's important to recognize that makerspaces tend to collect successful makers rather than produce them. The community who participates in hackerspaces and makerspaces is pronouncedly skewed white and male.  In 2012, Make Magazine reported that of its 300,000 in total readership, 81% are male, median age is 44, and the median household income is $106,000.

Lauren Britton, the librarian who was responsible for the very first Library Fab Lab/Makerspace is now studying as a doctoral student at Syracuse University in Information Science and Technology and a researcher for their Information Institute. She's been doing discourse analysis on the maker movement and last year she informally published some of her findings so far.  She's already tackled STEM anxiety and I'm particularly looking forward to what has has to say about gender and the makerspace movement.

But there's no time to get into all of that now, because it is now time to hop into c-base and travel through and time and space to the time before public libraries. We are going to travel up the makerspace evolutionary tree to what I like to consider the proto-species of the makerspace : The Mechanics Institute.

The world's first Mechanics' Institute was established in Edinburgh, Scotland in October 1821. Mechanics Institutes were formed to provide libraries and forms of adult education, particularly in technical subjects, to working men. As such, they were often funded by local industrialists on the grounds that they would ultimately benefit from having more knowledgeable and skilled employees. Mechanics Institutes as an institution did not last very long - the movement lasted only fifty years or so - although at their peak there were 700 of them worldwide.

What I think is so particularly poetic is that many of the buildings and core books collections of these Mechanics Institutes- especially where I'm from which is the province of Ontario in Canada - became the foundation for the very first public libraries.

Although there are still some Mechanics Institutes still among us, like coelacanths evolutionary speaking- most notably Montreal's Atwater Library and San Francisco's beautiful Mechanics Institute and Chess Room.

Now, I have to admit, when I see some makerspaces, they remind me of mechanics institutes: subsidized spaces that exist to provide access to technologies to be used for potential start-ups. And if that remains their primary focus, I think their moment will pass, just like mechanics institutes. The forces that made industrial technology accessible to small groups will presumably continue to develop into consumer technology.  To live by disruption is to die by disruption.

This is one reason why I'm so happy and proud of the way so many libraries have embraced makerspaces and have made them their own.  Because by and large, libraries keep people at the centre of the space- not technology.

Librarians - by and large - have opted for accessible materials and activities in their spaces and have host activities that emphasize creativity, personal expression and learning through play. 

This is The Bubbler which is a visually arts based makerspace from the Madison Public Library. I have never been but from what I can see, they are doing many wonderful things. They hosts events that involve bike hacking, audio engineering, board game making, and media creation projects. I was particular impressed how they are working with juvenile justice programs to bring these activities and workshops to justice involved youth.

As long as libraries can continue to focus on building a better future for all of us, then we can continue to be a space where that future can be built.

This concludes our tour through time and space. Thank you kindly for your attention.

May your libraries and your makerspaces be future compatible.

Nicole Engard: Bookmarks for February 2, 2015

planet code4lib - Mon, 2015-02-02 20:30

Today I found the following resources and bookmarked them on <a href=

  • Coggle Coggle is about redefining the way documents work: the way we share and store knowledge. It’s a space for thoughts that works the way that people do — not in the rigid ways of computers.

Digest powered by RSS Digest

The post Bookmarks for February 2, 2015 appeared first on What I Learned Today....

Related posts:

  1. Irony of Ironies
  2. ATO2014: Open Source Schools: More Soup, Less Nuts
  3. NFAIS: Innovation for Today’s Chemical Researchers

District Dispatch: President Obama’s budget increases library funding

planet code4lib - Mon, 2015-02-02 20:09

President Barack Obama today transmitted to Congress the Obama Administration’s nearly $4 trillion budget request to fund the federal government for fiscal year 2016, which starts October 1, 2015. The President’s budget reflected many of the ideas and proposals outlined in his January 20th State of the Union speech.

Highlights for the library community include $186.5 million in assistance to libraries through the Library Services and Technology Act (LSTA). This important program provides funding to states through the Institute of Museum and Library Services (IMLS).

“We applaud the President for recognizing the tremendous contributions libraries make to our communities, ” said American Library Association (ALA) President Courtney Young in a statement. “The American Library Association appreciates the importance of federal support for library services around the country, and we look forward to working with the Congress as they draft a budget for the nation.

“The biggest news for the library community is the announcement of $8.8 million funding for a national digital platform for library and museum services, which will give more Americans free and electronic access to the resources of libraries, archives, and museums by promoting the use of technology to expand access to the holdings of museums, libraries, and archives. Funding for this new program will be funded through the IMLS National Leadership Grant programs for Libraries ($5.3 million) and Museums ($3.5 million).

Statutory Authority FY 2010 FY 2011 FY 2012 FY 2013 FY 2014 Request FY 2014 Enacted FY 2015 Request Grants to States 172,561 160,032 156,365 150,000 150,000 154,848 152,501 Native Am/Haw. Libraries 4,000 3,960 3,869 3,667 3,869 3,861 3,869 Nat. Leadership / Libraries 12,437 12,225 11,946 11,377 13,200 12,200 12,232 Laura Bush 21st Century 24,525 12,818 12,524 10,000 10,000 10,000 10,000 Subtotal, LSTA 213,523 189,035 184,704 175,044 177,069 180,909 178,602

(View the full chart on the budget cuts from IMLS.)

“With the appropriations process beginning, we look forward to working for continued support of key programs, including early childhood learning, digital literacy, and the Library Services and Technology Act.”

The post President Obama’s budget increases library funding appeared first on District Dispatch.

LITA: LITA Board Meeting Two – ALA Midwinter 2015

planet code4lib - Mon, 2015-02-02 19:33

If you would like to listen in to the LITA Board meeting at ALA Midwinter 2015, it is streaming (in audio) below:

Islandora: Islandora/Fedora 4 Project Update

planet code4lib - Mon, 2015-02-02 19:21

The Islandora 7.x/Fedora 4.x integration that we announced in December has officially begun. Work began on January 19th and our first team meeting was Friday, January 30th and we will be meeting on the 4th Friday of every month at 1:00 PM Eastern time. Here's what's going on so far:

Project Updates

The new, Fedora 4 friendly version of Islandora is being built under the working designation of Islandora 7.x-2.x (as oppose to the 7.x-1.x series that encompasses current Fedora 3.x updates to Islandora, which are not going away any time soon). A new GitHub organization is in place for development and testing, and the Islandora Fedora 4 Interest Group has been reconvened under new Terms of Reference to act as a project group for the Fedora 4 integration. If you want to participate, please sign up as part of this group. If you don't have time to participate in regular meetings, we would still love to hear your use case. You can submit it for discussion in the issue queue of the interest group. Need help getting into the GitHub of it all? Contact us and we'll get you there.

There is also a new chef recipe in the works to quickly spin up development and testing environments with the latest for 7.x-2.x. Special thanks to MJ Suhonos and the team at Ryerson University for Islandora Chef!

The project is under the direction of Project Lead Nick Ruest (York University) and Tech Lead Danny Lamb (discoverygarden, Inc.), with participation from:

  • The University of Toronto Scarborough
  • The University of Oklahoma
  • The University of Manitoba
  • The University of Virginia
  • The University of Prince Edward Island
  • The University of Limerick
  • Simon Fraser University
  • Common Media
  • The Colorado Alliance

Special thanks goes to Aaron Coburn, whose fcrepo Camel module is going to be an integral part of our own designs for Fedora 4 and Islandora.

If you would like to talk to Nick and Danny about the project, or even offer up some help while they code away on an unofficial 'sprint,' you can meet up with them at discoverygarden's table at Code4Lib 2015 in Portland, OR February 9 - 12.

Technical Planning

Danny Lamb has kicked off the design of the next stage of Islandora with a Technical Design Doc that you should definitely read and comment on if you have any plans to use Islandora with Fedora 4 in the future. We are still at the stage of hearing use cases and making plans, so now is the time to get your needs into the mix. The opening line sums up the basic approach: Islandora version 7.x-2.x is middleware built using Apache Camel to orchestrate distributed data processing and to provide web services required by institutions who would like to use Drupal as a frontend to a Fedora 4 JCR repository. 

Some preliminary Big Ideas:

  • No more Tuque. No more GSearch. No more xml forms. The Java middleware layer will handle many things that were previously done in PHP and Drupal.
  • It will treat Drupal like any other component of the stack. There will be indexing in Drupal for display using nodes, fields, and other parts of the Drupal ecosystem.
  • It will use persistent queues, so the middleware layer can exist on separate servers.
  • The Fedora-Drupal connection comes first. An admin interface will be developed later.

And some preliminary Wild Ideas (we'd love to hear your opinions):

  • Headless Drupal 7.x
  • Make the REST API endpoints the same for Drupal 7 and Drupal 8 so migration is easier.
  • Dropbox-style ingest.

Or rather, upgration (a portmanteau of upgrade and migration, and our new favourite word). Nick Ruest and York University are working through a Fedora 3.x -> 4.x upgration path. Because York's Islandora stack is as close to generic as you can reasonably get in in production, this should provide a model for a generic upgration path that others can follow - as well as keeping the needs of the Islandora community on the radar for the Fedora 4 development team, so that all of the pieces evolve to work together.


We launched the project with a funding goal of $100,000 to get a functioning prototype and Fedora 3.x -> 4.x migration path. We are very pleased to announce that we have achieved more than half of that funding goal and are well set to see things through to the end. 

Many, many thanks to our supporters, all of whom are now members the Islandora Foundation as Partners:

  • York University
  • McMaster University
  • University of Prince Edward Island
  • University of Manitoba
  • University of Limerick

If your institution would like to join up, whether as a $10,000 Partner or at some other level of support, please contact us


State Library of Denmark: British Library and IIPCTech15

planet code4lib - Mon, 2015-02-02 13:42

For a change of pace: A not too technical tale of my recent visit to England.

The people behind IIPC Technical Training Workshop – London 2015 had invited yours truly as a speaker and participant in the technical training. IIPC stands for International Internet Preservation Consortium and I were to talk about using Solr for indexing and searching preserved Internet resources. That sounded interesting and Statsbiblioteket encourages interinstitutional collaboration, so the invitation was gladly accepted. Some time passed and British Library asked if I might consider arriving a few days early and visit their IT development department? Well played, BL, well played.

I kid. For those not in the know, British Library made the core software we use for our Net Archive indexing project and we are very thankful for that. Unfortunately they do have some performance problems. Spending a few days, primarily talking about how to get their setup to work better, was just reciprocal altruism working. Besides, it turned out to be a learning experience for both sides.

It is the little things, like the large buses

At British Library, Boston Spa

The current net archive oriented Solr setups at British Library is using SolrCloud with live indexes on machines with spinning drives (aka harddisks) and a – relative to index size – low amount of RAM. At Statsbiblioteket, our experience tells us that such setups generally have very poor performance. Gil Hoggarth and I discussed Solr performance at length and he was tenacious on exploring every option available. Andy Jackson partook in most of the debates. Log file inspections and previous measurements from the Statsbiblioteket setups seemed to sway them in favour of different base hardware, or to be specific: Solid State Drives. The open question is how much such a switch would help or if it would be a better investment to increase the amount of free memory for caching.

  • A comparative analysis of performance with spinning drives vs. SSDs for multi-TB Solr indexes on machines with low memory would help other institutions tremendously, when planning and designing indexing solutions for net archives.
  • A comparative analysis of performance with different amounts of free memory for caching, as a fraction of index size, for both spinning drives and SSDs, would be beneficial on a broader level; this would give an idea of how to optimize bang-for-the-buck.

Illuminate the road ahead

Logistically the indexes at British Library are quite different from the index at Statsbiblioteket: They follow the standard Solr recommendation and treats all shards as a single index, both for index and search. At Statsbiblioteket, shards are build separately and only treated as a whole index at search time. The live indexes at British Library have some downsides, namely re-indexing challenges, distributed indexing logistics overhead and higher hardware requirements. They also have positive features, primarily homogeneous shards and the ability to update individual documents. The updating of individual documents is very useful for tracking meta-data for resources that are harvested at different times, but have unchanged content. Tracking of such content, also called duplicate handling, is a problem we have not yet considered in depth at Statsbiblioteket. One of the challenges of switching to static indexes is thus:

  • When a resource is harvested multiple times without the content changing, it should be indexed in such a way that all retrieval dates can be extracted and such that the latest (and/or the earliest?) harvest date can be used for sorting, grouping and/or faceting.

One discussed solution is to add a document for each harvest date and use Solr’s grouping and faceting features to deliver the required results. The details are a bit fluffy as the requirements are not strictly defined.

At the IIPC Technical Training Workshop, London 2015

The three pillars of the workshop were harvesting, presentation and discovery, with the prevalent tools being Heritrix, Wayback and Solr. I am a newbie in two thirds of this world, so my outsider thoughts will focus on discovery. Day one was filled with presentations, with my Scaling Net Archive Indexing and Search as the last one. Days two and three were hands-on with a lot of discussions.

As opposed to the web archive specific tools Heritrix and Wayback, Solr is a general purpose search engine: There is not yet a firmly established way of using Solr to index and search net archive material, although the work from UKWA is a very promising candidate. Judging by the questions asked at the workshop, large scale full-text search is relatively new in the net archive world and as such the community lacks collective experience.

Two large problems of indexing net archive material is analysis and scaling. As stated, UKWA has the analysis part well in hand. Scaling is another matter: Net archives typically contains billions of documents, many of them with a non-trivial amount of indexable data (webpages, PDFs, DOCs etc). Search responses ideally involve grouping or faceting, which requires markedly more resources than simple search. Fortunately, at least from a resource viewpoint, most countries does not allow harvested material to be made available to the general public: The number of users and thus concurrent requests tend to be very low.

General recommendations for performant Solr systems tend to be geared towards small indexes or high throughput, minimizing the latency and maximizing the number of requests that can be processed by each instance. Down to Earth, the bottleneck tend to be random reads from the underlying storage, easily remedied by adding copious amounts of RAM for caching. While the advice arguable scales to net archive indexes in the multiple TB-range, the cost of terabytes of RAM, as well as the number of machines needed to hold them, is often prohibitive. Bearing in mind that the typical user groups on net archives consists of very few people, the part about maximizing the number of supported requests is overkill. With net archives as outliers in the Solr world, there is very little existing shared experience to provide general recommendations.

  • As hardware cost is a large fraction of the overall cost of doing net archive search, in-depth descriptions of setups are very valuable to the community.

All different, yet the same

Measurements from British Library as well as Statsbiblioteket shows that faceting on high cardinality fields is a resource hog when using SolrCloud. This is problematic for exploratory use of the index. While it can be mitigated with more hardware or software optimization, switching to heuristic counting holds promises of very large speed ups.

  • The performance benefits and the cost in precision of approximate search results should be investigated further. This area is not well-explored in Solr and mostly relies on custom implementations.

On the flipside of fast exploratory access is the extraction of large result sets for further analysis. SolrCloud does not scale for certain operations, such as deep paging within facets and counting of unique groups. Certain operations, such as percentiles in the AnalyticsComponent, are not currently possible. As the alternative to using the index tend to be very heavy Hadoop processing of the raw corpus, this is an area worth investing in.

  • The limits of result set extractions should be expanded and alternative strategies, such as heuristic approximation and per-shard processing with external aggregation, should be attempted.
On a personal note

Visiting British Library and attending the IIPC workshop was a blast. Being embedded in tech talk with intelligent people for 5 days was exhausting and very fulfilling. Thank you all for the hospitality and for pushing back when my claims sounded outrageous.


Subscribe to code4lib aggregator