You are here

Feed aggregator

Thom Hickey: Testing date parsing by fuzzing

planet code4lib - Tue, 2015-02-24 15:45

 Fuzz testing, or fuzzing, is a way of stress testing services by sending them potentially unexpected input data. I remember being very impressed by one of the early descriptions of testing software this way (Miller, Barton P., Louis Fredriksen, and Bryan So. 1990. "An empirical study of the reliability of UNIX utilities". Communications of the ACM. 33 (12): 32-44), but had never tried the technique.

Recently, however, Jenny Toves spent some time extending VIAF's date parsing software to handle dates associated with people in WorldCat.  As you might imagine, passing through a hundred million new date strings found some holes in the software.  While we can't guarantee that the parsing always gives the right answer, we would like to be as sure as we can that it won't blow up and cause an exception.

So, I looked into fuzzing.  Rather than sending random strings to the software, the normal techniques now used tend to generate them based on a specification or by fuzzing existing test cases.  Although we do have something close to a specification based on the regular expressions the code uses, I decided to try making changes to the date strings we have that are derived from VIAF dates.

Most frameworks for fuzzing are quite loosely coupled, typically they pass the fuzzed strings to a separate process that is being tested.  Rather than do that, I read in each of the strings, did some simple transformations on it and called the date parsing routine to see if it would cause an exception. Here's what I did for each test string, typically for as many times as the string was long.  At each step the parsing is called

  • Shuffle the string ('abc' might get replaced by 'acb')
  • Change the integer value of character up or down (e.g. 'b' would get replaced by 'a' and then by 'c')
  • Change each character to a random Unicode character

For our 384K test strings this resulted in 1.9M fuzzed strings. This took about an hour to run on my desktop machine.

While the testing didn't find all the bugs we knew about in the code, it did manage to tickle a couple of holes in it, so I think the rather minimal time taken (less than a day) was worth it, given the confidence it gives us that the code won't blow up on strange input.

The date parsing code in GitHub will be updated soon.  Jenny is adding support for Thai dates (different calendar) and generally improving things.

Possibly the reason I thought of trying fuzzing was an amazing post on lcamtuf's blog Pulling JPEGs out of thin air.  That post is really amazing.  By instrumenting some JPEG software so that his fuzzing software could follow code paths at the assembly level, he was able to create byte strings representing valid JPEG images by sending in fuzzed strings, a truly remarkable achievement. My feeling on reading it was very similar to my reaction reading the original UNIX testing article cited earlier.



Harvard Library Innovation Lab: Link roundup February 24, 2015

planet code4lib - Tue, 2015-02-24 14:21

This is the good stuff.

Sit Down. Shut Up. Write. Don’t Stop.

Hard work and working hard consistently. That’s the thing. Not romantic sparks of inspiration.

What makes us human? Videos from the BBC.

Fun, beautifully produced, short videos on what makes us human.

The Future of the Web Is 100 Years Old

Our current version of the Web (HTTP/HTML) is just one (far and away the most successful one) in a series of webs.

“Sea Rambler” Customized Bike by Geoff McFetridge

“I can learn small things that get me to points”

Boston Button Factory – Making Buttons Since 1872

17 pound, beautiful buttons. Want.

Open Knowledge Foundation: Open Data Camp UK: Bursting out of the Open Data Bubble

planet code4lib - Tue, 2015-02-24 13:23

“But nobody cares about Open Data”

This thought was voiced in many guises during last weekend’s Open Data Camp. Obviously not entirely true, as demonstrated by the 100+ people who had travelled to deepest Hampshire for the first UK camp of its kind, or the many more people involving themselves in Open Data Day activities around the world. However the sentiment that, while many of us are getting extremely excited about the potential of Open Data in areas including government, crime and health, the rest of the planet are ‘just not interested’ was very clear.

As a non-technical person I’m keen to see ways that this gap can be bridged.

Open Data Camp was a 2-day unconference that aimed to let the technical and making sit alongside the story-telling and networking. There was also lots of cake!

Open Data Camp t-shirts

Open Data Board Game

After a pitch from session leaders we were left with that tricky choice about what to go for. I attended a great session led by Ellen Broad from the Open Data Institute on creating an Open Data board game. Creating a board game is no easy task but has huge potential as a way to reach out to people. Those behind the Open Data Board Game Project are keen to create something informative and collaborative which still retains elements of individual competition.

In the session we spent some time thinking about what data could underpin the game: Should it use data sets that affect most members of the general public (transport, health, crime, education – almost a replication of the national information infrastructure)? Or could there be data set bundles (think environmental related datasets that help you create your own climate co-op or food app)? Or what about sets for different levels of the game (a newbie version, a government data version)?

What became clear quite early on was there was two ways to go with the board game idea: one was creating something that could share the merits of Open Data with new communities, the other was something (complex) that those already interested in Open Data could play. Setting out to create a game that is ‘all things to all people’ is unfortunately likely to fail.

Discussion moved away from the practicalities of board game design to engaging with ‘other people’. The observation was made that while the general public don’t care about Open Data per se they do care about the result it brings. One concrete example given was Uber which connects riders to drivers through apps, now with mainstream use.

One project taking an innovative approach is Numbers that Matter. They are looking to bypass the dominant demographic (white, male, middle class, young) of technology users and focus on communities and explore with them how Open Data will affect their well-being. They’ve set out to make Open Data personal and relevant (serving the individual rather than civic-level participant). Researchers in the project began by visiting members of the general public in their own environment (so taxi drivers, hairdressers,…) and spoke to them about what problems or issues they were facing and what solutions could be delivered. The team also spent time working with neighbourhood watch schemes – these are not only organised but have a ‘way in’ with the community. Another project highlighted that is looking at making Open Data and apps meaningful for people is Citadel on the Move which aims to make it easier for citizens and application developers from across Europe to use Open Data to create the type of innovative mobile applications they want and need.

The discussion about engagement exposed some issues around trust and exploitation; ultimately people want to know where the benefits are for them. These benefits needs to be much clearer and articulated better. Tools like Open Food Facts, a database of food products from the entire world, do this well: “we can help you identify products that contain the ingredient you are allergic to“.

Saturday’s unconference board

“Data is interesting in opposition to power”

Keeping with the theme of community engagement I attended a session led by RnR Organisation who support grassroots and minority cultural groups to change, develop and enhance their skills in governance, strategic development, operational and project management, and funding. They used the recent Release of Data fund, which targets the release of specific datasets prioritised by the Open Data User Group, to support the development of a Birmingham Data and Skills Hub. However their training sessions (on areas including data visualization, use of Tablau and Google Fusion tables) have not instilled much interest and on reflection they now realise that they have pitched too high.

Open Data understanding and recognition is clearly part of a broader portfolio of data literacy needs that begins with tools like Excel and Wikipedia. RnR work has identified 3 key needs of 3rd sector orgs: data and analysis skills; data to learn and improve activities; and measurement of impacts.

Among the group some observations were made on the use of data by community groups including the need for timely data (“you need to show people today“) and relevant information driven by community needs (“nobody cares about Open Data but they do care about stopping bad things from happening in their area“). An example cited was of a project to stop the go ahead of a bypass in Hereford, they specifically needed GIS data. One person remarked that “data is interesting in opposition to power“, and we have a role to support here. Other questions raised related to the different needs of communities of geography and communities of interest. Issues like the longevity of data also come in to play: Armchair Auditor is a way to quickly find out where the Isle of Wight council has been spending money, unfortunately a change in formats by the council has resulted in the site being comprimised.

Sessions were illustrated by Drawnalism

What is data literacy?

Nicely following on from these discussions a session later in the day looked at data literacy. The idea was inspired by an Open Data 4 Development research project led by Mark Frank and Johanna Walker (University of Southampton) in which they discovered that even technically literate individuals still found Open Data challenging to understand. The session ended up resulting a series of questions: So ‘what exactly is data literacy’? Is it a homogeneous set of expertise (e.g. finding data), or is the context everything? Are there many approaches (such as suggested in the Open Data cook book or is there a definitive guide such as the Open Data Handbook or a step by step way to learn such as through School of Data. Is the main issue asking the right questions? Is there a difference between data literacy and data fluency? Are there two types of specialism: domain specialism and computer expertise? And can you offset a lack of data expertise with better designed data?

The few answers seemed to emerge through analogies. Maybe data literacy is like traditional literacy – it is essential to all, it is everyone’s job to make it happen (a collaboration between parents and teachers). Or maybe it is more like plumbing – having some understanding can help you understand situations but then you often end up bringing in an expert. Then again it could be more like politics or PHSE – it enables you to interact with the world and understand the bigger picture. The main conclusion from the session was that it is the responsibility of everyone in the room to be an advocate and explainer of Open Data!

“Backbone of information for the UK”

The final session I attended was an informative introduction to the National Information infrastructure an iterative framework that lists strategically important data and documents the services that provide access to the data and connect it to other data. It intended as the “backbone of information” for the UK, rather like the rail and road networks cater for transport. The NII team began work by carrying out a data inventory followed by analysis of the quality of the data available. Much decision making has used the concept of “data that is of strategic value to country” – a type of ‘core reference data’. Future work will involve thinking around what plan the country needs to put into play to support this core data. Does being part of the NII protect data? Does the requirement for a particular data set compel release? More recently there has been engagement with the Open Data user group / transparency board / ODI / Open Knowledge and beyond to understand what people are using and why, this may prioritise release.

It seems that at this moment the NII is too insular, it may need to break free from consideration of just publicly owned data and begin to consider privately owned data not owned by the government (e.g. Ordnance Survey data). Also how can best practices be shared? The Local Government Association are creating some templates for use here but there is scope for more activity.

With event organiser Mark Braggins

Unfortunately I could only attend one day of Open Data Camp and there was way too much for one person to take in anyway! For more highlights read the Open Data Camp blog posts or see summaries of the event on Conferieze and Eventifier. The good news is that with the right funding and good will the Open Data Camp will become an annual roving event.

Where did people come from?

Terry Reese: MarcEdit 6 Update

planet code4lib - Tue, 2015-02-24 05:38

A new version of MarcEdit has been made available.  The update includes the following changes:

  • Bug Fix: Export Tab Delimited Records: When working with control data, if a position is requested that doesn’t exist, the process crashes.  This behavior has been changed so that a missing position results in a blank delimited field (as is the case if a field or field/subfield isn’t present.
  • Bug Fix: Task List — Corrected a couple reported issues related to display and editing of tasks.
  • Enhancement: RDA Helper — Abbreviations have been updated so that users can select the fields that abbreviation expansion occurs.
  • Enhancement: Linked Data Tool — I’ve vastly improved the process by which items are linked. 
  • Enhancement: Improved VIAF Linking — thanks to Ralp LeVan for pointing me in the right direction to get more precise matching.
  • Enhancement: Linked Data Tool — I’ve added the ability to select the index from VIAF to link to.  By default, LC (NACO) is selected.
  • Enhancement: Task Lists — Added the Linked Data Tool to the Task Lists
  • Enhancement: MarcEditor — Added the Linked Data Tool as a new function.
  • mprovements: Validate ISBNs — Added some performance enhancements and finished working on some code that should make it easier to begin checking remote services to see if an ISBN is not just valid (structurally) but actually assigned.
  • Enhancement: Linked Data Component — I’ve separated out the linked data logic into a new MarcEdit component.  This is being done so that I can work on exposing the API for anyone interested in using it.
  • Informational: Current version of MarcEdit has been tested against MONO 3.12.0 for Linux and Mac.

Linked Data Tool Improvements:

A couple specific notes of interest around the linked data tool.  First, over the past few weeks, I’ve been collecting instances where and viaf have been providing back results that were not optimal.  On the VIAF side, some of that was related to the indexes being queried, some of it relates to how queries are made and executed.  I’ve done a fair bit of work added some additional data checks to ensure that links occur correctly.  At the same time, there is one known issue that I wasn’t able to correct while working with, and that is around deprecated headings. currently provides no information within any metadata provided through the service that relates a deprecated item to the current preferred heading.  This is something I’m waiting for LC to correct.

To improve the Linked Data Tool, I’ve added the ability to query by specific index.  By default, the tool will default to LC (NACO), but users can select from a wide range of vocabularies (including, querying all the vocabularies at once).  The new screen for the Linked Data tool looks like the following:

In addition to the changes to the Linked Data Tool – I’ve also integrated the Linked Data Tool with the MarcEditor:

And within the Task Manager:

The idea behind these improvements is to allow users the ability to integrate data linking into normal cataloging workflows – or at least start testing how these changes might impact local workflows.


You can download the current version buy utilizing MarcEdit’s automatic update within the Help menu, or by going to: and downloading the current version.


SearchHub: Parsing and Indexing Multi-Value Fields in Lucidworks Fusion

planet code4lib - Mon, 2015-02-23 21:14
Recently, I was helping a client move from a pure Apache Solr implementation to Lucidworks Fusion.  Part of this effort entailed the recreation of indexing processes (implemented using Solr’s REST APIs) in the Fusion environment, taking advantage of Indexing Pipelines to decouple the required ETL from Solr and provide reusable components for future processing. One particular feature that was heavily used in the previous environment was the definition of “field separator” in the REST API calls to the Solr UpdateCSV request handler. For example: curl "http://localhost:8888/solr/collection1/update/csv/?commit=true&f.street_names.split=true&f.street_names.separator=%0D" --data-binary @input.csv -H 'Content-type:text/plain; charset=utf-8' The curl command above posts a CSV file to the /update/csv request handler, with the request parameters "f.aliases.split=true" and "f.aliases.separator=%0D" identifying the field in column “aliases” as a multi-value field, with the character “\r” separating the values (%0D is a mechanism for escaping the carriage return by providing the hexadecimal ASCII code to represent “\r”.)  This provided a convenient way to parse and index multi-value fields that had been stored as delimited strings.  Further information about this parameter can be found here. After investigating possible approaches to this in Fusion, it was determined that the most straightforward way to accomplish this (and provide a measure of flexibility and reusability) was to create an index pipeline with a JavaScript stage. The Index Pipeline Index pipelines are a framework for plugging together a series of atomic steps, called “stages,” that can dynamically manipulate documents flowing in during indexing.  Pipelines can be created through the admin console by clicking “Pipelines” in the left menu, then entering a unique and arbitrary ID and clicking “Add Pipeline” – see below. After creating your pipeline, you’ll need to add stages.  In our case, we have a fairly simple pipeline with only two stages: a JavaScript stage and a Solr Indexer stage.  Each stage has its own context and properties and is executed in the configured order by Fusion.  Since this post is about document manipulation, I won’t go into detail regarding the Solr Indexer stage; you can find more information about it here.  Below is our pipeline with its two new stages, configured so that the JavaScript stage executes before the Solr Indexer stage. The JavaScript Index Stage We chose a JavaScript stage for our approach, which gave us the ability to directly manipulate every document indexed via standard JavaScript – an extremely powerful and convenient approach.  The JavaScript stage has four properties:
  • “Skip This Stage” – a flag indicating whether this stage should be executed
  • “Label” – an optional property that allows you to assign a friendly name to the stage
  • “Conditional Script” – JavaScript that executes before any other code and must return true or false.  Provides a mechanism for filtering documents processed by this stage; if false is returned, the stage is skipped
  • “Script Body” – required; the JavaScript that executes for each document indexed (where the script in “Conditional Script,” if present, returned true)
Below, our JavaScript stage with “Skip This Stage” set to false and labeled “JavaScript_Split_Fields.” Our goal is to split a field (called “aliases”) in each document on a carriage return (CTRL-M).  To do this, we’ll define a “Script Body” containing JavaScript that checks each document for the presence of a particular field; if present, splits it and assigns the resulting values to that field. The function defined in “Script Body” can take one (doc, the pipeline document) or two (doc and _context, the pipeline context maintained by Fusion) arguments.  Since this stage is the first to be executed in the pipeline, and there are no custom variables to be passed to the next stage, the function only requires the doc argument.  The function will then be obligated to return doc (if the document is to be indexed) or null (if it should not be indexed); in actuality we’ll never return null as the purpose of this stage is to manipulate documents, not determine if they should be indexed. function (doc) { return doc; } Now that we have a reference to the document, we can check for the field “aliases” and split it accordingly.  Once its split, we need to remove the previous value and add the new values to “aliases” (which is defined as a multi-value field in our Solr schema.)  Here’s the final code: function (doc) { var f_aliases = doc.getFirstField("aliases"); if (f_aliases != null) { var v_aliases = f_aliases.value; } else { var v_aliases = null; } if (v_aliases != null) { doc.removeFields("aliases"); aliases = v_aliases.split("\r"); for (var i = 0; i < aliases.length; i++) { doc.addField('aliases',aliases[i]); } } return doc; } Click “Save Changes” at the bottom of this stage’s properties. Bringing It All Together Now we have an index pipeline – let’s see how it works!  Fusion does not tie an index pipeline to a specific collection or datasource, allowing for easy reusability.  We’ll need to associate this pipeline to the datasource from which we’re retrieving “aliases,” which is accomplished by setting the pipeline ID in that datasource to point to our newly-created pipeline. Save your changes and the next time you start indexing that datasource, your index pipeline will be executed. You can debug your JavaScript stage by taking advantage of the “Pipeline Result Preview” pane, which allows you to test your code against static data right in the browser. Additionally, you can add log statements to the JavaScript by calling a method on the logger object, to which your stage already has a handle.  For example: logger.debug("This is a debug message"); will write a debug-level log message to <fusion home>/logs/connector/connector.log.  By combining these two approaches, you should be able to quickly determine the root cause of any issues encountered. A Final Note You have probably already recognized the potential afforded by the JavaScript stage; Lucidworks calls it a “Swiss Army knife.”  In addition to allowing you to execute any ECMA-compliant JavaScript, you can import Java libraries – allowing you to utilize custom Java code within the stage and opening up a myriad of possible solutions.  The JavaScript stage is a powerful tool for any pipeline! About the Author Sean Mare is a technologist with over 18 years of experience in enterprise application design and development.  As Solution Architect with Knowledgent Group Inc., a leading Big Data and Analytics consulting organization and partner with Lucidworks, he leverages the power of enterprise search to enable people and organizations to explore their data in exciting and interesting ways.  He resides in the greater New York City area.

The post Parsing and Indexing Multi-Value Fields in Lucidworks Fusion appeared first on Lucidworks.

District Dispatch: Registration opens for the 16th annual Natl. Freedom of Information Day Conference

planet code4lib - Mon, 2015-02-23 20:38

Registration is now open for the 16th annual National Freedom of Information (FOI) Day Conference, which will be held on Friday, March 13, 2015, at the Newseum in Washington, D.C. The annual FOI Day Conference is hosted by the Newseum Institute’s First Amendment Center in cooperation with and the American Library Association (ALA). The event brings together access advocates, government officials, judges, lawyers, librarians, journalists, educators and others to discuss timely issues related to transparency in government and freedom of information laws and practices.

Madison Award Awardee Patrice McDermott

This year’s program will feature a discussion of the first ten years of the “Sunshine Week” national open records initiative, presented by the Reporters Committee for Freedom of the Press and the American Society of News Editors. Additionally, the event will include a preview of a major reporting package from The Associated Press, McClatchy and Gannett/USA Today on a decade of open government activity. Miriam Nisbet, former director of the National Archives’ Office of Government Information Services, will address attendees at the FOI Day Conference.

During the event, ALA will announce this year’s recipient of the James Madison Award, which is presented annually to individuals or groups that have championed, protected and promoted public access to government information and the public’s right to know. Previous Madison Award recipients include Internet activist Aaron Swartz, Representative Zoe Lofgren (D-CA) and the Government Printing Office. ALA Incoming President Sari Feldman will present Madison Award this year.

The program is open to the public, but seating is limited. To reserve a seat, please contact Ashlie Hampton, at 202-292-6288, or ahampton[at]newseum[dot]org. The program will be streamed “live” at

The post Registration opens for the 16th annual Natl. Freedom of Information Day Conference appeared first on District Dispatch.

Nicole Engard: Bookmarks for February 23, 2015

planet code4lib - Mon, 2015-02-23 20:30

Today I found the following resources and bookmarked them on <a href=

  • codebender Online development & collaboration platform for Arduino users, makers and engineers

Digest powered by RSS Digest

The post Bookmarks for February 23, 2015 appeared first on What I Learned Today....

Related posts:

  1. Home Automation with Arduino/RaspberryPi
  2. NFAIS: Embracing New Measures of Value
  3. Another way to use Zoho

Islandora: Looking Back at iCampBC

planet code4lib - Mon, 2015-02-23 19:48

Last week we took Islandora Camp to Vancouver, BC for the first time, and it was pretty awesome. 40 attendees and instructors came together for three days to talk about, use, and generally show off what's new with Islandora. We were joined mostly by folks from around BC itself, with attendees from Simon Fraser University, the University of Northern British Columbia, Vancouver Public Library, Prince Rupert Library (one of the very first Islandora sites in the world!), Emily Carr University of Art and Design, and the University of British Colombia.

iCampBC featured the largest-ever Admin Track, with 29 of us coming together to build our own demo islandora sites on a camp Virtual Machine. With the help of my co-instructor, Erin Tripp from discoverygarden, we made collections, played with forms, and built some very nice Solr displays for cat pictures (and the occasional dog). While we were sorting out the front-end of Islandora, the Developer Track and instructors Mark Jordan and Mitch MacKenzie, were digging into the code and ended up developing a new demo module, Islandora Weather, which you can use to display the current weather for locations described in the metadata of an Islandora object (should you ever need to do that...)

For sessions, we had a great panel on Digital Humanties, within and outside of Islandora, featuring Mark Leggott (UPEI), Karyn Huenemann (SFU), Mimi Lam (UBC), and Rebecca Dowson (SFU). SFU's Alex Garnett and Carla Graebner took us on a tour of the tools that SFU has built to manage research data in Islandora. Justin Simpson from Artefactual showed us how Islandora and Archivemitca can play together in Archidora. Slides for these and most other presentations are available via the conference schedule.

Camp Awards were handed out on the last day (and tuques were earned via trivia. Can you name three Interest Groups without looking it up?). A few highlights:

  • So Many Lizards Award: Ashok Modi for all of his rescues in the Dev track
  • Camp Kaleidoscope Award: Kay Cahill, for rocking at least three laptops at one point
  • Camo Mojo Award: Karyn Huenemann, for infectious enthusiasm 

Thank you to all of our attendees for making this camp such a success. We hope some of you will join us for our next big event, the first Islandora Conference, this summer in PEI.

We had a feeling it was going to be a good week when us east-coast camp instructors got to leave behind this:

And were greeted by this:

Seriously, Vancouver. You know you're supposed to be in Canada too, right?

Library of Congress: The Signal: Introducing the Federal Web Archiving Working Group

planet code4lib - Mon, 2015-02-23 16:07

The following is a guest post from Michael Neubert, a Supervisory Digital Projects Specialist at the Library of Congress.

View of Library of Congress from U.S. Capitol dome in winter. Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

“Publishing of federal information on government web sites is orders of magnitude more than was previously published in print.  Having GPO, NARA and the Library, and eventually other agencies, working collaboratively to acquire and provide access to these materials will collectively result in more information being available for users and will accomplish this more efficiently.” – Mark Sweeney, Associate Librarian for Library Services, Library of Congress.

“Harvesting born-digital agency web content, making it discoverable, building digital collections, and preserving them for future generations all fulfill the Government Publishing Office’s mission, Keeping America Informed. We are pleased to be partnering with the Library and NARA to get this important project off the ground. Networking and collaboration will be key to our success government-wide.” – Mary Alice Baish, Superintendent of Documents, Government Publishing Office.

“The Congressional Web Harvest is an invaluable tool for preserving Congress’ web presence. The National Archives first captured Congressional web content for the 109th Congress in 2006, and has covered every Congress since, making more than 25 TB of content publicly available at This important resource chronicles Congress’ increased use of the web to communicate with constituents and the wider public. We value this collaboration with our partners at the Government Publishing Office and the Library of Congress, and look forward to sharing our results with the greater web archiving community.” – Richard Hunt, Director of the Center for Legislative Archives, National Archives and Records Administration.

Today most information that federal government agencies produce is created in electronic format and disseminated over the World Wide Web. Few federal agencies have any legal obligation to preserve web content that they produce long-term and few deposit such content with the Government Publishing Office or the National Archives and Records Administration – such materials are vulnerable to being lost.

Exterior of Government Printing Office I [Today known as the Government Publishing Office]. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

How much information are we talking about? Just quantifying an answer to that question turns out to be a daunting task. James Jacobs, Data Services Librarian Emeritus, University of California, San Diego, prepared a background paper (PDF) looking at the problem of digital government information for the Center for Research Libraries for the “Leviathan: Libraries and Government in the Age of Big Data” conference organized in April 2014:

The most accurate count we currently have is probably from the 2008 “end of term crawl.” It attempted to capture a snapshot “archive” of “the U.S. federal government Web presence” and, in doing so, revealed the broader scope of the location of government information on the web. It found 14,338 .gov websites and 1,677 .mil websites. These numbers are certainly a more comprehensive count than the official GSA list and more accurate as a count of websites than the ISC count of domains. The crawl also included government information on sites that are not .gov or .mil. It found 29,798 .org, 13,856 .edu, and 57,873 .com websites that it classified as part of the federal web presence. Using these crawl figures, the federal government published information on 135,215 websites in 2008.

In other words, a sea of information in 2008 and now, in 2015, still more.

A function of government is to provide information to its citizens through publishing and to preserve some selected portion of these publications. Clearly some (if not most) .gov web sites are “government publications” and the U.S. federal government puts out information on .mil, .com, and other domains as well. What government agencies are archiving federal government sites for future research on a regular basis? And why? To what extent?

In part inspired by discussions at last year’s Leviathan conference, and in part fulfilling earlier conversations, managers and staff of three federal agencies that each do selective harvesting of federal web sites decided to start meeting and talking on a regular basis – the Government Publishing Office, the National Archives and Records Administration and the Library of Congress.

Managers and staff involved in web archiving from these three agencies have now met five times and have plans to continue meeting on a monthly basis during the remainder of 2015. At the most recent meeting we added a representative from the National Library of Medicine. So far we have been learning about what each of the agencies is doing with harvesting and providing access to federal web sites and why – whether it is the result of a legal mandate or because of other collection development policies. We expect to involve representatives of other federal agencies as seems appropriate over time.

Entrance of National Archives on Constitution Ave. side I. Courtesy of the Library of Congress, Prints & Photographs Division, by Theodor Horydczak.

So far one thing we have agreed on is that we have enjoyed our meetings – the world of web archiving is a small one, and sharing our experiences with each other turns out to be both productive and pleasant. Now that we better understand what we are all doing individually and collectively, we are able to discuss how we can be more efficient and effective in the aggregated results of what we are doing going forward, for example by reducing duplication of effort.

And that’s the kind of thing we hope comes out of this – a shared collective development strategy, if only informally developed. The following are some specific activities we are looking at:

  • Developing and describing web archiving best practices for federal agencies, a web archiving “FADGI” (Federal Agencies Digitization Guidelines Initiative), that could also be of interest to others outside of the federal agency community.
  • Investigate common metrics for use of our web archives of federal sites.
  • Establishing outreach to federal agency staff members who create the sites in order to improve our ability to harvest them.
  • Understand what federal agencies are doing (those that do something) to archive their sites themselves and how that work can be integrated with our efforts.
  • Maintain a seed list of federal URLs and who is archiving what (and which sites are not being harvested).

As the work progresses we look forward to communicating via blog posts and other means about what we accomplish. We hope to hear from you, via the comments on blog posts like this one, with your questions or ideas.

District Dispatch: Last call: Comment on draft national policy agenda for libraries by 2-27!

planet code4lib - Mon, 2015-02-23 16:07

Among the hundreds of powerful connections and conversations that took place at the 2015 American Library Association (ALA) Midwinter Meeting, librarians of all backgrounds began commenting on a draft national policy agenda for libraries. They asked how libraries can secure additional funding at a time of government budget cuts. Several noted and appreciated the inclusion of federal libraries, and most people specifically welcomed the premise of all libraries as linked together into a national infrastructure. And many saw potential for the national agenda to serve as a template for state- and local-level policy advocacy.

The draft agenda is the first step towards answering the questions “What are the U.S. library interests and priorities for the next five years that should be emphasized to national decision makers?” and “Where might there be windows of opportunity to advance a particular priority at this particular time?”

Outlining key issues and proposals is being pursued through the Policy Revolution! Initiative, led by the ALA Office for Information Technology Policy (OITP) and the Chief Officers of State Library Agencies (COSLA). A Library Advisory Committee—which includes broad representation from across the library community—provides overall guidance to the national effort. The three-year initiative, funded by the Bill & Melinda Gates Foundation, has three major elements: to develop a national public policy agenda, to initiate and deepen national stakeholder interactions based on policy priorities, and build library advocacy capacity for the long-term.

“We are asking big questions, and I’m really encouraged by the insightful feedback we’ve received in face-to-face meetings, emails and letters,” said OITP Director Alan Inouye. “I hope more people will share their perspectives and aspirations for building the capacity libraries of all kinds need to achieve shared national policy goals.”

The current round of public input closes Friday, February 27, 2015. Send your comments, questions and recommendations now to the project team at oitp[at]alawash[dot]org.

The draft agenda provides an umbrella of timely policy priorities and is understood to be too extensive to serve as the single policy agenda for any given entity in the community. Rather, the goal is that various library entities and their members can fashion their national policy priorities under the rubric of this national public policy agenda.

From this foundation, the ALA Washington Office will match priorities to windows of opportunity and confluence to begin advancing policy priorities—in partnership with other library organizations and allies with whom there is alignment—in mid-2015.

“In a time of increasing competition for resources and challenges to fulfilling our core missions, libraries and library organizations must come together to advocate proactively and strategically,” said COSLA President Kendall Wiggin. “Sustainable libraries are essential to sustainable communities.”

The post Last call: Comment on draft national policy agenda for libraries by 2-27! appeared first on District Dispatch.

DPLA: Let’s Talk about Ebooks

planet code4lib - Mon, 2015-02-23 15:45

Books are among the richest artifacts of human culture. In the last half-millenium, we have written over a hundred million of them globally, and within their pages lie incredibly diverse forms of literature, history, and science, poetry and prose, the sacred and the profane. Thanks to our many partners, the Digital Public Library of America already contains over two million ebooks, fully open and free to read.

But we have felt since DPLA’s inception that even with the extent of our ebook collection, we could be doing much more to connect the public, in more frictionless ways, with the books they wish to read. It is no secret that the current landscape for ebooks is rocky and in many ways inhospitable to libraries and readers. Ebook apps are often complicated for new users, and the selection of ebooks a mere fraction of what is on the physical shelves. To their credit, publishers have become more open recently to sharing books through library apps and other digital platforms, but pricing, restrictions, and the availability of titles still vary widely.

At the same time, new models for provisioning ebooks are arising from within the library community. In Colorado, Arizona, Massachusetts, and Connecticut, among other places, libraries and library consortia are exploring ways to expand their e-collections. Some are focusing on books of great local interest, such as genre writers within their areas or biographies of important state figures; others are working with small and independent publishers to provide a wider market for their works; and still others are attempting to recast the economics of ebook purchasing to the benefit of readers and libraries as well as publishers and authors through bulk purchases. Moreover, new initiatives such as the recent push from the National Endowment for the Humanities and the Andrew W. Mellon Foundation to open access to existing works, and the Authors Alliance, which is helping authors to regain their book rights, offer new avenues for books to be made freely available.

At the DPLA, we are particularly enthusiastic about the role that our large and expanding national network of hubs can play. Many of our service hubs have already scanned books from their regions, and are generously sharing them through DPLA. Public domain works are being aggregated by content hubs such as HathiTrust, with more coming online every month. It is clear that we can bring these threads together to create a richer, broader tapestry of ebooks for readers of all ages and interests.

That’s why we’re delighted to announce today that we have received generous funding from the Alfred P. Sloan Foundation to start an in-depth discussion about ebooks and their future, and what DPLA and our partners can do to help push things forward. Along with the New York Public Library, a leader in library technology and services, we plan to intensify the discussions we have already been having with publishers, authors, libraries, and the public about how to connect the maximal number of ebooks with the maximal number of readers.

This conversation will be one of the central events at DPLAfest. If you haven’t registered for the fest yet, this is your call to join us in Indianapolis on April 17-18 to kick off this conversation about the future of ebooks. It is a critical discussion, and we welcome all ideas and viewpoints. We look forward to hearing your thoughts about ebooks in Indy, and in other discussions throughout 2015.

Chris Prom: Arrangement and Description of Electronic Records

planet code4lib - Mon, 2015-02-23 15:21

Over the next few days, I’ll be updating the curriculum for the Society of American Archivists DAS course, “Arrangement and Description of Electronic Records,” (ADER) which I developed several years ago.   Ania Jarosek from the SAA Eduction Office tells me the course has been taught 17 times since May 2013  and that it is scheduled for an additional four offerings between now and September, making this a good time to undertake a thorough update.

The ADER course seeks to put into practice the philosophy that led me to start this site in the first place: to demystify digital preservation techniques as they apply to archival practice, facilitating practical methods and steps that can be applied in any archival repository.

From my perspective, the best thing about teaching the course has been the sense of community that it seeks to engender.   Sure, the course provides everyone who attends some practical and achievable steps you take to get digital materials under intellectual and ‘physical’ control.  But more than that, it offers an opportunity to think deeply and to learn from each other, to grow in a common understanding of what it means to be an archivist in the digital age.

That goes as much for me as it does for course attendees.  Every time I teach, I come away with new ideas to implement at the University of Illinois and to make the course an even better experience the next time I teach it.

In this respect, I’ve integrated many direct suggestions from participants over the years, as well as some tool guides provided by Carol Kussman.  In addition, Sam Meister and Seth Shaw, the other course instructors, have helpedimprove the course in meaningful ways.  Over time, we’ve worked on incorporate more and more active learning concepts and activities into the two days we spend together, since we don’t believe people learn all that much just by hearing us talk non-stop for two days!

Over the next week, we’ll be doing a larger-than-normal update to the course materials, using tips from many course participants.  I’ll be teaching this version of the course for the first time at the University of Hawai’i at Manoa.  For those of you looking to enhance your skills in a wonderful setting–it is not too late to register, the early bird deadline is March 1st.

Specifically, we’ll be making the following enhancements to the course:

  • Increasing emphasis on tool demonstration and use–integrating more direction demonstrations and directed use in small groups
  • Revising tool lists and providing a tool selection grid.
  • Introducing additional community building elements (collaboration spaces)
  • Adding a processing workflow demonstration and discussion
  • Improving the overall class ‘flow’ by tracking specific arrangement and description tasks to a model workflow
  • Adding new (and better!) sample collections to use in day two exercises (1) planning to process, (2) arranging records, and (3) describing records

All of these suggestions were provided directly by prior participants.  In this respect, and in many others, the course is a true collaborative effort of the Society, putting into practice SAA’s core organizational values.    Hope you can join me in Hawai’i–but if not, ADER is scheduled to be taught three more times between now and September–and it can be hosted elsewhere.  SAA lists its many fine educational offerings on its education calendar.



John Miedema: Lila’s four cognitive extensions to the writing process

planet code4lib - Mon, 2015-02-23 15:09

Lila is cognitive writing technology. It uses natural language processing to extend the cognitive abilities of a writer engaged in a project. In the previous post I described the seven root categories used to organize a non-fiction writing project and to optimize Lila’s analytics. These categories are considered a natural fit with the writing process and can be visualized as folders that contain notes. In this post I present a diagram that maps Lila’s four cognitive extensions through the folders to the writing process.

A “slip” is the unit of text in Lila. A slip is equivalent to a “note,” usually one or a few sentences, but no hard limit.

  1. The early stages of the writing process focus on thinking and research. An author sends slips to an Inbox and begins filing them in a Work folder. Documents and books that have not been read are sent to the TLDR folder. Lila processes the unread content, generating slips that are also filed in the Work folder.
  2. As the slips build up the author analyzes them. Using Lila, an author can visualize the connections between slips. The author can  “pin” interesting connections and discard others.
  3. Connections are made between the author slips, and from author slips to unread content slips. Where the connections are made to unread content, a link is provided to the original document or book. Authors can read both slips and original material in the context of their own content. This is called “embedded reading”, allowing for swifter analysis of new material.
  4. Analysis leads to organizing and writing drafts. An author will organize content in a particular hierarchical view, a table of contents. The author can get new insight by viewing the content in alternate hierarchical views generated by Lila.

The writing process usually involves each of these steps — thinking, research, analysis, etc. — at each step. Lila can perform its cognitive extensions at any step, e.g., integrate a new unread document late in the process. As the writing process continues, slips will be edited and integrated into a longer work for publication. Lila maintains a sense of “slips” in the background even when the author is working on a long integrated unit of text.

Open Knowledge Foundation: Building a Free &amp; Open World-wide Address Dataset

planet code4lib - Mon, 2015-02-23 11:39

Finding your way through the world is a basic need, so it makes sense that satellite navigation systems like GPS and Galileo are among open data’s most-cited success stories. But as wonderful as those systems are, they’re often more useful to robots than people. Humans usually navigate by addresses, not coordinates. That means that address data is an essential part of any complete mapping system.

Unfortunately, address data has historically been difficult to obtain. At best, it was sold for large amounts of money by a small set of ever-more consolidated vendors. These were often the product of public-private partnerships set up decades ago, under which governments granted exclusive franchises before the digital era unveiled the data’s full importance. In some cases, data exclusivity means that the data simply isn’t available at any price.

Fortunately, the situation is improving. Scores of governments are beginning to recognize that address data is an important part of their open data policy. This is thanks in no small part to the community of advocates working on the issue. Open Knowledge has done important work surveying the availability of parcel and postcode data, both of which are essential parts of address data. OpenAddresses UK has recently launched an ambitious plan to collect and release the country’s address data. And in France, the national OpenStreetMap community’s BANO project has been embraced by the government’s own open data portal.

This is why we’re building, a global community collecting openly available address data. I and my fellow contributors were pleased to recently celebrate our 100 millionth address point:

Getting involved in OpenAddresses is easy and can quickly pay dividends. Adding a new dataset is as easy as submitting a form, and you’ll benefit by improving a global open address dataset in one consistent format that anyone can use. Naturally, we also welcome developers: there are interesting puzzles and mountains of data that still need work.

Our most important tools to gather more data are email and search engines. Addresses are frequently buried in aging cadastral databases and GIS portals. Time spent hunting for them often reveals undiscovered resources. A friendly note to a person in government can unlock new data with surprising success. Many governments simply don’t know that citizens need this data or how to release it as an open resource.

If you work in government and care about open data, we’d like to hear from you. Around the world, countries are acknowledging that basic geographic data belongs in the commons. We need your help to get it there.

Code4Lib: Code4Lib 2016 Conference Proposals

planet code4lib - Sun, 2015-02-22 18:53

Accumulated Proposals by closing date of 2015-02-20

Los Angeles Proposal - Los Angeles, CA

Philadelphia Proposal - Philadelphia, PA

Code4Lib Wiki 2016 Proposals Page -

Host Voting is NOW OPEN 2015-02-23 00:00:00 UTC 2015-03-07 08:00:00 UTC. You can also watch the results.

Topic: conferences

Ian Davis: Another Blog Refresh

planet code4lib - Sun, 2015-02-22 10:04
Another Blog Refresh Internet Alchemy

est. 1999

2015 · 2010 · 2006 · 2002 2014 · 2009 · 2005 · 2001 2012 · 2008 · 2004 · 2000 2011 · 2007 · 2003 · 1999                   Sun, Feb 22, 2015 Another Blog Refresh

It’s time for another blog refresh, this time back to a static site after a few years being hosted by Once again I’m convinced by Aaron’s argument that baking is better than frying. It’s not about performance, it’s about simplicity and control.

While I liked the convenience of, it never really felt like a place I could tailor to my own requirements. I thought having a nice web UI and mobile apps to edit posts would encourage me to post more. It actually made no difference whatsoever. Whatever holds me back from blogging isn’t related to the editing UI.

For this move I looked at various static site generators such as Jekyll, Hyde and Hugo but I settled on a mimimal one: gostatic. My reasoning (which I admit may not be entirely justified) is that feature-led software gets updated at a much higher rate than I post to my blog. When I come to post, invariably something important has changed in the core software or in a dependency and so I’ll need to upgrade or fix that before being able to publish. I find this particularly true of larger systems in dynamic languages like Ruby or PHP.

This time around I have a single binary (gostatic) to generate the site with no dependencies. It’s deliberately feature-poor so I don’t rely on things that may be changed or deprecated some time in the future and I have a script that does the rebuild and can sync to whatever laptop I’ll be using in the coming years. It’s documented for a future me.

A few technical notes:

  • Posts are written in markdown and are compatible with all the static site generators I mentioned above
  • This move is partly motivated by moving to a new web server. I’m going to be using nginx and serving either static files or fronting Go services. This is the first time I’ve had a server that isn’t running Apache+PHP.
  • Hopefully the atom feed works ok – it looks ok, but there’s almost certainly some weird software out there that will break on it.
  • There will be broken links, but already I have fixed hundreds of bad internal links by being able to grep over all the posts locally.
  • Formatting will be weird in places since the posts were exported via Wordpress’s XML export. I’ll get to tidying up the individual posts as an ongoing job.
  • There are no comments. I have the comments as part of the Wordpress export, and I’m planning to take a look at how to incorporate them into the blog archives. However I’m not planning on adding commenting to the blog. Thank you to all of you who have commented on my posts in the past, I have enjoyed reading them. But… it’s time to admit that commenting is a broken form of communication.

For a contrary views on baked vs fried and blog commenting, see my post on moving from Moveable Type to Wordpress back in 2004 or my post on moving from a dynamic system to Moveable Type, or even my post on moving to a hosted blog on Posterous ;)

For early blog archeology see my post on early versions of Internet Alchemy

Code4Lib: Code4Lib North 2015: St. Catharines, ON

planet code4lib - Sat, 2015-02-21 21:00
Topic: meetings

The sixth Code4Lib North meeting will be on June 4--5, 2015 at the St. Catharines Public Library, 54 Church St., St. Catharines, Ontario. St. Catharines is on the Niagara Peninsula on the south side of Lake Ontario, close to the American cities of Buffalo and Rochester in New York. See the wiki page for details and to sign up for a talk.

Code4Lib: Code4Lib 2015 videos

planet code4lib - Sat, 2015-02-21 20:51
Topic: code4lib 2015

All of Code4Lib 2015 was recorded and the videos are available on the YouTube Code4Lib channel.

John Miedema: Seven root categories for organizing non-fiction writing and optimizing Lila’s analytics

planet code4lib - Sat, 2015-02-21 14:03

Lila technology collaborates with an author engaged in a writing project. A model of the writing process is assumed, one that is considered natural for writing non-fiction, at least, and compliant with existing writing software. In this model, an author writes notes and organizes them into categories. Seven root categories are assumed to be fundamental to a writing project, folders than contain the written material. The categories are presented here not so much as Lila system requirements, but as a best practice, structures that optimize the writing process and Lila’s analytics. If you do not use these categories to organize your non-fiction writing project, you might consider doing so, whether or not you intend to use Lila.

Step in the Writing Process Structural Category/Folder Category/Folder Description Comparison with Pirsig’s categories 1 The author begins a project. A root Project folder is created, a repository for everything else. Project A single root folder. Contains all other folders and slips. Root folder may contain high level instructions regarding project plans, to do lists, etc., but these are not content  for Lila’s analysis. Like Pirsig’s PROGRAM slips, the Project folder may contain “instructions for what to do with the rest of the slips” but this information will not operate as a “program.” All programming functions will be handled by Lila code. 2 The author takes notes on ideas using various software programs on different devices. Many notes will require further thought before filing into the project. These notes get sent to an inbox, a temporary queue, a point for later conscious attention and classification. Project > Inbox The Inbox may be an email inbox or an Evernote notebook dedicated to an inbox function. There can be multiple inboxes. Notes in the inbox may be tentatively assigned categories and/or tags, but these will be reviewed. Inbox corresponds to Pirsig’s UNASSIMILATED category, “new ideas that interrupted what he was doing.” 3 Notes are filed into a main folder, a workspace for all the active content. Project > Work Notes in the Work folder are organized by categories and subject classified by tags. These notes are the target of Lila’s analytics. See upcoming post on subject classification for more information. The Work folder contains all the topic categories Pirsig developed as he was working. 4 Some ideas are considered worth noting, but either not sufficiently relevant or too disruptive to file into the main work. These notes should not be trashed, but parked for later evaluation. Project > Park Parked notes are excluded from Lila’s analytics, but can be brought back into play later. Park corresponds to Pirsig’s CRIT and TOUGH categories. I see these two categories as the positive and negative versions of the same thing, i.e., disruptive ideas. Don’t let them take over but don’t ignore them either. Let them hang out in the Park for awhile. 5 A primary function of Lila is to assist with the large volume of content that an author does not have time to read. On the web, the acronym TLDR is used, “Too Long; Didn’t Read.” Project > TLDR TLDR is not a flippant term. Content Management Systems typically have special handling for large files. Lila will generate notes (slips) from this unread content, and present it in context for embedded reading. Pirsig provided no special classification for unread content. Likely it just went in a pile, perhaps left unread. 6 Some notes, and chains of notes, seem important at one time but later are considered irrelevant or out of scope. This typically happens as the project matures and editing is undertaken. These notes are not trashed but archived for possible reuse later. Project > Archive Archived notes are excluded from Lila’s analytics. Perhaps a switch will allow them to be included. The archive could tie into version control for successive drafts. Pirsig filed these notes in JUNK, “slips that seemed of high value when he wrote them down but which now seemed awful.” 7 Other notes are just plain trash: duplicates, dead lines of thought. To avoid noise in the archive it’s best to trash them. Project > Trash Trashed notes are excluded from Lila’s analytics. These notes may be purged on occasion. Pirsig filed these notes in JUNK, but maintained them indefinitely.



Subscribe to code4lib aggregator