You are here

Feed aggregator

Zotero: Indiana University Survey of Zotero Users

planet code4lib - Wed, 2016-04-13 23:08

As part of a grant funded by the Alfred P. Sloan Foundation to analyze altmetrics and expand the Zotero API, our research partners at Indiana University are studying the readership of reference sources across a range of platforms. Cassidy Sugimoto and a team of researchers at IU have developed an anonymous, voluntary survey that seeks to analyze the bibliometrics of Zotero data. The survey includes questions regarding user behavior, discoverability, networking, the research process, open access, open source software, scholarly communication, and user privacy. It is a relatively short survey and your input is greatly appreciated. We will post a follow-up to the Zotero blog that analyzes the results of the survey. Follow this link to take the survey.

LibUX: 036 – Penelope Singer

planet code4lib - Wed, 2016-04-13 21:08

I — you know, Michael! — talk shop with the user interface designer Penelope Singer. We chat about cross-platform and cross-media brand, material design and web animation, cats, and anticipatory design.

As usual, you can download the MP3 directly – or now on SoundCloud, too.

  • 0:55 – What does Penelope do as a user interface designer?
  • 3:00 – “The brand is nothing more than … what the user says you are.”
  • 7:16 – Cats
  • 9:00 – “People’s paradigms are always related to physical things”
  • 12:38 – Animations that communicate state change
  • 15:20 – Resistance to brand and style guidelines
  • 19:47 – Anticipatory design 1 and concerns around privacy

You can subscribe to LibUX on Stitcher, iTunes, SoundCloud or plug our feed right into your podcatcher of choice. Help us out and say something nice. You can find every podcast right here on

  1. Yes, again.

The post 036 – Penelope Singer appeared first on LibUX.

LITA: Jobs in Information Technology: April 13, 2016

planet code4lib - Wed, 2016-04-13 18:58

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Qualcomm, Inc., Content Integration Librarian, San Diego, CA

Multnomah County Library, Front End Drupal Web Developer, Portland, OR

City of Virginia Beach Library Department, Librarian I/Web Specialist #7509, Virginia Beach, VA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Equinox Software: As You Wish: No Programmer Required

planet code4lib - Wed, 2016-04-13 18:08

Every year, Library Journal investigates the Library Systems Landscape.  In this series of articles, libraries and vendors alike are polled and interviewed in order to come up with a cohesive glimpse into Libraryland’s inner workings.  We greatly appreciate the hard work Matt Enis puts in each year to give a well-rounded perspective on both proprietary and open source solutions for libraries.

Equinox President Mike Rylander was interviewed for the Open Invitation portion of this series.  But what really caught our attention here at Equinox was a separate portion:  Wish List.  In this article, librarians were surveyed on their current ILS and the things they would like to see in their own libraries.  

We were most interested in this quote (emphasis added):

“Others expressed concerns about whether on-staff expertise would be needed to operate an open source ILS—a perception that development houses such as ByWater, Equinox, and LibLime have been trying to battle. One respondent who had served on an ILS search team noted that despite being “extremely unhappy” with one commercial ILS, the library ultimately moved to another proprietary solution “due to insufficient funds for open source support staff.” Another wrote that an “open source solution has great appeal due to the customization possibilities. The cost and maintenance factors also play into this. But you have to have the internal capacity to support open source, and with budget reductions,we just haven’t been able to consider an open source option.

While the prevalence of this misunderstanding has decreased in recent years, it’s very unfortunate to see it repeated by libraries again and again.  The only winners when those beliefs persist are the proprietary vendors selling ILS’ that make their users “extremely unhappy[.]”

Equinox would like to take this opportunity to say that “Open Source support staff” is NOT necessary to use Open Source solutions in ANY library.  Equinox (as well as other open source support vendors) offer the full gamut of services.  We can migrate, host, install, maintain, and develop new features, at a cost lower than what you would expect from a proprietary ILS vendor.  There is absolutely no need for extra library staff.  There is absolutely no need to have a programmer on staff.  We can do the programming, tech support, and training for you.

You can have all the benefits an open source product brings, truly open APIs and code, product portability, and flexibility along with the security of having a team of experts supporting you.  You get real choice and a real voice, and you don’t have to hire any new staff to get them.

We don’t want you to be extremely unhappy with your current ILS.  We want you to be over the moon happy with your ILS solution because it has the functionality and flexibility you expect and deserve.  We can help you with this–no computer science degree required.

David Rosenthal: The Architecture of Emulation on the Web

planet code4lib - Wed, 2016-04-13 17:00
In the program my talk at the IIPC's 2016 General Assembly in Reykjavík was entitled Emulation & Virtualization as Preservation Strategies. But after a meeting to review my report called by the Mellon Foundation I changed the title to The Architecture of Emulation on the Web. Below the fold, an edited text of the talk with an explanation for the change, and links to the sources.
Title Its a pleasure to be here and I'm grateful to the organizers for inviting me to talk today. As usual, you don't need to take notes or ask for the slides, the text of the talk with links to the sources will go up on my blog shortly.

Thanks to funding from the Mellon Foundation I spent last summer on behalf of the Mellon and Sloan Foundations, and IMLS researching and writing a report entitled Emulation & Virtualization as Preservation Strategies. Jeff Rothenberg's 1995 Ensuring the Longevity of Digital Documents identified emulation and migration as the two possible techniques and came down strongly in favor of emulation. Despite this, migration has been overwhelmingly favored until recently. What has changed is that emulation frameworks have been developed that present emulations as a normal part of the Web.

Last month there was a follow-up meeting at the Mellon Foundation. In preparing for it, I realized that there was an important point that the report identified but didn't really explain properly. I'm going to try to give a better explanation today, because it is about how emulations of preserved software appear on the web, and thus how they can be become part of the Web that we collect, preserve and disseminate. I'll start by describing how the three emulation frameworks I studied appear on the Web, then illustrating the point with an analogy, and suggesting how it might be addressed.

When I gave a talk about the report at CNI I included live demos. It was a disaster; Olive was the only framework that worked via hotel WiFi. I have pre-recorded the demos using Kazam and a Chromium browser on my Ubuntu 14.04 system.
Theresa Duncan's CD-ROMs The Theresa Duncan CD-ROMs.From 1995 to 1997 Theresa Duncan produced three seminal feminist CD-ROM games, Chop Suey, Smarty and Zero Zero. Rhizome, a project hosted by the New Museum in New York, has put emulations of them on the Web. You can visit, click any of the "Play" buttons and have an experience very close to that of playing the CD on MacOS 7.5 . This has proved popular. For several days after their initial release they were being invoked on average every 3 minutes.
What Is Going On? What happened when I clicked Smarty's Play button?
  • The browser connects to a session manager in Amazon's cloud, which notices that this is a new session.
  • Normally it would authenticate the user, but because this CD-ROM emulation is open access it doesn't need to.
  • It assigns one of its pool of running Amazon instances to run the session's emulator. Each instance can run a limited number of emulators. If no instance is available when the request comes in it can take up to 90 seconds to start another.
  • It starts the emulation on the assigned instance, supplying metadata telling the emulator what to run.
  • The emulator starts. After a short delay the user sees the Mac boot sequence, and then the CD-ROM starts running.
  • At intervals, the emulator sends the session manager a keep-alive signal. Emulators that haven't sent one in 30 seconds are presumed dead, and their resources are reclaimed to avoid paying the cloud provider for unused resources.
    bwFLA architecturebwFLARhizome, and others such as Yale, the DNB and ZKM Karlsruhe use technology from the bwFLA team at the University of Freiburg to provide Emulation As A Service (EAAS). Their GPLv3 licensed framework runs in "the cloud" to provide comprehensive management and access facilities wrapped around a number of emulators. It can also run as a bootable USB image or via Docker. bwFLA encapsulates each emulator so that the framework sees three standard interfaces
    • Data I/O, connecting the emulator to data sources such as disk images, user files, an emulated network containing other emulators, and the Internet.
    • Interactive Access, connecting the emulator to the user using standard HTML5 facilities.
    • Control, providing a Web Services interface that bwFLA's resource management can use to control the emulator.
    The communication between the emulator and the user takes place via standard HTTP on port 80; there is no need for a user to install software, or browser plugins, and no need to use ports other than 80. Both of these are important for systems targeted at use by the general public.

    bwFLA's preserved system images are stored as a stack of overlays in QEMU's "qcow2'' format. Each overlay on top of the base system image represents a set of writes to the underlying image. For example, the base system image might be the result of an initial install of Windows 95, and the next overlay up might be the result of installing Word Perfect into the base system. Or the next overlay up might be the result of redaction. Each overlay contains only those disk blocks that differ from the stack of overlays below it. The stack of overlays is exposed to the emulator as if it were a normal file system via FUSE.

    The technical metadata that encapsulates the system disk image is described in a paper presented to the iPres conference in November 2015, using the example of emulating CD-ROMs. Broadly, it falls into two parts, describing the software and hardware environments needed by the CD-ROM in XML. The XML refers to the software image components via the Handle system, providing a location-independent link to access them.
    TurboTaxTurboTax97 on Windows 3.1I can visit and get 1997's TurboTax running on Windows 3.1. The pane in the browser window has top and bottom menu bars, and between them is the familiar Windows 3.1 user interface.
    What Is Going On? The top and bottom menu bars come from a program called VMNetX that is running on my system. Chromium invoked it via a MIME-type binding, and VMNetX then created a suitable environment in which it could invoke the emulator that is running Windows 3.1, and TurboTax. The menu bars include buttons to power-off the emulated system, control its settings, grab the screen, and control the assignment of the keyboard and mouse to the emulated system.

    The interesting question is "where is the Windows 3.1 system disk with TurboTax installed on it?"
    OliveThe answer is that the "system disk" is actually a file on a remote Apache Web server. The emulator's disk accesses are being demand-paged over the Internet using standard HTTP range queries to the file's URL.

    This system is Olive, developed at Carnegie Mellon University by a team under my friend Prof. Mahadev Satyanarayanan, and released under GPLv2. VMNetX uses a sophisticated two-level caching scheme to provide good emulated performance even over slow Internet connections. A "pristine cache" contains copies of unmodified disk blocks from the "system disk". When a program writes to disk, the data is captured in a "modified cache". When the program reads a disk block, it is delivered from the modified cache, the pristine cache or the Web server, in that order. One reason this works well is that successive emulations of the same preserved system image are very similar, so pre-fetching blocks into the pristine cache is effective in producing YouTube-like performance over 4G cellular networks.
    VisiCalcVisiCalc on Apple ][You can visit and run Dan Bricklin and Bob Frankston's VisiCalc from 1979 on an emulated Apple ][. It was the world's first spreadsheet. Some of the key-bindings are strange to users conditioned by decades of Excel, but once you've found the original VisiCalc reference card, it is perfectly usable.
    What Is Going On? The Apple ][ emulator isn't running in the cloud, as bwFLA's does, nor is it running as a process on my machine, as Olive's does. Instead, it is running inside my browser. The emulators have been compiled into JavaScript, using emscripten. When I clicked on the link to the emulation, metadata describing the emulation including the emulator to use was downloaded into my browser, which then downloaded the JavaScript for the emulator and the system image for the Apple ][ with VisiCalc installed.
    EmularityThis is the framework underlying the Internet Archive's software library, which currently holds nearly 36,000 items, including more than 7,300 for MS-DOS, 3,600 for Apple, 2,900 console games and 600 arcade games. Some can be downloaded, but most can only be streamed.

    The oldest is an emulation of a PDP-1 with a DEC 30 display running the Space War game from 1962, more than half a century ago. As I can testify having played this and similar games on Cambridge University’s PDP-7 with a DEC 340 display seven years later, this emulation works well

    The quality of the others is mixed. Resources for QA and fixing problems are limited; with a collection this size problems are to be expected. Jason Scott crowd-sources most of the QA; his method is to see if the software boots up and if so, put it up and wait to see whether visitors who remember it post comments identifying problems, or whether the copyright owner objects. The most common problem is the sound.

    It might be thought that the performance of running the emulator locally by adding another layer of emulation (the JavaScript virtual machine) would be inadequate, but this is not the case for two reasons. First, the user’s computer is vastly more powerful than an Apple ][ and, second, the performance of the JavaScript engine in a browser is critical to its success, so large resources are expended on optimizing it.

    The movement supported by major browser vendors to replace the JavaScript virtual machine with a byte-code virtual machine called WebAssembly has borne fruit. Last month four major browsers announced initial support, all running the same game, a port of Unity's Angry Bots. This should greatly reduce the pressure for multi-core and parallelism support in JavaScript, which was always likely to be a kludge. Improved performance for in-browser emulation is also likely to make in-browser emulation more competitive with techniques that need software installation and/or cloud infrastructure, reducing the barrier to entry.
    The PDF AnalogyLets make an analogy between emulation and something that everyone would agree is a Web format, PDF. Browsers lack built-in support for rendering PDF. They used to depend on external PDF renderers, such as Adobe Reader via a Mime-Type binding. Now, they download pdf.js and render the PDF internally even though its a format for which they have no built-in support. The Webby, HTML5 way to provide access to formats that don't qualify for built-in support is to download a JavaScript renderer. We don't preserve PDFs by wrapping them in a specific PDF renderer, we preserve them as PDF plus a MimeType. At access time the browser chooses an appropriate renderer, which used to be Adobe Reader and is now pdf.js.
    Landing Pages ACM landing pageThere's another interesting thing about PDFs on the web. In many cases the links to them don't actually get you to the PDF. The canonical, location-independent link to the LOCKSS paper in ACM ToCS is, which currently redirects to which is a so-called "landing page", not the paper but a page about the paper, on which if you look carefully you can find a link to the PDF.

    The fact that it is very difficult for a crawler to find this link makes it hard for archives to collect and preserve scholarly papers. Herbert Van de Sompel and Michael Nelson's Signposting proposal addresses this problem, as to some extent do W3C activities called Packaging on the Web and Portable Web Publications for the Open Web Platform.

    Like PDFs, preserved system images, the disk image for a system to be emulated and the metadata describing the hardware it was intended for, are formats that don't qualify for built-in support. The Webby way to provide access to them is to download a JavaScript emulator, as Emularity does. So is the problem of preserving system images solved?
    Problem Solved? NO! No it isn't. We have a problem that is analogous to, but much worse than, the landing page problem. The analogy would be that, instead of a link on the landing page leading to the PDF, embedded in the page was a link to a rendering service. The metadata indicating that the actual resource was a PDF, and the URI giving its location, would be completely invisible to the user's browser or a Web crawler. At best all that could be collected and preserved would be a screenshot.

    All three frameworks I have shown have this problem. The underlying emulation service, the analogy of the PDF rendering service, can access the system image and the necessary metadata, but nothing else can. Humans can read a screenshot of a PDF document, a screenshot of an emulation is useless. Wrapping a system image in an emulation like this makes it accessible in the present, not preservable for the future.

    If we are using emulation as a preservation strategy, shouldn't we be doing it in a way that is itself able to be preserved?
    A MimeType for Emulations?What we need is a MimeType definition that allows browsers to follow a link to a preserved system image and construct an appropriate emulation for it in whatever way suits them. This would allow Web archives to collect preserved system images and later provide access to them.

    The linked-to object that the browser obtains needs to describe the hardware that should be emulated. Part of that description must be the contents of the disks attached to the system. So we need two MimeTypes:
    • A metadata MimeType, say Emulation/MachineSpec, that describes the architecture and configuration of the hardware, which links to one or more resources of:
    • A disk image MimeType, say DiskImage/qcow2, with the contents of each of the disks.
    Emulation/MachineSpec is pretty much what the hardware part of bwFLA's internal metadata format does, though from a preservation point of view there are some details that aren't ideal. For example, using the Handle system is like using a URL shortener or a DOI, it works well until the service dies. When it does, as for example last year when's domain registration expired, all the identifiers become useless.

    I suggest DiskImage/qcow2 because QEMU's qcow2 format is a de facto standard for representing the bits of a preserved system's disk image.
    And binding to "emul.js" Then, just as with pdf.js, the browser needs a binding to a suitable "emul.js" which knows, in this browser's environment, how to instantiate a suitable emulator for the specified machine configuration and link it to the disk images.This would solve both problems:
    • The emulated system image would not be wrapped in a specific emulator; the browser would be free to choose appropriate, up-to-date emulation technology.
    • The emulated system image and the necessary metadata would be discoverable and preservable because there would be explicit links to them.
    The details need work but the basic point remains. Unless there are MimeTypes for disk images and system descriptions, emulations cannot be first-class Web objects that can be collected, preserved and later disseminated.

    SearchHub: Better Feature Engineering with Spark, Solr, and Lucene Analyzers

    planet code4lib - Wed, 2016-04-13 15:32

    This blog post is about new features in the Lucidworks spark-solr open source toolkit. For an introduction to the spark-solr project, see Solr as an Apache Spark SQL DataSource

    Performing text analysis in Spark

    The Lucidworks spark-solr open source toolkit now contains tools to break down full text into words a.k.a. tokens using Lucene’s text analysis framework. Lucene text analysis is used under the covers by Solr when you index documents, to enable search, faceting, sorting, etc. But text analysis external to Solr can drive processes that won’t directly populate search indexes, like building machine learning models. In addition, extra-Solr analysis can allow expensive text analysis processes to be scaled separately from Solr’s document indexing process.

    Lucene text analysis, via LuceneTextAnalyzer

    The Lucene text analysis framework, a Java API, can be used directly in code you run on Spark, but the process of building an analysis pipeline and using it to extract tokens can be fairly complex. The spark-solr LuceneTextAnalyzer class aims to simplify access to this API via a streamlined interface. All of the analyze*() methods produce only text tokens – that is, none of the metadata associated with tokens (so-called “attributes”) produced by the Lucene analysis framework is output: token position increment and length, beginning and ending character offset, token type, etc. If these are important for your use case, see the “Extra-Solr Text Analysis” section below.

    LuceneTextAnalyzer uses a stripped-down JSON schema with two sections: the analyzers section configures one or more named analysis pipelines; and the fields section maps field names to analyzers. We chose to define a schema separately from Solr’s schema because many of Solr’s schema features aren’t applicable outside of a search context, e.g.: separate indexing and query analysis; query-to-document similarity; non-text fields; indexed/stored/doc values specification; etc.

    Lucene text analysis consists of three sequential phases: character filtering – whole-text modification; tokenization, in which the resulting text is split into tokens; and token filtering – modification/addition/removal of the produced tokens.

    Here’s the skeleton of a schema with two defined analysis pipelines:

    { "analyzers": [{ "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... } ... ] }] }, { "name": "...", "charFilters": [{ "type": "...", ...}, ... ], "tokenizer": { "type": "...", ... }, "filters": [{ "type": "...", ... }, ... ] }] } ], "fields": [{"name": "...", "analyzer": "..."}, { "regex": ".+", "analyzer": "..." }, ... ] }

    In each JSON object in the analyzers array, there may be:

    • zero or more character filters, configured via an optional charFilters array of JSON objects;
    • exactly one tokenizer, configured via the required tokenizer JSON object; and
    • zero or more token filters, configured by an optional filters array of JSON objects.

    Classes implementing each one of these three kinds of analysis components are referred to via the required type key in these components’ configuration objects, the value for which is the SPI name for the class, which is simply the case-insensitive class’s simple name with the -CharFilterFactory, -TokenizerFactory, or -(Token)FilterFactory suffix removed. See the javadocs for Lucene’s CharFilterFactory, TokenizerFactory and TokenFilterFactory classes for a list of subclasses, the javadocs for which include a description of the configuration parameters that may be specified as key/value pairs in the analysis component’s configuration JSON objects in the schema.

    Below is a Scala snippet to display counts for the top 10 most frequent words extracted from spark-solr’s top-level README.adoc file, using LuceneTextAnalyzer configured with an analyzer consisting of StandardTokenizer (which implements the word break rules from Unicode’s UAX#29 standard) and LowerCaseFilter, a filter to downcase the extracted tokens. If you would like to play along at home: clone the spark-solr source code from Github; change directory to the root of the project; build the project (via mvn -DskipTests package); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar); type paste: into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish):

    import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] } """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("anything", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) // descending sort by count println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))

    The top 10 token(count) tuples will be printed out:

    the(158), to(103), solr(86), spark(77), a(72), in(44), you(44), of(40), for(35), from(34)

    In the schema above, all field names are mapped to the StdTokLower analyzer via the "regex": ".+" mapping in the fields section – that’s why the call to analyzer.analyze() uses "anything" as the field name.

    The results include lots of prepositions (“to”, “in”, “of”, “for”, “from”) and articles (“the” and “a”) – it would be nice to exclude those from our top 10 list. Lucene includes a token filter named StopFilter that removes words that match a blacklist, and it includes a default set of English stopwords that includes several prepositions and articles. Let’s add another analyzer to our schema that builds on our original analyzer by adding StopFilter:

    import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }, | { "name": "StdTokLowerStop", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }, | { "type": "stop" }] }], | "fields": [{ "name": "all_tokens", "analyzer": "StdTokLower" }, | { "name": "no_stopwords", "analyzer": "StdTokLowerStop" } ]} """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val file = sc.textFile("README.adoc") val counts = file.flatMap(line => analyzer.analyze("no_stopwords", line)) .map(word => (word, 1)) .reduceByKey(_ + _) .sortBy(_._2, false) println(counts.take(10).map(t => s"${t._1}(${t._2})").mkString(", "))

    In the schema above, instead of mapping all fields to the original analyzer, only the all_tokens field will be mapped to the StdTokLower analyzer, and the no_stopwords field will be mapped to our new StdTokLowerStop analyzer.

    spark-shell will print:

    solr(86), spark(77), you(44), from(34), source(32), query(25), option(25), collection(24), data(20), can(19)

    As you can see, the list above contains more important tokens from the file.

    For more details about the schema, see the annotated example in the LuceneTextAnalyzer scaladocs.

    LuceneTextAnalyzer has several other analysis methods: analyzeMV() to perform analysis on multi-valued input; and analyze(MV)Java() convenience methods that accept and emit Java-friendly datastructures. There is an overloaded set of these methods that take in a map keyed on field name, with text values to be analyzed – these methods return a map from field names to output token sequences.

    Extracting text features in pipelines

    The machine learning library includes a limited number of transformers that enable simple text analysis, but none support more than one input column, and none support multi-valued input columns.

    The spark-solr project includes LuceneTextAnalyzerTransformer, which uses LuceneTextAnalyzer and its schema format, described above, to extract tokens from one or more DataFrame text columns, where each input column’s analysis configuration is specified by the schema.

    If you don’t supply a schema (via e.g. the setAnalysisSchema() method), LuceneTextAnalyzerTransformer uses the default schema, below, which analyzes all fields in the same way: StandardTokenizer followed by LowerCaseFilter:

    { "analyzers": [{ "name": "StdTok_LowerCase", "tokenizer": { "type": "standard" }, "filters": [{ "type": "lowercase" }] }], "fields": [{ "regex": ".+", "analyzer": "StdTok_LowerCase" }] }

    LuceneTextAnalyzerTransformer puts all tokens extracted from all input columns into a single output column. If you want to keep the vocabulary from each column distinct from other columns’, you can prefix the tokens with the input column from which they came, e.g. word from column1 becomes column1=word – this option is turned off by default.

    You can see LuceneTextAnalyzerTransformer in action in the spark-solr MLPipelineScala example, which shows how to use LuceneTextAnalyzerTransformer to extract text features to build a classification model to predict the newsgroup an article was posted to, based on the article’s text. If you wish to run this example, which expects the 20 newsgroups data to be indexed into a Solr cloud collection, follow the instructions in the scaladoc of the NewsgroupsIndexer example, then follow the instructions in the scaladoc of the MLPipelineScala example.

    The MLPipelineScala example builds a Naive Bayes classifier by performing K-fold cross validation with hyper-parameter search over, among several other params’ values, whether or not to prefix tokens with the column from which they were extracted, and 2 different analysis schemas:

    val WhitespaceTokSchema = """{ "analyzers": [{ "name": "ws_tok", "tokenizer": { "type": "whitespace" } }], | "fields": [{ "regex": ".+", "analyzer": "ws_tok" }] }""".stripMargin val StdTokLowerSchema = """{ "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }""".stripMargin [...] val analyzer = new LuceneTextAnalyzerTransformer().setInputCols(contentFields).setOutputCol(WordsCol) [...] val paramGridBuilder = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(1000, 5000)) .addGrid(analyzer.analysisSchema, Array(WhitespaceTokSchema, StdTokLowerSchema)) .addGrid(analyzer.prefixTokensWithInputCol)

    When I run MLPipelineScala, the following log output says that the std_tok_lower analyzer outperformed the ws_tok analyzer, and not prepending the input column onto tokens worked better:

    2016-04-08 18:17:38,106 [main] INFO CrossValidator - Best set of parameters: { LuceneAnalyzer_9dc1a9c71e1f-analysisSchema: { "analyzers": [{ "name": "std_tok_lower", "tokenizer": { "type": "standard" }, "filters": [{ "type": "lowercase" }] }], "fields": [{ "regex": ".+", "analyzer": "std_tok_lower" }] }, hashingTF_f24bc3f814bc-numFeatures: 5000, LuceneAnalyzer_9dc1a9c71e1f-prefixTokensWithInputCol: false, nb_1a5d9df2b638-smoothing: 0.5 } Extra-Solr Text Analysis

    Solr’s PreAnalyzedField field type enables the results of text analysis performed outside of Solr to be passed in and indexed/stored as if the analysis had been performed in Solr.

    As of this writing, the spark-solr project depends on Solr 5.4.1, but prior to Solr 5.5.0, querying against fields of type PreAnalyzedField was not fully supported – see Solr JIRA issue SOLR-4619 for more information.

    There is a branch on the spark-solr project, not yet committed to master or released, that adds the ability to produce JSON that can be parsed, then indexed and optionally stored, by Solr’s PreAnalyzedField.

    Below is a Scala snippet to produce pre-analyzed JSON for a small piece of text using LuceneTextAnalyzer configured with an analyzer consisting of StandardTokenizer+LowerCaseFilter. If you would like to try this at home: clone the spark-solr source code from Github; change directory to the root of the project; checkout the branch (via git checkout SPAR-14-LuceneTextAnalyzer-PreAnalyzedField-JSON); build the project (via mvn -DskipTests package); start the Spark shell (via $SPARK_HOME/bin/spark-shell --jars target/spark-solr-2.1.0-SNAPSHOT-shaded.jar); type paste: into the shell; and finally paste the code below into the shell after it prints // Entering paste mode (ctrl-D to finish):

    import com.lucidworks.spark.analysis.LuceneTextAnalyzer val schema = """{ "analyzers": [{ "name": "StdTokLower", | "tokenizer": { "type": "standard" }, | "filters": [{ "type": "lowercase" }] }], | "fields": [{ "regex": ".+", "analyzer": "StdTokLower" }] } """.stripMargin val analyzer = new LuceneTextAnalyzer(schema) val text = "Ignorance extends Bliss." val fieldName = "myfield" println(analyzer.toPreAnalyzedJson(fieldName, text, stored = true))

    The following will be output (whitespace added):

    {"v":"1","str":"Ignorance extends Bliss.","tokens":[ {"t":"ignorance","s":0,"e":9,"i":1}, {"t":"extends","s":10,"e":17,"i":1}, {"t":"bliss","s":18,"e":23,"i":1}]}

    If we make the value of the stored option false, then the str key, with the original text as its value, will not be included in the output JSON.


    LuceneTextAnalyzer simplifies Lucene text analysis, and enables use of Solr’s PreAnalyzedField. LuceneTextAnalyzerTransformer allows for better text feature extraction by leveraging Lucene text analysis.

    The post Better Feature Engineering with Spark, Solr, and Lucene Analyzers appeared first on

    District Dispatch: Archived webinar on libertarians and copyright now available

    planet code4lib - Wed, 2016-04-13 13:43

    Check out the latest CopyTalk webinar.

    An archived copy of the CopyTalk webinar “what do Libertarians think about copyright law?” is now available. Originally webcasted on April 7, 2016 by the Office for Information Technology Policy’s Copyright Education Subcommittee, presenters were Zach Graves, director of digital strategy and a senior fellow working at the intersection of policy and technology at R Street and Dr. Wayne Brough, chief economist and vice president for research at FreedomWorks. They described how thinkers from across the libertarian spectrum view copyright law. Is copyright an example of big government? Does the law focus too much on content companies and their interests?  Is it in the interests of freedom and individual choice? FreedomWorks and R Street (along with ALA) are founding members of Re:Create, a new DC copyright coalition that  supports balanced copyright, creativity, understandable copyright law, and freedom to tinker and innovate!

    Plan ahead! One hour CopyTalk webinars occur on the first Thursday of every month at 11am Pacific/2 pm Eastern Time.  It’s free! On May 5, university copyright education programs will be on tap!

    The post Archived webinar on libertarians and copyright now available appeared first on District Dispatch.

    LITA: Creating a Technology Needs Pyramid

    planet code4lib - Wed, 2016-04-13 12:00

    Technology Training in Libraries” by Sarah Houghton has become my bible. It was published as part of LITA’s Tech Set series back in 2010 and acts as a no-nonsense guide to technology training for librarians. Before I started my current position, implementing a technology training model seemed easy enough, but I’ve found that there are many layers, including (but certainly not limited to) things like curriculum development, scheduling, learning styles, generational differences, staff buy-in, and assessment. It’s a prickly pear and one of the biggest challenges I’ve faced as a professional librarian.

    After several months of training attempts I took a step back after finding inspiration in the bible. In her chapter on planning, Houghton discusses the idea of developing a Technology Needs Pyramid similar to the Maslow Hierarchy of Needs (originally proposed by Aaron Schmidt on the Walking Paper blog). Necessary skills and competencies make up the base and more idealistic areas of interest are at the top. Most of my research has pointed me towards creating a list of competencies, but the pyramid was much more appealing to a visual thinker like me.

    In order to construct a pyramid for the Reference Services department, we held a brainstorming session where I asked my co-workers what they feel they need to know to work at the reference desk, followed by what they want to learn. At Houghton’s suggestion, I also bribed them with treats. The results were a mix of topics (things like data visualization and digital mapping) paired with specific software (Outlook, Excel, Photoshop).

    Once we had a list I created four levels for our pyramid. “Need to Know” is at the bottom and “Want to Learn” is at the top, with a mix of both in between. I hope that this pyramid will act as a guideline for our department, but more than anything it will act as a guide for me going forward. I’ve already printed it and pinned it to my bulletin board as a friendly daily reminder of what my department needs and what they’re curious about. While I’d like to recommend the Technology Needs Pyramid to everyone, I don’t have the results just yet! I look forward to sharing our progress as we rework our technology plan. In the meantime I can tell you that whether it’s a list, graphic, or narrative; collecting (and demanding) feedback from your colleagues is vital. It’s not always easy, but it’s definitely worth the cost of a dozen donuts.

    Harvard Library Innovation Lab: LIL at IIPC: Noticing Reykjavik

    planet code4lib - Tue, 2016-04-12 23:33

    Matt, Jack, and Anastasia are in Reykjavik this week, along with Genève Campbell of the Berkman Center for Internet and Society, for the annual meeting of the International Internet Preservation Consortium. We’ll have lots of details from IIPC coming soon, but for this first post we wanted to share some of the things we’re noticing here in Reykjavik. 

    [Genève] Nothing in Reykjavik seems to be empty space. There is always room for something different, new, creative, or odd to fill voids. This is the parking garage of the Harpa concert hall. Traditional fluorescent lamps are interspersed with similar ones in bright colors.

    [Jack] I love how many ways there are to design something as simple as a bathroom. Here are some details I noticed in our guest house:

    Clockwise from top left: shower set into floor; sweet TP stash as design element; soap on a spike and exposed hot/cold pipes; toilet tank built into wall.

    [Matt] Walking around the city is colorful and delightful. Spotting an engaging piece of street art is a regular occurrence. A wonderful, regular occurrence.

    [Anastasia] After returning from Iceland for the first time a year ago, I found myself missing something I don’t normally give much thought to: Icelandic money is some of the loveliest currency I have ever seen.

    The banknotes are quite complex artistically, and yet every denomination abides by thoughtful design principles. Each banknote’s side shows either a culturally-significant figure or scene. The denomination is displayed prominently, the typography is ornate but consistent. The colors, beautiful.

    But what trumps the aesthetics is the banknotes’ dimensions. Icelandic paper money is sized according to amount: the 500Kr note is smaller than the 1000Kr note, which in turn is outsized by the 5000Kr note. This is incredibly important — it allows visually impaired people to move about more freely in the world.

    In comparison, our money looks silly and our treasury department negligent, as it is impossible to differentiate the values by touch alone. And, confoundingly, there don’t seem to be movements to amend this either: in 2015 the department made “strides” by announcing it would start providing money readers, little machines that read value to people who filled out what I’m sure is not a fun amount of paperwork, instead of coming up with a simple design solution.

    The coins are a different story. When I first arrived the clunky coins were a happy surprise — they’re delightfully weighty (maybe even a little too bulky for normally non-cash-carrying types), adorned with beautifully thoughtful designs. On one side of each of the coins (gold or silver), the denomination stands out in large type along with local sea creatures: a big Lumpfish, three small Capelin fish, a dolphin, a Shore crab.

    On the reverse side the four great guardians of Iceland gaze intensely. They are the dragon (Dreki), the griffin (Gammur), the bull (Griðungur), and the giant (Bergrisi), that each protected Iceland from Denmark invasion in turn, according to the Saga of King Olaf Tryggvason. On the back of the 1 Krona, only the giant stands, commanding.

    And that’s it. No superfluous information. No humans, either, only mythology and fish.

    Returning home is good things, but sometimes it also means re-entering a world where money is just sad green rectangles (and oddly sized coins) full of earthly men.

    Villanova Library Technology Blog: CFP: Libraries and Archives in the Anthropocene: A Colloquium at NYU

    planet code4lib - Tue, 2016-04-12 21:54

    Call for Proposals

    Libraries and Archives in the Anthropocene: A Colloquium
    May 13-14, 2017
    New York University

    As stewards of a culture’s collective knowledge, libraries and archives are facing the realities of cataclysmic environmental change with a dawning awareness of its unique implications for their missions and activities. Some professionals in these fields are focusing new energies on the need for environmentally sustainable practices in their institutions. Some are prioritizing the role of libraries and archives in supporting climate change communication and influencing government policy and public awareness. Others foresee an inevitable unraveling of systems and ponder the role of libraries and archives in a world much different from the one we take for granted. Climate disruption, peak oil, toxic waste, deforestation, soil salinity and agricultural crisis, depletion of groundwater and other natural resources, loss of biodiversity, mass migration, sea level rise, and extreme weather events are all problems that indirectly threaten to overwhelm civilization’s knowledge infrastructures, and present information institutions with unprecedented challenges.

    This colloquium will serve as a space to explore these challenges and establish directions for future efforts and investigations. We invite proposals from academics, librarians, archivists, activists, and others.

    Some suggested topics and questions:

    • How can information institutions operate more sustainably?
    • How can information institutions better serve the needs of policy discussions and public awareness in the area of climate change and other threats to the environment?
    • How can information institutions support skillsets and technologies that are relevant following systemic unraveling?
    • What will information work look like without the infrastructures we take for granted?
    • How does information literacy instruction intersect with ecoliteracy?
    • How can information professionals support radical environmental activism?
    • What are the implications of climate change for disaster preparedness?
    • What role do information workers have in addressing issues of environmental justice?
    • What are the implications of climate change for preservation practices?
    • Should we question the wisdom of preserving access to the technological cultural legacy that has led to the crisis?
    • Is there a new responsibility to document, as a mode of bearing witness, the historical event of society’s confrontation with the systemic threat of climate change, peak oil, and other environmental problems?
    • Given the ideological foundations of libraries and archives in Enlightenment thought, and given that Enlightenment civilization may be leading to its own environmental endpoint, are these ideological foundations called into question? And with what consequences?

    Lightning talk (5 minutes)
    Paper (20 minutes)

    Proposals are due August 1, 2016.
    Notifications of acceptance will be sent by September 16, 2016.
    Submit your proposal here:

    Planning committee:



    SearchHub: Introducing Lucidworks View!

    planet code4lib - Tue, 2016-04-12 16:14

    Lucidworks is pleased to announce the release of Lucidworks View.

    View is an extensible search interface designed to work with Fusion, allowing for the deployment of an enterprise-ready search front end with minimal effort. View has been designed to harness the power of Fusion query pipelines and signals, and provides essential search capabilities including faceted navigation, typeahead suggestions, and landing page redirects.

    View showing automatic faceted navigation:

    View showing typeahead query pipelines, and the associated config file on the right:

    View is powered by Fusion, Gulp, AngularJS, and Sass allowing for the easy deployment of a sophisticated and customized search interface. All visual elements of View can be configured easily using SCSS styling.

    View is easy to customize.. quickly change styling with a few edits:

    Additional features:

    • Document display templates for common Fusion data sources.
    • Included templates are web, file, Slack, Twitter, Jira and a default.
    • Landing Page redirects.
    • Integrates with Fusion authentication.

    Lucidworks View 1.0 is available for immediate download at

    Read the release notes or documentation, learn more on the Lucidworks View product page, or browse the source on GitHub,

    The post Introducing Lucidworks View! appeared first on

    Mark E. Phillips: DPLA Descriptive Metadata Lengths: By Provider/Hub

    planet code4lib - Tue, 2016-04-12 15:30

    In the last post I took a look at the length of the description fields for the Digital Public Library of America as a whole.  In this post I wanted to spend a little time looking at these numbers on a per-provider/hub basis to see if there is anything interesting in the data.

    I’ll jump right in with a table that shows all 29 of the providers/hubs that are represented in the snapshot of metadata that I am working with this time.  In this table you can see the minimum record length, max length, the number of descriptions (remember values can be multi-valued so there are more descriptions than records for a provider/hub),  sum (all of the lengths added together), the mean of the length and then finally the standard deviation.

    provider min max count sum mean stddev artstor 0 6,868 128,922 9,413,898 73.02 178.31 bhl 0 100 123,472 775,600 6.28 8.48 cdl 0 6,714 563,964 65,221,428 115.65 211.47 david_rumsey 0 5,269 166,313 74,401,401 447.36 861.92 digital-commonwealth 0 23,455 455,387 40,724,507 89.43 214.09 digitalnc 1 9,785 241,275 45,759,118 189.66 262.89 esdn 0 9,136 197,396 23,620,299 119.66 170.67 georgia 0 12,546 875,158 135,691,768 155.05 210.85 getty 0 2,699 264,268 80,243,547 303.64 273.36 gpo 0 1,969 690,353 33,007,265 47.81 58.20 harvard 0 2,277 23,646 2,424,583 102.54 194.02 hathitrust 0 7,276 4,080,049 174,039,559 42.66 88.03 indiana 0 4,477 73,385 6,893,350 93.93 189.30 internet_archive 0 7,685 523,530 41,713,913 79.68 174.94 kdl 0 974 144,202 390,829 2.71 24.95 mdl 0 40,598 483,086 105,858,580 219.13 345.47 missouri-hub 0 130,592 169,378 35,593,253 210.14 2325.08 mwdl 0 126,427 1,195,928 174,126,243 145.60 905.51 nara 0 2,000 700,948 1,425,165 2.03 28.13 nypl 0 2,633 1,170,357 48,750,103 41.65 161.88 scdl 0 3,362 159,681 18,422,935 115.37 164.74 smithsonian 0 6,076 2,808,334 139,062,761 49.52 137.37 the_portal_to_texas_history 0 5,066 1,271,503 132,235,329 104.00 95.95 tn 0 46,312 151,334 30,513,013 201.63 248.79 uiuc 0 4,942 63,412 3,782,743 59.65 172.44 undefined_provider 0 469 11,436 2,373 0.21 6.09 usc 0 29,861 1,076,031 60,538,490 56.26 193.20 virginia 0 268 30,174 301,042 9.98 17.91 washington 0 1,000 42,024 5,258,527 125.13 177.40

    This table is very helpful to reference as we move through the post but it is rather dense.  I’m going to present a few graphs that I think illustrate some of the more interesting things in the table.

    Average Description Length

    The first is to just look at the average description length per provider/hub to see if there is anything interesting in there.

    Average Description Length by Hub

    For me I see that there are several bars that are very small on this graph, specifically for the providers bhl, kdl, nara, unidentified_provider, and virginia.  I also noticed that david_rumsey has the highest average description length of 450 characters.  Following david_rumsey is getty at 300 and then mmdl, missouri, and tn who are at about 200 characters for the average length.

    One thing to keep in mind from the previous post is that the average length for the whole DPLA was 83.32 characters in length, so many of the hubs were over that and some significantly over that number.

    Mean and Standard Deviation by Partner/Hub

    I think it is also helpful to take a look at the standard deviation in addition to just the average,  that way you are able to get a sense of how much variability there is in the data.

    Description Length Mean and Stddev by Hub

    There are a few providers/hubs that I think stand out from the others by looking at the chart. First david_rumsey has a stddev just short of double its average length.  The mwdl and the missouri-hub have a very high stddev compared to their average. For this dataset, it appears that these partners have a huge range in their lengths of descriptions compared to others.

    There are a few that have a relatively small stddev compared to the average length.  There are just two partners that actually have a stddev lower than the average, those being the_portal_to_texas_history and getty.

    Longest Description by Partner/Hub

    In the last blog post we saw that there was a description that was over 130,000 characters in length.  It turns out that there were two partner/hubs that had some seriously long descriptions.

    Longest Description by Hub

    Remember the chart before this one that showed the average and the stddev next to each other for the Provider/Hub,  there we said a pretty large stddev for missouri_hub and mwdl? You may see why that is with the chart above.  Both of these hubs have descriptions of over 120,000 characters.

    There are six Providers/Hubs that have some seriously long descriptions,  digital-commonwealth, mdl, missouri_hub, mwdl, tn, and usc.  I could be wrong but I have a feeling that descriptions that long probably aren’t that helpful for users and are most likely the full-text of the resource making its way into the metadata record.  We should remember,  “metadata is data about data”… not the actual data.

    Total Description Length of Descriptions by Provider/Hub

    Total Description Length of All Descriptions by Hub

    Just for fun I was curious about how the total lengths of the description fields per provider/hub would look on a graph, those really large numbers are hard to hold in your head.

    It is interesting to note that hathitrust which has the most records in the DPLA doesn’t contribute the most description content. In fact the most is contributed by mwdl.  If you look into the sourcing of these records you will have an understanding of why with the majority of the records in the hathitrust set coming from MARC records which typically don’t have the same notion of “description” that records from digital libraries and formats like Dublin Core have. The provider/hub mwdl is an aggregator of digital library content and has quite a bit more description content per record.

    Other providers/hubs of note are georgia, mdl, smithsonian, and the_portal_to_texas_history which all have over 100,000,000 characters in their descriptions.

    Closing for this post

    Are there other aspects of this data that you would like me to take a look at?  One idea I had was to try and determine on a provider/hub basis what might be a notion of “too long” for a given provider based on some methods of outlier detection,  I’ve done the work for this but don’t know enough about the mathy parts to know if it is relevant to this dataset or not.

    I have about a dozen more metrics that I want to look at for these records so I’m going to have to figure out a way to move through them a bit quicker otherwise this blog might get a little tedious (more than it already is?).

    If you have questions or comments about this post,  please let me know via Twitter.

    Evergreen ILS: Statement on North Carolina House Bill 2

    planet code4lib - Tue, 2016-04-12 15:04

    Due to the recent passage of House Bill 2 in North Carolina, the Evergreen Oversight Board, on behalf of the Evergreen Project, has released the following statement. Our main concern is that our conference is a safe and inclusive space for all Evergreen Community members. While other organizations have cancelled their conferences in North Carolina over this matter, our conference was simply too imminent to move or cancel without significant harm to the Project and its members. Please feel free to contact the Evergreen Oversight Board if you have any questions about this statement.

    Grace Dunbar
    Chair, Evergreen Oversight Board

    Per the Evergreen Project’s Event Code of Conduct Evergreen event organizers are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion. We do not tolerate harassment of event participants in any form. It is now important to reemphasize that commitment.

    In particular, the Evergreen Oversight Board is disappointed that the North Carolina General Assembly has chosen to pass legislation that nullifies city ordinances that protect LGBT individuals from discrimination, including one such ordinance passed by the Raleigh City Council. Since the 2016 Evergreen Conference is to be held in Raleigh, the Oversight Board is taking the following steps to protect conference attendees:

    • We are working with the conference venue and hotels to ensure that their staff and organizations will not discriminate on the basis of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion.
    • Members of the Evergreen Safety Committee will be available at the conference to advocate for and assist conference attendees, including accompanying attendees to and from the conference venues.
    • The Evergreen Project, via its fiscal agent Software Freedom Conservancy, will refund in full registrations from individuals who feel that they can no longer safely attend the conference. We’re committed to processing refunds requested on this basis, but it will take us some time to process them. Please be patient if you request a refund.

    Library of Congress: The Signal: Expanding NDSA Levels of Preservation

    planet code4lib - Tue, 2016-04-12 15:01

    This is a guest post by Shira Peltzman from the UCLA Library.

    Shira Peltzman. Photo by Alan Barnett.

    Last month Alice Prael and I gave a presentation at the annual Code4Lib conference in which I mentioned a project I’ve been working on to update the NDSA Levels of Digital Preservation so that it includes a metric for access. (You can see the full presentation on YouTube at the 1:24:00 minute mark.)

    For anyone who is unfamiliar with NDSA Levels, it’s a tool that was developed in 2012 by the National Digital Stewardship Alliance as a concise and user-friendly rubric to help organizations manage and mitigate digital preservation risks. The original version of the Levels of Digital Preservation includes four columns (Levels 1-4) and five rows. The columns/levels range in complexity, from the least you can do (Level 1) to the most you can do (Level 4). Each row represents a different conceptual area: Storage and Geographic Location, File Fixity and Data Integrity, Information Security, Metadata and File Formats. The resulting matrix contains a tiered list of concrete technical steps that correspond to each of these preservation activities.

    It has been on my mind for a long time to expand the NDSA Levels so that the table includes a means of measuring an organization’s progress with regard to access. I’m a firm believer in the idea that access is one of the foundational tenets of digital preservation. It follows that if we are unable to provide access to the materials we’re preserving, then we aren’t really doing such a great job of preserving those materials in the first place.

    When it comes to digital preservation, I think there’s been an unfortunate tendency to give short shrift to access, to treat it as something that can always be addressed in the future. In my view, the lack of any access-related fields within the current NDSA Levels reflects this.

    Of course I understand that providing access can be tricky and resource-intensive in general, but particularly so when it comes to born-digital. From my perspective, this is all the more reason why it would be useful for the NDSA Levels to include a row that helps institutions measure, build, and enhance their access initiatives.

    While some organizations use NDSA Levels as a blueprint for preservation planning, other organizations — including the UCLA Library where I work — employ NDSA Levels as a means to assess compliance with preservation best practices and identify areas that need to be improved.

    In fact, it was in this vein that the need originally arose for a row in NDSA Levels explicitly addressing access. After suggesting that we use NDSA Levels as a framework for our digital preservation gap analysis, it quickly became apparent to me that its failure to address Access would be a blind spot too great to ignore.

    Providing access to the material in our care is so central to UCLA Library’s mission and values that failing to assess our progress/shortcomings in this area was not an option for us. To address this, I added an Access row to the NDSA Levels designed to help us measure and enhance our progress in this area.

    My aims in crafting the Access row were twofold: First, I wanted to acknowledge the OAIS reference model by explicitly addressing the creation of Dissemination Information Packages (which in turn necessitated mentioning other access-related terms like Designated Community, Representation Information and Preservation Description Information). This resulted in the column feeling rather jargon-heavy, so eventually I’d like to adjust this so that it better matches the tone and language of the other columns.

    Second, I tried to remain consistent with the model already in place. That meant designing the steps for each column/level so that they are both content agnostic and system agnostic and can be applied to various collections or systems. For the sake of consistency I also tried to maintain the sub-headings for each column/level, (i.e., “protect your data,” “know your data,” “monitor your data,” and “repair your data”) even though some have questioned their usefulness in the past; for more on this, see the comments at the bottom of Trevor Owens blog post.

    While I’m happy with the end result overall, these categories map better in some instances than in others. I welcome feedback from you and the digital preservation community at large about how they could be improved. I have deliberately set the permissions to allow anyone to view/edit the document, since I’d like for this to be something to which the preservation community at large can contribute.

    Fortunately, NDSA Levels was designed to be iterative. In fact, in a paper titled “The NDSA Levels of Digital Preservation: An Explanation and Uses,” published shortly after NDSA Levels’ debut, its authors solicited feedback from the community and acknowledged future plans to revise the chart. Tools like this ultimately succeed because practitioners push for them to be modified and refined so that they can better serve the community’s needs. I hope that enough consensus builds around some of the updates I proposed for them to eventually become officially incorporated into the next iteration of the NDSA Levels if and when it is released.

    My suggested updates are in the last row of the Levels of Preservation table below, labeled Access. If you have any questions please contact me: Shira Peltzman, Digital Archivist, UCLA Library, | (310) 825-4784.


    Level One
    (Protect Your Data) Level Two
    (Know Your data) Level Three
    (Monitor Your Data) Level Four
    (Repair Your Data) Storage and Geographic Location Two complete copies that are not collocated

    For data on heterogeneous media (optical disks, hard drives, etc.) get the content off the medium and into your storage system

    At least three complete copies

    At least one copy in a different geographic location/

    Document your storage system(s) and storage media and what you need to use them

    At least one copy in a geographic location with a different disaster threat

    Obsolescence monitoring process for your storage system(s) and media

    At least 3 copies in geographic locations with different disaster threats

    Have a comprehensive plan in place that will keep files and metadata on currently accessible media or systems

    File Fixity and Data Integrity Check file fixity on ingest if it has been provided with the content

    Create fixity info if it wasn’t provided with the content

    Check fixity on all ingestsUse write-blockers when working with original media

    Virus-check high risk content

    Check fixity of content at fixed intervals

    Maintain logs of fixity info; supply audit on demand

    Ability to detect corrupt data

    Virus-check all content

    Check fixity of all content in response to specific events or activities

    Ability to replace/repair corrupted data

    Ensure no one person has write access to all copies

    Information Security Identify who has read, write, move, and delete authorization to individual files

    Restrict who has those authorizations to individual files

    Document access restrictions for content Maintain logs of who performed what actions on files, including deletions and preservation actions Perform audit of logs Metadata Inventory of content and its storage location

    Ensure backup and non-collocation of inventory

    Store administrative metadata

    Store transformative metadata and log events

    Store standard technical and descriptive metadata Store standard preservation metadata File Formats When you can give input into the creation of digital files encourage use of a limited set of known open file formats and codecs Inventory of file formats in use Monitor file format obsolescence issues Perform format migrations, emulation and similar activities as needed Access Determine designated community1

    Ability to ensure the security of the material while it is being accessed. This may include physical security measures (e.g. someone staffing a reading room) and/or electronic measures (e.g. a locked-down viewing station, restrictions on downloading material, restricting access by IP address, etc.)

    Ability to identify and redact personally identifiable information (PII) and other sensitive material

    Have publicly available catalogs, finding aids, inventories, or collection descriptions available to so that researchers can discover material

    Create Submission Information Packages (SIPs) and Archival Information Packages (AIPs) upon ingest2

    Ability to generate Dissemination Information Packages (DIPs) on ingest3

    Store Representation Information and Preservation Description Information4

    Have a publicly available access policy

    Ability to provide access to obsolete media via its native environment and/or emulation

    1 Designated Community essentially means “users”; the term that comes from the Reference Model for an Open Archival Information System (OAIS).
    2 The Submission Information Package (SIP) is the content and metadata received from an information producer by a preservation repository. An Archival Information Package (AIP) is the set of content and metadata managed by a preservation repository, and organized in a way that allows the repository to perform preservation services.
    3 Dissemination Information Package (DIP) is distributed to a consumer by the repository in response to a request, and may contain content spanning multiple AIPs.
    4 Representation Information refers to any software, algorithms, standards, or other information that is necessary to properly access an archived digital file. Or, as the Preservation Metadata and the OAIS Information Model put it, “A digital object consists of a stream of bits; Representation Information imparts meaning to these bits.” Preservation Description Information refers to the information necessary for adequate preservation of a digital object. For example, Provenance, Reference, Fixity, Context, and Access Rights Information.

    Access Conference: Peer Reviewers Wanted!

    planet code4lib - Tue, 2016-04-12 13:22

    Interested in helping out with Access 2016? Looking to gain some valuable professional experience? The Access 2016 program committee is looking for volunteers to serve as peer-reviewers!

    If you’re interested, send us an email at by Friday, April 22, attaching a copy of your current CV and answers to the following:

    • Name
    • Current Position (student reviewers are also welcome!)
    • Institution
    • Have you attended Access before?
    • Have you presented at Access before?
    • Have you been a peer reviewer for Access before?

    Questions or comments? Drop us a line at

    DuraSpace News: Registration for the VIVO 2016 Conference is Now Open!

    planet code4lib - Tue, 2016-04-12 00:00

    From the VIVO 2016 Conference organizers

    Join us in beautiful Denver, Colorado, August 17 to 19 for the VIVO 2016 Conference. To reserve your hotel room at the VIVO conference discount, book now before rooms sell out. >

    David Rosenthal: Brewster Kahle's "Distributed Web" proposal

    planet code4lib - Mon, 2016-04-11 20:21
    Back in August last year Brewster Kahle posted Locking the Web Open: A Call for a Distributed Web. It consisted of an analysis of the problems of the current Web, a set of requirements for a future Web that wouldn't have those problems, and a list of pieces of current technology that he suggested could be assembled into a working if simplified implementation of those requirements layered on top of the current Web. I meant to blog about it at the time, but I was busy finishing my report on emulation.

    Last November, Brewster gave the EE380 lecture on this topic (video from YouTube or Stanford), reminding me that I needed to write about it. I still didn't find time to write a post. On 8th June, Brewster, Vint Cerf and Cory Doctorow are to keynote a Decentralized Web Summit. I encourage you to attend. Unfortunately, I won't be able to, and this has finally forced me to write up my take on this proposal. Follow me below the fold for a brief discussion; I hope to write a more detailed post soon.

    I should start by saying that I agree with Brewster's analysis of the problems of the current Web, and his requirements for a better one. I even agree that the better Web has to be distributed, and that developing it by building prototypes layered on the current Web is the way to go in the near term. I'll start by summarizing Brewster's set of requirements and his proposed implementation, then point out some areas where I have concerns.

    Brewster's requirements are:
    • Peer-to-Peer Architecture to avoid the single points of failure and control inherent in the endpoint-based naming of the current Web.
    • Privacy to disrupt the current Wed's business model of pervasive, fine-grained surveillance.
    • Distributed Authentication for Identity to avoid the centralized control over identity provided by Facebook and Google.
    • Versioning to provide the memory the current Web lacks.
    • Easy payment mechanism to provide an alternate way to reward content generators.
    There are already a number of attempts at partial implementations of these requirements, based as Brewster suggests on JavaScript, public-key cryptography, blockchain, Bitcoin, and Bittorrent. An example is IPFS (also here). Pulling these together into a coherent and ideally interoperable framework would be an important outcome of the upcoming summit.

    Thinking of these as prototypes, exploring the space of possible features, they are clearly useful. But we have known the risks of allowing what should be prototypes to become "interim" solutions since at least the early 80s. The Alto "Interim" File Server (IFS) was designed and implemented by David R. Boggs and Ed Taft in the late 70s. In 1977 Ed wrote:
    The interim nature of the IFS should be emphasized. The IFS is not itself an object of research, though it may be used to support other research efforts such as the Distributed Message System. We hope that Juniper will eventually reach the point at which it can replace IFS as our principal shared file system.Because IFS worked well enough for people at PARC to get the stuff they needed done, the motivation to replace it with Juniper was never strong enough. The interim solution became permanent. Jim Morris, who was at PARC at the time, and who ran the Andrew Project at C-MU on which I worked from 1983-85, used IFS as the canonical example of a "success disaster", something whose rapid early success entrenches it in ways that cause cascading problems later.

    And in this case the permanent solution is at least as well developed as the proposed "interim" one. For at least the last decade, rather than build a “Distributed Web”, Van Jacobson and many others have been working to build a “Distributed Internet”. The Content-Centric Networking project at Xerox PARC, which has become the Named Data Networking (NDN) project spearheaded by UCLA, is one of the NSF’s four projects under the Future Internet Architecture Program. Here is a list of 68 peer-reviewed papers published in the last 7 years relating to NDN.

    By basing the future Internet on the name of a data object rather than the location of the object, many of the objectives of the “Distributed Web” become properties of the network infrastructure rather than something implemented in some people’s browsers.

    Another way of looking at this is that the current Internet is about moving data from one place to another, NDN is about copying data. By making the basic operation in the net a copy, caching works properly (unlike in the current Internet). This alone is a huge deal, and not just for the Web. The Internet is more than just the Web, and the reasons for wanting to be properly “Distributed” apply just as much to the non-Web parts. And Web archiving should be, but currently isn't, about persistently caching selected parts of the Web.

    I should stress that I believe that implementing these concepts initially on top of IP, and even on top of HTTP, is a great and necessary idea; it is how NDN is being tested. But doing so with the vision that eventually IP will be implemented on top of a properly “Distributed” infrastructure is also a great idea; IP can be implemented on top of NDN. For a detailed discussion of these ideas see my (long) 2013 blog post reviewing the 2012 book Trillions.

    There are other risks in implementing Brewster's requirements using JavaScript, TCP/IP, the blockchain and the current Web:
    • JavaScript poses a fundamental risk, as we see from Douglas Crockford's attempt to define a "safe" subset of the language. It isn't clear that it is possible to satisfy Brewster's requirements in a safe subset of JavaScript, even if one existed. Allowing content from the Web to execute in your browser is a double-edged sword; it enables easy implementation of new capabilities, but if they are useful they are likely to pose a risk of being subverted.
    • Implementing anonymity on top of a communication infrastructure that explicitly connects endpoints turns out to be very hard. Both Tor and Bitcoin users have been successfully de-anonymized.
    • I have written extensively about the economic and organizational issues that plague Bitcoin, and will affect other totally distributed systems, such as the one Brewster wants to build. It is notable that Postel's Law (RFC 793) or the Robustness Principle has largely prevented these problems affecting the communication infrastructure level that NDN addresses.
    So there are very good reasons why this way of implementing Brewster's requirements should be regarded as creating valuable prototypes, but we should be wary of the Interim File System effect. The Web we have is a huge success disaster. Whatever replaces it will be at least as big a success disaster. Lets not have the causes of the disaster be things we knew about all along.

    Mark E. Phillips: DPLA Description Field Analysis: Yes there really are 44 “page” long description fields.

    planet code4lib - Mon, 2016-04-11 16:00

    In my previous post I mentioned that I was starting to take a look at the descriptive metadata fields in the metadata collected and hosted by the Digital Public Library of America.  That last post focused on records, how many records had description fields present, and how many were missing.  I also broke those numbers into the Provider/Hub groupings present in the DPLA dataset to see if there were any patterns.

    Moving on the next thing I wanted to start looking at was data related to each instance of the description field.  I parsed each of the description fields, calculated a variety of statistics using that description field and then loaded that into my current data analysis tool, Solr which acts as my data store and my full-text index.

    After about seven hours of processing I ended up with 17,884,946 description fields from the 11,654,800 records in the dataset.  You will notice that we have more descriptions than we do records, this is because a record can have more than one instance of a description field.

    Lets take a look at a few of the high-level metrics.


    I first wanted to find out the cardinality of the lengths of the description fields.  When I indexed each of the descriptions,  I counted the number of characters in the description and saved that as an integer in a field called desc_length_i in the Solr index.  Once it was indexed, it was easy to retrieve the number of unique values for length that were present.  There are 5,287 unique description lengths in the 17,884,946 descriptions that were are analyzing.  This isn’t too surprising or meaningful by itself, just a bit of description of the dataset.

    I tried to make a few graphs to show the lengths and how many descriptions had what length.  Here is what I came up with.

    Length of Descriptions in dataset

    You can see a blue line barely,  the problem is that the zero length records are over 4 million and the longer records are just single instances.

    Here is a second try using a log scale for the x axis

    Length of Descriptions in dataset (x axis log)

    This reads a little better I think, you can see that there is a dive down from zero lengths and then at about 10 characters long there is a spike up again.

    One more graph to see what we can see,  this time a log-log plot of the data.

    Length of Descriptions in dataset (log-log)

    Average Description Lengths

    Now that we are finished with the cardinality of the lengths,  next up is to figure out what the average description length is for the entire dataset.  This time the Solr StatsComponent is used and makes getting these statistics a breeze.  Here is a small table showing the output from Solr.

    min max count missing sum sumOfSquares mean stddev 0 130,592 17,884,946 0 1,490,191,622 2,621,904,732,670 83.32 373.71

    Here we see that the minimum length for a description is zero characters (a record without a description present has a length of zero for that field in this model).  The longest record in the dataset is 130,592 characters long.  The total number of characters present in the dataset was nearly one and a half billion characters.  Finally the number that we were after is the average length of a description, this turns out to be 83.32 characters long.

    For those that might be curious what 84 characters (I rounded up instead of down) of description looks like,  here is an example.

    Aerial photograph of area near Los Angeles Memorial Coliseum, Los Angeles, CA, 1963.

    So not a horrible looking length for a description.  It feels like it is just about one sentence long with 13 “words” in this sentence.

    Long descriptions

    Jumping back a bit to look at the length of the longest description field,  that description is 130,592 characters long.  If you assume that the average single spaced page is 3,000 characters long, this description field is 43.5 pages long.  The reader of this post that has spent time with aggregated metadata will probably say “looks like someone put the full-text of the item into the record”.  If you’ve spent some serious (or maybe not that serious) time in the metadata mines (trenches?) you would probably mumble somethings like “ContentDM grumble grumble” and you would be right on both accounts.  Here is the record on the DPLA site with the 130,492 character long description –

    The next thing I was curious about was the number of descriptions that were “long”.  To answer this I am going to require a little bit of back of the envelope freedom right now to decide what “long” is for a description field in a metadata record.  (In future blog posts I might be able to answer this with different analysis on the data but this hopefully will do for today.)  For now I’m going to arbitrarily decide that anything over 325 characters in length is going to be considered “too long”.

    Descriptions: Too Long and Not Too Long

    Looking at that pie chart,  there are 5.8% of the descriptions that are “too long” based on my ad-hoc metric from above.  This 5.8% of the records make up 708,050,671 or  48% of the 1,490,191,622 characters in the entire dataset.  I bet if you looked a little harder you would find that the description field gets very close to the 80/20 rule with 20% of the descriptions accounting for 80% of the overall description length.

    Short descriptions

    Now that we’ve worked with long descriptions, the next thing we should look at are the number of descriptions that are “short” in length.

    There are 4,113,841 records that don’t have a description in the DPLA dataset.  This means that for this analysis 4,113,841(23%) of the descriptions have a length of 0.  There are 2,041,527 (11%) descriptions that have a length between 1 and 10 characters in length. Below is the breakdown of these ten counts,  you can see that there is a surprising number (777,887) of descriptions that have a single character as their descriptive contribution to the dataset.

    Descriptions 10 characters or less

    There is also an interesting spike at ten characters in length where suddenly we jump to over 500,000 descriptions in the DPLA.

    So what?

    Now that we have the average length of a description in the DPLA dataset,  the number of records that we consider “long” and the number of records that we consider “short”.  I think the very next question that gets asked is “so what?”

    I think there are four big reasons that I’m working on this kind of project with the DPLA data.

    One is that the DPLA is the largest aggregation of descriptive metadata in the US for digital resources in cultrual heritage institutions. This is important because you get to take a look at a wide variety of data input rules, practices, and conversions from local systems to an aggregated metadata system.

    Secondly this data is licensed with a CC0 license and in a bulk data format so it is easy to grab the data and start working with it.

    Thirdly there haven’t been that many studies on descriptive metadata like this that I’m aware of. OCLC will publish analysis on their MARC catalog data from time to time, and the research that was happening at UIUC in the GSILS with IMLS funded metadata isn’t going on anymore (great work to look at by the way)  so there really aren’t that many discussions about using large scale aggregations of metadata to understand the practices in place in cultural heritage institutions across the US.  I am pretty sure that there is work being carried out across the Atlantic with the Eureopana datasets that are available.

    Finally I think that this work can lead to metadata quality assurance practices and indicators for metadata creators and aggregators about what may be wrong with their metadata (a message saying “your description is over a page long, what’s up with that?”).

    I don’t think there are many answers so far in this work but I feel that they are moving us in the direction of a better understanding of our descriptive metadata world in the context of these large aggregations of metadata.

    If you have questions or comments about this post,  please let me know via Twitter.

    District Dispatch: Involve young people in library advocacy

    planet code4lib - Mon, 2016-04-11 14:00

    Guest post by Katie Bowers, Campaigns Director for the Harry Potter Alliance.

    The Harry Potter Alliance (HPA) is an organization that uses the power of story to turn fans into heroes. Each spring, the HPA hosts an annual worldwide literacy campaign known as “Accio Books“. Started in 2009, Accio Books began as a book drive where HPA members donated over 30,000 books to the Agahozo Shalom Youth Village in Rwanda. Today, we’ve donated over 250,000 books to communities in need around the world.

    The HPA runs Accio Books every year because we believe that the power of story should be accessible to everyone. That’s why we’re so excited to be using our unique model to help young people advocate for libraries on National and Virtual Library Legislative Day (VLLD) this year! We’ve invited our members to call, write, and even visit their lawmakers in support of the ALA’s asks. Our members are excited: we have already received 475 pledges to send owls (letters and calls) to Washington on May 3.

    Want to get young people at your library excited for VLLD? The HPA has created a guide for library staff on using pop culture to help young people send their own owls to Washington. The first step? Start an HPA chapter! Chapters are fun, youth-led, and entirely free to start. As a chapter, library staff and your organizers get access to more free resources and lots of support from HPA staff and volunteers.

    Once you have your chapter, talk with members about why they love libraries, and how they feel about issues like accessibility, copyright, and technology. Brainstorming parallels from beloved stories can be a great way to have this conversation. For example, adequate funding is a major concern for libraries. Imagine – if the Hogwarts’ library had its funding cut then Hermione, Harry and Ron might never have learned about the Sorcerer’s Stone! Use storytelling to help make VLLD understandable and relatable, and then ask your chapter what they want to do support libraries.

    HPA members have come up with all sorts of creative ideas, from hosting letter writing socials and call centers to creating Youtube videos and Tumblr photosets celebrating libraries. You might be asking yourself, “Why would young people care?” Through HPA campaigns, young people made 3,000 phone calls for marriage equality, donated 250,000 books to libraries and literacy organizations worldwide, and organized over 20,000 Youtube video creators and fans to advocate for net neutrality. The truth is, young people want to make a difference, and advocacy can give them that chance. Virtual Library Legislative Day makes Washington advocacy available to anyone, but you and the HPA can bring the inspiration and excitement!

    You can check out the full VLLD resource on the HPA’s website, along with ideas for bringing Accio Books to your library and celebrating World Book Night! You can also learn how to help our chapter in Masaka, Uganda stock the shelves of a brand new school library. 

    Editor’s note: If you plan to participate in VLLD this year, let us know!

    The post Involve young people in library advocacy appeared first on District Dispatch.


    Subscribe to code4lib aggregator