You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 17 hours 25 min ago

FOSS4Lib Recent Releases: TemaTres Vocabulary Server - 2.0

Tue, 2015-08-11 15:45

Last updated August 11, 2015. Created by Peter Murray on August 11, 2015.
Log in to edit this page.

Package: TemaTres Vocabulary ServerRelease Date: Monday, August 10, 2015

David Rosenthal: Patents considered harmful

Tue, 2015-08-11 15:00
Although at last count I'm a named inventor on at least a couple of dozen US patents, I've long believed that the operation of the patent system, like the copyright system, is profoundly counter-productive. Since "reform" of these systems is inevitably hijacked by intellectual property interests, I believe that at least the patent system, if not both, should be completely abolished. The idea that an infinite supply of low-cost, government enforced monopolies is in the public interest is absurd on its face. Below the fold, some support for my position.

The Economist is out with a trenchant leader, and a fascinating article on patents, and they agree with me. What is more, they point out that they made this argument first on July 26th, 1851, 164 years ago:
the granting of patents “excites fraud, stimulates men to run after schemes that may enable them to levy a tax on the public, begets disputes and quarrels betwixt inventors, provokes endless lawsuits [and] bestows rewards on the wrong persons.” In perhaps our first reference to what are now called “patent trolls”, we fretted that “Comprehensive patents are taken out by some parties, for the purpose of stopping inventions, or appropriating the fruits of the inventions of others.”Every one of these criticisms is as relevant today. Alas, even after pointing out the failure of previous reforms, The Economist's current leader ends up arguing for yet another round of "reform".

Even more than in the past those who extract rents via patents are able to divert a minuscule fraction of their ill-gotten gains to rewarding politicians who prevent or subvert reform. Although The Economist's proposals, including a "use-it-or-lose-it" rule, stronger requirements for non-obviousness, and shorter terms, are all worthy, they would all encounter resistance almost as fierce as abolition:
Six bills to reform patents in some way ... have been proposed to the current American Congress. None seeks abolition: any lawmaker brave enough to propose doing away with them altogether, or raising similar questions about the much longer monopolies given to copyright holders, would face an onslaught from the intellectual-property lobby. The Economist's article draws heavily on the work of Michele Boldrin and David Levine:
Reviewing 23 20th-century studies [they] found “weak or no evidence that strengthening patent regimes increases innovation”—all it does is lead to more patents being filed, which is not the same thing. Several of these studies found that “reforms” aimed at strengthening patent regimes, such as one undertaken in Japan in 1988, for the most part boosted neither innovation nor its supposed cause, R&D spending.The exception was interesting:
A study of Taiwan’s 1986 reforms found that they did lead to more R&D spending in the country and more American patents being granted to Taiwanese people and enterprises. This shows that countries whose patent protection is weaker than others’ can divert investment and R&D spending to their territory by strengthening it. But it does not demonstrate that the overall amount of spending or innovation worldwide has been increased.It is clear that far too many patents are being filed and granted. First, companies need them for defense:
In much of the technology industry companies file large numbers of patents (see chart 2), but this is mostly to deter their rivals: if you sue me for infringing one of your thousands of patents, I’ll use one of my stash of patents to sue you back.The number of patents in the stash, rather than the quality of those patents, is what matters for this purpose. And second, counting patents has become a (misleading) measure of innovation:
In some industries and countries they have become a measure of progress in their own right—a proxy for innovation, rather than a spur. Chinese researchers, under orders to be more inventive, have filed a flurry of patents in recent years. But almost all are being filed only with China’s patent office. If they had real commercial potential, surely they would also have been registered elsewhere, too. Companies pay their employees bonuses for filing patents, irrespective of their merits. In the same way that rewarding authors by counting papers leads to the Least Publishable Unit phenomenon, and bad incentives in science, rewarding inventors by counting patents leads to the Least Patentable Unit phenomenon. It is made worse by the prospect of litigation. Companies file multiple overlapping patents on the same invention not just to inflate their number of trading beans, but also to ensure that, even if some are not granted, their eventual litigators will have as many avenues of attack against alleged infringement as possible.

Companies forbid their engineers and scientists to read patents, in case they might be accused of "willful infringement" and be subject to triple damages. Thus the idea that patents provide:
the tools whereby others can innovate, because the publication of good ideas increases the speed of technological advance as one innovation builds upon another.is completely obsolete. The people who might build new innovations on the patent disclosures aren't allowed to know about them. The people providing the content as patent lawyers write new patents aren't allowed to know about related patents already issued. Further:
The evidence that the current system encourages companies to invest in research in a way that leads to innovation, increased productivity and general prosperity is surprisingly weak. A growing amount of research in recent years, including a 2004 study by America’s National Academy of Sciences, suggests that, with a few exceptions such as medicines, society as a whole might even be better off with no patents than with the mess that is today’s system. The Economist noted that the original purpose of patents was that the state share in the rent they allowed to be extracted:
in the early 17th century King James I was raising £200,000 a year from granting patents. The only thing that's changed is that the state now hands patents out cheaply, rather than charging the market rate for their rent-extraction potential. How about limiting the number of patents issued each year and auctioning the slots?

DPLA: New DPLA Contract Opportunity: Metadata Ingestion Development Contractor

Tue, 2015-08-11 14:45

The Digital Public Library of America (http://dp.la) invites interested and qualified individuals or firms to submit a proposal for development related to Heidrun and Krikri, DPLA’s metadata ingestion and aggregation systems.

Proposal Deadline: 5:00 PM EDT (GMT-04:00), August 31, 2015

Background

DPLA aggregates metadata for openly available digital materials from America’s libraries, archives and museums through its ingestion process. The ingestion process has three steps: 1) harvesting metadata from partner sources, 2) mapping harvested records to the DPLA Metadata Application Profile, an RDF model based on the Europeana Data Model, 3) and enriching the mapped metadata to clean and add value to the data (e.g. normalization of punctuation, geocoding, etc.). New metadata providers are subject to a quality assurance process that allows DPLA staff to identify the accuracy of metadata mappings, enrichments, and indexing strategies.

DPLA technology staff has implemented these functions as part of Krikri, a Ruby on Rails engine which provides the core functionality for the DPLA ingestion process. Krikri includes abstract classes and implementations of harvester modules, a metadata mapping domain specific language, and a framework for building enrichments (Audumbla). DPLA deploys Krikri as part of Heidrun. More information about Heidrun can be found on its project page. Krikri uses Apache Marmotta as a backend triple store, PostgreSQL as a backend database, Redis and Resque for job queuing, and Apache Solr and Elasticsearch as search index platforms.

Krikri, Heidrun, Audumbla, and metadata mappings are released as free and open source software under the MIT License. All metadata aggregated by DPLA is released under a CC0 license.

Statement of Needs

The selected contractor will provide programming staff as needed for DPLA related to development of Krikri and Heidrun. These resources will be under the direction of Mark A. Matienzo, DPLA Director of Technology.

DPLA staff is geographically distributed, so there is no requirement for the contractor to be located in a particular place. Responses may provide options or alternatives so that DPLA gets the best value for the price. If the contractor’s staff is distributed, the response should include detail on how communications will be handled between the contractor and DPLA staff. We expect the contractor will provide a primary technical/operations contact and a business contact; these contacts may be the same person, but they must be identified in the response. In addition, we expect that the technical/operations contact and the business contact will be available for occasional meetings between 9:00 AM and 5:00 PM Eastern Time (GMT-04:00).

Core implementation needs include the following two tracks, with Track 1 being the primary deliverable. Work in Track 2, and other work identified by DPLA staff, is subject to available resources remaining in the contract.

  • Track 1 (highest priority; work to be completed by December 24, 2015):
    • 1a. Development of mappings for 20 DPLA hubs (providers) to be used by Heidrun using the Krikri metadata mapping DSL.
      • Harvested metadata includes, but is not limited to, the following schemas and formats: MARCXML, MODS, OAI Dublin Core, Qualified Dublin Core, JSON-LD.
      • Sample mappings can be found in our GitHub repository.
      • Mappings are understood to be specific to each DPLA hub.
      • As needed, revisions and/or refactoring of the metadata mapping DSL implementation may be necessary to support effective mapping.
    • 1b. Development of 5 harvesters for site-specific application programming interfaces, static file harvests, etc.
      • Krikri currently supports harvesting from OAI-PMH providers, CouchDB, and a sample generic API harvester. Heidrun includes an implementation of an existing site-specific API harvester.
    • 1c. As needed, development or modification of enrichment modules for Krikri.
  • Track 2 (additional development work to be completed as resources allow after Track 1):
    • Refactoring and development to allow Krikri applications to more effectively queue batches of jobs to improve concurrency and throughput.
    • Refactoring for support for Rails 4.2 and Blacklight 5.10+.
    • Expanding the Krikri “dashboard” staff-facing application, which currently supports the quality assurance process, to allow non-technical staff to start, schedule, and enqueue harvest, mapping, enrichment, and indexing activities.

All code developed as part of this contract is subject to code review by DPLA technology staff, using GitHub. In addition, implemented mappings will be subject to quality assurance processes led by Gretchen Gueguen, DPLA Data Services Coordinator.

Proposal guidelines

All proposals must adhere to the following submission guidelines and requirements.

  • Proposals are due no later than 5:00PM EDT, Monday, August 31, 2015.
  • Proposals should be sent via email to ingest-contract@dp.la, as a single PDF file attached to the message. Questions about the proposal can also be sent to this address.
  • Please format the subject line with the phrase “DPLA Metadata Ingestion Proposal – [Name of respondent]”.

All proposals should include the following:

  • Pricing, as an hourly rate in US Dollars, and as costs for each work item to be completed in Track 1
  • Proposed staffing plan, including qualifications of project team members (resumes/CVs, links or descriptions of previous projects such as open source contributions)
  • References, listing all clients/organizations with whom the proposer has done business like that required by this solicitation with the last 3 years
  • Qualifications and experience, including
    • General qualifications and development expertise
      • Information about development and project management skills and philosophy
      • Examples of successful projects, delivered on time and on budget
      • Preferred tools and methodologies used for issue tracking, project management, and communication
      • Preferences for change control tools and methodologies
    • Project specific strategies
      • History of developing software in the library, archives, or museum domain
      • Evidence of experience with Ruby on Rails, search platforms such as Solr and Elasticsearch, domain specific language implementations, and queuing systems
      • Information about experience with extract-transform-load workflows and/or metadata harvesting, mapping, and cleanup at scale, using automated processes
      • Information about experience with RDF metadata, triple stores, and implementations of W3C Linked Data Platform specification
Timeline
  • RFP issued: August 11, 2015
  • Work is to be performed no sooner than September 1, 2015.
  • Work for Track 1 must be completed by December 24, 2015.
  • Any additional work, such as Track 2 or other work mutually agreed upon by DPLA and contractor, is to be completed no later than March 31, 2016.
Contract guidelines
  • Proposals must be submitted by the due date.
  • Proposers are asked to guarantee their proposal prices for a period of at least 60 days from the date of the submission of the proposal.
  • Proposers must be fully responsible for the acts and omissions of their employees and agents.
  • DPLA reserves the right to include a mandatory meeting via teleconference to meet with submitters of the proposals individually before acceptance. Top scored proposals may be required to participate in an interview and/or site visit to support and clarify their proposal.
  • The contractor will enter into a contract with DPLA that is consistent with DPLA’s standard contracting policies and procedures.
  • DPLA reserves the right to negotiate with each contractor.
  • There is no allowance for project expenses, travel, or ancillary expenses that the contractor may incur.
  • Individuals or companies based outside the US are eligible to submit proposals, but will have to comply with US and host country labor and tax laws.
About DPLA

The Digital Public Library of America strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated 11 million items from 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

FOSS4Lib Recent Releases: Koha - Security and maintenance releases - v 3.20.2, 3.18.9, 3.16.13

Tue, 2015-08-11 12:09
Package: KohaRelease Date: Thursday, July 30, 2015

Last updated August 11, 2015. Created by David Nind on August 11, 2015.
Log in to edit this page.

Monthly security and maintenance releases for Koha.

See the release announcements for the details:

SearchHub: Basics of Storing Signals in Solr with Fusion for Data Engineers

Tue, 2015-08-11 11:36

In April we featured a guest post Mixed Signals: Using Lucidworks Fusion’s Signals API, which is a great introduction to the Fusion Signals API. In this post I work through a real-world e-commerce dataset to show how quickly the Fusion platform lets you leverage signals derived from search query logs to rapidly and dramatically improve search results over a products catalog.

Signals, What’re They Good For?

In general, signals are useful any time information about outside activity, such as user behavior, can be used to improve the quality of search results. Signals are particularly useful in e-commerce applications, where they can be used to make recommendations as well as to improve search. Signal data comes from server logs and transaction databases which record items that users search for, view, click on, like, or purchase. For example, clickstream data which records a user’s search query together with the item which was ultimately clicked on is treated as one “click” signal and can be used to:

  • enrich the results set for that search query, i.e., improve the items returned for that query
  • enrich the information about the item clicked on, i.e., improve the queries for that item
  • uncover similarities between items, i.e., cluster items based on other clicks on for queries
  • make recommendations of the form:
    • “other customers who entered this query clicked on that”
    • “customers who bought this also bought that”
Signals Key Concepts
  • A signal is a piece of information, event, or action, e.g., user queries, clicks, and other recorded actions that can be related back to a document or documents which are stored in a Fusion collection, referred to as the “primary collection”.
    • A signal has a type, an id, and a timestamp. For example, signals from clickstream information are of type “click” and signals derived from query logs are of type “query”.
    • Signals are stored in an auxiliary collection and naming conventions link the two so that the name of the signals collection is the name the primary collection plus the suffix “_signals”.
  • An aggregation is the result of processing a stream of signals into a set of summaries that can be used to improve the search experience. Aggregation is necessary because in the usual case there is a high volume of signals flowing into the system but each signal contains only a small amount of information in and of itself.
    • Aggregations are stored in an auxiliary collection and naming conventions link the two so that the name of the aggregations collection is the name the primary collection plus the suffix “_signals_aggr”.
    • Query pipelines use aggregated signals to boost search results.
    • Fusion provides an extensive library of aggregation functions allowing for complex models of user behavior. In particular, date-time functions provide a temporal decay function so that over time, older signals are automatically downweighted.
  • Fusion’s job scheduler provides the mechanism for processing signals and aggregations collections in near real-time.
Some Assembly Required

In a canonical e-commerce application, your primary Fusion collection is the collection over your products, services, customers, and similar. Event information from transaction databases and server logs would be indexed into an auxiliary collection of raw signal data and subsequently processed into an aggregated signals collection. Information from the aggregated signals collection would be used to improve search over the primary collection and make product recommendations to users.

In the absence of a fully operational ecommerce website, the Fusion distribution includes an example of signals and a script that processes this signal data into an aggregated signals collection using the Fusion Signals REST-API. The script and data files are in the directory $FUSION/examples/signals (where $FUSION is the top-level directory of the Fusion distribution). This directory contains:

  • signals.json – a sample data set of 20,000 signal events. These are ‘click’ events.
  • signals.sh – a script that loads signals, runs one aggregation job, and gets recommendations from the aggregated signals.
  • aggregations_definition.json – examples of how to write custom aggregation functions. These examples demonstrate several different advanced features of aggregation scripting, all of which are outside of the scope of this introduction.

The example signals data comes from a synthetic dataset over Best Buy query logs from 2011. Each record contains the user search query, the categories searched, and the item ultimately clicked on. In the next sections I create the product catalog, the raw signals, and the aggregated signals collections.

Product Data: the primary collection ‘bb_catalog’

In order to put the use of signals in context, first I recreate a subset of the Best Buy product catalog. Lucidworks cannot distribute the Best Buy product catalog data that is referenced by the example signals data, but that data is available from the Best Buy Developer API, which is a great resource both for data and example apps. I have a copy of previously downloaded product data which has been processed into a single file containing a list of products. Each product is a separate JSON object with many attribute-value pairs. To create your own Best Buy product catalog dataset, you must register as a developer via the above URL. Then you can use the Best Buy Developer API query tool to select product records or you can download a set of JSON files over the complete product archives.

I create a data collection called “bb_catalog” using the Fusion 2.0 UI. By default, this creates collections for the signals and aggregated signals as well.

Although the collections panel only lists collection “bb_catalog”, collections “bb_catalog_signals” and “bb_catalog_signals_aggr” have been created as well. Note that when I’m viewing collection “bb_catalog”, the URL displayed in the browser is: “localhost:8764/panels/bb_catalog”:

By changing the collection name to “bb_catalog_signals” or “bb_catalog_signals_aggr”, I can view the (empty) contents of the auxiliary collections:

Next I index the Best Buy product catalog data into collection “bb_catalog”. If you choose to get the data in JSON format, you can ingest it into Fusion using the “JSON” indexing pipeline. See blog post Preliminary Data Analysis in Fusion 2 for more details on configuring and running datasources in Fusion 2.

After loading the product catalog dataset, I check to see that collection “bb_catalog” contains the products referenced by the signals data. The first entry in the example signals file “signals.json”is a search query with query text: “Televisiones Panasonic 50 pulgadas” and docId: “2125233”. I do a quick search to find a product with this id in collection “bb_catalog”, and the results are as expected:

Raw Signal Data: the auxiliary collection ‘bb_catalog_signals’

The raw signals data in the file “signals.json” are the synthetic Best Buy dataset. I’ve modified the timestamps on the search logs in order to make them seem like fresh log data. This is the first signal (timestamp updated):

{ "timestamp": "2015-06-01T23:44:52.533Z", "params": { "query": "Televisiones Panasonic 50 pulgadas", "docId": "2125233", "filterQueries": [ "cat00000", "abcat0100000", "abcat0101000", "abcat0101001" ] }, "type": "click" },

The top-level attributes of this object are:

  • type – As stated above, all signals must have a “type”, and as noted in the earlier post “Mixed Signals”, section “Sending Signals”, the value should be applied consistently to ensure accurate aggregation. In the example dataset, all signals are of type “click”.
  • timestamp – This data has timestamp information. If not present in the raw signal, it will be generated by the system.
  • id – These signals don’t have distinct ids; they will be generated automatically by the system.
  • params – This attribute contains a set of key-value pairs, using a set of pre-defined keys which a appropriate for search-query event information. In this dataset, the information captured includes the free-text search query entered by the user, the document id of the item clicked on, and the set of Best Buy site categories that the search was restricted to. These are codes for categories and sub-categories such as “Electronics” or “Televisions”.

In summary, this dataset is an unremarkable snapshot of user behaviors between the middle of August and the end of October, 2011 (updated to May through June 2015).

The example script “signals.sh” loads the raw signal via a POST request to the Fusion REST-API endpoint: /api/apollo/signals/<collectionName> where <collectionName> is the name of the primary collection itself. Thus, to load raw signal data into the Fusion collection “bb_catalog_signals”, I send a POST request to the endpoint: /api/apollo/signals/bb_catalog.

Like all indexing processes, an indexing pipeline is used to process the raw signal data into a set of Solr documents. The pipeline used here is the default signals indexing pipeline named “_signals_ingest”. This pipeline consists of three stages, the first of which is a Signal Formatter stage, followed by a Field Mapper stage, and finally a Solr Indexer stage.

(Note that in a production system, instead of doing a one time upload of some server log data, raw signal data could be streamed into a signals collection an ongoing basis by using a Logstash or JDBC connector together with a signals indexing pipeline. For details on using a Logstash connector, see blog post on Fusion with Logstash).

Here is the curl command I used, running Fusion locally in single server mode on the default port:

curl -u admin:password123 -X POST -H 'Content-type:application/json' http://localhost:8764/api/apollo/signals/bb_catalog?commit=true --data-binary @new_signals.json

This command succeeds silently. To check my work, I use the Fusion 2 UI to view the signals collection, by explicitly specifying the URL “localhost:8764/panels/bb_catalog_signals”. This shows that all 20K signals have been indexed:

Further exploration of the data can be done using Fusion dashboards. To configure a Fusion dashboard using Banana 3, I specify the URL “localhost:8764/banana”. (For details and instructions on Banana 3 dashboards, see this this post on log analytics). I configure a signals dashboard and view the results:

The top row of this dashboard shows that there are 20,000 clicks in the collection bb_catalog_signals that were recorded in the last 90 days. The middle row contains a bar-chart showing the time at which the clicks came in and a pie chart display of top 200 documents that were clicked on. The bottom row is a table over all of the signals – each signal contains only click.

The pie chart allows us to visualize a simple aggregation of clicks per document. The most popular document got 232 clicks, roughly 1% of the total clicks. The 200th most popular document got 12 clicks, and the vast majority of documents only got one click per document. In order to use information about documents clicked on, we need to make this information available in a form that Solr can use. In other words, we need to create a collection of aggregated signals.

Aggregated Signals Data: the auxiliary collection ‘bb_catalog_signals_aggr’

Aggregation is the “processing” part of signals processing. Fusion runs queries over the documents in the raw signals collection in order to synthesize new documents for the aggregated signals collection. Synthesis ranges from counts to sophisticated statistical functions. The nature of the signals collected determines the kinds of aggregations performed. For click signals from query logs, the processing is straightforward: an aggregated signal record contains a search query, a count of the number of raw signals that contained that search query; and aggregated information from all raw signals: timestamps, ids of documents clicked on, search query settings, in this case, the product catalog categories over which that search was carried out.

To aggregate the raw signals in collection “bb_catalog_signals” from the Fusion 2 UI, I choose the “Aggregations” control listed in the “Index” section of the “bb_catalog_signals” home panel:

I create a new aggregation called “bb_aggregation” and define the following:

  • Signal Types = “click”
  • Time Range = “[* TO NOW]” (all signals)
  • Output Collection = “bb_catalog_signals_aggr”

The following screenshot shows the configured aggregation. The circled fields are the fields which I specified explicitly; all other fields were left at their default values.

Once configured, the aggregation is run via controls on the aggregations panel. This aggregation only takes a few seconds to run. When it has finished, the number of raw signals processed and aggregated signals created are displayed below the Start/Stop controls. This screenshot shows that the 20,000 raw signals have been synthesized into 15,651 aggregated signals.

To check my work, I use the Fusion 2 UI to view the aggregated signals collection, by explicitly specifying the URL “localhost:8764/panels/bb_catalog_signals_aggr”. Aggregated click signals have a “count” field. To see the more popular search queries, I sort the results in the search panel on field “count”:

The most popular searches over the Best Buy catalog are for major electronic consumer goods: TVs and computers, at least according to this particular dataset.

Fusion REST-API Recommendations Service

The final part of the example signals script “signals.sh” calls the Fusion REST-API’s Recommendation service endpoints “itemsForQuery”, “queriesForItem”, and “itemsForItems”. The first endpoint, “itemsForQuery” returns the list of items that were clicked on for a query phrase. In the “signals.sh” example, the query string is “laptop”. When I do a search on query string “laptop” over collection “bb_catalog”, using the default search pipeline, the results don’t actually include any laptops:

With properly specified fields, filters, and boosts, the results could probably be improved. With aggregated signals, we see improvements right away. I can get recommendations using the “itemsForQuery” endpoint via a curl command:

curl -u admin:password123 http://localhost:8764/api/apollo/recommend/bb_catalog/itemsForQuery?q=laptop

This returns the following list of ids: [ 2969477, 9755322, 3558127, 3590335, 9420361, 2925714, 1853531, 3179912, 2738738, 3047444 ], most of which are popular laptops:

When not to use signals

If the textual content of the documents in your collection provides enough information such that for a given query, the documents returned are the most relevant documents available, then you don’t need Fusion signals. (If it ain’t broke, don’t fix it.) If the only information about your documents is the documents themselves, you can’t use signals. (Don’t use a hammer when you don’t have any nails.)

Conclusion

Fusion provides the tools to create, manage, and maintain signals and aggregations. It’s possible to build extremely sophisticated aggregation functions, and to use aggregated signals in many different ways. It’s also possible to use signals in a simple way, as I’ve done in this post, with quick and impressive results.

In future posts in this series, we will show you:

  • How to write query pipelines to harness this power for better search over your data, your way.
  • How to harness the power of Apache Spark for highly scalable, near-real-time signal processing.

The post Basics of Storing Signals in Solr with Fusion for Data Engineers appeared first on Lucidworks.

William Denton: Jesus, to his credit

Tue, 2015-08-11 01:30

A quote from a sermon given by an Anglican minister a couple of weeks ago: “Jesus, to his credit, was a lot more honourable than some of us would have been.”

DuraSpace News: 2016 DLF eResearch Network

Tue, 2015-08-11 00:00

From Rita Van Duinen, Council on Library and Information Resources (CLIR)/Digital Library Federation (DLF)

Interested in joining next year's Digital Library Federation’s (DLF) eResearch Network?

Karen Coyle: Google becomes Alphabet

Mon, 2015-08-10 22:24
I thought it was a joke, especially when the article said that they have two investment companies, Ventures and Capital. But it's all true, so I have this to say:

G is for Google, H is for cHutzpah. In addition to our investment companies Ventures and Capital, we are instituting a think tank, Brain, and a company focused on carbon-based life-based forms, Body. Servicing these will be three key enterprises: Food, Water, and Air. Support will be provided by Planet, a subsidiary of Universe. Of course, we'll also need to provide Light. Let there be. Singularity. G is for God. 

Code4Lib: Code4Lib 2016

Mon, 2015-08-10 20:37

The Code4Lib 2016 Philadelphia Committee is pleased to announce that we have finalized the dates and location of the 2016 conference.

The 2016 conference will be held from March 7 through March 10 in the Old City District of Philadelphia. This location puts conference attendees within easy walking distance of many of Philadelphia’s historical treasures, including Independence Hall, the Liberty Bell, the Constitution Center, and the house where Thomas Jefferson drafted the Declaration of Independence. Attendees will also be a very short distance from the Delaware River waterfront and will be a short walk from numerous excellent restaurants.

As we’ll be reserving almost all of the space within the hotel for our conference (both rooms and conference spaces), Code4Lib 2016 will have the tight-knit community feel we know is important.

More details to come soon; in the meantime, the Keynote Committee is gearing up to open submissions for the conference keynote speaker, so be sure to contact them at pacella@gmail.com for more information, or you can nominate a keynote speaker at http://wiki.code4lib.org/2016_Invited_Speakers_Nominations.

Also, our Sponsorship Committee is actively looking for sponsors for 2016, so please contact the committee via Shaun Ellis to learn about all the ways your organization can help sponsor our 2016 conference.

It’s shaping up to be a great conference this year, and there will be lots more opportunities to volunteer. Our team is looking forward to seeing you on March 7!

The Code4Lib 2016 Philadelphia Committee

Code4Lib: Code4Lib 2016

Mon, 2015-08-10 20:37

The Code4Lib 2016 Philadelphia Committee is pleased to announce that we have finalized the dates and location of the 2016 conference.

The 2016 conference will be held from March 7 through March 10 in the Old City District of Philadelphia. This location puts conference attendees within easy walking distance of many of Philadelphia’s historical treasures, including Independence Hall, the Liberty Bell, the Constitution Center, and the house where Thomas Jefferson drafted the Declaration of Independence. Attendees will also be a very short distance from the Delaware River waterfront and will be a short walk from numerous excellent restaurants.

As we’ll be reserving almost all of the space within the hotel for our conference (both rooms and conference spaces), Code4Lib 2016 will have the tight-knit community feel we know is important.

More details to come soon; in the meantime, the Keynote Committee is gearing up to open submissions for the conference keynote speaker, so be sure to contact them at pacella@gmail.com for more information, or you can nominate a keynote speaker at http://wiki.code4lib.org/2016_Invited_Speakers_Nominations.

Also, our Sponsorship Committee is actively looking for sponsors for 2016, so please contact the committee via Shaun Ellis to learn about all the ways your organization can help sponsor our 2016 conference.

It’s shaping up to be a great conference this year, and there will be lots more opportunities to volunteer. Our team is looking forward to seeing you on March 7!

The Code4Lib 2016 Philadelphia Committee

Islandora: Call for 7.x-1.6 Release Team volunteers -- We want you on our team!!!

Mon, 2015-08-10 18:30

The 7.x-1.6 Release Team will be working on the next release very soon, and you could be our very next release team member! Want some more motivation? Release team members get really awesome shirts!

We are looking for members for all four release team roles. We added a new auditor role this time around in order to break up some of the responsibilities between testers and documentors.

Release team roles

  • Documentors: Documentation will need to be updated for the next release. Any new components will also need to be documented. If you are interested in working on the documentation for a given component, please add your name to any component listed here.
     
  • Testers: All components with JIRA issues set to 'Ready for Test' will need to be tested and verifying. Additionally, testers test the overall functionality of a given component. If you are interested in being a tester for a given component, please add your name to any component listed here. Testers will be provided with a release candidate virtual machine to do testing on.
     
  • Auditors: Each release we audit our README and LICENSE files. Auditors will be responsible auditing a given component. If you are interested in being an auditor for a given component, please add your name to any component listed here.
     
  • Component managers: Are responsible for the code base of their components. If you are interested in being a component manager, please add your name to any component listed here.

Time lines

  • Code Freeze: Tuesday, September 1, 2015
  • Release Candidate: Tuesday, September 15, 2015
  • Release: Friday October 30, 2015

If you have a questions about being a member of the release team, feel free to ask here here.
 

DPLA: New DPLA Job Opportunity: Ebook Project Manager

Mon, 2015-08-10 17:00

Come work with us! We’re pleased to share an exciting new DPLA job opportunity: Ebook Project Manager. The deadline to apply is August 31. We encourage you to share this posting far and wide!

Ebook Project Manager

The Digital Public Library of America seeks a full-time Ebook Project Manager to assist DPLA with its new ebook initiatives. The Ebook Project Manager should be a knowledgeable, creative community leader who can move our early stage ebook work from conversation to action. We are seeking a creative individual who demonstrates strong organizational and project management skills, with a broad knowledge of the ebook landscape. The Ebook Project Manager will work closely with the Business Development Director to develop DPLA’s ebook strategy and services, and will coordinate DPLA’s National Ebook Working Group, organize future meetings, and administer discrete pilots targeting key areas of our framework for ebooks.

Responsibilities of the Ebook Project Manager:

  • Serves as DPLA’s primary point person for service development, community engagement and other aspects of DPLA developing ebook program;
  • Leads community convenings; facilitates stakeholder conversations; and synthesizes issues, decisions, and system/service requirements;
  • Organizes and directs the DPLA ebook curation group;
  • Coordinates external communications to the broader DPLA community;
  • Works with DPLA network partners to identify and curate open content for use by content distribution partners.

Requirements for the position:

  • Strong knowledge of current ebook landscape, with a preference given to candidates who demonstrate deep understanding of the public library marketplace, publisher distribution/acquisition processes, and library collection development/acquisition workflow;
  • Understanding of the technology behind ebooks, including EPUB and EPUB conversion processes, web- and app-based display of ebooks;
  • Experience with project management, especially as it relates to large-scale digital projects;
  • MLS or equivalent experience with books, cataloguing, and metadata;
  • Demonstrated commitment to DPLA’s mission to maximize access to our shared culture.

This position is full-time, ideally situated either in DPLA’s Boston headquarters, or remotely in New York, Washington, or another location in the northeast corridor, but other locations will also be considered.

Like its collection, DPLA is strongly committed to diversity in all of its forms. We provide a full set of benefits, including health care, life and disability insurance, and a retirement plan. Starting salary is commensurate with experience.

Please send a letter of interest, a resume/cv, and contact information for three references to jobs@dp.la by August 31, 2015. Please put “Ebook Project Manager” in the subject line. Questions about the position may also be directed to jobs@dp.la.

About DPLA

The Digital Public Library of America strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated 11 million items from 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

Islandora: Goodbye, Islandoracon. You were awesome.

Mon, 2015-08-10 14:43

Last week marked a huge milestone for the Islandora Community as we came together for our first full length conference, in the birthplace of Islandora at the University of Prince Edward Island in Charlottetown, PEI. With a final headcount of 80 attendees, a line up of 28 sessions and 16 workshops, and a day-long Hackfest to finish things off, there is a lot to reflect on.

Mark Leggott opened the week with a Keynote that looked back over the history of the Islandora project through the lens of evolution - from its single-celled days as an idea at UPEI to the "Futurzoic" era ahead of us. We spent the rest of Day One talking about repository strategies, how Islandora works as a community, and how Islandora can work for communities of end users. The day ended with a BBQ on the lawn of the Robertson Library and a first exposure to a variety of Canadian potato chip flavours (roast chicken flavour, anyone?)

Day Two split the conference into two tracks, which meant some tough choices between some really great sessions on Islandora tools, sites, migration strategies, working with the arts and digital humanities, and the future with Fedora 4. You can find the slides from many sessions linked in the conference schedule. We ended with beer, snacks, and brutally hard bar trivia.

Day Three launched two days of 90-minute workshops in two tracks, delving into the details of Islandora with some hands-on training from Islandora experts. We covered everything from the basics of setting up the Drupal side of an Islandora Site, to a detailed look at the Tuque API or mass-migrating content with Drush scripts. The social events continued as well, with our big conference dinner at Fishbones, complete with live music and an oyster bar on Wednesday night, and a seafood tasting hosted by conference sponsor discoverygarden, Inc, where this view from the deck was augmented with an actual, literal, rainbow:

Not pictured: Rainbow

After the workshops finished up on Thursday, we held the first Islandora Foundation AGM, where new Chairman Mark Jordan (Simon Fraser University) received the ceremonial screaming monkey and former Chairman Mark Leggott took a new place as the Foundation's Treasurer. There was also lively debate around the subject of individual membership in the IF (more on that in days to come). 

Finally, we had the Hackfest, which went off better than we could have hoped. In addition to some bug fixes and improvements, the teams of the Hackfest produced an whopping four new tools, one of which is so use-ready that it has been proposed for adoption in the next release. The Hackfest tools are:

With apologies to anyone whose name I've left out. It was a big crowd and everyone did great work.

From the conference planning team and the Islandora Foundation, thank you very much to our attendees for making our first conference a big success. We hope you enjoyed yourself and learned a ton. And we hope you'll join us again at our next conference!

Next up, for those who can't wait for the second Islandoracon: Islandora Camp CT in Hartford, Connecticut, October 20 - 23

Shelley Gullikson: Adventures in Information Architecture Part 2: Thinking Big, Thinking Small

Mon, 2015-08-10 12:45

When we last saw them in Part 1, our Web Committee heroes were stuck with a tough decision: do we shoehorn the Ottawa Room content into an information architecture that doesn’t really fit it, or do we try to revamp the whole IA?

There was much hand-wringing and arm-waving. (Okay, did a lot of hand-wringing and arm-waving.) Our testing showed that users were either using Summon or asking someone to get information, and that when they needed to use the navigation they were stymied. Almost no one looked at the menus. What are our menus for if no one is using them? Are they just background noise? If so, should we just try to make the background noise more pleasant? What if the IA isn’t there primarily to organize and categorize our content, but to tell our users something about our library? Maybe our menus are grouping all the rocks in one spot and all the trees in another spot and all the sky bits somewhere else and what we really need to do is build a beautiful path that leads them…

Oh, hey, (said our lovely and fabulous Web Committee heroes) why don’t you slow down there for a second? What is the problem we need to solve? We’ve already tossed around some ideas that might help, why don’t we look at those to see if they solve our problem? Yes, those are interesting questions you have, and that thing about the beautiful path sounds swell, but… maybe it can wait.

And they kindly took me by the hand — their capes waving in the breeze — and led me out of the weeds. And we realized that we had already come up with a couple of solutions. We could use our existing category of “Research” (which up to now only had course guides and subject guides in it) to include other things like the resources in the Ottawa Room and all our Scholarly Communications / Open Access stuff. We could create a new category called “In the Library” (or maybe “In the Building” is better?) and add information about the physical space that people are searching our site for because it doesn’t fit anywhere in our current IA.

The more we talked about small, concrete ideas like this we realized they might also help with some of the issues left back in the weeds. The top-level headings on the main page (and in the header menu) would read: “Find Research Services In the Building.” Which is not unpleasant background noise for a library.


DuraSpace News: NOW AVAILABLE: Fedora 4.3.0—Towards Meeting Key Objectives

Mon, 2015-08-10 00:00

Winchester, MA  On July 24, 2015 Fedora 4.3.0  was released by the Fedora team. Full release notes are included in this message and are also available on the wiki: https://wiki.duraspace.org/display/FF/Fedora+4.3.0+Release+Notes. This new version  furthers several major objectives including:

  • Moving Fedora towards a clear set of standards-based services

  • Moving Fedora towards runtime configurability

Terry Reese: MarcEdit 6 Wireframes — Validating Headings

Sun, 2015-08-09 14:44

Over the last year, I’ve spent a good deal of time looking for ways to integrate many of the growing linked data services into MarcEdit.  These services, mainly revolving around vocabularies, provide some interesting opportunities for augmenting our existing MARC data, or enhancing local systems that make use of these particular vocabularies.  Examples like those at the Bentley (http://archival-integration.blogspot.com/2015/07/order-from-chaos-reconciling-local-data.html) are real-world demonstrations of how computers can take advantage of these endpoints when they are available. 

In MarcEdit, I’ve been creating and testing linking tools for close to a year now, and one of the areas I’ve been waiting to explore is whether libraries to utilized linking services to build their own authorities workflows.  Conceptually, it should be possible – the necessary information exists…it’s really just a matter of putting it together.  So, that’s what I’ve been working on.  Utilizing the linked data libraries found within MarcEdit, I’ve been working to create a service that will help users identify invalid headings and records where those headings reside. 

Working Wireframes

Over the last week, I’ve prototyped this service.  The way that it works is pretty straightforward.  The tool extracts the data from the 1xx, 6xx, and 7xx fields, and if they are tagged as being LC controlled, I query the id.loc.gov service to see what information I can learn about the heading.  Additionally, since this tool is designed for work in batch, there is a high likelihood that headings will repeat – so MarcEdit is generating a local cache of headings as well – this way it can check against the local cache rather than the remote cache when possible.  The local cache will constantly be grown – with materials set to expire after a month.  I’m still toying with what to do with the local cache, expirations, and what the best way to keep it in sync might be.  I’d originally considered pulling down the entire LC names and subjects headings – but for a desktop application, this didn’t make sense.  Together, these files, uncompressed, consumed GBs of data.  Within an indexed database, this would continue to be true.  And again, this file would need to be updated regularly.  To, I’m looking for an approach that will give some local caching, without the need to make the user download and managed huge data files. 

Anyway – the function is being implemented as a Report.  Within the Reports menu in the MarcEditor, you will eventually find a new item titled Validate Headings.

When you run the Validate Headings tool, you will see the following window:

You’ll notice that there is a Source file.  If you come from the MarcEditor, this will be prepopulated.  If you come from outside the MarcEditor, you will need to define the file that is being processed.  Next, you select the elements to authorize.  Then Click Process.  The Extract button will initially be enabled until after the data run.  Once completed, users can extract the records with invalid headings.

When completed, you will receive the following report:

This includes the total processing time, average response from LC’s id.loc.gov service, total number of records, and the information about how the data validated.  Below, the report will give you information about headings that validated, but were variants.  For example:

Record #846
Term in Record: Arnim, Bettina Brentano von, 1785-1859
LC Preferred Term: Arnim, Bettina von, 1785-1859

This would be marked as an invalid heading, because the data in the record is incorrect.  But the reporting tool will provide back the Preferred LC label so the user can then see how the data should be currently structured.  Actually, now that I’m thinking about it – I’ll likely include one more value – the URI to the dataset so you can actually go to the authority file page, from this report. 

This report can be copied or printed – and as I noted, when this process is finished, the Extract button is enabled so the user can extract the data from the source records for processing. 

Couple of Notes

So, this process takes time to run – there just isn’t any way around it.  For this set, there were 7702 unique items queried.  Each request from LC averaged 0.28 seconds.  In my testing, depending on the time of day, I’ve found that response rate can run between 0.20 seconds per request to 1.2 seconds per response.  None of those times are that bad when done individually, but when taken in aggregate against 7700 queries – it adds up.  If you did the math, 7702*0.2 = 1540 seconds to just ask for the data.  Divide that by 60 and you get 25.6 minutes.  The total time to process that means that there are 11 minutes of “other” things happening here.  My guess, that other 11 minutes is being eaten up by local lookups, character conversions (since LC request UTF8 and my data was in MARC8) and data normalization.  Since there isn’t anything I can do about the latency between the user and the LC site – I’ll be working over the next week to try and remove as much local processing time from the equation as possible. 

Questions – let me know.

–tr

Manage Metadata (Diane Hillmann and Jon Phipps): Five Star Vocabulary Use

Fri, 2015-08-07 18:50

Most of us in the library and cultural heritage communities interested in metadata are well aware of Tim Berners-Lee’s five star ratings for linked open data (in fact, some of us actually have the mug).

The five star rating for LOD, intended to encourage us to follow five basic rules for linked data is useful, but, as we’ve discussed it over the years, a basic question rises up: What good is linked data without (property) vocabularies? Vocabulary manager types like me and my peeps are always thinking like this, and recently we came across solid evidence that we are not alone in the universe.

Check out: “Five Stars of Linked Data Vocabulary Use”, published last year as part of the Semantic Web Journal. The five authors posit that TBL’s five star linked data is just the precondition to what we really need: vocabularies. They point out that the original 5 star rating says nothing about vocabularies, but that Linked Data without vocabularies is not useful at all:

“Just converting a CSV file to a set of RDF triples and linking them to another set of triples does not necessarily make the data more (re)usable to humans or machines.”

Needless to say, we share this viewpoint!

I’m not going to steal their thunder and list here all five star categories–you really should read the article (it’s short), but only note that the lowest level is a zero star rating that covers LD with no vocabularies. The five star rating is reserved for vocabularies that are linked to other vocabularies, which is pretty cool, and not easy to accomplish by the original publisher as a soloist.

These five star ratings are a terrific start to good practices documentation for vocabularies used in LOD, which we’ve had in our minds for some time. Stay tuned.

Patrick Hochstenbach: Penguin in Africa II

Fri, 2015-08-07 18:01
Filed under: Comics Tagged: africa, cartoon, comic, comics cartoons, inking, kinshasa, Penguin

District Dispatch: Envisioning copyright education

Fri, 2015-08-07 16:52

I have been an ALA employee for a while now, primarily on copyright policy and education. During that time, I have worked with several librarian groups, taught a number of copyright workshops, and appreciate that more librarians have a better understanding of what copyright is than was true several years ago. Nonetheless, on a regular basis, librarians across the country, primarily academic but also school librarians, find themselves tasked with the assignment to be the “copyright person” for their library or educational institution. These new job responsibilities are usually unwanted, because the victims recognize that they don’t know anything about copyright. The fortunate among them make connections with more knowledgeable colleagues, or perhaps have the funding to attend a copyright workshop here or there that may be, but often is not, reliable. In short, their graduate degree in library and information science, accredited or not, has not prepared them for the assignment. Information policy course work in library school is limited to a discussion of censorship and banned books week.

Sounds a bit harsh, doesn’t it?

I don’t expect or recommend that graduate students become fluent in the details of every aspect of the copyright law. What they do need to know is the purpose of the copyright law, why information professionals in particular have a responsibility for upholding balanced copyright law by representing the information rights of their communities, why information policy understanding must go hand in hand with librarianship, and of course, what is fair use? They need to understand copyright law as a concept, not a set of dos and don’ts.

Recently, this void in library and information science education is being investigated. I know several librarians that are conducting research on MLIS programs, the need for copyright education, how copyright is taught and the requirements of those teaching information policy courses. More broadly, the University of Maryland Information School published Re-envisioning the MLS: Findings, Issues and Considerations, the first year report of a three-year study on the future of the Masters of Library Science degree and how we prepare information professionals for their careers. If you already have your masters’ degrees, don’t feel left out. Look forward to new learning, knowing that not all of the old learning is for naught.  The values of librarianship have survived and will continue to be at the heart of what we need to know and do.

The post Envisioning copyright education appeared first on District Dispatch.

Open Knowledge Foundation: Onwards to AbreLatAm 2015: what we learned last year

Fri, 2015-08-07 14:44

This post was co-written by Mor Rubinstein and Neal Bastek. It is cross-posted and available in Spanish at the AbreLatAm blog.

AbreLatAm, for us “gringos”, is magical. Even in the age where everyone is glued to a screen, face to face connection is still the strongest connection humans can have; it fosters the trust that can lead to new cooperations and innovations. However, in the case of Latin America, it also creates a family. This feeling creates both a sense of solidarity and security that lets people share and consult about their open data and transparency issues with greater passion and awareness of the challenges and conditions we face daily in our own communities. It is unique, and difficult to replicate. You may not realise it, but in our experience, this feeling is not so common in other parts of the world, where the culture of work is more strict and, with all due respect for our differences, less personal. AbreLatAm therefore is a gift to the movement itself and not just to those of us lucky enough to attend.

For open data practitioners from outside of America Latina like us, AbreLatAm is a place to learn how communities evolve and how they work together. It is a place for us to listen, deeply. So, our command of the Spanish language is not so great (pero es mejor que ayer!) but we don’t need Spanish to feel the atmosphere, see the sparks and contribute, in English, with hand gestures to amplify the event. We try hard to understand the context and the words (and are grateful for the support we have from patient translators!) and are understand the unique problems in the region. For example, the high levels of corruption, the low levels of trust in government and highest rates of inequality in the world. However, other problems are universal, and we should all examine how to solve them together. The question is how?

The Open Knowledge Network has gained tremendous inspiration from AbreLatAm. What appeared early on as a good opportunity to promote the Global Open Data Index and build connections with the Latin American community has become so much more — a fertile ground for sharing and feedback. Some of the processes that we are doing now in this year’s Index, such as our methodology consultation and datasets selections, were the direct result of our participation in AbreLatAm last year.

Neal and Mor promoting the Index in last’s year AbreLatam

We are very excited to see what we will learn this year. As AbreLatAm matures, it also receives more attention and attracts more participants. AbreLatAm was, and still is, a pioneering community participatory event. The challenges now are about scaling, and it is a mirror to similar challenges around the globe. How can we harness the energy of an un-conference with such a vast amount of participants? How can we go from talking and sharing to coordinated global action?

The movement’s ability to scale will only be a success if it’s rooted in community-based, citizen driven needs and not handed down from on high by way of intellectual and academic arguments rooted in a Eurocentric experience. AbreLatAm is an ideal setting for discovering this demand in the Latin American context and matching it and adapting it to global practices and experiences that have succeeded elsewhere– be it in the North or South! Likewise, the LATAM community has much to share in terms of their own experiences and success, and at Open Knowledge we’re keenly interested in bringing those back to our global network for reflection and consideration.

Pages