You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 13 hours 33 min ago

Nicole Engard: Bookmarks for June 8, 2015

Mon, 2015-06-08 20:30

Today I found the following resources and bookmarked them on Delicious.

  • The bttn Control almost any Internet-enabled device or service with the push of a button, or give it to your customer for super-easy ordering of your product or services
  • Guacamole — HTML5 Clientless Remote Desktop Guacamole is a clientless remote desktop gateway. It supports standard protocols like VNC and RDP. We call it clientless because no plugins or client software are required. Thanks to HTML5, once Guacamole is installed on a server, all you need to access your desktops is a web browser.
  • Internet of Things News – The CIO Report – WSJ
  • IoT News Network Updates from the Internet of Things

Digest powered by RSS Digest

The post Bookmarks for June 8, 2015 appeared first on What I Learned Today....

Related posts:

  1. Social Networking on your Desktop
  2. ATO2014: Open Source & the Internet of Things
  3. Library Association Rant

District Dispatch: Best practices webinar now available

Mon, 2015-06-08 20:21

From Wikimedia Commons.

An archive of the CopyTalk webinar on best practices for fair use, originally broadcast on May 7th, 2015 is now available. You can also find PowerPoint slides from the presentation here.

Since 2005, practitioners including filmmakers, poets, K-12 teachers, film and communications scholars, open courseware providers, journalists and — of course — librarians have come together to articulate Codes and Statements of Best Practices in Fair Use for their communities. Many of these efforts have been facilitated by researchers at the American University Law School and School of Communications. In the last few months, two new documents have been released — the College Art Association’s Code of Best Practices for Fair Use in the Visual Arts and the Statement of Best Practices in Fair Use of Orphan Works for Libraries & Archives.

Our speaker: Peter Jaszi is a Professor of Law at American University Washington College of Law, where he teaches copyright law and courses in law and cinema, as well as supervising students in the Glushko-Samuelson Intellectual Property Law Clinic, which he helped to established, along with the Program on Intellectual Property and Information Justice. He has served as a Trustee of the Copyright Society of the U.S.A. and is a member of the editorial board of its journal. In 2007, he received the American Library Association’s L. Ray Patterson Copyright Award, and in 2009 the Intellectual Property Section of the District of Columbia Bar honored him as the year’s Champion of Intellectual Property. He has written about copyright history and theory and co-authored a standard copyright textbook.

CopyTalk, is a bimonthly webinar brought to you by the Office for Information Technology Policy’s subcommittee on Copyright Education.

CopyTalks are scheduled for the first Thursday of even numbered months. Currently on hiatus, the next CopyTalk is scheduled for October 1, 2015.

The post Best practices webinar now available appeared first on District Dispatch.

District Dispatch: Library leaders to discuss veteran services at 2015 ALA Annual Conference

Mon, 2015-06-08 18:31

Cropped Photo by Adam Baker via Flickr

How can libraries better connect veterans and their families to veteran benefits and services? Learn more about providing veteran services at “Veterans Connect @ Your Library: Veterans Resource Centers in California Libraries,” an interactive conference session that will take place at the 2015 American Library Association (ALA) Annual Conference in San Francisco. The session takes place at the Moscone Convention Center from 3:00 to 4:00 p.m. on Saturday, June 27, 2015, in room 2014 in the West building.

During the session, library leaders will discuss the success of the California State Library’s Veterans Resource Centers, which connect veterans and their families to benefits and services for which they are eligible. A panel of expert leaders will detail ways that libraries can support returning veterans and positively change the way they participate in their communities. The conference session is cosponsored by the ALA Committee on Legislation’s (COL) E-government Services Subcommittee and the ALA Federal and Armed Forces Libraries Roundtable (FAFLRT).

Speakers
  • Moderator: Jennifer E. Manning, information research specialist, Congressional Research Service, Library of Congress; co-chair, American Library Association Subcommittee on E-Government Services
  • Christy Aguirre, branch supervisor, Sacramento Public Library
  • Karen Bosch Cobb, library consultant, Infopeople, Co-manager of Veterans Connect @ the Library program in California
  • Kevin Graves, coordinator, Bay Area and North Coast Local Interagency Network, California Department of Veterans Affairs (CalVet)
  • Susan Hildreth, executive director, Peninsula Library System, Pacific Library Partnership and Califa; immediate past director, Institute of Museum and Library Services (IMLS)

View all ALA Washington Office conference sessions

The post Library leaders to discuss veteran services at 2015 ALA Annual Conference appeared first on District Dispatch.

Peter Murray: Seeking new opportunity in library technology

Mon, 2015-06-08 14:23

Dear Colleagues,

Know of someone looking for a skilled library technologist? The funding for my position at LYRASIS will run out at the end of June, and I am looking for a new opportunity for my skills in library technology, open source, and community engagement. My resume/c.v. is online, and I welcome any information about potential positions.

Over the course of my career I have worked in a variety of library environments ranging from medium- and large-sized academic institutions to a professional graduate school. Most recently I have worked for two consortia of academic libraries and a broad spectrum of archives and museums. I have used my skills to write local custom code when needed, adapt open source systems when possible, and contribute code to open source projects when able. I have also led teams to deploy open source software for cultural heritage organizations (most recently at LYRASIS). My professional activities include leading the Innovative Interfaces Users group, contributing to the development of the W3C’s Library Linked Data Incubator Group report, participating in NISO committees (Discovery to Delivery Topic Committee, Publications Committee, and NISO/OAI ResourceSync Protocol Working Group), and communing with fellow Code4Lib library technologists.

I place a high value on the development of relationships between organizations and professionals. At LYRASIS I was the principal investigator on an Andrew W. Mellon-funded project to identify and promote sustainable practices in open source software development, which culminated in two symposia: one invitational face-to-face and one that was open to a worldwide audience. I have also worked to build community consensus, particularly with several key OhioLINK consortium strategic initiatives such as research on e-textbook development and licensing, and leading a task force on a discovery system.

If these skills would be valuable to your organization or if you know of a position where I can use this experience, please e-mail me at jester@dltj.org.

Link to this post!

Islandora: Islandora and the Public Knowledge Project Announce Strategic Relationship

Mon, 2015-06-08 14:04

Islandora and the Public Knowledge Project (PKP) are pleased to announce a strategic relationship to advocate and support the use of open source technologies that support scholarly publishing and research. Both Islandora and PKP are successful open source projects that serve a very similar community of academics and researchers. Both parties will encourage collaboration amongst their respective networks, partners and user communities to use “open” tools and technologies. One immediate example of this collaboration is Mark Jordan of the SFU Library who has been active in PKP development activities for the last decade and is currently serving as a Director on the Board of the Islandora Foundation.

Brian Owen, PKP’s Managing Director stated: “PKP and Islandora are representative of the healthy and growing spectrum of open source solutions that provide more than just functional and cost-effective alternatives to proprietary software. It is also about openness, collaboration and self-sufficiency. Our communities are not just consumers but also creators and supporters of open source technology.”

Mark Leggott, Islandora Foundation Board Chair agreed: “Canada hits above its weight class in the development of strong open source software, especially in the areas of scholarly communication and digital preservation. The Islandora Foundation is looking forward to intersecting with the PKP community on new initiatives and innovations.”

About Islandora

Islandora is an open-source software framework designed to help institutions and organizations and their audiences collaboratively manage, and discover digital assets using a best-practices framework.  Islandora was originally developed by the University of Prince Edward Island's Robertson Library, but is now implemented and contributed to by an ever-growing international community. More information can be found at http://islandora.ca/

About PKP

The Public Knowledge Project was established in 1998 at the University of British Columbia. Since that time PKP has expanded and evolved into an international and virtual operation with two institutional anchors at Stanford University and Simon Fraser University Library. OJS is open source software made freely available to journals worldwide for the purpose of making open access publishing a viable option for more journals, as open access can increase a journal’s readership as well as its contribution to the public good on a global scale. More information about PKP and its software and services is available at pkp.sfu.ca.

LITA: Agile Development: Sprint Review

Mon, 2015-06-08 14:00

At the boundary between sprints, there are three different tasks that an Agile team should perform:

  1. Review and demo the work completed in the finished sprint (Sprint Review)
  2. Plan the next iteration (Sprint Planning)
  3. Evaluate the team’s performance during the sprint and look for improvements (Sprint Retrospective)

While it may be tempting to package that entire list into one long meeting, these should really be separate sessions. Each one requires a different focus and a unique cast of characters; plus, meetings can only last so long before efficiency plummets. Over the next set of posts I will be discussing each of these tasks separately, in the order in which I listed them. Even though planning happens first in any particular sprint, I prefer to look at the transition between sprints as a single process. So, we’ll start with Sprint Review.

Objective

As we’ve discussed previously, one of the core values of Agile is to prioritize working software; an Agile team should deliver a potentially shippable product at the end of each sprint. This approach increases flexibility (each new sprint is an opportunity to begin with a clean slate and completely change direction if necessary), and forces the entire team (on both the technology and business sides) to optimize feature design. Therefore, at the end of every sprint, the team should be able to demo a new piece of working software, and that is indeed what the focus of the Sprint Review meeting should be.

Participants

This should be the most crowded meeting out of the three; it should be open to all project stakeholders. The development team should attend and be ready to demo their product. This should be an informal presentation, focused on the software itself rather than presentation slides or backlog lists. A more formal approach would take the focus away from the product itself, and it would mean unnecessary prep work for the developers. The team knows what they built, and they should present that to the rest of the participants.

The Product Owner will be there to compare the completed product to the predefined sprint goals, but the focus should be on whether the software fulfills user needs, rather than checking off assigned tickets. The team may have found a better way to solve the problem during the sprint, and the Product Owner would have been consulted at that time, so there should be no surprises.

The demo should also include the Scrum Master, organization management, customers (if possible), and anyone else who is invested in the success of the project. Developers from other teams should be allowed to attend if desired, as a way to promote cross-team communication and organizational knowledge sharing.

Meeting Agenda

The Product Owner should begin the meeting by briefly articulating the sprint goals and providing context for attendees. Again, this shouldn’t be a big formal presentation. Sprints typically only last 1-3 weeks, so no one should be taking up valuable hours getting ready for a review meeting. Each participant is only asked to talk about her specific areas of expertise, so she should be able to almost improvise her remarks.

Next, the development team will demo the completed work, within an environment that mimics production as closely as possible. If the object of the sprint is to produce releasable software, the demo should feel as if the software has been released, so the audience can get a feel for the true user experience. Hard-coded constants and mock-ups should be used minimally, if at all.

Next, the audience should ask questions and offer feedback on the demo, followed by the Product Owner describing next steps and goals. The purpose of this meeting is not for the Product Owner to approve completed work, or for the development team to share their work with one another; that will have happened during the course of the sprint. It is for the PO and the development team to showcase their work and receive feedback on whether the product fulfills the needs of the user.

If you want to learn more about sprint review meetings, you can check out the following resources:

I’ll be back next month to discuss sprint planning.

What are your thoughts on how your organization implements sprint reviews? Who should (or shouldn’t be present a sprint review demo? What do you do about work that has not been completed by the end of the iteration?

BIS-Sprint-Final-24-06-13-05” image By Birkenkrahe (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

 

 

Tara Robertson: Clint Lalonde’s post On Using OpenEd: An Opprotunity

Sat, 2015-06-06 13:49

 

This was posted on Clint’s blog clintlalonde.net on June 1, 2015.

For the past 6 months my organization BCcampus has been in a dispute with the University of Guelph over our use of this:

Current BCcampus Open Education logo

Like many of you, we have always used the term OpenEd as a short form way of saying Open Education. It’s a term that is familiar to anyone working in the field of open education. In our community, many of us host forums andevents using the term OpenEd. Around the world, people write blog posts,create websites, and host conferences using the term OpenEd. Our global community uses the term OpenEd interchangeably with Open Education to mean a series of educational practices and processes built on a foundation of collaboration and sharing.

BCcampus has been working with higher education institutions in British Columbia for over a decade on open education initiatives, so when it came time to redesign our main open education website (open.bccampus.ca), it was only natural that we would gravitate to the term that many people in BC and beyond associate with us: OpenEd. Our graphic designer, Barb Murphy, developed this logo in the fall of 2013 and, at the end of November, 2013, we launched our new website with our new OpenEd logo. We thought nothing of it and went along our merry way chugging along on the BC Open Textbook Project.

Little did we know that, on December 18, 2013, the University of Guelph trademarked OpenEd.

Last fall, we received an email from UGuelph asking us to stop using OpenEd. At first, we thought it was a joke. Someone trademarking OpenEd? Anyone involved in the open education community would realize how ridiculous that sounds. But after numerous emails, it became apparent that they were, indeed, serious about wanting us to stop using OpenEd.

We went back and forth with Guelph until it became apparent that they were not going to give up on their trademark claim, but for the cost of their legal paperwork to write up a permission contract ( $500), they would allow us to use the term in perpetuity to describe any open education activities in BC that we were associated with.

We considered the offer, and thought it a fair request from Guelph. They didn’t ask us for a licensing fee. The would give us the rights to use the mark for basically the cost of their lawyers writing up the contract. $500 is not a lot of money.

But then we thought about the rest of the open education community in Canada and how they will not be able to use the term unless they negotiate with Guelph as well. And we thought that, if we agreed to the terms, we would be legitimizing their claim to a term that runs against the very ethos of what we practice. We decided we couldn’t do it.

Then we thought perhaps we should fight and win the mark back? Wrestle the trademark from Guelph and then turn around and release the trademark with a CC0 license for the entire community to use (even Guelph). We thought we could prove our prior use, not only based on the fact that we started using the logo on our new website weeks before their claim was finalized in December of 2013, but going back even further to the 2009 OpenEd conference BCcampus sponsored at UBC in Vancouver where a wordmark very similar to what Guelph has trademarked was first used.

The 2009 Open Education Conference Logo. The conference was at UBC and sponsored by BCcampus

But after speaking with a lawyer, we discovered that the best we could do is win prior use rights for BCcampus, which would be good for BCcampus, but lousy for the entire open education community.

So in the end,  we have decided to change. We are currently working on dropping the term OpenEd from our logo and replacing it with the words Open Education.

This will not be cheap for us. The redesign is simple, but that BCcampus OpenEd mark is used in many places. Most notably, we now have to redo the covers for close to 90 textbooks in our open textbook collection as that OpenEd mark appears on the cover of every book.

Each cover on every open textbook in our collection needs to be changed

And then once the cover is changed, we need to update 3 different websites where that cover might be used. Plus, we have created a ton of additional material that has the mark OpenEd on it that will now need to be scrapped.

In my mind, however, this is the right move. If BCcampus pays even a modest fee, then we accept that it is ok to copyright and trademark something that, I believe, should rightly belong to the community. Given my own personal values around openness and sharing of resources, it’s a bargain I did not want to make. And it doesn’t make sense to fight a battle that will win a victory for BCcampus, but not for the wider open education community. It would feel less than hollow.

So, we change.

The opportunity. If you are from Guelph and are reading this, there is another alternative. You have the trademark to the OpenEd mark. You control the IP. You can always choose to release the mark with a Creative Commons license and show the wider open education community that you understand the community and the open values that drive our work in education everyday. You can be a leader here by taking the simple act of licensing your mark with a CC license and releasing it to the community for everyone to use.

Update June2, 2015:  Trademarks and copyright are different ways to protect intellectual property, and the suggestion I made in the post is probably too simplistic a wish as CC licenses are meant to alleviate copyright, not trademark, restrictions (h/t to David Wiley for pointing me to this distinction).  However, it appears that the two can co-exist and you can openly license and protect trademarks at the same time, as this document from Creative Commons on trademarks & copyright suggests.

SearchHub: Query Autofiltering Extended – On Language and Logic in Search

Sat, 2015-06-06 13:20
Spock: The logical thing for you to have done was to have left me behind. McCoy: Mr. Spock, remind me to tell you that I’m sick and tired of your logic. This is the third in a series of blog posts on a technique that I call Query Autofiltering – using the knowledge built into the search index itself to do a better job of parsing queries and therefore giving better answers to user questions. The first installment set the stage by arguing that a better understanding of language and how it is used when users formulate queries, can help us to craft better search applications – especially how adjectives and adverbs – which can be thought of as attributes or properties of subject or action words (nouns and verbs) – should be made to refine rather than to expand search results – and why the OOTB search engine doesn’t do this correctly. Solving this at the language level is a hard problem. A more tractable solution involves leveraging the descriptive information that we may have already put into our search indexes for the purposes of navigation and display, to parse or analyze the incoming query. Doing this enables us to produce results that more closely match the user’s intent. The second post describes an implementation approach using the Lucene FieldCache that can automatically detect when terms or phrases in a query are contained in metadata fields and to then use that information to construct more precise queries. So rather than searching and then navigating, the user just searches and finds (even if they are not feeling lucky). An interesting problem developed from this work – what to do when more than one term in a query matches the same metadata field? It turns out that the answer is one of the favorite phrases of software consultants – “It Depends”. It depends on whether the field is single or multi valued. Understanding why this is so leads to a very interesting insight – logic in language is not ambiguous, it is contextual, and part of the context is knowing what type of field we are talking about. Solving this enables us to respond correctly to boolean terms (“and” and “or”) in user queries, rather than simply ignoring them (by treating them as stop words) as we typically do now. Logic in Mathematics vs Logic in Language Logic is of course fundamental to both mathematics and language. It is especially important in computer engineering as it forms the operational basis of the digital computer. Another area where logic reigns is Set Theory – the mathematics of groups – and it is in this arena where language and search collide because search is all about finding a set of documents that match a given query (sets can have zero or more elements in them). When we focus on the mathematical aspects of sets, we need to define precise operators to manipulate them – intersection, union, exclusion – AND, OR, NOT, etc. Logic in software needs to be explicit or handled with global defaults. Logic in language is contextual – it can be implicit or explicit. An example of implied logic is in the use of adjectives as refiners such as the “red sofa” example that I have been using. Here, the user is clearly looking for things that are sofas AND are red in color. If the user asks for “red or blue sofas”, there are two logical operators, one implied and one explicit – they want to see sofas that are either red or blue. But what if the user asks for “red and blue sofas”? They could be asking to see sofas in both colors if referring to sets, or to individual sofas that have both red and blue in them. So this is somewhat ambiguous because the refinement field “color” is not clearly defined yet – can a single sofa have more than one color or just one? Lets choose something that is definitely single-valued – size. If I say “show me large or extra-large t-shirts” the language use of logic is the same as the mathematical one but if I say “show me large and extra-large t-shirts” it is not. Both of these phrases in language mean the same thing because we instinctively understand that a single shirt has only one size so if we use “and” we mean “show me shirts in both sizes” and for “or” we mean “show me shirts of either size” which in terms of set theory translates to the same operation – UNION or OR. In other words, “and” and “or” are synonyms in this context! For search, only OR can be supported for single-value fields because using AND gives the non result – zero records. The situation is not the same when dealing with attributes for which a single entity can have more than one value. If I say, “show me shirts that are soft, warm and machine-washable” then I mean intersection of these attributes – I only want to see shirts that have all of these qualities. But if I say “show me shirts that are comfortable or lightweight” I expect to see shirts with at least one of these attributes, or in other words the union of comfortable and lightweight shirts. “And” and “or” are now antonyms as they are in mathematics and computer science. It also makes sense from a search perspective because we can use either AND or OR in the context of a multi-value field and still get results. Getting back to implied vs. explicit, it is AND that is implied in this case because I can say “show me soft, warm, machine-washable shirts” which means the same as “soft, warm and machine-washable shirts”. So we conclude that how the vernacular use of “and” and “or” should be interpreted depends on whether the values for that attribute are exclusive or not (i.e. single or multi-valued). That is, “and” means “both” (or “all” if more than two values are given) and “or” means “either” (or “any”, respectively). For “and” if the attribute is single valued we mean “show me both things”, if it is multi-valued we mean “show me things with both values”. For “or”, single valued attributes translate to “show me either thing” and if multi-valued “show me things with either value”.  As Mr. Spock would say, its totally logical (RIP Leonard – we’ll miss you!) Implementing Contextual Logic in Search Armed with a better understanding of how logic works in language and how that relates to the mathematics of search operations, we can do a better job of responding to implied or explicit logic embedded in search queries – IF – we know how terms map to fields and what type of fields they map to. It turns out that the Query Autofiltering component can give us this context – it uses the Lucene FieldCache to create a map of field values to field names – and once it knows what field a part of the query maps to, it knows whether that field is single or multi-valued. So given this, if there are more than one value for a given field in a query, and the field is single valued, we always use OR. If the field is multi-valued then we use AND if no operator is specified and OR if that term is used within the positional context of the set of field values. In other words, we see if the term “or” occurs somewhere between the first and last instances of a particular field such as in “lightweight or comfortable”. This also allows us to handle phrases that have multiple logical operators such as “soft, warm, machine-washable shirts that come in red or blue”. Here the “or” does not override the attribute list’s implied “and” because it is outside of the list. It instead refers to values of color – which if a single value field in the index is ignored and defaults to OR. Here is the code that does this contextual interpretation. As the sub-phrases in the query are mapped to index fields, the first and last positions of the phrase set are captured. Then if the field is multi-valued, AND is used unless the term “or” has been interspersed: SolrIndexSearcher searcher = rb.req.getSearcher(); IndexSchema schema = searcher.getSchema(); SchemaField field = schema.getField( fieldName ); boolean useAnd = field.multiValued() && useAndForMultiValuedFields; // if query has 'or' in it and or is at a position // 'within' the values for this field ... if (useAnd) { for (int i = termPosRange[0] + 1; i < termPosRange[1]; i++ ) { String qToken = queryTokens.get( i ); if (qToken.equalsIgnoreCase( "or" )) { useAnd = false; break; } } } StringBuilder qbldr = new StringBuilder( ); for (String val : valList ) { if (qbldr.length() > 0) qbldr.append( (useAnd ? " AND " : " OR ") ); qbldr.append( val ); } return fieldName + ":(" + qbldr.toString() + ")" + suffix; The full source code for the QueryAutofilteringComponent is available on github for both Solr 4.x and Solr 5.x. (Due to API changes introduced in Solr 5.0, two versions of this code are needed.) Demo To show these concepts in action, I created a sample data set for a hypothetical department store (available on the github site). The input data contains a number of fields, product_type, product_category, color, material, brand, style, consumer_type and so on. Here are a few sample records: <doc> <field name="id">17</field> <field name="product_type">boxer shorts</field> <field name="product_category">underwear</field> <field name="color">white</field> <field name="brand">Fruit of the Loom</field> <field name="consumer_type">mens</field> </doc> . . . <doc> <field name="id">95</field> <field name="product_type">sweatshirt</field> <field name="product_category">shirt</field> <field name="style">V neck</field> <field name="style">short-sleeve</field> <field name="brand">J Crew Factory</field> <field name="color">grey</field> <field name="material">cotton</field> <field name="consumer_type">womens </doc> . . . <doc> <field name="id">154</field> <field name="product_type">crew socks</field> <field name="product_category">socks</field> <field name="color">white</field> <field name="brand">Joe Boxer</field> <field name="consumer_type">mens</field> </doc> . . . <doc> <field name="id">135</field> <field name="product_type">designer jeans</field> <field name="product_category">pants</field> <field name="brand">Calvin Klein</field> <field name="color">blue</field> <field name="style">pre-washed</field> <field name="style">boot-cut</field> <field name="consumer_type">womens</field> </doc> The dataset contains built-in ambiguities in which a single token can occur as part of a product type, brand name, color or style. Color names are good examples of this but there are others (boxer shorts the product vs Joe Boxer the brand).  The ‘style’ field is multi-valued. Here is the schema.xml definitions of the fields: <field name="brand" type="string" indexed="true" stored="true" multiValued="false" /> <field name="color" type="string" indexed="true" stored="true" multiValued="false" /> <field name="colors" type="string" indexed="true" stored="true" multiValued="true" /> <field name="material" type="string" indexed="true" stored="true" multiValued="false" /> <field name="product_type" type="string" indexed="true" stored="true" multiValued="false" /> <field name="product_category" type="string" indexed="true" stored="true" multiValued="false" /> <field name="consumer_type" type="string" indexed="true" stored="true" multiValued="false" /> <field name="style" type="string" indexed="true" stored="true" multiValued="true" /> <field name="made_in" type="string" indexed="true" stored="true" multiValued="false" /> To make these string fields searchable from a “freetext” – box-and-a-button query (e.g. q=red socks ), the data is copied to the catch-all text field ‘text': <copyField source="color" dest="text" /> <copyField source="colors" dest="text" /> <copyField source="brand" dest="text" /> <copyField source="material" dest="text" /> <copyField source="product_type" dest="text" /> <copyField source="product_category" dest="text" /> <copyField source="consumer_type" dest="text" /> <copyField source="style" dest="text" /> <copyField source="made_in" dest="text" /> The solrconfig file has these additions for the QueryAutoFilteringComponent and a request handler that uses it: <requestHandler name="/autofilter" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> <int name="rows">10</int> <str name="df">description</str> </lst> <arr name="first-components"> <str>queryAutofiltering</str> </arr> </requestHandler> <searchComponent name="queryAutofiltering" class="org.apache.solr.handler.component.QueryAutoFilteringComponent" /> Example 1: “White Linen perfume” There are many examples of this query problem in the data set where a term such as “white” is ambiguous because it can occur in a brand name and as a color, but this one has two ambiguous terms “white” and “linen” so it is a good example of how the autofiltering parser works. The phrase “White Linen” is known from the dataset to be a brand and “perfume” maps to a product type, so the basic autofiltering algorithm would match “White” as a color, then reject that for “White Linen” as a brand – since it is a longer match. It will then correctly find the item “White Linen perfume”.  However, what if I search for “white linen shirts”? In this case, the simple algorithm won’t match because it will fail to provide the alternative parsing “color:white AND material:linen”. That is, now the phrase “White Linen” is ambiguous. In this case, an additional bit of logic is applied to see if there is more than one possible parsing of this phrase, so in this case, the parser produces the following query: ((brand:"White Linen" OR (color:white AND material:linen)) AND product_category:shirt) Since there are no instances of shirts made by White Linen (and if there were, the result would still be correct), we just get shirts back. Similarly for the perfume, since perfume made from linen doesn’t exist, we only get the one product. That is, some of the filtering here is done in the collection. The parser doesn’t know what makes “sense” at the global level and what doesn’t, but the dataset does – so between the two, we get the right answer. Example 2: “white and grey dress shirts” In this case, I have created two color fields, “color” which is used for solid-color items and is single valued and “colors” which is used for multicolored items (like striped or patterned shirts) and is multi valued.  So if I have dress shirts in the data set that are solid-color white and solid-color grey and also striped shirts that are grey and white stripes and I search for “white and grey dress shirts”, my intent is interpreted by the autofiltering parser as “show me solid-color shirts in both white and grey or multi-colored shirts that have both white and grey in them”. This is the boolean query that it generates: ((product_type:"dress shirt" OR ((product_type:dress OR product_category:dress) AND (product_type:shirt OR product_category:shirt))) AND (color:(white OR grey) OR colors:(white AND grey))) (Note that it also creates a redundant query for dress and shirt since “dress” is also a product type – but this query segment returns no results since no item is both a “dress” and a “shirt” – so it is just a slight performance waster). If I don’t want the solid colors, I can search for “striped white and grey dress shirts” and get just those items ( or use the facets). (We could also have a style like “multicolor” vs “solid color” to disambiguate but that may not be too intuitive.) In this case, the query that the autofilter generates looks like this: ((product_type:"dress shirt" OR ((product_type:dress OR product_category:dress) AND (product_type:shirt OR product_category:shirt))) AND (color:(white OR grey) OR colors:(white AND grey)) AND style:striped) Suffice it to say that the out-of-the-box /select request handler doesn’t do any of this. To be fair, it does a good job of relevance ranking for these examples, but its precision (percentage of true or correct positives) is very poor. You can see this by comparing the number of results that you get with the /select handler vs. the /autofilter handler – in terms of precision, its night and day. But is this dataset “too contrived” to be of real significance? For eCommerce data, I don’t think so, many of these examples are real-world products and marketing data is rife with ambiguities that standard relevance algorithms operating at the single token level simply can’t address. The autofilter deals with ambiguity by noting that phrases tend to be less ambiguous than terms, but goes further by providing alternate parsing when the phrase is ambiguous. We want to remove ambiguities that stem from the tokenization that we do on the documents and queries – we cannot remove real ambiguities, rather we need to respond to them appropriately. Performance considerations: On the surface it would appear that the Query Autofiltering component adds some overhead to the query parsing process – how much is a matter for further research on my part – but lets look at the potential cost-benefit analysis. Increasing the precision of search results helps both in terms of result quality and performance, especially with very large datasets because two of the most expensive things that a search engine has to do are sorting and faceting. Both of these require access to the full result set, so fewer false positives means fewer things to sort (i.e. demote) and facet – and overall faster responses. And while relevance ranking can push false positives off of the first page (or shove under the rug so to speak), the faceting engine does not – it shows all.  In some examples shown here, the precision gains are massive – in some cases an order of magnitude better. On very large datasets, one would expect that to have a significant positive impact on performance. Autofiltering vs. Autophrasing Awhile back, I introduced another solution for dealing with phrases and handling multi-term synonyms called “autophrasing” (1,2). What is the difference between these two things? They basically do the same thing – handle noun phrases as single things, but use different methods and different resources. Both can solve the multi-term synonym problem. The autophrasing solution requires an additional metadata file “autophrases.txt” that contains a list of multi-word terms that are used to represent singular entities. The autofiltering solution gets this same information from collection fields so that it doesn’t need this extra file. It can also work across fields and can solve other problems such as converting colloquial logic to search engine logic as discussed in this post. In contrast, the autophrasing solution lacks this “relational context” – it knows that a phrase represents a single thing, but it doesn’t know what type of thing it is and how it relates to other things. Therefore, it can’t know what to do when user queries contain logical semantic constructs that cross field boundries. So, if there already is structured information in your index, which is typically the case for eCommerce data, use autofiltering. Autophrasing is more appropriate when you don’t have structured information – as with record sets that have lots of text in them (bibliographic) – and you simply want phrase disambiguation. Or, you can generate the structured data needed for autofiltering by categorizing your content using NLP or taxonomic approaches. The choice of categorization method may be informed by the need to have this “relational context” that I spoke about above. Taxonomic tagging can give this context – a taxonomy can “know” things about terms and phrases like what type of entities that they represent. This gives it an advantage over machine learning classification techniques where term relationships and interrelationships are statistically rather than semantically defined. For example, if I am crawling documents on software technology and encounter the terms “Cassandra”, “Couchbase”, “MongoDB”, “Neo4J”, “OrientDB” and NoSQL DB”, both machine learning and taxonomic approaches can determine/know that these terms are related. However, the taxonomy understands the difference between a term that represents a class or type of thing (“NoSQL DB”) and an instance of a class/type (“MongoDB”) where as an ML classifier would not – it learns that they are related but not how those relationships are structured, semantically speaking.  The taxonomy would also know that “Mongo” is a synonym for “MongoDB”. It is doubtful that an ML algorithm would get that.  This is a critical aspect for autofiltering because it needs to know both what sets of tokens constitute a phrase and also what those phrases represent. Entity extraction techniques can also be used – regular expression, person, company, location extractors – that associate a lexical pattern with a metadata field value.  Gazetteer or white-list entity extractors can do the same thing for common phrases that need to be tagged in a specific way. Once this is done, autofiltering can apply all of this effort to the query, to bring that discovered context to the sharp tip of the spear – search-wise.  Just as we traditionally apply the same token analysis in Lucene/Solr to both the query and the indexed documents, we can do the same with classification technologies. So it is not that autofiltering can replace traditional classification techniques – these typically work on the document where there is a lot of text to process. Autofiltering can leverage this effort because it works on the query where there is not much text to work with. Time is also of the essence here and we don’t have time for expensive text crunching algorithms as we do when we index data (well … sort of), because in this case we are dealing with what I call HTT – Human Tolerable Time. Expect a blog post on this in the near future.

The post Query Autofiltering Extended – On Language and Logic in Search appeared first on Lucidworks.

District Dispatch: Which strategies yield dollars for libraries?

Fri, 2015-06-05 16:39

New York Public Library Photo by Alex Pang via Flickr

The revolution in libraries creates many new opportunities for serving communities and campuses. But those in power—decision makers and influencers—often do not understand these new roles and possibilities for libraries. And without this understanding, the needed funding and other forms of support are not likely to flow. What can library leaders do to accelerate this understanding for the benefit of all?

The Policy Revolution!, an initiative of the American Library Association (ALA) and sponsored by the Bill & Melinda Gates Foundation, provides some insight into this challenge. The initiative, in its second year, just completed a national public policy agenda and strategic communications plan, which also provides a framework for state and local policy advocacy.

Come learn about the current state of this initiative and how it may apply to your locality and state at “Policy Revolution: Federal Dollars for Local Libraries,” a conference session that takes place during the 2015 ALA Annual Conference. An interactive program will be held from 3:00 to 4:00 p.m. on Sunday, June 28, 2015. The session will be held at the Moscone Convention Center in room 2022 of the West building.

Speakers include Alan S. Inouye, Director, ALA Office for Information Technology Policy, Washington, D.C.; Julie Todaro, Incoming ALA President-elect and Dean, Library Services, Austin Community College; and Ken Wiggin, State Librarian of Connecticut and President, Chief Officers of State Library Agencies (COSLA).

View all ALA Washington Office conference sessions

The post Which strategies yield dollars for libraries? appeared first on District Dispatch.

Open Knowledge Foundation: Putting Open at the Heart of the Digital Age

Fri, 2015-06-05 15:40
Introduction

I’m Rufus Pollock.

In 2004 I founded a non-profit called Open Knowledge

The mission we set ourselves was to open up all public interest information – and see it used to create insight that drives change.

What sort of public interest information? In short, all of it. From big issues like how our government spends our taxes or how fast climate change is happening to simple, everyday, things like when the next bus is arriving or the exact address of that coffee shop down the street.

For the last decade, we have been pioneers and leaders in the open data and open knowledge movement. We wrote the original definition of open data in 2005, we’ve helped unlock thousands of datasets. And we’ve built tools like CKAN, that powers dozens of open data portals, like data.gov in the US and data.gov.uk in the UK. We’ve created a network of individuals and organizations in more than 30 countries, who are all working to make information open, because they want to drive insight and change.

But today I’m not here to talk specifically about Open Knowledge or what we do.

Instead, I want to step back and talk about the bigger picture. I want to talk to you about digital age, where all that glitters is bits, and why we need to put openness at its heart.

Gutenberg and Tyndale

To do that I first want to tell you a story. Its a true story and it happened a while ago – nearly 500 years ago. It involves two people. The first one is Johannes Gutenberg. In 1450 Gutenberg invented this: the printing press. Like the Internet in our own time, it was revolutionary. It is estimated that before the printing press was invented, there were just 30,000 books in all of Europe. 50 years later, there were more than 10 million. Revolutionary, then, though it moved at the pace of the fifteenth century, a pace of decades not years. Over the next five hundred years, Gutenberg’s invention would transform our ability to share knowledge and help create the modern world.

The second is William Tyndale. He was born in England around 1494, so he grew up in world of Gutenberg’s invention.

Tyndale followed the classic path of a scholar at the time and was ordained as a priest. In the 1510s, when he was still a young man, the Reformation still hadn’t happened and the Pope was supreme ruler of a united church across Europe. The Church – and the papacy – guarded its power over knowledge, forbidding the translation of the bible from Latin so that only its official priests could understand and interpret it.

Tyndale had an independent mind. There’s a story that he got into an argument with a local priest. The priest told him:

“We are better to be without God’s laws than the Pope’s.”

Tyndale replied:

“If God spare my life ere many years, I will cause the boy that drives the plow to know more of the scriptures than you!”

What Tyndale meant was that he would open up the Bible to everyone.

Tyndale made good on his promise. Having fled abroad to avoid persecution, between 1524 and 1527 he produced the first printed English translation of the Bible which was secretly shipped back to England hidden in the barrels of merchant ships. Despite being banned and publicly burnt, his translation spread rapidly, giving ordinary people access to the Bible and sowing the seeds of the Reformation in England.

However, Tyndale did not live to see it. In hiding because of his efforts to liberate knowledge, he was betrayed and captured in 1534. Convicted of heresy for his work, on the 6th October 1536, he was strangled then burnt at the stake in a prison yard at Vilvoorden castle just north of modern day Brussels. He was just over 40 years old.

Internet

So let’s fast forward now back to today, or not quite today – the late 1990s.

I go to college and I discover the Internet.

It just hit me: wow! I remember days spent just surfing around. I’d always been an information junkie, and I felt like I’d found this incredible, never-ending information funfair.

And I got that I was going to grow up in a special moment, at the transition to an information age. We’d be living in this magical world, where the the main thing we create and use – information – could be instantaneously and freely shared with everyone on the whole planet.

But … why Openness

So, OK the Internet’s awesome …

Bet you haven’t heard that before!

BUT … – and this is the big but.

The Internet is NOT my religion.

The Internet – and digital technology – are not enough.

I’m not sure I have a religion at all, but if I believe in something in this digital age, I believe in openness.

This talk is not about technology. It’s about how putting openness at the heart of the digital age is essential if we really want to make a difference, really create change, really challenge inequity and injustice.

Which brings me back to Tyndale and Gutenberg.

Tyndale revisited

Because, you see, the person that inspired me wasn’t Gutenberg. It was Tyndale.

Gutenberg created the technology that laid the groundwork for change. But the printing press could very well have been used to pump out more Latin bibles, which would then only have made it easier for local priests to be in charge of telling their congregations the word of God every Sunday. More of the same, basically.

Tyndale did something different. Something so threatening to the powers that be that he was executed for it.

What did he do? He translated the Bible into English.

Of course, he needed the printing press. In a world of hand-copying by scribes or painstaking woodcut printing, it wouldn’t make much difference if the Bible was in English or not because so few people could get their hands on a copy.

But, the printing press was just the means: it was Tyndale’s work putting the Bible in everyday language that actually opened it up. And he did this with the express purpose of empowering and liberating ordinary people – giving them the opportunity to understand, think and decide for themselves. This was open knowledge as freedom, open knowledge as systematic change.

Now I’m not religious, but when I talk about opening up knowledge I am coming from a similar place: I want anyone and everyone to be able to access, build on and share that knowledge for themselves and for any purpose. I want everyone to have the power and freedom to use, create and share knowledge.

Knowledge power in the 16th century was controlling the Bible. Today, in our data driven world it’s much broader: it’s about everything from maps to medicines, sonnets to statistics. Its about opening up all the essential information and building insight and knowledge together.

This isn’t just dreaming – we have inspiring, concrete examples of what this means. Right now I’ll highlight just two: medicines and maps.

Example: Medicines

Everyday, millions of people around the world take billions of pills, of medicines.

Whether those drugs actually do you good – and what side effects they have – is obviously essential information for researchers, for doctors, for patients, for regulators – pretty much everyone.

We have a great way of assessing the effectiveness of drugs: randomized control trials in which a drug is compared to its next best alternative.

So all we need is all the data on all those trials (this would be non-personal information only – any information that could identify individuals would be removed). In an Internet age you’d imagine that that this would be a simple matter – we just need all the data openly available and maybe some way to search it.

You’d be wrong.

Many studies, especially negative ones, are never published – the vast majority of studies are funded by industry who use restrictive contracts to control what gets published. Even where pharmaceutical companies are required to report on the clinical trials they perform, the regulator often keeps the information secret or publishes it as 8,000 page PDFs each page hand-scanned and unreadable by a computer.

If you think I’m joking I’ll give just one very quick example which comes straight from Ben Goldacre’s Bad Pharma. In 2007 researchers in Europe wanted to review the evidence on a diet drug called rimonabant. They asked the European regulator for access to the original clinical trials information submitted when the drug was approved. For three years they were refused access on a variety of grounds. When they did get access this is what they got initially – that’s right 60 pages of blacked out PDF.

We might think this was funny if it weren’t so deadly serious: in 2009, just before the researchers finally got access to the data, rimonabant was removed from the market on the grounds that it increased the risk of serious psychiatric problems and suicide.

This situation needs to change.

And I’m happy to say something is happening. Working with Ben Goldacre, author of Bad Pharma, we’ve just started the OpenTrials project. This will bring together all the data, on all the trials and link it together and make it open so that everyone from researchers to regulators, doctors to patients can find it, access it and use it.

Example: Maps

Our second example is maps. If you were looking for the “scriptures” of this age of digital data, you might well pick maps, or, more specifically the geographic data on which they are built. Geodata is everywhere: from every online purchase to the response to the recent earthquakes in Nepal.

Though you may not realize it, most maps are closed and proprietary – you can’t get the raw data that underpins the map, you can’t alter it or adapt it yourself.

But since 2004 a project called OpenStreetMap has been creating a completely open map of the planet – raw geodata and all. Not only is it open for access and reuse use the database itself is collaboratively built by hundreds of thousands of contributors from all over the world.

What does this mean? Just one example. Because of its openness OpenStreetMap is perfect for rapid updating when disaster strikes – showing which bridges are out, which roads are still passable, what buildings are still standing. For example, when a disastrous earthquake struck Nepal in April this year, volunteers updated 13,199 miles of roads and 110,681 buildings in under 48 hours providing crucial support to relief efforts.

The Message not the Medium

To repeat then: technology is NOT teleology. The medium is NOT the message – and it’s the message that matters.

The printing press made possible an “open” bible but it was Tyndale who made it open – and it was the openness that mattered.

Digital technology gives us unprecedented potential for creativity, sharing, for freedom. But they are possible not inevitable. Technology alone does not make a choice for us.

Remember that we’ve been here before: the printing press was revolutionary but we still ended up with a print media that was often dominated by the few and the powerful.

Think of radio. If you read about how people talked about it in the 1910s and 1920s, it sounds like the way we used to talk about the Internet today. The radio was going to revolutionize human communications and society. It was going to enable a peer to peer world where everyone can broadcast, it was going to allow new forms of democracy and politics, etc. What happened? We got a one way medium, controlled by the state and a few huge corporations.

Look around you today.

The Internet’s costless transmission can – and is – just as easily creating information empires and information robber barons as it can creating digital democracy and information equality.

We already know that this technology offers unprecedented opportunities for surveillance, for monitoring, for tracking. It can just as easily exploit us as empower us.

We need to put openness at the heart of this information age, and at the heart of the Net, if we are really to realize its possibilities for freedom, empowerment, and connection.

The fight then is on the soul of this information age and we have a choice.

A choice of open versus closed.

Of collaboration versus control.

Of empowerment versus exploitation.

Its a long road ahead – longer perhaps than our lifetimes. But we can walk it together.

In this 21st century knowledge revolution, William Tyndale isn’t one person. It’s all of us, making small and big choices: from getting governments and private companies to release their data, to building open databases and infrastructures together, from choosing apps on your phone that are built on open to using social networks that give you control of your data rather than taking it from you.

Let’s choose openness, let’s choose freedom, let’s choose the infinite possibilities of this digital age by putting openness at its heart.

Thank you.

David Rosenthal: Archiving games

Fri, 2015-06-05 15:00
This is just a quick note to flag two good recent posts on important but extremely difficult problem of archiving computer games.  Gita Jackson at Boing-Boing in The vast, unplayable history of video games describes the importance to scholars of archiving games. Kyle Orland at Ars Technica in The quest to save today’s gaming history from being lost forever covers the technical reasons why it is so difficult in considerable detail, including quotes from many of the key players in the space.

My colleagues at the Stanford Libraries are actively working to archive games. Back in 2013, on the Library of Congress' The Signal digital preservation blog Trevor Owens interviewed Stanford's Henry Lowood, who curates our games collection.

Richard Wallis: Schema.org 2.0

Fri, 2015-06-05 14:32

About a month ago Version 2.0 of the Schema.org vocabulary hit the streets.

This update includes loads of tweaks, additions and fixes that can be found in the release information.  The automotive folks have got new vocabulary for describing Cars including useful properties such as numberofAirbags, fuelEfficiency, and knownVehicleDamages. New property mainEntityOfPage (and its inverse, mainEntity) provide the ability to tell the search engine crawlers which thing a web page is really about.  With new type ScreeningEvent to support movie/video screenings, and a gtin12 property for Product, amongst others there is much useful stuff in there.

But does this warrant the version number clicking over from 1.xx to 2.0?

These new types and properties are only the tip of the 2.0 iceberg.  There is a heck of a lot of other stuff going on in this release that apart from these additions.  Some of it in the vocabulary itself, some of it in the potential, documentation, supporting software, and organisational processes around it.

Sticking with the vocabulary for the moment, there has been a bit of cleanup around property names. As the vocabulary has grown organically since its release in 2011, inconsistencies and conflicts between different proposals have been introduced.  So part of the 2.0 effort has included some rationalisation.  For instance the Code type is being superseded by SoftwareSourceCode – the term code has many different meanings many of which have nothing to do with software; surface has been superseded by artworkSurface and area is being superseded by serviceArea, for similar reasons. Check out the release information for full details.  If you are using any of the superseded terms there is no need to panic as the original terms are still valid but with updated descriptions to indicate that they have been superseded.  However you are encouraged to moved towards the updated terminology as convenient.  The question of what is in which version brings me to an enhancement to the supporting documentation.  Starting with Version 2.0 there will be published a snapshot view of the full vocabulary – here is http://schema.org/version/2.0.  So if you want to refer to a term at a particular version you now can.

How often is Schema being used? – is a question often asked. A new feature has been introduced to give you some indication.  Checkout the description of one of the newly introduced properties mainEntityOfPage and you will see the following: ‘Usage: Fewer than 10 domains‘.  Unsurprisingly for a newly introduced property, there is virtually no usage of it yet.  If you look at the description for the type this term is used with, CreativeWork, you will see ‘Usage: Between 250,000 and 500,000 domains‘.  Not a direct answer to the question, but a good and useful indication of the popularity of particular term across the web.

Extensions
In the release information you will find the following cryptic reference: ‘Fix to #429: Implementation of new extension system.’

This refers to the introduction of the functionality, on the Schema.org site, to host extensions to the core vocabulary.  The motivation for this new approach to extending is explained thus:

Schema.org provides a core, basic vocabulary for describing the kind of entities the most common web applications need. There is often a need for more specialized and/or deeper vocabularies, that build upon the core. The extension mechanisms facilitate the creation of such additional vocabularies.
With most extensions, we expect that some small frequently used set of terms will be in core schema.org, with a long tail of more specialized terms in the extension.

As yet there are no extensions published.  However, there are some on the way.

As Chair of the Schema Bib Extend W3C Community Group I have been closely involved with a proposal by the group for an initial bibliographic extension (bib.schema.org) to Schema.org.  The proposal includes new Types for Chapter, Collection, Agent, Atlas, Newspaper & Thesis, CreativeWork properties to describe the relationship between translations, plus types & properties to describe comics.  I am also following the proposal’s progress through the system – a bit of a learning exercise for everyone.  Hopefully I can share the news in the none too distant future that bib will be one of the first released extensions.

W3C Community Group for Schema.org
A subtle change in the way the vocabulary, it’s proposals, extensions and direction can be followed and contributed to has also taken place.  The creation of the Schema.org Community Group has now provided an open forum for this.

So is 2.0 a bit of a milestone?  Yes taking all things together I believe it is. I get the feeling that Schema.org is maturing into the kind of vocabulary supported by a professional community that will add confidence to those using it and recommending that others should.

Harvard Library Innovation Lab: Link roundup June 5, 2015

Fri, 2015-06-05 14:08

Some links to start your June with

In A Digital Chapter, Paper Notebooks Are As Relevant As Ever : NPR

What do you take to a meeting? Laptop or paper?

A Lovely Sunny Day | Zachary Levi and Bert From Sesame Street | Mashable – YouTube

Go outside!

An Innovative New Timelapse Video Constructed From Online Photographs | Mental Floss

Amazing crowdsourced timelapse video

Pigs and Mice Have the Lowliest Jobs in Richard Scarry’s Busytown | Mental Floss

Busytown pigs and mice have less prestigious professions

I’ve been texting with an astronaut – Boing Boing

An iOS game puts you in charge of saving an astronaut crashed on a distant moon. Using text messages.

Library of Congress: The Signal: DPOE Interview with Jim Corridan

Fri, 2015-06-05 14:00

The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress.

Jim Corridan

This post is part of a series about digital preservation training inspired by the Library’s Digital Preservation Outreach & Education (DPOE) Program. Today I’ll focus on an exceptional individual, who among other things, hosted one of the DPOE Train-the-Trainer workshops in 2012. This is an interview with Jim Corridan, Indiana Commission on Public Records Director and State Archivist, as well as Council of State Archivists (CoSA) Secretary-Treasurer. In both his roles as a state archivist and a CoSA officer, he supports missions to preserve state government information as a public good.

Barrie: Jim, your tenure as a champion of the DPOE Train-the-trainer Workshop parallels your time as a CoSA Board member in one capacity or another. I’d like to eventually focus on what you’re up to with CoSA, but first can you recount a little about your experience hosting the DPOE Workshop and what kind of impact that had on the participants?

Jim:  Indiana was fortunate a number of years ago to be able to sponsor a Midwest DPOE regional training. We had participants from more than 10 states interact and go through the weeklong training.  The train-the-trainers attendees seemed to enjoy the content and the experience.  Hundreds if not thousands of people in the Midwest have now benefited from that one week of training, not to mention the benefit of their institutions.

Barrie: The Workshop that you hosted was supported by funding from the Institute of Museum and Library Services (IMLS), right? How did that work, and could other state archives do the same?

Jim:  Yes; Indiana received a grant from the State Library using IMLS’s Library Service and Technology Act (LSTA) grant funds to hold the conference. We used the grant funds to cover travel, lodging and some meals, as well as the instructors’ costs.  If another state archives has access to LSTA funding, or a state library was interested in hosting a statewide or regional DPOE training, that would be a great use of the funds and likely make a significant contribution to the state of digital preservation in the area.

Barrie: Speaking of IMLS funding and training, CoSA has developed a series of institutes through its State Electronic Records Initiative (SERI) with a focus on digital preservation for state archivists and records managers. Can you tell the readers a little about that project?

Jim:  The SERI project began about four years ago to provide state archives staff with resources and training on electronic records management and preservation. It’s been a terrific program that has significantly enhanced, encouraged, and of course, educated the government archives community to move forward in the areas of electronic records.

The SERI Institutes looked at DPOE as one of the main training frameworks in our initial planning, among others to develop the week-long basic and advanced programs. About a third of the states and territories attended the primary class held in Indiana and all the states and territories attended one of the two advanced programs held in Richmond, Virginia and Salt Lake City, Utah.  Each institute was followed by monthly webinars.  The states are going through a process now, an electronic records program self-assessment, to determine how much progress has been made during the past few years.  I think the results will be significant in many areas.

Barrie: So the SERI institutes offer training in both traditional in-person learning environments and distance learning options like webinars? What do you think are the merits and challenges of each approach?

Jim:  We all have different learning styles.  Personally, I become distracted by email, staff or other things when I’m attending webinars. So for me, in-person training is significantly more useful, and CoSA finds there is tremendous benefit to the networking, sharing and sometimes commiserating over challenges and opportunities involved with e-records training.

When addressing complex topics like digital preservation, it’s better to focus on it, rather than trying to learn remotely at a slow pace.  SERI’s preference was to follow the Institutes with webinars to look more deeply at specific tasks or topics once the institutes were completed, which allowed for in-depth training in areas where the group may have needed additional assistance or more detail. The webinars have provided information on various topics, and have provided an opportunity for colleagues to share successes and concerns.

Barrie: You wear many hats, as you also sit on the coordinating committee of the National Digital Stewardship Alliance in addition to your above-mentioned credentials. Given your knowledge and experience, what’s missing from digital preservation education and training today?

Jim:  The three biggest challenges facing digital preservation, and specifically education, are resources, awareness and leadership. It’s evident the field lacks the resources to educate, develop, research and implement digital preservation programs. It sometime appears we are challenged as a field with an inability to simply and clearly articulate the opportunities and threats to the national digital memory faced by a lack of training and infrastructure. While some efforts are being made around governance, this is an area ripe for a much broader conversation and engagement by our colleagues in digital preservation.

As far as leadership, the nation has been fortunate to have had the guidance of Laura Campbell and her team at the Library of Congress. As NDSA has matured and the Library of Congress has redefined its mission, NDSA is looking for a new host institution and I suspect that NDSA will take the opportunity to review its mission. Perhaps another respected federal institution could take on the role the Library of Congress had provided as the convener and central hub for digital preservation; but it seems to me a centralized resource provider with broad community input, much like NDSA with its academic, corporate, government and non-profit members would be a good partner and leader.

Barrie: Any advice on developing a skillset for managing digital content you’d like to share with aspiring digital archivists and electronic records managers?

Jim:  Four things come to mind when hiring an electronic records employee: 1) technical competence; 2) innovation; 3) flexibility and adaptability; 4) articulate communication skills.  Competence stands on its own merits. As for the others–in a resource-poor arena–innovation and flexibility are key factors in success.  How can I implement or fix “X” within the existing infrastructure, and how can I make things more efficient and easier?

But perhaps one of the most important factors is written and verbal communication.  In a complex field with a technical vocabulary, it is essential to be able to accurately and clearly describe the issues at hand.  You will work with IT professionals who for instance will have different definitions from our field for the same words, like “archive.”  As you work with administrators, the public and IT staff you will eliminate much difficulty if you can quickly grasp the disconnects and articulate the opportunities at hand.  In a developing field, grant writing, presentations and clarity are all going to play a factor in a program’s success. I look forward to the new employees joining the field making significant progress in digital preservation, and to continue the success in preserving the nation’s digital memory.

LibUX: LibGuides — How Usable is the Three Column Layout?

Fri, 2015-06-05 13:29

Almost all LibGuides in use — 4,800 libraries; 5.9 million monthly users — look like this:

A three-column flexible-width page with tabs across the top is the default template, introduced by Springshare years ago in version 1. With version 2, they introduced – alongside a much more robust templating system – a new out-of-the-box template – called “Side Nav.”

Which one’s #WordPress, which one’s #LibGuides? Mwuahaha. #libux #libweb Making an example here folks. pic.twitter.com/9ZDdfon2sY

— Michael Schofield (@schoeyfield) February 5, 2015

This is a single column of content with a sidebar menu. Both can be wildly modified, so I should note that we can’t really infer much about the quality of the design from the rough data Springshare was willing to share. That said, I thought it might be fun to talk briefly about the implications of three-columns-plus-tabs default in terms of its usability.

The right side is a blind spot

According to a 2010 writeup by the Nielsen Norman Group, people spend almost three-quarters of their time on a page looking at the left side. Given that the average visit lasts less than 1 minute, we should presume that the right-most column is a no-man’s land where content goes to die.

This isn’t always true. For instance, the Gutenberg principle shows that when design elements are introduced the traditional F-pattern tends to break down and the viewer’s attention follows to the bottom right. In general, though, the rule of thumb should be that it’s a wasteland.

Center content source order is problematic

Traditionally, the center column is the broadest, and librarians put the most important content there. This is fine and and makes sense for large screens, but LibGuides 2 uses Bootstrap — which, good news, is mobile-first — that determines the organization of content on mobile by its source order.

On small screens, the right-most column falls beneath the center, which in turn falls beneath the left.

Image created by Luke Wroblewski

As libraries approach their mobile moment, it is increasingly important that the most important information on the page appears at the top. This is in line with the best practice of structuring content as an inverted pyramid, but as the device landscape gets increasingly weird (and increasingly in-hand), we have to consider the literal source order, where our content lives in relation to elements on the page, and adjust.

The menu isn’t neutral

By which I mean, the left-most menu items hold more weight than those further right. This is a feature, not a bug.

Source – Top Navigation vs Left Navigation: Which Works Better?

That said, as the number of menu items increase so too does the interaction cost. It actually becomes more work for the user to navigate.This can be a devastating timesuck in the precious moment it takes the user to determine whether or not this page is worth their time. The higher the percentage of users that bounce because of the lack of scannability, the lower the perceived credibility of the library.

It’s not all bad

Multiple, minimal — no separate boxes or borders — columns can increase the scannability and reading-speed of single types of content, long-form text, and so on. That is, if the columns can be kept above the bottom of the screen so the user doesn’t have to vertically scroll.

Rather, the way that “content boxes” on LibGuides are used as literal blocks of content with distinct borders separated from other boxes of content, it is more accurate to think of the default LibGuides template as something like Pinterest. There are some serious usability implications for libraries using Pinterest-style layouts, about which I have been critical on twitter, but used smartly, boxes can make it easier for the patron to browse.
I don’t really want to spill all the beans regarding my thoughts on Pinterest-style layouts for libraries, that’s another thinkpiece.

Anyway, I am an on-record fan of LibGuides 2, and Springshare has proven to be worthy of these increasingly high expectations a lot of us in the #libweb hold them to: their fresh APIs, robust templating, and customer service have a lot to do with this. They have always been popular with libraries, but not so with web services folk; that they are largely winning us over update by update is to their credit.

There is a cost to the columns-and-tabs default that is probably greater than the side-menu template bundled with LibGuides. It is a choice libraries make without too much thought, but one of the few high-impact library web decisions many without dedicated web folk are logistically capable of making.

It is one I encourage my friends to make consciously.

The post LibGuides — How Usable is the Three Column Layout? appeared first on LibUX.

William Denton: Interest rates and climate change as music

Fri, 2015-06-05 01:47

Two great examples of turning data into sound and music: Interest Rates: The Musical (“70 years of interest rates ups and downs, with one month represented by one beat, and quarter-point changes given by semitones”) and Planetary Bands, Warming World, composed by Douglas Crawford (“The cello matches the temperature of the equatorial zone. The viola tracks the mid latitudes. The two violins separately follow temperatures in the high latitudes and in the arctic”).

Here’s “Interest Rates: The Musical.” If you don’t know what interest rates were like in the seventies, this will make it clear. And you won’t miss 2008.

And “Planetary Bands, Warming World.” Composer Daniel Crawford (an undergraduate at U Minnesota) worked with geography professor Scott St. George. Using an open climate change data set, Crawford turned surface temperate in the northern hemisphere into a composition for string quartet.

William Denton: Interest rates and climate change as music

Fri, 2015-06-05 01:14

Two great examples of turning data into sound and music: Interest Rates: The Musical (“70 years of interest rates ups and downs, with one month represented by one beat, and quarter-point changes given by semitones”) and Planetary Bands, Warming World, composed by Douglas Crawford (“The cello matches the temperature of the equatorial zone. The viola tracks the mid latitudes. The two violins separately follow temperatures in the high latitudes and in the arctic”).

Here’s “Interest Rates: The Musical.” If you don’t know what interest rates were like in the seventies, this will make it clear. And you won’t miss 2008.

And “Planetary Bands, Warming World.” Composer Daniel Crawford (an undergraduate at U Minnesota) worked with geography professor Scott St. George. Using an open climate change data set, Crawford turned surface temperate in the northern hemisphere into a composition for string quartet.

DuraSpace News: Fedora Workshops, Presentations, and Posters at OR2015

Fri, 2015-06-05 00:00

From David Wilcox, Fedora Product Manager

Eric Hellman: Towards the Post-Privacy Library?

Thu, 2015-06-04 21:27
I have an article in this month Digital Futures, a supplement to American Libraries magazine. The full issue is an important one, so go take a look. In addition to my article, be sure to read the article starting on page 20 entitled "Empowering Libraries to Innovate" in which I am quoted.

I'm reprinting the article here so as to have a good place for discussion.


Alice, a 17 year old high school student, goes to her local public library and reads everything she can find about pregnancy. Noticing this, a librarian calls up some local merchants and tells them that Alice might be pregnant. When Alice visits her local bookstore, the staff has some great suggestions about newborn care for her. The local drugstore sends her some coupons for scent-free skin lotion. She reads "what you can expect..." at the library and a few months later she starts getting mail about diaper services.

Unthinkable? In the physical library, I hope this never happens. It would be too creepy!

In the digital library, this future could be happening now. Libraries and their patrons are awash in data that really isn't sensitive until aggregated, and the data is getting digested by advertising networks and flowing into "big data" archives. The scenario in which advertisers exploit Alice's library usage is not only thinkable, it needs to be defended against. It's a "threat model" that's mostly unfamiliar to libraries.

Recently, I read a book called Half Life. Uranium theft, firearms technology and computer hacking are important plot elements, but I'm not worried about people knowing that I loved it. The National Security Agency (NSA) is not going to identify me as a potential terrorist because I'm reading Half Life. On the contrary, I'd love for my reading behavior to broadcast to the entire world, because maybe more people would discover what a wonderful writer S.L. Huang is. A lot of a library user's digital usage data is like that. It's not particularly private, and most would gladly trade usage information for convenience or to help improve the services they rely on. It would be a waste of time and energy for a library to worry much about keeping that information secret. Quite the opposite, libraries are helping users share their behavior with things like Facebook Like buttons and social media widgets.

Which is why Alice should be very worried and why it's important to for libraries to understand new threat models. What breaches of user privacy are most likely to occur and which are most likely to present harm?

A 2012 article in the New York Times Magazine described a real situation involving Target (the retailer).  Target's "big data" analytics team developed a customer model that identified pregnant women based on shopping behavior. Purchases of scent-free skin lotion, vitamin supplements, and cotton balls turned out to be highly predictive of subsequent purchases of baby diapers. Using the model, Target sent ads for baby-oriented products to the customers their algorithm had identified. In one case, an irate father whose daughter had received ads for baby clothes and cribs accused the store of encouraging his daughter to get pregnant. When a manager called to apologize, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

Among the companies collecting "big data" about users are the advertising networks, companies that sit in between advertisers and websites. They use their data to decide which ad from a huge inventory is most likely to result in a user response. If I were Alice, I don't think I would want my search for pregnancy books broadcasted to advertising networks. Yet that's precisely what happens when I do a search on my local public library's online catalog. I very much doubt that many advertisements are being targeted based on that searching ... yet. But the digital advertising industry is extremely competitive, and unless libraries shift their practices, it's only a matter of time that library searches get factored into advanced customer models.

But it doesn't have to happen that way. Libraries have a strong tradition of protecting user privacy. Once all the "threat models" associated with the digital environment are considered, practices will certainly change.

So let's get started. In the rest of this article, I'll examine the process of borrowing and reading an ebook, and identify privacy weaknesses in the processes that advertisers and their predictive analytics modeling could exploit.
  1. Most library catalogs allow non-encrypted searches. This exposes Alice's ebook searches to internet providers between Alice and the library's server. The X-UIDH header has been used by providers such as Verizon and AT&T to help advertisers target mobile users. By using HTTPS for their catalogs, libraries can limit this intrusion. This is relatively easy and cheap, and there's no good excuse in 2015 for libraries not to make the switch.

  2. Some library catalogs use social widgets such as AddThis or ShareThis that broadcast a user's search activity to advertising networks. Similarly, Facebook "Like" buttons send a user's search activity to Facebook whether or not the user is on Facebook. Libraries need to carefully evaluate the benefits of these widgets against the possibility that advertising networks will use Alice's search history inappropriately.

  3. Statistics and optimization services like Google Analytics and NewRelic don't currently share Alice's search history with advertising networks, but libraries should evaluate the privacy assurances from these services to see if they are consistent with their own policies and local privacy laws.

  4. When Alice borrows a book from a vendor such as OverDrive or 3M, it monitors Alice's reading behavior, albeit anonymously. At this date, it's very difficult for an advertiser to exploit Alice's use of reading apps from OverDrive or 3M. Although many have criticized the use of Adobe digital rights management (DRM) in these apps, both 3M and OverDrive use the "vendorID" method which avoids the disclosure of user data to Adobe, and at this date, there is no practical way for an advertising network to exploit Alice's use of these services. Here again, libraries should review their vendor contracts to make sure that can't change.
  5. If Alice reads her ebook using a 3rd party application such as Adobe Digital Editions (ADE), the privacy behavior of the third party comes into play. Last year, ADE was found to be sending user reading data back to Adobe without encryption;  even today, it's known to phone home with encrypted reading data. Other applications, such as Bluefire Reader, have a better reputation for privacy, but as they say "past performance is no guarantee of future returns".

  6. If Alice wants to read her borrowed ebook on a Kindle (via OverDrive), it's very likely that Amazon will be able to exploit her reading behavior for marketing purposes. To avoid it, Alice would need to create an anonymous account on Amazon for reading her library books. Most people will just use their own (non-anonymous) accounts for convenience. If Alice shares her Amazon account with others, they'll know what she reads.

    This is a classic example of the privacy vs. convenience tradeoff that libraries need to consider. A Kindle user trusts that Amazon will not do anything too creepy, and Amazon has every incentive to make that user comfortable with their data use. Libraries need to let users make their own privacy decisions, but at the same time libraries need to make sure that users understand the privacy implications of what they do.

  7. The library's own records are also potential source of a privacy breach. This "small-data" threat model is perhaps more familiar to librarians. Alice's parents could come in and demand to know what she's been reading. A schoolmate might hack into the library's lightly defended databases looking for ways to embarrass Alice. A staff member might be a friend of Alice's family. Libraries need clear policies and robust processes to be worthy of Alice's trust.

In the digital environment, it's easy for libraries to be unduly afraid of using the data from Alice's searches and reading to improve her experience and make the library a more powerful source of information. Social networks are changing the way we think about our privacy, and often the expectation is that services will make use of personal information that's been shared. Technologies exist to protect the user's control over that data but advertising networks have no incentive to employ them. I want my library to track me, not advertising networks!. I want great books to read, and no, I'm not in the market for uranium-238!

Eric Lease Morgan: Boxplots, histograms, and scatter plots. Oh, my!

Thu, 2015-06-04 20:12

I have started adding visualizations literally illustrating the characteristics of the various “catalogs” generated by the HathiTrust Workset Browser. These graphics (box plots, histograms, and scatter plots) make it easier to see what is in the catalog and the features of the items it contains.

For example, read the “about page” reporting on the complete works of Henry David Thoreau. For more detail, see the “home page” on GitHub.

Pages