### ### Planet Code4Lib

Blogs and feeds of interest to the Code4Lib community, aggregated.


July 22, 2014

Jenny Rose Halperin

Numbers are not enough: Why I will only attend conferences with explicitly enforceable Codes of Conduct and a commitment to accessibility

I recently had a bad experience at a programming workshop where I was the only woman in attendance and eventually had to leave early out of concern for my safety.

Having to repeatedly explain the situation to a group of men who promised me that “they were working on fixing this community” was not only degrading, but also unnecessary. I was shuttled to three separate people, eventually receiving some of my money back approximately a month later (which was all I asked for) along with promises and placating statements about “improvement.”

What happened could have been prevented: each participant signed a “Code of Conduct” that was buried in the payment for the workshop, but there was no method of enforcement and nowhere to turn when issues arose.

At one point while I was attempting to resolve the issue, this community’s Project Manager told me, “Three other women signed up, but they dropped out at the last minute because they had to work. It was very strange and unexpected that you were the only woman.” I felt immediately silenced. The issue is not numbers, but instead inviting people to safe spaces and building supportive structures where people feel welcomed and not marginalized. Increasing the variety of people involved in an event is certainly a step, but it is only part of the picture. I realize now that the board members of this organization were largely embarrassed, but they could have handled my feelings in a way where I didn’t feel like their “future improvements” were silencing my very real current concerns.

Similarly, I’ve been thinking a lot about a conversation I had with some members of the German Python community a few months ago. Someone told me that Codes of Conduct are an American hegemonic device and that introducing the idea of abuse opens the community up for it, particularly in places that do not define “diversity” in the same way as Americans. This was my first exposure to this argument, and it definitely gave me a lot of food for thought, though I adamantly disagree.

In my opinion, the open-source tech community is a multicultural community and organizers and contributors have the responsibility to set their rules for participation. Mainstream Western society, which unfortunately dictates many of the social rules on the Internet, does a bad job teaching people how to interact with one another in a positive and genuine way, and going beyond “be excellent to one another, we’re all friends here!” argument helps us participate in a way in which people feel safe both on and off the Web.

At a session at the Open Knowledge Festival this week, we were discussing accessibility and realized that the Code of Conduct (called a “User Guide”) was not easily located and many participants were probably not aware of its existence. The User Guide is quite good: it points to other codes of conduct, provides clear enforcement, and emphasizes collaboration and diversity.

At the festival, accessibility was not addressed in any kind of cohesive manner: the one gender-neutral bathroom in the huge space was difficult to find, sessions were loud and noisy and often up stairs, making it impossible for anyone with any kind of hearing or mobility issue to participate, and finally, the conference organizers did not inform participants that food would not be free, causing the conference’s ticket price to increase dramatically in an expensive neighborhood in Berlin.

In many ways, I’m conflating two separate issues here (accessibility and behavior of participants at an event.) I would counter that creating a safe space is not only about behavior on the part of the participants, but also on the part of the conference organizers. Thinking about how participants interact at your event not only has to do with how people interact with one another, but also how people interact with the space. A commitment to accessibility and “diversity” hinges upon more than words and takes concerted and long term action. It may mean choosing a smaller venue or limiting the size of the conference, but it’s not impossible, and incredibly important. It also doesn’t have to be expensive!  A small hack that I appreciated at Ada Camp and Open Source Bridge was a quiet chill out room. Being able to escape from the hectic buzz was super appreciated.

Ashe Dryden writes compellingly about the need for better Codes of Conduct and the impetus to not only have events be a reflection of what a community looks like, but also where they want to see them go. As she writes,

I worry about the conferences that are adopting codes of conduct without understanding that their responsibility doesn’t end after copy/pasting it onto their site. Organizers and volunteers need to be trained about how to respond, need to educate themselves about the issues facing marginalized people attending their events, and need to more thoughtfully consider their actions when responding to reports.

Dryden’s  Code of Conduct 101 and FAQ should be required reading for all event organizers and Community Managers. Codes of Conduct remove the grey areas surrounding appropriate and inappropriate behavior and allow groups to set the boundaries for what they want to see happening in their communities. In my opinion, there should not only be a Code of Conduct, but also an accessibility statement that collaboratively outlines what the organizers are doing to make the space accessible and inclusive and addresses and invites concerns and edits.  In her talk at the OKFestival, Penny pointed out that accessibility and inclusion actually makes things better for everyone involved in an event. As she said, “No one wants to sit in a noisy room! For you, it may be annoying, but for me it’s impossible.”

Diversity is not only about getting more women in the room, it is about thinking intersectionally and educating oneself so that all people feel welcome regardless of class, race, physicality, or level of education. I’ve had the remarkable opportunity to go to conferences all over the world this year, and the spaces that have made an obvious effort to think beyond “We have 50% women speakers!” are almost immediately obvious. I felt safe and welcomed at Open Source Bridge and Ada Camp. From food I could actually eat to lanyards that indicated comfort with photography to accessibility lanes, the conference organizers were thoughtful, available, and also kind enough that I could approach them if I needed anything or wanted to talk.

From now on, unless I’m presented a Code of Conduct that is explicit in its enforcement, defines harassment in a comprehensive manner, makes accessibility a priority, and provides trained facilitators to respond to issues, you can count me out of your event.

We can do better in protecting our friends and communities, but change can only begin internally. I am a Community Manager because we get together to educate ourselves and each other as a collaborative community of people from around the world. We should feel safe in the communities of practice that we choose, whether that community is the international Python community, or a local soccer league, or a university. We have the power to change our surroundings and our by extension our future, but it will take a solid commitment from each of us.

Events will never be perfect, but I believe that at least in this respect, we can come damn close.

by jennierosehalperin at July 22, 2014 01:17 PM

July 21, 2014

Terry Reese

Code4Lib Article: Opening the Door: A First Look at the OCLC WorldCat Metadata API

For those interested in some code and feedback on experiences using the OCLC Metadata API, you can find my notes here: http://journal.code4lib.org/articles/9863

 

–TR

by reeset at July 21, 2014 06:24 PM

Code4Lib Journal

Editorial introduction: On libraries, code, support, inspiration, and collaboration

Reflections on the occasion of the 25th issue of the Code4Lib Journal: sustaining a community for support, inspiration, and collaboration at the intersection of libraries and information technology.

by Dan Scott at July 21, 2014 05:54 PM

Getting What We Paid for: a Script to Verify Full Access to E-Resources

Libraries regularly pay for packages of e-resources containing hundreds to thousands of individual titles. Ideally, library patrons could access the full content of all titles in such packages. In reality, library staff and patrons inevitably stumble across inaccessible titles, but no library has the resources to manually verify full access to all titles, and basic URL checkers cannot check for access. This article describes the E-Resource Access Checker—a script that automates the verification of full access. With the Access Checker, library staff can identify all inaccessible titles in a package and bring these problems to content providers’ attention to ensure we get what we pay for.

by Kristina M. Spurgin at July 21, 2014 05:54 PM

Opening the Door: A First Look at the OCLC WorldCat Metadata API

Libraries have long relied on OCLC’s WorldCat database as a way to cooperatively share bibliographic data and declare library holdings to support interlibrary loan services. As curator, OCLC has traditionally mediated all interactions with the WorldCat database through their various cataloging clients to control access to the information. As more and more libraries look for new ways to interact with their data and streamline metadata operations and workflows, these clients have become bottlenecks and an inhibitor of library innovation. To address some of these concerns, in early 2013 OCLC announced the release of a set of application programming interfaces (APIs) supporting read and write access to the WorldCat database. These APIs offer libraries their first opportunity to develop new services and workflows that directly interact with the WorldCat database, and provide opportunities for catalogers to begin redefining how they work with OCLC and their data.

by Terry Reese at July 21, 2014 05:54 PM

Docker: a Software as a Service, Operating System-Level Virtualization Framework

Docker is a relatively new method of virtualization available natively for 64-bit Linux. Compared to more traditional virtualization techniques, Docker is lighter on system resources, offers a git-like system of commits and tags, and can be scaled from your laptop to the cloud.

by John Fink at July 21, 2014 05:54 PM

A Metadata Schema for Geospatial Resource Discovery Use Cases

We introduce a metadata schema that focuses on GIS discovery use cases for patrons in a research library setting. Text search, faceted refinement, and spatial search and relevancy are among GeoBlacklight's primary use cases for federated geospatial holdings. The schema supports a variety of GIS data types and enables contextual, collection-oriented discovery applications as well as traditional portal applications. One key limitation of GIS resource discovery is the general lack of normative metadata practices, which has led to a proliferation of metadata schemas and duplicate records. The ISO 19115/19139 and FGDC standards specify metadata formats, but are intricate, lengthy, and not focused on discovery. Moreover, they require sophisticated authoring environments and cataloging expertise. Geographic metadata standards target preservation and quality measure use cases, but they do not provide for simple inter-institutional sharing of metadata for discovery use cases. To this end, our schema reuses elements from Dublin Core and GeoRSS to leverage their normative semantics, community best practices, open-source software implementations, and extensive examples already deployed in discovery contexts such as web search and mapping. Finally, we discuss a Solr implementation of the schema using a "geo" extension to MODS.

by Darren Hardy and Kim Durante at July 21, 2014 05:54 PM

Ebooks without Vendors: Using Open Source Software to Create and Share Meaningful Ebook Collections

The Community Cookbook project began with wondering how to take local cookbooks in the library’s collection and create a recipe database. The final website is both a recipe website and collection of ebook versions of local cookbooks. This article will discuss the use of open source software at every stage in the project, which proves that an open source publishing model is possible for any library.

by Matt Weaver at July 21, 2014 05:54 PM

Within Limits: mass-digitization from scratch

The provincial library of West-Vlaanderen (Belgium) is digitizing a large part of its iconographic collection. Due to various (technical and financial) reasons no specialist software was used. FastScan is a set of VBS-scripts that was developed by the author using off-the-shelf software that was either included in MS Windows (XP & 7) or already installed (imageMagick, Irfanview, littlecms, exiv2). This scripting package has increased the digitization efforts immensely. The article will show what software was used, the problems that occurred and how they were scripted together.

by Pieter De Praetere at July 21, 2014 05:54 PM

A Web Service for File-Level Access to Disk Images

Digital forensics tools have many potential applications in the curation of digital materials in libraries, archives and museums (LAMs). Open source digital forensics tools can help LAM professionals to extract digital contents from born-digital media and make more informed preservation decisions. Many of these tools have ways to display the metadata of the digital media, but few provide file-level access without having to mount the device or use complex command-line utilities. This paper describes a project to develop software that supports access to the contents of digital media without having to mount or download the entire image. The work examines two approaches in creating this tool: First, a graphical user interface running on a local machine. Second, a web-based application running in web browser. The project incorporates existing open source forensics tools and libraries including The Sleuth Kit and libewf along with the Flask web application framework and custom Python scripts to generate web pages supporting disk image browsing.

by Sunitha Misra, Christopher A. Lee and Kam Woods at July 21, 2014 05:54 PM

Processing Government Data: ZIP Codes, Python, and OpenRefine

While there is a vast amount of useful US government data on the web, some of it is in a raw state that is not readily accessible to the average user. Data librarians can improve accessibility and usability for their patrons by processing data to create subsets of local interest and by appending geographic identifiers to help users select and aggregate data. This case study illustrates how census geography crosswalks, Python, and OpenRefine were used to create spreadsheets of non-profit organizations in New York City from the IRS Tax-Exempt Organization Masterfile. This paper illustrates the utility of Python for data librarians and should be particularly insightful for those who work with address-based data.

by Frank Donnelly at July 21, 2014 05:54 PM

Indexing Bibliographic Database Content Using MariaDB and Sphinx Search Server

Fast retrieval of digital content has become mandatory for library and archive information systems. Many software applications have emerged to handle the indexing of digital content, from low-level ones such Apache Lucene, to more RESTful and web-services-ready ones such Apache Solr and ElasticSearch. Solr’s popularity among library software developers makes it the “de-facto” standard software for indexing digital content. For content (full-text content or bibliographic description) already stored inside a relational DBMS such as MariaDB (a fork of MySQL) or PostgreSQL, Sphinx Search Server (Sphinx) is a suitable alternative. This article will cover an introduction on how to use Sphinx with MariaDB databases to index database content as well as some examples of Sphinx API usage.

by Arie Nugraha at July 21, 2014 05:54 PM

Solving Advanced Encoding Problems with FFMPEG

Previous articles in the Code4Lib Journal touch on the capabilities of FFMPEG in great detail, and given these excellent introductions, the purpose of this article is to tackle some of the common problems users might face, dissecting more complicated commands and suggesting their possible uses.

by Josh Romphf at July 21, 2014 05:54 PM

HathiTrust Ingest of Locally Managed Content: A Case Study from the University of Illinois at Urbana-Champaign

In March 2013, the University of Illinois at Urbana-Champaign Library adopted a policy to more closely integrate the HathiTrust Digital Library into its own infrastructure for digital collections. Specifically, the Library decided that the HathiTrust Digital Library would serve as a trusted repository for many of the library’s digitized book collections, a strategy that favors relying on HathiTrust over locally managed access solutions whenever this is feasible. This article details the thinking behind this policy, as well as the challenges of its implementation, focusing primarily on technical solutions for “remediating” hundreds of thousands of image files to bring them in line with HathiTrust’s strict specifications for deposit. This involved implementing HTFeed, a Perl 5 application developed at the University of Michigan for packaging content for ingest into Hathi Trust, and its many helper applications (JHOVE to detect metadata problems, Exiftool to detect metadata issues and repair missing image metadata, and Kakadu to create JP2000 files), as well as a file format conversion process using ImageMagick. Today, Illinois has over 1600 locally managed volumes queued for ingest, and has submitted over 2300 publicly available titles to the HathiTrust Digital Library.

by Kyle R. Rimkus & Kirk M. Hess at July 21, 2014 05:54 PM

Open Library

Open library’s been doing that the whole time…. for free

Amazon’s “Kindle Unlimited” announcement has been helping raise awareness of Open Library.

Last week, Amazon informed us that for ten dollars per month, Kindle users can have unlimited access to over six hundred thousand books in its library. But it shouldn’t cost a thing to borrow a book, Amazon, you foul, horrible, profiteering enemies of civilization. For a monthly cost of zero dollars, it is possible to read six million e-texts at the Open Library, right now

. On a Kindle, or any other tablet or screen thing.

Don’t forget our easy to use interface or downloading with your choice of device or software!
sesame street book of nonsense in the bookreader

by Jessamyn West at July 21, 2014 05:10 PM

James Grimmelmann

Internet Law: Cases and Problems Version 4.0

Version 4.0 of Internet Law: Cases and Problems is now available. This is the 2014 update of my casebook, and it has been a busy year. I produced a special supplemental chapter on the NSA and the Fourth Amendment in December, and it was out of date within a week. The new edition has over twenty new cases and other principal materials and dozens of new questions and problems. Here is a partial list of what’s new:

I have also gone over every question in the book, tightening up wording, removing redundancies, and focusing the inquiries on what really matters. As before, the book is available through Semaphore Press as a pay-what-you want DRM-free PDF download at a suggested price of $30. The price has stayed the same, but compared with the first edition you get now 55% more casebook for your dollar. The book is still targeted at law students but written, I hope, to be broadly interesting.

Download it while it’s hot!

by James Grimmelmann (james@grimmelmann.net) at July 21, 2014 03:21 PM

Islandora

A Summer in the GTA: iCamp Schedule Now Available

What will be happening at #iCampGTA? Check out the full schedule of events.

Our instructors have prepared a curriculum which blends discussion, hands-on experience, collaboration, and community presentations.

Day One and Three will focus on developments, initiatives and of course, you, the Islandora Community. Day Two will be a full day, hands-on workshop where you can either build and configure a new Islandora site (Admin Track) or learn how to build a custom Islandora module (Developers Track).

Having a hard time deciding on which track to attend?

Admin Group

For repository and collection managers, librarians, archivists, and anyone else who deals primarily with the front-end experience of Islandora and would like to learn how to get the most out of it, or developers who would like to learn more about the front-end experience.

Developer Group

For developers, systems people, and anyone dealing with Islandora at the code-level, or any front-end Islandora users who are interested in learning more about the developer side.

We will be posting slides after the camp for those wanting to check out discussions and material from the other track.

 

by sfritz at July 21, 2014 03:15 PM

Harvard Library Innovation Lab

Link roundup July 21, 2014

Summertime is the best time to share a few pieces of the Web we’ve enjoyed lately.

Google Is Designing the Font of the Future — NYMag

Motion Silhouette: An Interactive Shadow Picture Book | Colossal

It’s gotta be the shoes

Book smell is back – 25 paper-scented perfumes and candles

Bibliocycle is Boston Public Library’s bicycle based library

by Annie at July 21, 2014 03:14 PM

July 20, 2014

John Miedema

What makes a system cognitive? Conclusion to the Whatson iteration.

In March I asked the question, Can I build Watson Jr in my basement? I performed two iterations of a basement build that I dubbed “What, son?” or “Whatson” for short. In a first iteration, I recreated the Question-Answer system outlined in Taming Text by Ingersoll, Mortin and Farris. In a second iteration, I did deep dive into a first build of my own, writing code samples on essential parts and charting out architectural decisions. Of course there is plenty more to be done but I consider the second iteration complete. I have to put the next “Wilson” iteration on hold for a bit as my brain is required elsewhere. I would like to conclude this iteration with a final post that covers what I believe to be the most important question in this emerging field … What makes a system cognitive?

Here are some key features of a cognitive system:

Big Data. Cognitive systems can process large amounts of data from multiple source in different formats. They are not limited to a well-defined domain of enterprise data but can also access data across domains and integrate it into analytics. One might call this feature “big open data” to reflect its oceanic size and readiness for adventure. You would expect this feature from an intelligent system, just as humans process large amounts of experience outside their comfort zone.

Unstructured Data. Structured data is handled nicely by relational database management systems. A cognitive system extracts patterns from unstructured data, just as human intelligence finds meaning in unstructured experience.

Natural Language Processing (NLP). A true artificial intelligence should be able to process raw sensory experience and smart people are working on that. A entry level cognitive system should at least be able to perform NLP on text. Language is a model of human intelligence, and the system should be able to understand Parts of Speech and grammar. The deeper the NLP processsing the smarter the system.

Pattern-Based Entity Recognition. Traditional database systems and even the modern linked data approach rely heavily on arbitrary unique identifiers, e.g., GUID, URI. A cognitive system strives to uniquely identify identities based on meaningful patterns, e.g., language features.

Analytic. Meaning is a two-step between context and focus, sometimes called figure and ground. Interpretation and analytics are cognitive acts, using contextual information to understand the meaning of the focus of attention.

Game Knowledge. Game knowledge is high order understanding of context. A cognitive system does not simply spit out results, but understands the user and the stakes surrounding the question.

Summative. A traditional search system spills out a list of results, leaving the user to sort through them them for relevance. A cognitive system reduces the results to the lowest possible number of results that satisfy the question, and presents them in summary format.

Adaptive. A cognitive system needs to be able to learn. This is expressed in trained models, and also in the ability to accept feedback. A cognitive system uses rules, but these rules are learned “bottom-up” from data rather than “top-down” from hard-wired rules. This approach is probabilistic and associated with a margin of error. To err is human. It allows systems to learn from new experience.

I believe the second Whatson iteration demonstrates these features.

by johnmiedema at July 20, 2014 03:08 AM

July 19, 2014

John Miedema

QA Architecture III: Enrichment and Answer. Playing the game with confidence.

1-3 QA Enrich AnswerThe Question and Answer Architecture of Whatson can be divided into three major processes. Previous posts covered I – Initialization and II – Natural Language Processing and Queries. This post describes the third and final process, III – Enrichment and Answer as shown in the chart to the right.

  1. Confidence. At this point, candidate results have been obtained from data sources and analyzed for answers. The work has involved a number of Natural Language Processing (NLP) steps that are associated with probabilities. Probabilities at different steps are combined to calculate an aggregate confidence for a result. There will be one final confidence value for each result obtained from each data source. The system must decide if it has the confidence to risk an answer. The risk depends on Game Rules. In Jeopardy, IBM’s Watson was penalized for a wrong answer.
  2. Spell Correction. If the confidence is low, the system can check the original question text for probable spelling mistakes. A corrected query can be resubmitted to Process 2 to obtain new search results, hopefully with higher confidence. Depending on the Game being played, a system might suggest spell correction before the first search is submitted, i.e., Did You Mean … ?
  3. Synonyms. If the confidence is still low, the system can expand the original question text with synonyms. E.g., ‘writer’ = ‘author’. The query is submitted, with the intent of obtaining higher confidence in the results.
  4. Clue Enrichment Automatic. The system is built to understand unstructured text and respond with answers. This build can be used to enrich a question with additional clues. Suppose a person asked for the author of a particular quote. The quote might be cited by several blog authors, but the system could deduce that the question refers to the primary or original author.
  5. Clue Enrichment Dialog. If all else fails the system will admit it does not know the answer. Depending on the Game, the system could ask the user to restate the question with more clues.
  6. Answer. Once the confidence level is high enough, the system will present the Answer. In a Game like Jeopardy only one answer is allowed. Providing only one answer is also a design goal, i.e., it should be smart enough to know the answer, and not return pages of search results. In some cases, a smart system should return more than one answer, e.g., if there are two different but equally probably answers. The format of the answer will depend on the Game. It makes sense to utilize templates to format the answer in a natural language format. Slapping on text-to-speech will be easy at this point.
  7. Evidence. Traditional search engines typically highlight keywords embedded in text snippets. The user can read the full document and try to evaluate why a particular result was selected. In a cognitive system, a single answer is returned based on a confidence. It can demonstrate why the answer was selected. A user might click on a “Evidence” link to see detailed information about the decision process and supporting documents.

This post concludes the description of the three processes in The Question and Answer Architecture of Whatson.

by johnmiedema at July 19, 2014 07:56 PM

Eric Lease Morgan

Fun with Koha

These are brief notes about my recent experiences with Koha.

Introduction

koha logoAs you may or may not know, Koha is a grand daddy of library-related open source software, and it is an integrated library system to boot. Such are no small accomplishments. For reasons I will not elaborate upon, I’ve been playing with Koha for the past number of weeks, and in short, I want to say, “I’m impressed.” The community is large, international, congenial, and supportive. The community is divided into a number of sub-groups: developers, committers, commercial support employees, and, of course, librarians. I’ve even seen people from another open source library system (Evergreen) provide technical support and advice. For the most part, everything is on the ‘Net, well laid out, and transparent. There are some rather “organic” parts to the documentation akin to an “English garden”, but that is going to happen in any de-centralized environment. All in all, and without any patronizing intended, “Kudos to Koha!”

Installation

Looking through my collection of tarballs, I see I’ve installed Koha a number of times over the years, but this time it was challenging. Sparing you all the details, I needed to use a specific version of MySQL (version 5.5), and I had version 5.6. The installation failure was not really Koha’s fault. It is more the fault of MySQL because the client of MySQL version 5.6 outputs a warning message to STDOUT when a password is passed on the command line. This message confused the Koha database initialization process, thus making Koha unusable. After downgrading to version 5.5 the database initialization process was seamless.

My next step was to correctly configure Zebra — Koha’s default underlying indexer. Again, I had installed from source, and my Zebra libraries, etc. were saved in a directory different from the configuration files created by the Koha’s installation process. After correctly updating the value of modulePath to point to /usr/local/lib/idzebra-2.0/ in zebra-biblios-dom.cfg, zebra-authorities.cfg, zebra-biblios.cfg, and zebra-authorities-dom.cfg I could successfully index and search for content. I learned this from a mailing list posting.

Koha “extras”

Koha comes (for free) with a number of “extras”. For example, the Zebra indexer can be deployed as both a Z39.50 server as well as an SRU server. Turning these things on was as simple as uncommenting a few lines in the koha-conf.xml file and opening a few ports in my firewall. Z39.50 is inherently unusable from a human point of view so I didn’t go into configuring it, but it does work. Through the use of XSL stylesheets, SRU can be much more usable. Luckily I have been here before. For example, a long time ago I used Zebra to index my Alex Catalogue as well as some content from the HathiTrust (MBooks). The hidden interface to the Catalogue sports faceted searching and used to support spelling corrections. The MBooks interface transforms MARCXML into simple HTML. Both of these interfaces are quite zippy. In order to get Zebra to recognize my XSL I needed to add an additional configuration directive to my koha-conf.xml file. Specifically, I need to add a docpath element to my public server’s configuration. Once I re-learned this fact, implementing a rudimentary SRU interface to my Koha index was easy and results are returned very fast. I’m impressed.

My big goal is to figure out ways Koha can expose its content to the wider ‘Net. To this end sKoha comes with an OAI-PMH interface. It needs to be enabled, and can be done through the Koha Web-based backend under Home -> Koha Administration -> Global Preferences -> General Systems Preferences -> Web Services. Once enabled, OAI sets can be created through the Home -> Administration -> OAI sets configuration module. (Whew!) Once this is done Koha will respond to OAI-PMH requests. I then took it upon myself to transform the OAI output into linked data using a program called OAI2LOD. This worked seamlessly, and for a limited period of time you can browse my Koha’s cataloging data as linked data. The viability of the resulting linked data is questionable, but that is another blog posting.

Ideas and next steps

Library catalogs (OPACs, “discovery systems”, whatever you want to call them) are not simple applications/systems. They are a mixture of very specialized inventory lists, various types of people with various skills and authorities, indexing, and circulation, etc. Then we — as librarians — add things like messages of the day, record exporting, browsable lists, visualizations, etc. that complicate the whole thing. It is simply not possible to create a library catalog in the “Unix way“. The installation of Koha was not easy for me. There are expenses with open source software, and I all but melted down my server during the installation process. (Everything is now back to normal.) I’ve been advocating open source software for quite a while, and I understand the meaning of “free” in this context. I’m not complaining. Really.

Now that I’ve gotten this far, my next step is to investigate the feasibility of using a different indexer with Koha. Zebra is functional. It is fast. It is multi-faceted (all puns intended). But configuring it is not straight-forward, and its community of support is tiny. I see from rooting around in the Koha source code that Solr has been explored. I have also heard through the grapevine that ElasticSearch has been explored. I will endeavor to explore these things myself and report on what I learn. Different indexers, with more flexible API’s may make the possibility of exposing Koha content as linked data more feasible as well.

Wish me luck.

by Mini-musings at July 19, 2014 06:16 PM

July 18, 2014

Dan Scott

Posting on the Laurentian University library blog

Since returning from my sabbatical, I've felt pretty strongly that one of the things our work place is lacking is open communication about the work that we do--not just outside of the library, but within the library as well. I'm convinced that the more that we know about the demands on each other's time and the goals that we're trying to achieve, the more likely we'll be able to work together towards the same goals and have a better understanding of each other's challenges.

Towards that end, I've decided to try maintaining a work blog so that my colleagues will have a better idea about what I've been up to. I wouldn't be surprised if some of my peers think that I sit in my office all day browsing the internet (which, actually, happens sometimes but I swear I'm doing it to try and find a solution for a problem!), because the day-to-day work of a systems librarian can be pretty esoteric. And when you know that they have many expectations for you to fix the many small annoyances they have to deal with, it might help them to develop some empathy if they understand what you actually are spending your time on.

Anyway, I decided not to mirror the content here because, well, it's probably too site-specific to really be of interest to you, my dear readers. Whoever you are. However, I will link to the two entries that I've cranked out so far; you can decide if you want to follow along from there:

by dan@coffeecode.net (Dan Scott) at July 18, 2014 09:40 PM

District Dispatch

ALA, ACRL file network neutrality comments with FCC

Today the American Library Association (ALA) and the Association of College and Research Libraries (ACRL) urged the Federal Communications Commission (FCC) to adopt the legally enforceable network neutrality rules necessary to fulfill library missions and serve communities nationwide. The ALA and ACRL joined nine other national higher education and library organizations in filing joint public comments (pdf) with the FCC.

The joint comments build on the ALA resolution adopted by Council at the 2014 Annual Conference and align with the 2014 legislative agenda developed by ACRL. They also provide greater detail for the network neutrality principles released July 10 and suggest ways to strengthen the FCC’s proposed rules (released May 15, 2014) to preserve an open internet for libraries, higher education and the communities they serve. For instance, the FCC should:

The joint comments mark another definitive statement on behalf of all types of libraries and the communities we serve, but are simply one more step in a long journey toward our goal. There’s more to be done, and librarians can make their voices heard in a number of ways:

  1. Email to the ALA Washington Office (lclark[at]alawash[dot]org) examples of Internet Service Provider (ISP) slowdowns, lost quality of service relative to your subscribed ISP speeds, and any other harm related to serving your community needs. Alternately, please share examples of potential harm if we do not preserve the open internet (e.g., impact on cloud-based services and/or ability to disseminate digitized or streaming content on an equal footing with commercial content providers that otherwise might pay for faster “lanes” for their content over library content).
  2. Ask your board to support and/or adopt the network neutrality principles. Several people in attendance at the Annual Conference program on the topic suggested this, and the ALA Washington Office will develop and share a template for this purpose in the coming weeks.

The post ALA, ACRL file network neutrality comments with FCC appeared first on District Dispatch.

by Larra Clark at July 18, 2014 04:58 PM

Islandora

iCamp GTA: Meet your Instructors

#iCampGTA will begin in just 19 days, and we are so excited to see participation from over a dozen institutions in this summer camp!

We are pleased to introduce the team ready to lead these workshops:

Admin Track:

Kirsta Stapelfeldt, once the Repository Manager of the Islandora Project, Kirsta is currently the Digital Scholarship Unit Coordinator and the University of Toronto Scarborough. She is also an active member of the Roadmap Committee, a frequent guest-blogger on this site, and your maven of all things Islandora admin-interface.

David Wilcox is the Product Manager for the Fedora Project and a long-term member of both the Fedora and Islandora communities. He is a representative on the Islandora Roadmap Committee, a convenor for the Fedora 4 Interest Group, and regularly provides support and mentoring to new and current repository users.

Developer Track:

Nick Ruest is the Digital Assets Librarian at York University. He is also a member of the Islandora Roadmap Committee, a prolific committer and contributor of new modules to the community, and is a convenor for the Preservation Interest Group. Nick blogs at ruebot.net.

Jordan Dukart is a developer at discoverygarden and a new addition to the Islandora instructor circuit. He is an active member within the Islandora community, a dedicated committer, an established contributor of new modules and master of the pull request.

by sfritz at July 18, 2014 03:47 PM

July 17, 2014

Nicole Engard

Bookmarks for July 17, 2014

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for July 17, 2014 appeared first on What I Learned Today....

by Nicole C. Engard at July 17, 2014 08:30 PM

CrossRef

Persistence at CrossRef

CrossRef 's promise is persistent scholarly citation and in order to fulfill that promise, CrossRef takes persistence seriously. Persistence depends equally on the technology of CrossRef's systems and services and the social infrastructure of its membership agreement.

CrossRef has taken steps in four areas to ensure persistence:

Continue reading about the steps here.

by Anna Tolwinska at July 17, 2014 08:23 PM

New CrossRef Members

Updated July 17, 2014

Voting Members
Codon Publications
Croatian Dairy Journal
Encuentros/Encounters/Rencontres on Education
Instituto Educacional Piracicabano da Igreja Metodista
Instituto Metodista de Ensino Superior
Instituto Metodista de Servicos Educacionais
Instituto Metodista Izabela Hendrix
Instituto Porto Alegre da Igreja Metodista
Journal of Exercise Therapy and Rehabilitation
Journal of the ASEAN Federation of Endocrine Societies (JAFES)
PNRPU Publishing Office
Precast/Prestressed Concrete Institute
Research and Development Centre for Marine and Fisheries Product Processing and Biotechnology
University and Research Librarians' Association
Wyzsza Szkota Spoleczno-Przyrodnicza im Wincentego Pola W Lublinie (Vincent Pol University)

Sponsored Members
Aleksandras Stulginskis University
Institute of Liquid Atomization and Spray Systems - Korea
Instituto Educacional Piracicabano da Igreja Metodista
Instituto Metodista de Ensino Superior
Instituto Metodista de Servicos Educacionais
Instituto Metodista Izabela Hendrix
Instituto Porto Alegre da Igreja Metodista
Japanese Political Science Review
Korean Academy of Esthetic Dentistry
Korean Academy of Traditional Oncology
Korean Society of Medical History
Max Weber Studies
Medicinos Mintis
Quotus Publishing
Revista Cientifica de Producao Animal
Revista de Ensino de Engenharia
Schleuen Verlag
Society of Allied Health Services
Society of Conservatoire at Jardin Botaniques de la Ville de Geneve
The Korean Society for Marine Biotechnology
The Korean Society of Earth Science Education
The Korean Society of School Health
The Society for Chromatographic Sciences
Todas as Letras: Revista de Lingua e Literatura

Represented Members

The Korean Society for Biomedical Laboratory Sciences
Vellalar College for Women
Indian Association of Health, Research, and Welfare (IAHRW)
International Society or University Colon and Rectal Surgeons
The Korean Association of Political Science and Communication

Last update July 8, 2014

Voting Members
Centers for Disease Control MMWR Office
Co. Ltd. Ukrinformnauka
Fundacion para el Analisis Estrategio y Desarrollo de la Pequena y Mediana Empresa (FAEDPYME)
Fundacion Universidad de Oviedo
Geography Department, Alexandru Ioan Cuza University of Iasi
Group of Companies Med Expert, LLC
Intercom - Sociedade Brasileira de Estudos Interdisciplinares da Comunicacao
Medicinos Mintis
Science Gate Publishing PC
Universidad de Navarra

Represented Members

Admiral Makarov National University of Shipbuilding
Asian Business Consortium
Global Journal of Enterprise Information System
Indian Journal of Peritoneal Dialysis
National Institute for Health Research
Private Company Technology Center
The Korean Society for Microsurgery
The Korean Society of Art Theories
Zaporizhzhia National Technical University

by Anna Tolwinska at July 17, 2014 08:14 PM

Jonathan Rochkind

ActiveRecord Concurrency in Rails4: Avoid leaked connections!

My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.

I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4.   The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.

However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.

Background: The ActiveRecord Concurrency Model

Is pretty much described in the header docs for ConnectionPool, and the fundamental architecture and contract hasn’t changed since Rails 2.2.

Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.

You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use.  So far so good.

But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.

And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)

And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before.  A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.

And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.

The danger of leaked connections

So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).

But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do.  If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.

The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.

And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it.  If you leak one connection like this, that’s one less connection available in the ConnectionPool.  If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.

In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would  go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed.   You could call this method from time to time yourself to try and clean up after yourself.

And in fact, if you tried to check out a connection, and no connections were available — Rails 3.2 would call clear_stale_cached_connections! itself to see if there were any leaked connections that could be reclaimed, before raising a ConnectionTimeout. So if you were leaking connections all over the place, you still might not notice, the ConnectionPool would clean em up for you.

But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually.  As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.

So this makes it pretty important to avoid leaking connections.

(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup.  That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )

Monkey-patch AR to avoid leaked connections

I understand where Rails is coming from with the ‘implicit checkout’ thing.  For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).

So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.

The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.

That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `Thread.new` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).

So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”.  Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is:

Thread.new do 
   ActiveRecord::Base.forbid_implicit_checkout_for_thread!

   # stuff
end

Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.

If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.

This way you can enforce your code to only use `with_connection` like it should.

Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.

DO fear the Reaper

In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections.  In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”

The problem is, as far as I can tell by reading the code, it simply does not do this.

What does the reaper do?  As far as I can tell trying to follow the code, it mostly looks for connections which have actually dropped their network connection to the database.

A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long.  A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).

The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.

Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)

So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.

Oh, maybe fear ruby 1.9.3 too

When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:

ThreadError: deadlock; recursive locking:

In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0. 

If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.

Can you use an already loaded AR model without a connection?

Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?

Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.

Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.

But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).

Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.

Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.

Concurrency Patterns to Avoid in ActiveRecord?

Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).

So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.

If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.

I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.

That’s not what my app does.  I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).

My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)

So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`.  But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.

I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point,  I’d recommend avoiding using ActiveRecord concurrency this way.

What to do?

What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.

At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.”  DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).

Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).

I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel).  I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.

So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads.  Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.

That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.


Filed under: General

by jrochkind at July 17, 2014 07:59 PM

CrossRef

CrossRef Indicators

Updated July 15, 2014

Total no. participating publishers & societies 5090
Total no. voting members 2407
% of non-profit publishers 57%
Total no. participating libraries 1883
No. journals covered 35,286
No. DOIs registered to date 68,198,387
No. DOIs deposited in previous month 552,871
No. DOIs retrieved (matched references) in previous month 34,385,296
DOI resolutions (end-user clicks) in previous month 98,365,532

by Anna Tolwinska at July 17, 2014 07:57 PM

District Dispatch

FCC Chairman Tom Wheeler speaks on your libraries, E-rate and calls for improvement

During the ALA Annual Conference that wrapped  up a few weeks ago, the Washington Office secured a video from the Chairman of the Federal Communications Commission, Tom Wheeler. We are pleased to share four clips from the video for use in your own advocacy work on the importance of high-capacity broadband and the E-rate program for your libraries and the communities you serve.

“The staff at the FCC is appreciative of the commitment ALA has shown to engaging in the modernization process while also remaining staunch advocates for their members. We hope the Chairman’s message is helpful to all of the work you do on behalf of your libraries,” Gigi Sohn, Special Counsel for External Affairs to Chairman Wheeler told us.

In these videos, the Chairman speaks about the changing nature of libraries today, emphasizing the importance of providing access to digital resources, technology, and free public Wi-Fi. He states

Libraries are where large numbers of Americans go to get online. And it’s not just to research information. Libraries are where Americans go to apply for their VA benefits, or apply for their healthcare benefits, or apply for jobs… Put another way, as I have learned, libraries complete Education, jump start Employment and Entrepreneurship, Empower people of all ages and backgrounds and foster community Engagement—“The 5 E’s of Libraries.

The four clips take listeners through:

In addition to the short clips, we are pleased to provide the full transcript (pdf) to the Chairman’s remarks during the video for Annual Conference.

ALA has engaged deeply with the Commission on the E-rate proceeding and expects to participate fully in the coming weeks as we move from the order adopted on July 11 (not publicly released as of this writing) into the next phase of this multi-step process. We take heed of the Chairman’s call to action:

So my request to ALA is simple – let’s work together to get this process in motion starting now. Let’s make some meaningful improvements to the program for libraries starting now. And let’s keep working together over the coming months to address those issues we don’t tackle in this order as part of an ongoing process to make the E-rate program work as well as it possibly can for libraries.

The post FCC Chairman Tom Wheeler speaks on your libraries, E-rate and calls for improvement appeared first on District Dispatch.

by Marijke Visser at July 17, 2014 07:33 PM

James Grimmelmann

Three Letters About the Facebook Study

My colleague Leslie Meltzer Henry and I have sent letters asking three institutions—the Proceedings of the National Academy of Sciences, the federal Office for Human Research Protections, and the Federal Trade Commission—to investigate the Facebook emotional manipulation study. We wrote three letters, rather than one, because responsibility for the study was diffused across PNAS, Cornell, and Facebook, and it is important that each of them be held accountable for its role in the research. The letters overlap, but each has a different focus.

Our letters deal with cleaning up the mistakes of the past. But they also look to the future. The Facebook emotional manipulation study offers an opportunity to put corporate human subjects research on a firmer ethical footing, one in which individuals given meaningful informed consent and in which there is meaningful oversight. We invite PNAS, OHRP, and the FTC to take leading roles in establishing appropriate ethical rules for research in an age of big data and constant experiments.

UPDATE, July 17, 2014, 1:30 PM: I am reliably informed that Cornell has “unchecked the box”; its most recent Federalwide Work Agreement now commits to apply the Common Rule only to federally funded research, not to all research undertaken at Cornell. I made the mistake of relying on the version of its FWA that the Cornell IRB posted on its own website I regret the error.) This affects the issue of the OHRP’s jurisdiction, but not the soundness of the Cornell IRB’s reasoning, which rested on the activities of Cornell affiliates rather than on the source of funding.

by James Grimmelmann (james@grimmelmann.net) at July 17, 2014 05:18 PM

John Miedema

QA Architecture II: Natural Language Processing and Queries. Context-Focus pairing of the question and results.

The Question and Answer Architecture of Whatson can be divided into three major processes: I – Initialization, II – Natural Language Processing and Queries, and III – Enrichment and Answer. This post describes the second process, as shown in the chart:

1-2-QA-NLP

  1. Context for the Question. There are two pairs of green Context and Focus boxes. The first pair is about Natural Language Processing (NLP) for the Question Text. Context refers to all the meaningful clues that can be extracted from the question text. The Initialization process determined that the domain of interest is English Literature. In this step, custom NLP models will be used to recognize domain entities: book titles, authors, characters, settings, quotes, and so on.
  2. Focus for the Question. The Context provides known facts from the question and helps determine what is not known, i.e., the focus. The Focus is classified as a type, e.g., a question about an author, a question about a setting.
  3. Data Source Identification. Once the question has been analyzed into entities, the appropriate data sources can be selected for queries. The Data Source Catalog associates sources with domain entities. More information about the Catalog can be found under the discussion of the Tank-less architecture.
  4. Queries. Once the data sources have been identified, queries can constructed using the Context and Focus entities as parameters. Results are obtained from each source.
  5. Parts of Speech. Basic parts of speech (POS) analysis is performed on the results, just like in the Initialization process.
  6. Context for the Results. The second pair of green Context and Focus boxes is for the Results text. Domain entities are extracted from the results. Now the question and answer can be lined up to find relevant results.
  7. Focus for the Results. The final step is to resolve the focus, asked by the question and hopefully answered by the result. The basic work is matching up entities in the Question and Results. Additional cognitive analysis may be applied here.

The third and final post will describe how the system evaluates results before offering an answer.

by johnmiedema at July 17, 2014 03:22 PM

OCLC Dev Network

Corrected: Systems Maintenance on July 20

Please see below where AEDT times have been corrected from our initial post.

 

by Shelley Hostetler at July 17, 2014 03:00 PM

Take a Look: Updated VIAF API Documentation

As we were preparing for today's (free) VIAF API Workshop - at 1pm ET - we decided it would be a good opportunity to do a thorough review of the documentation. Working closely with Ralph LeVan, we've made some significant changes to the VIAF API documentation so that you can get a better understanding of how the API works and all of the operations it supports.

by Shelley Hostetler at July 17, 2014 01:00 PM

DuraSpace News

SURVEY: Help Library of Congress Improve Digital Preservation Outreach and Education

From Howard Barrie, IT Project Manager, Library of Congress

Washington, DC  How prepared is your organization to provide long-term, durable access to its mission-critical digital content? What skills and experience do staff need to address the digital preservation needs of their organization?

by carol at July 17, 2014 01:00 AM

Towards Research Assessment Metrics: NISO Releases Draft Altmetrics White Paper

From Cynthia Hodgson, Technical Editor/Consultant, National Information Standards Organization (NISO)
 

by carol at July 17, 2014 01:00 AM

July 16, 2014

District Dispatch

No amendment offered to limit FCC’s funding ability on E-rate.

The ALA Washington is getting word from the Hill that a possible amendment to the House Financial Services Appropriations bill that would limit the Federal Communication Commission’s (FCC) ability to increase funding to the Federal E-rate programs has been withdrawn. Please see Monday’s District Dispatch post.

Sources in the House Democratic Leadership tell ALA that the majority chose not to proceed with the amendment due to widespread opposition from the library and school community.

Thank you to everyone that made calls to their House Member to defeat this amendment.

The post No amendment offered to limit FCC’s funding ability on E-rate. appeared first on District Dispatch.

by Jeffrey Kratz at July 16, 2014 08:38 PM

Thom Hickey

Exploring Golang

Here is yet another blog post giving first impressions of a new language and comparing it with one the writer is familiar with.  In this case, comparing Google's Go language with Python.

A few of us at OCLC have been using Python fairly extensively for the last decade.  In fact, I have the feeling that I used to know it better than I do now, as there has been a steady influx of features into it, not to mention the move to Python 3.0

Go is relatively new, but caught my eye because of some groups moving their code from Python to Go.  Looking at what they were trying to do in Python, one wonders why they thought Python was a good fit, but maybe you could say the same thing about how we use Python.  All of the data processing for VIAF, is done in Python, as well as much of the batch processing that FRBRizes WorldCat.  We routinely push 1.5 billion marcish records through it, processing hundreds or even thousands of gigabytes of data.

We use Python because of ease and speed of writing and testing.  But, it's always nicer if things run faster and Go has a reputation as a reasonably efficient language.  The first thing I tried was to just write a simple filter that doesn't do anything at all, just reads from standard input and writes to standard output one line at a time.  This is basic to most of our map-reduce processing and I've had more than one language fail this test (Clojure comes to mind).  In Python it's simple and efficient:

import sys
for line in sys.stdin:
    sys.stdout.write(line)

The Python script reads a million authority records (averaging 5,300 bytes each) in just under 10 seconds, or about 100,000 records/second.

Go takes a few more lines, but gets the job done fairly efficiently:

package main

import (
    "bufio"
    "io"
    "os"
)

func main() {
    ifile := bufio.NewReader(os.Stdin)
    for {
        line, err := ifile.ReadBytes('\n')
        if err == io.EOF {
            break
        }
        if err != nil {
            panic(err)
        }
        os.Stdout.Write(line)
    }
}

The Go filter takes at least 16 seconds to read the same file, about 62,000 records/second.  Not super impressive and maybe there is a faster way to do it, but fast enough so that it won't slow down most jobs appreciably.

My sample application was to read in a file of MARC-21 and UNIMARC XML records, parse it into an internal datastructure, then write it out as JSON.  

I had already done this in Python, but it took some effort to do it in Go.  The standard way of parsing XML in Go is very elegant (adding notations to structures to show how the XML parser should interpret them), but turned out to both burn memory and run very slowly.  A more basic approach of processing the elements as a stream that the XML package is glad to send you, was both more similar to how we were doing it in Python and much more efficient in Go.  Although there have been numerous JSON interpretations of MARC (and I've done my own), I came up with one more: a simple list (array) of uniform structures (dictionaries in Python) that works well in both Python and Go.

Overall, the Go code was slightly more verbose, mostly because of its insistence on checking return codes (rather than Python's tendency to rely on exceptions), but very readable.  Single-threaded code in Go turns about a thousand MARC XML records/second into JSON.

Which sounds pretty good until you do the same thing in Python and it transforms them at about 1,700 records/second.  I profiled the Go code (nice profiler) and found that at least 2/3 of the time was in Go's XML parser, so no easy speedups there.  Rather than give up, I decided to try out goroutines, Go's rather elegant way of launching routines on a new thread.

Go's concurrency options seem well thought out.  I set up a pool of go routines to do the XML parsing and managed to get better than a 5x speedup (on a 12-core machine). That would be worthwhile, but we do most of our processing in Hadoop's map-reduce frame work, so I tested it in that.  The task was to read 47 million MARC-21/UNIMARC XML authority records stored in 400 files and write the resultant JSON to 120 files. 

Across our Hadoop cluster we typically run a maximum of 195 mappers and 195 reducers (running out of memory on Linux is something to avoid!).  The concurrent Go program was able to do the transform of the records in about 7 minutes ( at least a couple minutes of that is pushing the output through a simple reducer), and the machines were very busy, substantially busier than when running the equivalent single threaded Python code.  Somewhat to my surprise the Python code did the task in 6.5 minutes.  Possibly a single threaded Go program could be speedup a bit, but my conclusion is that Go offers minimal speed advantages over Python for the work we are doing.  The fairly easy concurrency is nice, but map-reduce is already providing that for us.

I enjoyed working with it, though.  The type safety sometimes got in the way (especially missed Python dictionaries/maps that are indifferent to the types of the keys and values), but at other times the type checking caught errors at compile time.  The stand alone executables are convenient to move around, the profiler was easy to use, and I really liked that it has a standard source file formatter.  I didn't try the built in documentation generator, but it looked simple and useful, as does the testing facility. The libraries aren't as mature as Python's, but they are very good.  We never drop down into C for our Python work, but we do depend on libraries written in C, such as the XML parser CElementTree.  It would be nice to have a system where everything could be done at the same level (Julia or PyPy?), but right now we're still happy with straight Python and feel that its speed seldom gets in the way.

If nothing else, I learned a bit about Go and came up with a simple JSON MARC format that works quite a bit faster in Python (and Go) than my old one did.

--Th

The machine I used for stand alone timings is a dual 6-core 3.1 GHz AMD Opteron box and runs Linux 2.6.18 (which precluded loading 1.3 Go, so I used 1.1).  I got similar (but slower) timings with Go 1.3 on my 64-bit quad-core 2.66GHz Intel PC running Windows 7, so I don't think that using 1.3 Go would have made much of a difference.  Both the Go and Python programs were executed as streaming map-reduce jobs across 39 dual quad-core 2.6 GHz AMD machines running Cloudera 4.7.

by Thom at July 16, 2014 08:06 PM

District Dispatch

President called by ALA and partners to again threaten veto of privacy-hostile “cybersecurity” info sharing legislation advancing in Senate

As recently reported, Senate Intelligence Committee “markup” and approval of the privacy-hostile Cybersecurity Information Sharing Act of 2014 (CISA), S. 2588, was delayed until after Congress’ brief July 4 recess . . . but not for long.  Again meeting in secret, the Committee approved a somewhat modified (but insufficiently improved) version of the bill on July 10.

Like three other similar bills introduced in the past four  years, CISA is intended to head off and remediate hacking and other threats to communications and government electronic networks by authorizing private communications companies to share evidence of those “cybersecurity threats” with multiple arms of the federal government.  To enable and encourage such reporting, however, it also effectively immunizes those companies against any legal action that might be brought against them by individual customers whose private information is disclosed without their permission.

So what’s wrong with preventing and blunting cyber-attacks?  Nothing, except that CISA and its predecessors foster that laudable objective in the most overbroad way possible without building in important, entirely reasonable and wholly achievable safeguards for Americans’ privacy.  As ALA’s coalition partners at the Open Technology Institute of the New America Foundation and Electronic Frontier Foundation have pointed out in new analyses of the bill, as passed by the Senate Intelligence Committee, CISA:

With CISA now reportedly supported not just by the intelligence community but by powerful interests in the banking, securities and other industries, ALA and its coalition partners are concerned that it could be among the few bills that the Senate actually takes up in the waning days of the current (pre-August break) legislative session and the current Congress, which is likely to adjourn not long after Labor Day until after the November 2014 mid-term elections.  Accordingly, we and our partners yesterday delivered a letter to President Obama calling on him to publicly indicate that he will veto CISA, or any similarly overbroad and dangerously imbalanced “cybersecurity” legislation that fails to much more fully protect all of our personal privacy.  The President issued such a statement in 2012 regarding similar legislation.

ALA and its partners will also continue, of course, to fight CISA in the Senate and it’s entirely likely that we’ll need your help.  Sign up now to learn what you can do when the call comes.

The post President called by ALA and partners to again threaten veto of privacy-hostile “cybersecurity” info sharing legislation advancing in Senate appeared first on District Dispatch.

by Adam Eisgrau at July 16, 2014 06:14 PM

LITA

Jobs in Information Technology: July 16

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.


New This Week

Head Librarian, Penn State Wilkes-Barre campus,  Pennsylvania State University,  Wilkes-Barre, PA

Library Director, Pelham Public Library, Pelham, NH

Summer Reading Regional Presenter,  Kansas Library Consultants for Youth,  Manhattan, KS

Technical Content Specialist, Computercraft Corporation, Bethesda, MD

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

 

by vedmonds at July 16, 2014 06:02 PM

District Dispatch

E-rate modernization process in the news

The FCC’s recent 3-2 vote to approve a reform order as part of its ongoing effort to modernize the E-rate program came in the midst of significant discussion and debate among policymakers and program stakeholders. Below are news articles that offer insight into the attitudes of Members of Congress and library and school advocates toward the reforms contained in the order.

Stakeholder responses to E-rate order

Before the vote

(6/20/2014) SETDA Applauds Momentum at FCC in Advancing E-Rate Modernization
http://www.setda.org/2014/06/20/setda-applauds-momentum-at-fcc-in-advancing-e-rate-modernization/

(6/20/2014) Media Statement by ISTE CEO Brian Lewis in Response to FCC Chairman Wheeler’s Action on E-Rate
http://www.iste.org/about/media-relations/news-details/2014/06/20/media-statement-by-iste-ceo-brian-lewis-in-response-to-fcc-chairman-wheeler-s-action-on-e-rate

(6/20/2014) NEA President: FCC E-Rate proposal is bad public policy
http://www.nea.org/home/59434.htm

(6/20/2014)Gov. Bob Wise (President, Alliance for Excellent Education) Comments on FCC Plan to Modernize E-Rate and Expand High-Speed Internet Access in Nation’s Schools and Libraries
http://all4ed.org/press/gov-bob-wise-comments-on-fcc-plan-to-modernize-e-rate-and-expand-high-speed-internet-access-in-nations-schools-and-libraries/

After the vote

(7/11/2014) ALA welcomes forward movement on E-rate modernization
http://www.ala.org/news/press-releases/2014/07/ala-welcomes-forward-movement-e-rate-modernization

(7/11/2014) NEA supports E-Rate proposed changes that keep Internet connectivity intact
http://www.nea.org/home/59738.htm

(7/11/2014) SETDA Comment on FCC Vote on E-Rate Modernization
http://www.setda.org/2014/07/11/setda-comment-on-fcc-vote-on-e-rate-modernization/

(7/11/2014) NTCA Statements on E-Rate Modernization, Rural Broadband Experiments Rulemakings
http://www.ntca.org/2014-press-releases/ntca-statements-on-e-rate-modernization-rural-broadband-experiments-rulemakings.html

(7/11/2014) New America Foundation Statement on FCC’s E-rate Order for School and Library Connectivity
http://oti.newamerica.net/node/117754

(7/11/2014) ROCKEFELLER STATEMENT ON FCC VOTE ON INITIAL STEPS TO MODERNIZE E-RATE
http://www.rockefeller.senate.gov/public/index.cfm/press-releases?ID=d099e21f-26db-4967-b475-34ce829eab2c

(7/11/2014) Rep. Waxman Statement on FCC Proposal to Modernize E-Rate Program
http://democrats.energycommerce.house.gov/index.php?q=news/rep-waxman-statement-on-fcc-proposal-to-modernize-e-rate-program

(7/11/2014) CCSSO Applauds FCC Steps to Modernize E-Rate
http://www.ccsso.org/News_and_Events/Press_Releases/CCSSO_Applauds_FCC_Steps_to_Modernize_E-Rate.html

Reactions to Wi-Fi formula

(7/11/2014) FCC passes controversial $5 billion Wi-Fi plan for schools and libraries
http://arstechnica.com/business/2014/07/fcc-passes-controversial-5-billion-wi-fi-plan-for-schools-and-libraries/

(7/10/2014) Statement of Commissioner Ajit Pai criticizing the E-rate order’s Wi-Fi formula
http://www.e-ratecentral.com/files/bulletins/DOC-328137A1.pdf

Hill reactions

(7/10/2014) Markey Challenges FCC on Modernizing Classroom Internet Access
http://www.boston.com/business/technology/2014/07/10/markey-takes-fcc-over-modernizing-classroom-internet-access/XdUiFGWYWOJZdaqbXh6KCJ/story.html

(7/10) More From Lawmakers Ahead of Friday’s E-Rate Vote
http://blogs.rollcall.com/technocrat/more-from-lawmakers-ahead-of-fridays-e-rate-vote/?dcz=

(7/9/2014) The people who created E-Rate think the FCC’s going about it all wrong
http://www.washingtonpost.com/blogs/the-switch/wp/2014/07/09/the-people-who-created-e-rate-think-the-fccs-going-about-it-all-wrong/

(7/8/2014) Key Senate Dems push back on FCC’s Wi-Fi plan
http://thehill.com/policy/technology/211638-key-senate-dems-push-back-on-fccs-wi-fi-plan

(7/8/2014) FCC’s E-Rate Proposal Draws Questions from Rockefeller, Markey
http://blogs.rollcall.com/technocrat/rockefeller-markey-critical-of-e-rate-proposal/?dcz

Program funding issues

(7/10/2014) FCC’s Pai: E-Rate Proposal Would Slash Connectivity Funding
http://www.broadcastingcable.com/news/washington/fccs-pai-e-rate-proposal-would-slash-connectivity-funding/132317

(7/9) FCC Chairman’s E-Rate Proposal Dogged by Funding Controversies
http://www.bna.com/fcc-chairmans-erate-n17179892064/

(7/7/2014) E-rate Survey: Funding Increase Deemed Critical
http://thejournal.com/Articles/2014/07/07/E-rate-Survey-Funding-Increase-Deemed-Critical.aspx

Reactions from Ed. Community

(7/9/2014) Will the FCC ignore educators in modernizing key U.S. technology program?
http://www.washingtonpost.com/blogs/answer-sheet/wp/2014/07/09/will-the-fcc-ignore-educators-in-modernizing-key-u-s-technology-program/

(7/8) FCC Prepares to Vote on E-Rate Overhaul
http://www.edweek.org/ew/articles/2014/07/09/36erate.h33.html

(7/6/2014) Teachers threaten to derail Wi-Fi push (ALA’s Marijke Visser also quoted)
http://thehill.com/policy/technology/211350-teachers-threaten-to-derail-wi-fi-push

(6/24/2014) Teachers Give Failing Grade to Obama’s Push for Tech in Schools
http://www.nationaljournal.com/tech/teachers-give-failing-grade-to-obama-s-push-for-tech-in-schools-20140624

(6/23/2014) FCC E-Rate Reform Proposal Criticized By School Advocacy Groups And Unions
http://www.ibtimes.com/fcc-e-rate-reform-proposal-criticized-school-advocacy-groups-unions-1608932

(6/20/2014) Proposed E-Rate Reforms Raise Concerns, Uncertainty
https://www.edsurge.com/n/2014-06-20-proposed-e-rate-reforms-deemed-unsatisfactory

(6/20/2014) FCC E-Rate Reforms Don’t Rate With Education Groups
http://www.broadcastingcable.com/news/washington/fcc-e-rate-reforms-dont-rate-education-groups/131932

(6/20/2014) Education community coalition letter expressing concern over the E-rate order’s Priority 2 (Wi-Fi) reforms
http://www.nea.org/assets/docs/Ed_Orgs_LTR_062014.pdf

ALA/library press

(7/10/2014) What libraries need from key US technology program (op-ed by ALA Washington Office executive director Emily Sheketoff)
http://www.washingtonpost.com/blogs/answer-sheet/wp/2014/07/10/what-libraries-need-from-key-u-s-technology-program/

(7/6/2014) Fed WiFi funding plan irks libraries (ALA’s Marijke Visser quoted)
http://www.heraldnet.com/article/20140706/NEWS02/140709468/Fed-WiFi-funding-plan-irks-libraries

(7/3/2014) Urban libraries say they’re getting shortchanged in a battle for WiFi funding (ALA’s Marijke Visser quoted)
http://www.washingtonpost.com/blogs/the-switch/wp/2014/07/03/urban-libraries-say-theyre-getting-shortchanged-in-a-battle-for-wifi-funding/

(7/3/2014) FCC’s E-Rate proposal aims to close the Wi-Fi gap in US schools and libraries (ALA’s Marijke Visser quoted)

http://www.techrepublic.com/article/fccs-e-rate-proposal-aims-to-close-the-wi-fi-gap-in-us-schools-and-libraries/

The post E-rate modernization process in the news appeared first on District Dispatch.

by Charles Wapner at July 16, 2014 05:18 PM

Open Knowledge Foundation

Open Knowledge Festival – the story so far…

It is hot hot hot here in Berlin, and the Festival is in full swing! In every corner, little groups are clustered, sharing ideas, plotting, and putting faces to profiles. From graffiti walls to linked budgets, from destroying printers to building a social contract for open data – the only problem is that you can’t be in five places at once!

Last night we proudly announced our School of Data Fellows – 12 amazing individuals from around the world, who will work with civil society and journalists in their regions to bring the power of open data to their work. You can read all about them here.

Today we heard inspiring keynotes from Patrick Alley, founder of Global Witness, and Beatriz Busaniche, founder of Wikimedia Argentina. We also heard from the awesome Ory Okolloh, activist, lawyer and blogger from Kenya, before dispersing into a whirlwind of workshops, talks and connecting.

Here are a few photos from the past couple of days. We’ll bring you more tales from Berlin soon, and you can keep up to date on twitter, storify and through the Festival website.

IMG_0152

by Theodora Middleton at July 16, 2014 04:41 PM

Journal of Web Librarianship

Discovering Usability: Comparing Two Discovery Systems at One Academic Library

Journal of Web Librarianship, Ahead of Print.

by Mireille Djenno et al at July 16, 2014 04:06 PM

Patrick Hochstenbach

Designing an error page

Instead of a boring server error page, Ghent University Library wanted a friendly web page that will be shown when a service goes offline. This is the design I created for them: Filed under: Doodles Tagged: apache, art, cartoon, comic,

by hochstenbach at July 16, 2014 02:44 PM

DPLA

Summer of Archives: Firing on enemy lines in 16 steps (1918)

The Summer of Archives is back this week with another installment of awesome archival finds from the wide world of DPLA. This week we’re featuring rare footage of US military operations in France during World War I. The 16 GIFs in this installment depict crews of US Naval Railway Batteries preparing and firing artillery rounds deep into German military operations in Verdun, France in 1918.

View on Imgur »

The film footage used in this post is courtesy of the National Archives and Records Administration. To view the entire film, visit http://bit.ly/1zCH24a.

by Kenny Whitebloom at July 16, 2014 02:11 PM

In the Library, With the Lead Pipe

Open Source Outline: Locating the Library within Institutional Oppression

In Brief: A call for articles based on an open source outline

On January 20th, 2014 nina de jesus posted “Outline for a Paper I Probably Won’t Write.” The editors at In the Library with the Lead Pipe approached de jesus to see if she might like to write it after all. We also discussed her idea to release her outline with an open source license and see what others would write. We are thrilled to announce that de jesus agreed to both.

If you are interested in writing an article for us based on this outline and would like to work with a Lead Pipe editor, please email ellie@leadpi.pe by August 13th, 2014. If you would like to write your article without going through the Lead Pipe peer review process, please email ellie@leadpi.pe by September 10th, 2014 with a link to your completed article. You are not bound to follow the outline to the letter. We welcome divergence and dissent.

Depending on the number and quality of submissions we receive, we will either publish all of the articles together as a digital edition or we will publish de jesus’s article here along with links to any other articles published using this outline.

The deadline for the completed article is September 10th, with a publication date of September 24th.

by Ellie Collier at July 16, 2014 01:00 PM

OCLC Dev Network

Systems Maintenance on July 20

Web services that require user level authentication will be down for WMS and Identity Management system (IDM) updates beginning Sunday, July 20th, 2:00 am EDT.

 

by Shelley Hostetler at July 16, 2014 01:00 PM

Eric Lease Morgan

Matisse: "Jazz"

"Arguably one of the most beloved works of twentieth-century art, Henri Matisse's "Jazz" portfolio - with its inventiveness, spontaneity, and pure intensely pigmented color - projects a sense of joy and freedom." These are the gallery notes from an exhibit of Jazz at the Des Moines (Iowa) art museum.

by Eric Lease Morgan (emorgan@nd.edu) at July 16, 2014 04:00 AM

Jazz, (Henri Matisse)

"Jazz (1947) is an artist's book of 250 prints for the folded book version and 100 impressions for the suite, which contains the unfolded pochoirs without the text, based on paper cutouts by Henri Matisse. Teriade, a noted 20th century art publisher, arranged to have Matisse's cutouts rendered as pochoir (stencil) prints."

by Eric Lease Morgan (emorgan@nd.edu) at July 16, 2014 04:00 AM

Context for the creation of Jazz

"In 1943, while convalescing from a serious operation, Henri Matisse began work on a set of collages to illustrate an, as yet, untitled and undecided text. This suite of twenty images, translated into "prints" by the stenciling of gouache paint, became known as Jazz---considered one of his most ambitious and important series of work." These are notes about the work Jazz by Matisse.

by Eric Lease Morgan (emorgan@nd.edu) at July 16, 2014 04:00 AM

Dan Chudnov

Considering "Computational Politics"

Zeynep Tufekci's recent paper in First Monday "Engineering the public: Big Data, surveillance and computational politics" focuses our attention on the shifting power dynamics of data-driven targeting of political influence and the expected effects on society. The term "computational politics" itself seems to have been used here and there for at least 10 years, previously referring more narrowly to computational study of elections and electoral systems, but in her definition it expands to:

"applying computational methods to large datasets derived from online and off-line data sources for conducting outreach, persuasion and mobilization in the service of electing, furthering or opposing a candidate, a policy or legislation."

Her arguments (summarized weakly here to be sure) focus on how big data "can undermine the civic experience," concluding that the conditions she observes lead to a fractured public sphere replaced by privatized targeting of attempts to influence, that this activity is an important result of an implied but just-invisible surveillance state, and that all of this "favors incumbents who already have troves of data," including the already-wealthy, the new platform providers, and those with the means to acquire and apply data at scale to these same ends.

With my limited training, knowledge, and experience I won't deign to offer a substantial critique beyond saying that these are compelling arguments, especially in light of the decades-long trends noted along the way. I encourage you to read for yourself and to dig in to the notes and references, which taken together might, as do many of the best works shared freely online, form an ideal sort of self-contained open online course.

The only minor quibble I'll mention regards how these explicit expectations of how big data affordances will continue to be used are to a degree undercut by an implied expectation that those empowered to do so will themselves act rationally. In the section describing how our shifting understanding of behavioral sciences has moved beyond the rational actor model to incorporate a more nuanced sense of individual and collective irrationality which may be modeled and targeted accurately at the individual level, I got a sense that this assumes that those wielding this power will each apply it uniformly toward singular or related objectives, such as swaying an election toward a candidate, party, or one side of a set of issues. Although this no doubt is how things work in political campaigns, I am less confident that the major platform providers of our day are likely to have such tunnel vision. If you own the platform, would you use it to shift your user base toward a consistent set of political aims at the macro scale (everyone), or, rather, would you engineer the platform to profit off all efforts to do so, at all scales? It seems to me that for-profit platform providers like Facebook and Twitter have an overwhelming motive to profit off empowering the peddling of influence and the manufacturing of consent for every micro-community who finds a voice and audience, especially if they clash into each other in the spaces themselves and create more traffic and sustain attention and charge emotions further. All of this activity enhances the ability of those who can afford to acquire and model all of the data to target their users, and the churn of charged messages itself obscures how only the already-rich and already-powerful can apply the most powerful models over all the data for their own aims.

Because we don't want to know (as is pointed out in the paper) that we are being manipulated, we prefer our platforms to provide at least a meaningful illusion of neutrality. What better way to provide this illusion than to allow the smaller and less rich among us to believe the platform gives us our own insights into our own more modest target audiences? There is a line beyond which the utility of a small organization's or movement's ability to tie in a handful of social media accounts to an inexpensive hosted CRM to target its own members is surpassed by the cost or risk to that organization and thousands of others like it when they stop believing that the platform serves its modest interests rather than those of the platform provider itself. The minute Facebook or Twitter is believed to be taking political sides, they will die like so many other platforms before them. If anything, this reinforces the paper's argument that the opacity and proprietary nature of how these platforms might choose to inject influence should be of greater concern.

I wrote some years ago that user-generated content is a reverse supply chain for information, but in that piece I considered mainly profit, rather naively. It is not lost on me how several cases described in the paper were occurring back in 2007 when I was writing my column, and how small whatever insight I might have been able to share back then seems now. In a surveillance state, my own activities define how my own future actions will be swayed, imperceptibly but precisely so and just for me, by the environments in which I choose to take those activities.

In the months following the events of 9/11 I spent a lot of my transit time on the train between New Haven and Cambridge reading a report from MITRE or Rand on the shift in paradigms from hierarchical to network and peer-to-peer command and control systems (my apologies, I searched several times in the years since but have not found that piece again and cannot provide a reference). If I recall correctly, the main thesis of this report was that our surveillance and military systems were not up to the challenge, that our 20th century infrastructure for intelligence gathering would not enable us to study and fight a distributed 21st century enemy successfully, and that we needed to shift rapidly to adapt to the network model. Although we can readily question how successfully this objective has been achieved militarily, it is startling to consider how successfully every aspect of our computed lives embody this achievement.

One last thought, more grounded in a field I do actually know something about. There are also implications here for the provision of what we blithely refer to as open data. We could argue that by giving data away in a raw form, we are empowering anyone to make what they will of it. This seems, at first, to be a positive result all around. Who doesn't want to be empowered? But the skill, experience, manpower, and computing power needed to bring diverse data together meaningfully and at scale are, unfortunately, not readily available to most of us as individuals, and this paper hammers that message home. Even if I fancy myself a competent programmer and a budding data scientist, even if I have inexpensive at-will computing power at my disposal, how can I afford to make my own targeted models of influence and persuasion and apply them meaningfully?

How can I afford not to?

I don't have answers, but it seems clear that there is a huge gap to fill in service of those who wish to study data meaningfully and at scale, and that more of us would do well to step in and fill it.

by dchud at July 16, 2014 02:33 AM

Galen Charlton

Taking the ALA Statement of Appropriate Conduct seriously

After I got back from this year’s ALA Annual Conference (held in the City of It’s Just a Dry Heat), I saw some feedback regarding B.J. Novak’s presentation at the closing session, where he reportedly marred a talk that by many accounts was quite inspiring with a tired joke alluding to a sexual (and sexist) stereotype about librarians.

Let’s suppose a non-invited speaker, panel participant, or committee member had made a similar joke. Depending on the circumstances, it may or may not have constituted “unwelcome sexual attention” per the ALA Statement of Appropriate Conduct at ALA Conferences, but, regardless, it certainly would not have been in the spirit of the statement’s request that “[s]peakers … frame discussions as openly and inclusively as possible and to be aware of how language or images may be perceived by others.” Any audience member would have been entitled to call out such behavior on the spot or raise the issue with ALA conference services.

The statement of appropriate conduct is for the benefit of all participants: “… members and other attendees, speakers, exhibitors, staff and volunteers…”. It does not explicitly exclude any group from being aware of it and governing their behavior accordingly.

Where does Novak fit in? I had the following exchange with @alaannual on Twitter:

Question to @alaannual: are invited speakers who are not members specifically told about the statement of appropriate conduct?

— Galen Charlton (@gmcharlt) July 3, 2014

@gmcharlt Checking on process – will let you know when we hear back, probably next week with the holiday.

— ALA Annual (@alaannual) July 4, 2014

@gmcharlt Heard back that we don't currently share the Statement with auditorium speakers but will discuss doing that in the future.

— ALA Annual (@alaannual) July 15, 2014

.@alaannual Thanks for the response. I strongly encourage that the Statement of Appropriate Conduct be shared with invited speakers

— Galen Charlton (@gmcharlt) July 15, 2014

A key aspect of many of the anti-harassment policies and codes of conduct that have been adopted by conferences and conventions recently is that the policy applies to all event participants. There is no reason to expect that an invited keynote speaker or celebrity will automatically not cross lines — and there have been several incidents where conference headliners have erred (or worse).

I am disappointed that when the Statement of Appropriate Conduct was adopted in late 2013, it apparently was not accompanied by changes to conference procedures to ensure that invited speakers would be made aware of it. There’s always room for process improvement, however, so for what it’s worth, here are my suggestions to ALA for improving the implementation of the Statement of Appropriate Conduct:

I invite feedback, either here or directly to ALA.

July 16, 2014 12:27 AM

Library Tech Talk (U of Michigan)

Decommissioning BlueStream

After running for over a decade, BlueStream was retired on June 30th, 2014.

by Robert McIntyre at July 16, 2014 12:00 AM

July 15, 2014

code4lib

Code4Lib NorCal 28 July in San Mateo

A one-day Code4Lib NorCal Meeting will be held 10am - 3:00pm on 28 July at 777 Mariner's Island Blvd., San Mateo (the building that the OCLC Research San Mateo office is in).

There is no cost, and OCLC will provide attendees lunch. Parking is plentiful and free.

Please see More information on the wiki.

by rtennant at July 15, 2014 09:29 PM

Casey Bisson

Porn consumption by geography and type

This is shamefully old news, but Pornhub released stats that correlate viewing preferences by geography and pulled out a quote too juicy to ignore:

Dixie loves dicks so much that the percentage of gay viewers for every single state in the South is higher than the average of the legal gay marriage states.

Choropleth of searches for gay porn

I’m concerned that some of the numbers are contradicted in three different places in the same article, but it suits my worldview, so why bother questioning it?

Even older is a 2009 study that showed Utah had the highest rate of porn subscriptions.

[T]hose who attend religious services [on Sunday] shift their consumption of adult entertainment to other days of the week, despite on average consuming the same amount of adult entertainment as others.

Porn subscriptions per capita by state

This too is just too funny to question (though the author seems forthcoming about the limitations of the data).

Thinking of convenient material too funny to question, this was almost exactly a decade ago.

Back to the present, a paper in the Proceedings of the National Academy of Sciences compared “tightness–looseness” (“strength of punishment and degree of latitude/permissiveness”) and came up with the following map (as it appeared in Mother Jones):

Tightness-looseness choropleth map

That particular paper is just too rich with detail to excerpt from. It’s a quick read, just go for it. Though, if you’re too lazy for that, at least you can appreciate the correlation between this map and the others.

by Casey Bisson at July 15, 2014 09:11 PM

District Dispatch

FCC extends network neutrality filing deadline

FCC Press Secretary Kim Hart issued the following statement this afternoon:

The deadline for filing submissions as part of the first round of public comments in the FCC’s Open Internet proceeding arrived today. Not surprisingly, we have seen an overwhelming surge in traffic on our website that is making it difficult for many people to file comments through our Electronic Comment Filing System (ECFS). Please be assured that the Commission is aware of these issues and is committed to making sure that everyone trying to submit comments will have their views entered into the record. Accordingly, we are extending the comment deadline until midnight Friday, July 18. You also have the option of emailing your comments to openinternet@fcc.gov, and your views will be placed in the public record.

The American Library Association will file comments this week. Stay tuned!

The post FCC extends network neutrality filing deadline appeared first on District Dispatch.

by Larra Clark at July 15, 2014 08:48 PM

How fast is your library’s internet (really)?

Last Friday, the Federal Communications Commission approved its first E-rate Order as part of its modernization proceeding.

But the work is far from over. The FCC also seeks additional data and public comment to inform its next steps in E-rate modernization.

As part of its ongoing advocacy, the ALA and other national library partners will gather new data this month to gauge the quality of public access to the internet in our nation’s public libraries. The effort is funded by the Institute of Museum and Library Services (IMLS), and is supported by the Association of Rural and Small Libraries, the Chief Officers of State Library Agencies, the Public Library Association, and the Urban Libraries Council.

The broadband speed test will measure the actual internet speeds delivered to desktops, laptops and mobile devices in public library buildings. Gathering information on the actual speeds to the device will help better describe and improve the library patron experience using library wired and wireless networks. The resulting data—needed from libraries of all sizes in all 50 states—will aid the library community in advocating for adequate E-rate funding for libraries. All participating libraries also will receive their local speed data.

“Strong Wi-Fi and internal broadband connections in libraries and schools are necessary to support individualized learning. We need to better understand this issue from a ‘front-line’ context,” said IMLS Director Susan H. Hildreth. “I hope we will have broad library participation in this effort so that policy decisions will be well informed and can accommodate future broadband needs of library customers.”

Please help us spread the word! Public libraries can log on to the speed test at: digitalinclusion.pnmi.com/speedtest.

The post How fast is your library’s internet (really)? appeared first on District Dispatch.

by Larra Clark at July 15, 2014 08:21 PM

E-rate modernization: Take a breath, now back to work

July 11, the Federal Communications Commission (FCC) voted in a 3-2 vote along party lines to move forward on the next phase of the E-rate Modernization proceeding. The resulting E-rate Report and Order, not yet publicly released, will focus on making Wi-Fi support available to more libraries and schools, streamlining the application and administration of the program, and ensuring the program is cost-effective and efficient to make E-rate dollars go further.

In the final weeks before the Commission vote, ALA invested significant time to make a final play for addressing library issues in the draft order circulated by the Chairman. Through phone calls and emails with the Chairman’s staff and the legal advisors to the Commissioners as well as in-person meetings (logging 12 ex parte filings in the last week of the public comment period alone) we responded directly to questions about our most important issues and further explained the rationale for the positions we have taken. Most notable and described in detail below, is the collaboration between the Association for Rural & Small Libraries (ARSL), the Chief Officers of State Library Agencies (COSLA), the Public Library Association (PLA), and the Urban Libraries Council (ULC); the adoption by the Commission of ALA’s formula proposal; and the video of the FCC Chairman speaking on libraries and E-rate for ALA’s Annual Conference.

E-rate and #alaac14

lv_lib

A Las Vegas-Clark County library Gigi Sohn visited during #alaac14

We headed into Annual Conference on the heels of a packed week of Commission meetings, calls, emails, and more meetings. We secured a “feature-length” video of the Chairman speaking for and on behalf of libraries and their important contributions in the E-rate proceeding. And, in person, Gigi Sohn, Special Counsel for External Affairs to Chairman Wheeler, met informally with PLA leadership, representatives from COSLA, ARSL, the OITP Advisory Committee, and the ALA E-rate task force to talk E-rate details and the nature of library services in today’s and tomorrow’s libraries. In addition to these meetings, Gigi took a field trip to the main library and one of the smaller branches of the Las Vegas-Clark County Library District where we saw a model of what libraries can offer their communities. The experience exposed Gigi to an example of the depth libraries go to understand their demographics (in very specific granularity) and through that, the needs of the communities they serve to truly become the learning hub for the community.

A significant outcome from Annual, and in no small part due to the meetings with Gigi, was the impetus to bring together the library E-rate community in the final advocacy stages to call for swift action on the part of the Commission to move its first step proposal forward. On the last day of public comment before the vote, ALA, ARSL, COSLA, PLA, and ULC filed a joint letter (pdf) supporting this first step in the E-rate modernization process.

E-rate and the tricky business of square footage formulas

Prior to and after Annual, OITP continued its in-depth review of the cost data for Wi-Fi and related services we had gathered from state libraries, library systems, and individual libraries. While many were celebrating a long Fourth of July weekend with picnics and fireworks, the OITP E-rate team made the weekend even longer, . Our team spent the weekend reviewing recently-gathered information in addition to our previous cost analysis to model whether the Commission $1.00 and 6000 floor square foot formula would adequately address library Wi-Fi needs.

We compared itemized lists of equipment and services libraries purchased to support their Wi-Fi and internal connections. We also further studied the potential impact of the proposed Commission library formula for Wi-Fi and related services (the former Priority 2 bucket and the new Category 2) on library applicants. In addition to the data we collected, we consulted with the library organizations that filed the joint letter, who turned to their member leaders for feedback.

In coming to a formula proposal (i.e., $2.30 per square foot or a floor for libraries at or below 4000 square feet of $9,200—over a five year period), we were careful to take into account the fact that a “library formula” must at once be robust and defensible and account for the needs of the smallest to the largest library applicants. A formula is fundamentally different than the historical, solely needs-based funding model for the E-rate program. By its nature, a formula will not provide all of each applicant’s needs. It is a recognition that the Commission must work within an imperfect system and that distributing Wi-Fi funding equitably to as many libraries (and schools) as possible is a positive change to the long-term lack of such funding to any libraries.

E-rate and the near future

The now Order and Further Notice of Proposed Rulemaking adopted by the Commission on Friday will address a significant shortfall for libraries and schools and seeks to close the “Wi-Fi gap” as well as simplify the administration and application processes, and maximize the cost effectiveness of the program. After long hours of negotiating with Commission staff, we are gratified to learn that ALA’s proposal for the per-square-foot library formula has been adopted. Though the Order is not public as of this writing, we understand that a number of ALA’s other proposals are included in some fashion (refer to the Fact Sheet summarizing the content of the Order). Significantly, we learned the Commission’s continued commitment to addressing the connectivity gap (i.e., the lack of high-capacity broadband to the majority of the nation’s libraries) is part of its continued review of the E-rate program.

In addition to the Order, as part of its 11th hour negotiations, the Commission elected to call for further public input to address the long-term funding issues that have plagued the program, and a call to revisit the new per pupil and per square foot allocation model for Wi-Fi (Category 2) funds as well as outstanding issues that are not folded into the Order. We will weigh in on this important addendum to the process because, as Commissioner Clyburn noted in her statement, (pdf) “Our work is not done and we will continue to contemplate how to close these gaps and ensure that all schools and libraries have affordable access to the connectivity to and within their buildings.”

We expect that the Commission will release the Order and FNPRM this week, at which time a number of us in the Washington Office will hit the print button so we can read and digest what we hear is 158 pages. We will provide a summary and are planning outreach to the library community. We anticipate USAC will be providing in-depth outreach and we are also talking to Commission staff to develop some library specific outreach materials. More to come!

The post E-rate modernization: Take a breath, now back to work appeared first on District Dispatch.

by Marijke Visser at July 15, 2014 07:25 PM

D-Lib

July/August Issue

Editorial by Laurence Lannom, CNRI

July 15, 2014 06:47 PM

On Being a Hub: Some Details behind Providing Metadata for the Digital Public Library of America

Article by Lisa Gregory and Stephanie Williams, North Carolina Digital Heritage Center

July 15, 2014 06:47 PM

The SIMP Tool: Facilitating Digital Library, Metadata, and Preservation Workflow at the University of Utah's J. Willard Marriott Library

Article by Anna Neatrour, Matt Brunsvik, Sean Buckner, Brian McBride and Jeremy Myntti, University of Utah J. Willard Marriott Library

July 15, 2014 06:47 PM

Realizing Lessons of the Last 20 Years: A Manifesto for Data Provisioning and Aggregation Services for the Digital Humanities (A Position Paper)

Article by Dominic Oldman, British Museum, London; Martin Doerr, FORTH-ICS, Crete; Gerald de Jong, Delving BV, Barry Norton, British Museum, London and Thomas Wikman, Swedish National Archives

July 15, 2014 06:47 PM

What Do Researchers Need? Feedback On Use of Online Primary Source Materials

Article by Jody L. DeRidder and Kathryn G. Matheny, University of Alabama Libraries

July 15, 2014 06:47 PM

Degrees of Openness: Access Restrictions in Institutional Repositories

Article by Hélène Prost, Institute of Scientific and Technical Information (CNRS) and Joachim Schöpfel, Charles de Gaulle University Lille 3

July 15, 2014 06:47 PM

Managing Ambiguity in VIAF

Article by Thomas B. Hickey and Jenny A. Toves, Online Computer Library Center, Inc.

July 15, 2014 06:47 PM

Report on Libraries in the Digital Age (LIDA 2014)

Conference Report by Darko Lacović, University of Osijek, Croatia and Mate Juric, University of Zadar, Croatia

July 15, 2014 06:47 PM

In Brief: The IDEASc Project Seeks Applicants

July 15, 2014 06:47 PM

In Brief: Unpacking Fedora 4

July 15, 2014 06:47 PM

In Brief: The Pisa Declaration: An Open Access Roadmap for Grey Literature Resources

July 15, 2014 06:47 PM

In Brief: Connected Learning in Digital Heritage Curation

July 15, 2014 06:47 PM

Cynthia Ng

Presentation: Accessible Formats for People with Print Disabilities in Canada: A Short Overview on Public Library Services

This is a presentation that I did recently for an interview. The idea is to cover the distribution of accessible formats for people with print disabilities, which I took to mean how these people can access materials with focus on public libraries.I apologize for the PowerPoint look, but I did them fairly quickly and figured […]

by Cynthia at July 15, 2014 05:58 PM

District Dispatch

Stories live in libraries, but how to share them?

A lot has happened in my first month as a Google Policy Fellow at the American Library Association (ALA), where today I am formally launching a digital storytelling project called Living Stories, Living Libraries. The blog relies on photo documentary-style submissions to capture the diverse stories of people using libraries. It gives individuals a place to share how libraries have impacted their lives, hear from others, connect ideas, and provides a space for you to tell your own story.

Social media, and the ubiquity of mobile internet access and mobile photography allows for the unprecedented ease of online storytelling. At present, library information shared through social media is largely conducted in editorial format. Living Stories, Living Libraries is based on the belief that libraries could benefit in advocacy and visibility-raising through the more personal approach of letting individual librarians and users document and promote their unique experiences with the library. Currently, the twitter handle #futureoflibraries allows library patrons and librarians to tweet what they would like to see in libraries of the future. One telling example includes “Libraries could be doing more to tell the story of how much they’ve changed- eg. adapting to the digital ecology.”

And I learned two important things: libraries are unsung liberators, and everyone in a library has a story.

As institutions, libraries are increasingly going beyond providing information and knowledge, and are actually producing information and enabling creation. The new role of the library is becoming ever more important as we shift to a knowledge-based economy that relies on digital skills and innovation. Libraries are changing in many future-forward ways, from providing 3-D printing and teaching coding to acting as publishers, but unfortunately research shows the majority of people are only aware of a small portion of the resources libraries offer. Libraries represent public innovation spaces that fill vital needs in access to opportunities, skills, knowledge, and creative space. But what is the best way to show their growing value?

The idea for this project began to take shape a year ago when I became increasingly frustrated in my international affairs studies with the existing narratives and problematic approaches to international development. I took refuge in the library, which seemed to me the only truly humanitarian institution that helps all people, without any ulterior motive, and I started looking at libraries as drivers of community development. At this point, a bout of madness overtook me and I decided to write a thesis around the topic. I traveled across rural Romania on a grant, interviewing librarians and patrons about how Internet access through the library impacted their communities. And I learned two important things: libraries are unsung liberators, and everyone in a library has a story.

I realized that the library has an underused opportunity to further connect, share, and create a wider community freely and simply through social media.

I wanted to find a way to collect the stories people told me about how the library has made differences great and small in their lives. More than that, I wanted to share them in an accessible, human way, because the stories themselves reflected a stunning humanity. Across diverse cultures and communities, libraries fill needs as unique as the people who use them. From a grandmother who spoke to her American-born grandson for the first time on Skype at the library, to a woman who turned to the library to understand cancer treatments when a family member was diagnosed, stories reflected the multitudinous and vital needs for quality access to information and technology.

Since my undergraduate thesis is not about to make the New York Times bestsellers list, the chances of someone finding these stories are unfortunately pretty slim. Instead, I decided to create a blog site in collaboration with ALA where library users and librarians can openly post their own unique stories. While conducting research, I realized that the library has an underused opportunity to further connect, share, and create a wider community freely and simply through social media. The goals of this blog are to:

  1. Interact: Use the social media format to build trust and mirror the personal interaction we value in libraries-real people telling real stories
  2. Show: Promote the image of libraries as vibrant community centers that fill a variety of roles
  3. Share: Allow librarians to get ideas from this blog, to see what other libraries are offering, and to share their own successes using the #librariesinthewild tag
  4. Advocate: to share stories with policymakers for library advocacy at both the local and national level
  5. Grow: to create new library users by reaching people on social media and showing powerful, short, easy to consume human stories of impact

More than anything else, stories about other people and their lives capture our attention—the success and popularity of social media in starting movements and developing identities makes that clear. This blog is designed to capture meaningful stories and gives voice to all those who have been impacted by the library. It also presents a space to illustrate all of the innovative new roles libraries fill. The project is licensed under Creative Commons, and I encourage anyone to share and adapt its content for any non-commercial advocacy and awareness-raising purposes. My hope is that Living Stories, Living Libraries will provide a useful tool to create an online community and to keep libraries as prominent providers of free and equal access to information and new digital opportunities.

I invite you now to follow the blog, to share the project widely with colleagues and friends, and tell your own story. Please feel free to reach out to me if you or your library would like to be involved, mkavaras [at] alawash [dot] org.

The post Stories live in libraries, but how to share them? appeared first on District Dispatch.

by Margaret Kavaras at July 15, 2014 03:59 PM

DPLA

Finding family information through DPLA

Photo of Larry Naukam, author of this postLarry Naukam is the retired Director of Historical Services (Local and Family History, Digitizing, and Newspaper Retrieval) for the Central Library of Rochester and Monroe County, New York. He has been actively involved with local historical and genealogical groups since 1978. He is on the Board of Directors of the Rochester Genealogical Society and has been a member of RGS Genealogical Educator’s Group since 1991. 

Genealogists are getting much more interested in doing serious research and having accurate citations than may have been the case in the past. DPLA offers a place where these researchers can utilize the “serendipitous discovery” potential of items in the DPLA to advance their research. As more people discover DPLA, they will enhance the quality of their research output by accessing this larger pool of available materials.  Even having a small piece of a larger collection will stimulate use of that collection.  Case in point: a small historical society in mid-New York state digitized an account book from the early years of that town. As it was created before the U.S. was a separate country, this helped people who had colonial ancestors flesh out the stories of some of their ancestor’s lives. Academics used it to reconstruct a look at that community. More searchers are using more materials and doing so in a historically responsible manner. DPLA can greatly enhance this process of discovery.

Any library or archive wants to engage their users. After all, what good is a collection that no one knows about and no one uses? Having materials online to be discovered is a good start. There is far more to be used for family research than just censuses. I wrote an article twenty-five years ago which utilized land records, church records, letters from one family member to another, photographs, wills, and the like. If the Internet and DPLA has existed then, it would have greatly aided this research by pointing me to likely research sources instead of having to guess which institutions might have various kinds of materials.

I am often asked: why should I put everything online? No one will come to use the collection! Not true. Personal experience showed me that by putting some useful tools online, usage of the place that I used to work at septupled. That’s right. Seven times as many people came in to use the collection because they had discovered some content online and wanted to see more. That location continues to be a leader in small public library research facilities.

But there’s another item to consider: why is the DPLA important to researchers? It’s obvious that as DPLA follows its plan to make a growing number of items available and discoverable though different access points, then all will benefit. As more and more information goes behind pay walls, it becomes more and more essential to have other items freely available.

Capturing information, and presenting it in a useful and discoverable manner is very important. Members and participants in the DPLA can put forth extra effort to include at least samples of what’s in their collections. For example, a local historian can put a selection of pictures with adequate metadata online, and mention in the accompanying text that there is much more to be used in person. I have visited many local historians in my area, and there is a great amount of useful materials in their offices. How great it would be to have  that material discovered! These historians might or might not be able to do this work themselves. That’s why I am a member of a group which does volunteer scanning and cataloging of such records. In 10 years we have done over 150,000 pages of materials. We get about 15,000 hits a year to that web site, without a lot of publicity (yet).These records will greatly assist family searchers in their quest for additional data about their ancestors and collateral relatives. People have told us so, and a national genealogy group gave us an award for this work in 2010.

Draper family genealogy diagram.

Draper family genealogy diagram. Courtesy Bancroft Memorial Library via Digital Commonwealth.

Photograph of two African-American corporals, Perry, Houston County, Georgia, 1918. Pearlie Brown and Quitman.

Two African-American corporals, 1918. Courtesy GA Dept. of Archives and History via DLG.

Immigrants seated on long benches, Main Hall, U.S. Immigration Station.

Immigrants seated on long benches, Main Hall, U.S. Immigration Station, Ellis Island. Courtesy NYPL.

What do genealogists and family historians want to see? Censuses, of course. Those are available already from numerous resources. But a big, big move now underway from both commercial and non-commercial publishers is trying to get researchers to write up their findings and more importantly to amplify them with far more than just simple census data. Biographies, letters, church records, manuscripts, diaries, journals, and many other sources can be used to write a family narrative. By taking the above-mentioned material, a name becomes a person. Such things can tell us more about how a person lived and felt. Think of some letters written back and forth between a young couple in wartime, and what they would mean to their descendants decades later. The people that the readers knew as older folks would seem vital and alive.

Those involved with putting information online should talk to users about access points. That’s something I did years ago when users would come in to my area. This is what we have. Is it useful to you? What would be useful as well? How do you want to access it? Simply making things available increased usage 7 times, and as of a few years ago visits just to the top guide pages of our web site got 250,000 hits a year. Not bad for a small library!

This means that our funding sources (and visiting users!) took notice of us. We did a great presentation to the American Association for State and Local History in 2008, and hopefully inspired more digitizing efforts. Funders watch use, and grants can be much easier to get if a location is successful. Having a location discovered through DPLA is likely not only to result in increased usage, but also in higher quality research being done as more materials become available.

Remember as well that intellectual activity inspires actions. Genealogists are becoming much more sophisticated in their research. Academics have traditionally had to do rigorous research and fact checking, while genealogists got a somewhat deserved reputation as being mere name collectors who believed anything if it was written. No longer. Major conventions are flooded with those anxious to learn the right way to do things, and by making so much available the DPLA gets used by a wider audience.

Images of Anneke W. Jans-Bogardus and Everardus Bogardus with a genealogy chart.

Anneke W. Jans-Bogardus, Everardus Bogardus, [and] Genealogy Chart. Courtesy the New York Public Library.

Let me give an example of multiple ways to engage the material. A friend recently was researching the Bogardus family of New York. They were early settlers of the colony and state. The Dutch patriarch lived from 1589 to 1647. The friend found over 150 mentions of the Bogardus family in the collections pointed to by the DPLA. They ranged from drawings to manuscript to photographs, from English to Dutch to Danish.  He was amazed at the sources that he could locate and see through the DPLA. His research has been enriched and he is happy to tell others of his positive experience.

But it can be so much more. “Genealogy” has over 63,000 results in DPLA. One of them is the actual Ellis Island immigration web site; another an oral history of an African American point of view. The term “genealogical” has almost 6,500 entries. The term “family history” has over 1,200. Searching on a surname can yield a feast of sources that could be useful. People search for terms like “genealogy,” “genealogical,” and “family history” as they start a search and often will come upon something cataloged against that term. Find-ability is only as good as the terms used to describe it.

The DPLA has a curated exhibition, Leaving Europe that provides a learning commentary for those who wish to pursue more avenues of inquiry.  I use it to demonstrate how people came to the U.S. during the 19th century.

All in all, DPLA is a magnificent resource that should grow and be advertised and marketed to genealogists and family historians. We all will get a lot from this endeavor by sharing our discoveries.

Cover image: Atlas Universel. Tableau mythologique. 1834. Courtesy David Rumsey CC BY-NC-SA 3.0.


cc-by-iconAll written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

by DPLA at July 15, 2014 03:03 PM

Casey Bisson

Followup: Triggertrap latency and Fuji Instax tips

Short answer: Triggertrap app audio triggering latency is too long to capture a fast moving event.

dice

The app, the dongle, my trusty EOS Rebel XTi, Lensbaby (manual focus, soft edge details), and Neewer flash worked, but too slowly. The phone was just inches from where I was throwing the dice, but the flash and camera were triggered after most of the action happened. Most of the time the die flew off the table before the picture was captured. The delay was such that I was able to slam the die to the table and remove my hand from the scene without being captured on camera.

On the other hand, the Fuji Instax tips and tricks have proved handy over the past couple weeks.

Santa Cruz Boat Rentals

I’ve gotten a bit better at managing focus and I can definitely confirm the camera doesn’t perform in the dark (heck, this was even a good blue hour shot that it couldn’t pull off)*. I still struggle with it though, and rarely get the focus, exposure, and framing exactly to my liking. It’s still great fun, just challenging.

* Yeah, the lesson here is that I should try earlier, during the golden hour, or bring lots of flash. Alternatively, a Belair with instant back and a longer exposure might have done it.

by Casey Bisson at July 15, 2014 09:35 AM

July 14, 2014

John Miedema

QA Architecture: Initialization. The solution is in the question.

1-1 QA DetectionThe Question and Answer Architecture of Whatson can be divided into three major processes. The first process may be called Initialization, and is shown in the chart to the left. It involves the following steps:

  1. Accept the Question. A user asks a question, “Who is the author of The Call of the Wild?” Everything flows from the user question. One might say, the solution is in the question. It is assumed that the question is in text format, e.g., from an HTML form. A fancier system might use voice recognition. The user can enter any text. It is assumed at the beginning that the literal text is entered correctly, i.e., no typos, and that there are sufficient clues in the question to find the answer. If these conditions prove wrong, a later step will be used to correct and/or enrich the original question text.
  2. Language Detection. The question text is used to detect the user’s language. The cognitive work performed by the system is derived from its knowledge of a particular language. Dictionaries, grammar, and models are all configured for individual languages. The language to be used for analysis must be selected right at the start.
  3. Parts of Speech. Once we know the language of the question, the right language dictionary and models can be applied to obtain the Parts of Speech that will be used to do Natural Language Processing (NLP).
  4. Domain Detection. A typical NLP application will use English language models to perform tasks such as Named Entity Recognition, the identification of common entities such as People, Locations, Organizations, etc. This common level of analysis is fine for many types of questions, but there are limitations. How can a Person detector know the difference between an Author and a Character? I have shown how to build a custom model for Book Title identification. My intent is to build custom models for all elements of the subject domain. The current domain of interest is English literature but a system should use the question text to identify others domains too.

The next post will describe how to use the inputs for NLP.

by johnmiedema at July 14, 2014 02:24 PM

District Dispatch

Possible amendment damaging to E-rate is expected soon.

The ALA Washington Office just got off a conference call with House Democratic Leadership regarding a possible amendment to the House Financial Services Appropriations bill that would limit the Federal Communication Commission’s (FCC) ability to increase funds to the Federal E-rate program. We expect a possible amendment to be voted on the House floor sometime within the next 48 to 72 hours.

Please call your U.S. Representative today and encourage him or her not to support any amendment that would thwart the FCC’s ability to increase funds to the Federal E-rate program.

The E-rate program is part of the Universal Service Fund, which is overseen by the FCC. Through this program, libraries and schools receive discounts on telecommunications services including high capacity broadband.

More information on this amendment will be posted on District Dispatch as it becomes available.

The post Possible amendment damaging to E-rate is expected soon. appeared first on District Dispatch.

by Jeffrey Kratz at July 14, 2014 12:59 PM

Stuart Yeates

BIBFRAME

Adrian Pohl ‏wrote some excellent thoughts about the current state of BIBFRAME at http://www.uebertext.org/2014/07/name-authority-files-linked-data.html The following started as a direct response but, after limiting myself to where I felt I knew what I was talking about and felt I was being constructive, turned out to be much much narrower in scope.

My primary concern in relation to BIBFRAME is interlinking and in particular authority control. My concern is that a number of the players (BIBFRAME, ISNI, GND, ORCID, Wikipedia, etc) define key concepts differently and that without careful consideration and planning we will end up muddying our data with bad mappings. The key concepts in question are those for persons, names, identities, sex and gender (there may be others that I’m not aware of).

Let me give you an example.

In the 19th Century there was a mass creation of male pseudonyms to allow women to publish novels. A very few of these rose to such prominence that the authors outed themselves as women (think Currer Bell), but the overwhelming majority didn’t. In the late 20th and early 21st Centuries, entries for the books published were created in computerised catalogue systems and some entries found their way into the GND. My understanding is that the GND assigned gender to entries based entirely on the name of the pseudonym (I’ll admit I don’t have a good source for that statement, it may be largely parable). When a new public-edited encyclopedia based on reliable sources called Wikipedia arose, the GND was very successfully cross-linked with Wikipedia, with hundreds of thousands of articles were linked to the catalogues of their works. Information that was in the GND was sucked into a portion of Wikipedia called Wikidata. A problem now arose: there were no reliable sources for the sex information in GND that had been sucked Wikidata by GND, the main part of Wikipedia (which requires strict sources) blocked itself from showing Wikidata sex information. A secondary problem was that the GND sex data was in ISO 5218 format (male/female/unknown/not applicable) whereas Wikipedia talks not about sex but gender and is more than happy for that to include fa'afafine and similar concepts. Fortunately, Wikidata keeps track of where assertions come from, so the sex info can, in theory, be removed; but while people in Wikipedia care passionately about this, no one on the Wikidata side of the fence seems to understand what the problem is. Stalemate.

There were two separate issues here: a mismatch between the Person in Wikipedia and the Pseudonym (I think) in GND; and a mismatch between a cataloguer-assigned ISO 5218 value and a free-form self-identified value. 

The deeper the interactions between our respective authority control systems become, the more these issues are going to come up, but we need them to come up at the planning and strategy stages of our work, rather than halfway through (or worse, once we think we’ve finished).

My proposed solution to this is examples: pick a small number of ‘hard cases’ and map them between as many pairs of these systems as possible.

The hard cases should include at least: Charlotte Brontë (or similar); a contemporary author who has transitioned between genders and published broadly similar work under both identities; a contemporary author who publishes in different genre using different identities; ...

The cases should be accompanied by instructions for dealing with existing mistakes found (and errors will be found, see https://en.wikipedia.org/wiki/Wikipedia:VIAF/errors for some of the errors recently found during he Wikipedia/VIAF matching).

If such an effort gets off the ground, I'll put my hand up to do the Wikipedia component (as distinct from the Wikidata component).


by Stuart Yeates (noreply@blogger.com) at July 14, 2014 10:27 AM

July 13, 2014

John Miedema

Hammering out the Question and Answer Architecture. The big picture.

I settled on the Tankless option for the overall architecture — see diagram and discussion. In that architecture, the Question and Answer piece was one major component. I need to hammer out the details of that component because it has the most complexity, naturally. The following is the complete picture of the Question and Answer Architecture. On the left is the flow from the the original question text, to the natural language processing and querying steps in the middle, to the clue enrichment and final answer on the right. All of these pieces need explanation. I will be presenting and discussing the pieces in three posts. Stay tuned.

1 Question and Answer Architecture

by johnmiedema at July 13, 2014 09:18 PM

Patrick Hochstenbach

Superhero study II

  My second superhero study. I grabbed a reference image of a shushing girl from the internet and started to draw her in a BatGirl custome.Filed under: Comics Tagged: art, batgirl, cartoons, comics, Illustrator, Photoshop, superhero

by hochstenbach at July 13, 2014 12:09 PM

July 12, 2014

SearchHub

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter

In a previous blog post, I introduced the AutoPhrasingTokenFilter. This filter is designed to recognize noun-phrases that represent a single entity or ‘thing’.  In this post, I show how the use of this filter combined with a Synonym Filter configured to take advantage of auto phrasing, can help to solve an ongoing problem in Lucene/Solr – how to deal with multi-term synonyms.

The problem with multi-term synonyms in Lucene/Solr is well documented (see Jack Krupansky’s proposal, John Berryman’s excellent summary and Nolan Lawson’s query parser solution). Basically, what it boils down to is a problem with parallel term positions in the synonym-expanded token list – based on the way that the Lucene indexer ingests the analyzed token stream. The indexer pays attention to a token’s start position but does not attend to its position length increment. This causes multi-term tokens to overlap subsequent terms in the token stream rather than maintaining a strictly parallel relation (in terms of both start and end positions) with their synonymous terms. Therefore, rather than getting a clean ‘state-graph’, we get a pattern called “sausagination” that does not accurately reflect the 1-1 mapping of terms to synonymous terms within the flow of the text (see blog post by Mike McCandless on this issue). This problem disappears if all of the synonym pairs are single tokens. 

The multi-term synonym problem was described in a Lucene JIRA ticket (LUCENE-1622) which is still marked as “Unresolved”:

For some reason, the issue is marked as “Minor”!

The solution would seem to be to only use synonym expansion at query time, but this also has problems with phrase queries, incorrect boosting of rare synonyms due to IDF, and problems with matching multi-term synonyms – which tend to match more than they should (see above cited references). As we search wonks are wont to say, the Lucene/Solr synonyms solutions has problems with both precision and recall.

One solution to this problem is to avoid it altogether by making sure that the synonym list only contains single tokens. One suggested way to do this is to use one-way expansions such as big apple,new york city => nyc at both index and query time.  However, this doesn’t work since the query parser can’t ‘see’ beyond whitespaces (LUCENE-2605) so that a search for text:big apple gets converted to text:big text:apple and the expected synonym expansion doesn’t happen. It works if you search for text:”big apple”, but having to quote phrases to get their synonyms to work defeats the purpose of having synonyms for phrases in the first place. They should “just work” whenever a user enters the phrase in a query string.

From LUCENE-2605 (also currently Unresolved):

This one at least is marked ‘Major’ but at the time of this writing is still unresolved (it was opened in 2010).

From this it would seem that the solution to the problem is to avoid multi-term synonyms altogether (if possible) as the underlying problem(s) seem to be intractable – or at least elusive.  When this happens in the software world where a bug fix does not appear to be imminent – we look instead for a … workaround! 8-) This is where the AutoPhrasingTokenFilter comes in – by providing a way to convert multi-term phrases into single tokens, it can be used as a precursor to synonym mapping.  The solution has a number of side benefits – it preserves phrase searching and cross phrase searches like ‘big apple restaurants’.  It preserves highlighting and it works at either index or query time (if you are worried about the IDF issue). Why? Because rather than going for a solution of the root problem – it simply avoids it!  In other words, “If you can’t beat ‘em, join ‘em”.

Fixing the LUCENE-1622 problem with the Auto Phrasing TokenFilter

The exact use case described in LUCENE-1622 can be “fixed” by noticing that the phrases “Big Apple” and “New York City” are meant to represent a single entity – the great City of New York (another possible synonymous phrase). As described in the previous post, the AutoPhrasingTokenFilter can be used to detect these phrases in a token stream and convert them to single tokens. To preserve character position, a new attribute: replaceWhitespaceWith was added so that the length of the autophrased token will equal the original phrase length but it will not be split by the query parser – because it now has no whitespace characters in it.  Replacing white space with another character in the indexed data also helps with highlighting – which depends on character positions. The source code for this filter is available on github.

So if we have an autophrases.txt file consisting of:

big apple
new york city
city of new york
new york new york
new york ny
ny city
ny ny
new york

Once we configure the AutophrasingTokenFilter to replace whitespace characters with an underscore character (see configuration below), we can create a synonyms.txt entry like this:

big_apple,new_york_city,city_of_new_york,new_york_new_york,new_york_ny,ny_city,ny_ny,nyc

(Note that the use of the ‘_’ character will break stemming filters so you should probably use a letter such as ‘x’ but the underscore is used here for the sake of clarity)

Note that the ‘of’ in the phrase ‘City of New York’ is normally considered to be a stopword. However, if we put the AutoPhrasing Filter before the StopFilter, it will ‘hide’ the stopword so that it can be used in the phrase. This is useful for cases where we have stop words that contained in phrases but otherwise should be treated as noise words. 

The configuration of the text analyzer looks like this. Note that I put the AutoPhrasingTokenFilter in the index analyzer only (with includeTokens=true so that single term queries and sub phrases will continue to hit). Putting auto phrasing in the query analyzer has no effect because of LUCENE-2605.  The SynonymFilter is also in the index analyzer only. It can also go in the query analyzer if you want – this is better if your synonyms list changes often but it does incur the IDF problem:

<fieldType name="text_autophrase" class="solr.TextField" 
           positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" 
            phrases="autophrases.txt" includeTokens="true"
            replaceWhitespaceWith="_" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"    
            ignoreCase="true" expand="true" />
    <filter class="solr.KStemFilterFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.KStemFilterFactory" />
  </analyzer>
</fieldType>

Fixing the LUCENE-2605 problem:

Fixing the problem identified in LUCENE-2605 requires a little more work. Because the query parser only sends tokens to the query analyzer one at at time, there is no way to glue them together in the Analyzer’s token filter chain (even though the Solr Analysis console suggests that you can!). The solution is to do auto phrasing at query time before sending the query to the query parser. A QParserPlugin wrapper that preserves query syntax while auto phrasing the query ‘in place’ before passing it off to a ‘real’ query parser implementation does the trick. In other words, it does something similar to what was proposed in LUCENE-2605 by filtering “around” the query operators. The AutoPhrasingQParserPlugin uses the AutoPhrasingTokenFilter internally. Since this is a query parser, it requires a separate configuration in solrconfig.xml:

<requestHandler name="/autophrase" class="solr.SearchHandler">
   <lst name="defaults">
     <str name="echoParams">explicit
     <int name="rows">10
     <str name="df">text
     <str name="defType">autophrasingParser
   </lst>
  </requestHandler>

  <queryParser name="autophrasingParser" 
               class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" >
    <str name="phrases">autophrases.txt
    <str name=”replaceWhitespaceWith”>_
  </queryParser>

To test the use case identified in LUCENE-1622, several test documents were created and indexed into a Solr collection (Can you spot the theme here? New Yorkers like myself are chauvinists :-) )

  <doc>
    <field name="id">1001</field>
    <field name="name">Doc 1</field>
    <field name="text">Example from LUCENE-1622 search for New York City restaurants</field>
  </doc>
  <doc>
    <field name="id">1002</field>
    <field name="name">Doc 2</field>
    <field name="text">There are many fine restaurants in the great City of New York.</field>
  </doc>
  <doc>
    <field name="id">1003</field>
    <field name="name">Doc 3</field>
    <field name="text">Multi-term synonyms in Solr is a big problem, and its not a new one.</field>
  </doc>
  <doc>
    <field name="id">1004</field>
    <field name="name">Doc 4</field>
    <field name="text">The empire state, New York State is a big state. There are many things to do in the State of New York.</field>
  </doc>
  <doc>
    <field name="id">1005<field>
    <field name="name">Doc 5</field>
    <field name="text">Many people like to visit the Big Apple, but they wouldn't want to live there.</field>
  </doc>
  <doc>
    <field name="id">1006</field>
    <field name="name">Doc 6</field>
    <field name="text">I like New York, New York its a hell of a town - the West Side's up and the Battery's down!</field>
  </doc>
  <doc>
    <field name="id">1007</field>
    <field name="name">Doc 7</field>
    <field name="text">I have a nice house near New Paltz. New Paltz has some nice restaurants and apple orchards too.</field>
  </doc>
  <doc>
    <field name="id">1008</field>
    <field name="name">Doc 8</field>
    <field name="text">As a New York baseball fan, you can root for the Yankees or you can root for the Mets. You can't root for both.</field>
  </doc>
  <doc>
    <field name="id">1009</field>
    <field name="name">Doc 9</field>
    <field name="text">The capital of New York is Albany.</field>
  </doc>
  <doc>
    <field name="id">1010</field>
    <field name="name">Doc 10</field>
    <field name="text">The Grand Old Duke of York, he had ten thousand men. He marched them up to the top of the hill and he marched them down again.</field>
  </doc>
  <doc>
    <field name="id">1011</field>
    <field name="name">Doc 11</field>
    <field name="text">There are some great parks in NYC, including Central Park and Riverside Park.</field>
  </doc>
  <doc>
    <field name="id">1012</field>
    <field name="name">Doc 12</field>
    <field name="text">It would be nice to live at 123 Broadway, NY, NY 10013.</field>
  </doc>
</add>

Query Tests: Comparing OOB behavior with auto phrasing:

Since the city of New York is in a State of the same name, queries for ‘New York’ are ambiguous and should return both. Out of the box (‘/select?q=New+York’), Solr will also return documents that have the single terms ‘new’ and ‘york’ in them as well. That is, consider the two documents about the ‘Grand Old Duke of York’ and ‘Multi-term synonyms in Solr’ that are returned in the result set below. They hit because they have the terms ‘new’ and or ‘york’ in them but are not really relevant to the probable intent of the query. Furthermore, there are documents about New York that are missing because they use synonyms for New York City. So in this case, the OOTB SearchHandler suffers from both precision and recall errors.

"response": {
  "numFound": 9,
  "start": 0,
  "docs": [
    {
      "id": "1009",
      "text": "The capital of New York is Albany."
    },
    {
      "id": "1006",
      "text": "I like New York, New York its a hell of a town - the West Side's up and the Battery's down!"
    },
    {
      "id": "1002",
      "text": "There are many fine restaurants in the great City of New York."
    },
    {
      "id": "1004",
      "text": "The empire state, New York State is a big state. There are many things to do in the State of New York."
    },
    {
      "id": "1001",
      "text": "Example from LUCENE-1622 search for New York City restaurants"
    },
    {
      "id": "1008",
      "text": "As a New York baseball fan, you can root for the Yankees or you can root for the Mets. You can't root for both."
    },
    {
      "id": "1010",
      "text": "The Grand Old Duke of York, he had ten thousand men."
    },
    {
      "id": "1007",
      "text": "I have a nice house near New Paltz. New Paltz has some nice restaurants and apple orchards too."
    },
    {
      "id": "1003",
      "text": "Multi-term synonyms in Solr is a big problem, and its not a new one."
    }
  ]
}

With the auto phrasing filter in place, searching for New York (/autophrase?q=New+York) only returns documents containing that phrase (i.e. contained in both New York City and New York State), excluding records that contain synonyms like NYC or Big Apple:

"response": {
    "numFound": 6,
    "start": 0,
    "docs": [
      {
        "id": "1009",
        "name": "Doc 9",
        "text": "The capital of New York is Albany.",
        "_version_": 1473362972290056200
      },
      {
        "id": "1002",
        "name": "Doc 2",
        "text": " The are many fine restaurants in the great City of New York.",
        "_version_": 1473362972282716200
      },
      {
        "id": "1004",
        "name": "Doc 4",
        "text": "The empire state, New York State is a big state. There are many things to do in the State of New York.",
        "_version_": 1473362972284813300
      },
      {
        "id": "1001",
        "name": "Doc 1",
        "text": "Example from LUCENE-1622 search for New York City restaurants",
        "_version_": 1473362972255453200
      },
      {
        "id": "1006",
        "name": "Doc 6",
        "text": "I like New York, New York its a hell of a town - the West Side's up and the Battery's down!",
        "_version_": 1473362972285862000
      },
      {
        "id": "1008",
        "name": "Doc 8",
        "text": "As a New York baseball fan, you can root for the Yankees or you can root for the Mets. You can't root for both.",
        "_version_": 1473362972289007600
      }
    ]
  }

And searching for New York City (/autophrase?q=new+york+city) or any of its synonyms ( big apple, city of new york, nyc, etc.) only return records that contain records about the New York City. Note that records about New York State or the baseball teams are correctly excluded:

"response": {
  "numFound": 6,
  "start": 0,
  "docs": [
    {
      "id": "1002",
      "text": "There are many fine restaurants in the great City of New York."
    },
    {
      "id": "1001",
      "text": "Example from LUCENE-1622 search for New York City restaurants"
    },
    {
      "id": "1005",
      "text": "Many people like to visit the Big Apple, but they wouldn't want to live there."
    },
    {
      "id": "1006",
      "text": "I like New York, New York its a hell of a town - the West Side's up and the Battery's down!"
    },
    {
      "id": "1011",
      "text": "There are some great parks in NYC, including Central Park and Riverside Park."
    },
    {
      "id": "1012",
      "text": "It would be nice to live at 123 W Broadway, NY, NY 10013. "
    }
  ]
}

Finally, getting back to the original use case reported in LUCENE-1622 the boolean search for any synonym of NYC AND restaurants such as big apple AND restaurants (or +big apple +restaurants) will only return records about the New York City restaurant scene:

"response": {
  "numFound": 2,
  "start": 0,
  "docs": [
    {
      "id": "1002",
      "text": "There are many fine restaurants in the great City of New York."
    },
    {
      "id": "1001",
      "text": "Example from LUCENE-1622 search for New York City restaurants"
    }
  ]
}

Conclusion

The AutoPhrasingTokenFilter can be an important tool in solving one of the more difficult problems with Lucene/Solr search – how to deal with multi-term synonyms. Simultaneously, we can improve another serious problem that all search engines have – their focus on single tokens and the ambiguities that are present at that level. By shifting the focus more towards phrases that should be treated as semantic entities or units of language (i.e. “things”), the search engine is better able to return results based on ‘what’ the user is looking for rather than documents containing words that match the query. We are moving from searching with a “bag of words” to searching a “bag of things”.

by Ted Sullivan at July 12, 2014 03:38 PM

July 11, 2014

Emily Morton-Owens

#deletefacebook. Do it for the German Vagina Statue.

Since I first posted about it, I've seen several excuses for the Facebook experiment, which--even if they come from different people--strike me as kettle logic. The experiment was covered by Facebook's Terms of Use so no IRB approval was required. The research was exempt by an IRB (whose?). The editor thought the authors' IRB(s) had approved it. Who cares if the IRB approved it because they're passé.

by emily at July 11, 2014 03:59 PM

Open Knowledge Foundation

New Local Groups in Cameroon, Guernsey, Kenya, Bermuda and New Zealand!

5891389188_023dc72cb9_b

Once again we can proudly announce the establishment of a new round of Open Knowledge Local Groups, headed by community leaders around the world. This time we welcome Cameroon, Guernsey, Kenya, Bermuda and New Zealand to the family of Local Groups, which brings the global Open Knowledge community tally beyond the 50+ countries mark. In this blog post we would like to introduce the people heading these groups and invite everyone to join the community in these countries.

Cameroon

In Cameroon, the incubating Local Group is headed in unison by Agnes Ebo’o and Jean Brice Tetka. Agnes Ebo’o is the founder of the Citizens Governance Initiatives in Cameroon, a nonprofit association that promotes accountability and citizens’ participation in governance. A pioneer in the promotion of freedom of information and open government in Cameroon, Agnes has been involved in the creation of several regional initiatives that promote open government and the rule of law in Africa. These include the Academy for Constitutional Law and Justice in Africa and the Africa Freedom of Information Centre; a Pan-African NGO and resource centre that promotes the right of access to information across Africa. Agnes is also the Co-founder of the Gulf of Guinea Citizens Network, a network of advocates for participatory, transparent and accountable management of the natural resources in the Gulf of Guinea region of Africa. A lawyer by training, Agnes holds an undergraduate degree from the University of Poitiers, France, and an LLM from the University of Wales Cardiff, UK.

Jean joined Transparency International in February 2014 as Data and Technology Coordinator for the People Engagement Programme working on technological solutions to anti-corruption, data analysis and visualisation. He has a Bachelors degree in Management ICT Studies from the African Institute of Programming and his previous experiences includes three years as a project manager with an anti-corruption organisation, two years as IT manager for a private company and volunteering for several NGOs.

Kenya

Ahmed Maawy is a Shaper with the Global Shapers Community (which is an Initiative of the World Economic Forum) and an Executive Direcotor at The Mombasa Tech Community (CBO). He is a technology expert working with D8A and Appfrica labs, and a Technology Lead at Abayima. Ahmed is also one of the pioneers in the groundbreaking institution that aims to create a world without boundaries, The Amani Institute‘s Post Graduate certificate in Social Innovation Management. Ahmed has spent more than 10 years developing web, mobile, and enterprise software as well as functioning as a project manager for a number of software products and projects. He has worked with corporations and non profits alike, as well as media agencies such as Al Jazeera New Media (on 3 important curation projects covering Somalia, Libya and Gaza) as well as Internews Europe. He has also worked for Ushahidi as a Software Engineer for SwiftRiver, Datadyne as Product Manager for EpiSurveyor (now MagPi), and with Kenya Airways for their Online Marketing strategy, Bookings and Reservations engines, and overall web strategy, to name a few.

Bermuda

Heading up the Open Knowledge efforts in Bermuda by setting up a new Local Group are Andrew Simons and Louis Galipeau. Andrew is Bermudian, born and raised. He attended Stanford University as a Bermuda Government Scholar, and graduated with a BSc in computer science and an MSc in chemical engineering. Before moving home to Bermuda, he worked in the Boston area at EMC, a global technology company. He now works as a catastrophe modeler in the insurance industry. In 2013, Andrew co-founded Bermuda.io, a free online repository of Bermuda public data running on CKAN.

Louis is Canadian and has made Bermuda his home. A self-taught technophile with a diverse background, he has a drive towards the use of new media and technology in art, business, and community efforts. He is involved locally as a core member of TEDxBermuda and works at a law firm as the senior lead applications architect. In 2013, Louis also co-founded Bermuda.io with Andrew.

New Zealand

The Local Group in New Zealand is being booted by Rowan Crawford, a software developer who originally trained as a pharmacist. He maintains New Zealand’s Freedom of Information requests site, fyi.org.nz, and currently focuses on connecting the public to representatives via askaway.org.nz and bringing Code for America-style fellowships to New Zealand.

Guernsey

In Guernsey, Philip Smith is the initiator of the new Local Group. He is a project and programme manager heading CBO Projects, has a background with charity This Is Epic and is one of the founders of The Dandelion Project, a community-driven initiative aiming to create a better place for people by bringing together citizens to share their knowledge and skills. Dandelion has, among other, started a small number of community led projects that involve Guernsey moving forward with open data, for example a bus app for local bus services and an open data portal that will hopefully drive open access to valuable data in Guernsey.

We encourage everyone to get in touch with these new Local Groups – to join, connect and collaborate! Contact information can be found via our global network page.

Photo by Volker Agüeras Gäng, CC-BY.

by Christian Villum at July 11, 2014 12:34 PM

Patrick Hochstenbach

Superhero study

Drawing a bit from Lee Townsend’s book “Drawing Action Comics” to learn about the inking techniques.Filed under: Comics Tagged: art, comic, comics, Illustrator, Photoshop, study, superhero

by hochstenbach at July 11, 2014 10:48 AM

Ed Summers

why @congressedits?

Note: as with all the content on this blog, this post reflects my own thoughts about a personal project, and not the opinions or activities of my employer.

Two days ago a retweet from my friend Ian Davis scrolled past in my Twitter stream:

This Twitter bot will show whenever someone edits Wikipedia from within the British Parliament. It was set up by @tomscott using @ifttt.

— Parliament WikiEdits (@parliamentedits)

July 8, 2014

The simplicity of combining Wikipedia and Twitter in this way immediately struck me as a potentially useful transparency tool. So using my experience on a previous side project I quickly put together a short program that listens to all major language Wikipedias for anonymous edits from Congressional IP address ranges (thanks Josh) … and tweets them.

In less than 48 hours the @congressedits Twitter account had more than 3,000 followers. My friend Nick set up gccaedits for Canada using the same software … and @wikiAssemblee (France) and @RiksdagWikiEdit (Sweden) were quick to follow.


Watching the followers rise, and the flood of tweets from them brought home something that I believed intellectually, but hadn’t felt quite so viscerally before. There is an incredible yearning in this country and around the world for using technology to provide more transparency about our democracies.

Sure, there were tweets and media stories that belittled the few edits that have been found so far. But by and large people on Twitter have been encouraging, supportive and above all interested in what their elected representatives are doing. Despite historically low approval ratings for Congress, people still care deeply about our democracies, our principles and dreams of a government of the people, by the people and for the people.

We desperately want to be part of a more informed citizenry, that engages with our local communities, sees the world as our stage, and the World Wide Web as our medium.


Consider this thought experiment. Imagine if our elected representatives and their staffers logged in to Wikipedia, identified much like Dominic McDevitt-Parks (a federal employee at the National Archives) and used their knowledge of the issues and local history to help make Wikipedia better? Perhaps in the process they enter into conversation in an article’s talk page, with a constituent, or political opponent and learn something from them, or perhaps compromise? The version history becomes a history of the debate and discussion around a topic. Certainly there are issues of conflict of interest to consider, but we always edit topics we are interested and knowledgeable about, don’t we?

I think there is often fear that increased transparency can lead to increased criticism of our elected officials. It’s not surprising given the way our political party system and media operate: always looking for scandal, and the salacious story that will push public opinion a point in one direction, to someone’s advantage. This fear encourages us to clamp down, to decrease or obfuscate the transparency we have. We all kinda lose, irrespective of our political leanings, because we are ultimately less informed.


I wrote this post to make it clear that my hope for @congressedits wasn’t to expose inanity, or belittle our elected officials. The truth is, @congressedits has only announced a handful of edits, and some of them are pretty banal. But can’t a staffer or politician make a grammatical change, or update an article about a movie? Is it really news that they are human, just like the rest of us?

I created @congressedits because I hoped it could engender more, better ideas and tools like it. More thought experiments. More care for our communities and peoples. More understanding, and willingness to talk to each other. More humor. More human.

@Congressedits is why we invented the Internet

— zarkinfrood (@zarkinfrood)

July 11, 2014

I’m pretty sure zarkinfrood meant @congressedits figuratively, not literally. As if perhaps @congressedits was emblematic, in its very small way, of something a lot bigger and more important. Let’s not forget that when we see the inevitable mockery and bickering in the media. Don’t forget the big picture. We need transparency in our government more than ever, so we can have healthy debates about the issues that matter. We need to protect and enrich our Internet, and our Web … and to do that we need to positively engage in debate, not tear each other down.

Educate and inform the whole mass of the people. Enable them to see that it is their interest to preserve peace and order, and they will preserve them. And it requires no very high degree of education to convince them of this. They are the only sure reliance for the preservation of our liberty. — Thomas Jefferson

Who knew TJ was a Wikipedian…

by ed at July 11, 2014 04:44 AM

District Dispatch

Higher education, library groups release net neutrality principles

Today, higher education and library organizations representing thousands of colleges, universities, and libraries nationwide released a joint set of Net Neutrality Principles they recommend form the basis of an upcoming Federal Communications Commission (FCC) decision to protect the openness of the Internet. The groups believe network neutrality protections are essential to protecting freedom of speech, educational achievement, and economic growth. The organizations endorsing these principles are:

Libraries and institutions of higher education are leaders in creating, fostering, using, extending, and maximizing the potential of the Internet for research, education, and the public good. These groups are extremely concerned that the recent court decision vacating two of the key “open Internet” rules creates an opportunity for Internet providers to block or degrade (e.g., arbitrarily slow) certain Internet traffic, or prioritize certain services, while relegating public interest services to the “slow lane.”

“America’s libraries collect, create, and disseminate essential information to the public over the Internet, and enable our users to create and distribute their own digital content and applications,” said American Library Association President Courtney Young in a statement. “Network neutrality is essential to ensuring open and nondiscriminatory access to Internet content and services for all. The American Library Association is proud to stand with other education and learning organizations in outlining core principles for preserving the open Internet as a vital platform for free speech, innovation, and civic engagement.”

Moving forward, the American Library Association will continue to advocate for an open Internet next week, when the organization comments on the FCC’s Notice of Proposed Rulemaking.

The post Higher education, library groups release net neutrality principles appeared first on District Dispatch.

by Jazzy Wright at July 11, 2014 02:34 AM

DuraSpace News

UPDATE: Towards a Fedora 4 Production Release

From David Wilcox, Fedora Product Manager
 
Winchester, MA  The Fedora 4.0 Beta [1] is available for testing, and progress is being made toward the Fedora 4.0 production release. In order to focus efforts over the next 6 months, a planning document [2] is available that outlines the goals over this short period, and the steps the Fedora team is taking to reach those goals.

by carol at July 11, 2014 01:00 AM

July 10, 2014

District Dispatch

Heads Up! What does the HathiTrust decision mean for libraries?

Jonathan Band, legal counsel for the Library Copyright Alliance (of which ALA is a member), prepared a document [pdf] detailing how the HathiTrust decision affects libraries interested in mass digitization project. The document primarily focuses on academic libraries, but there are nuggets for any non-profit library conducting mass digitization.

Now we wait for the decision that will be made in the Authors Guild v. Google [pdf] decision, on appeal in the U.S. Court of Appeals for the Second Circuit. The key difference between the two cases is that Google displays snippets of the books searched to provide context for the user. Google no longer collects advertising revenue from database, so currently, it seems that the company’s for-profit status will not make as much of a difference between the two cases. But you never know…

The post Heads Up! What does the HathiTrust decision mean for libraries? appeared first on District Dispatch.

by Carrie Russell at July 10, 2014 08:57 PM

Nicole Engard

Bookmarks for July 10, 2014

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for July 10, 2014 appeared first on What I Learned Today....

by Nicole C. Engard at July 10, 2014 08:30 PM

LITA

Search Engine Optimization (SEO) for Libraries

Search Engine Optimization (SEO) for Libraries

Being found in commercial search engines, like Google, and writing indexable content have largely been on the periphery of library web development practice. In this course/workshop, we will explore the mechanics and principles of acceptable best practices for SEO, identify components that contribute to successful harvesting of library web sites and microsites, and discuss the need to make library content findable in broader online settings. Come learn why SEO is not just “snake oil” and can be an integral part of library marketing and outreach initiatives.

Topics include:

Learning Outcomes include:

Who Should Attend:

Instructors:

(Team teaching with Montana State University research team including:

Kenning Arlitsch, Dean of the Library; Jason Clark, Head of Library Informatics and Computing;  Patrick OBrien, Semantic Web Research Director Scott Young, Digital Initiatives Librarian

Course Level & Prerequisites

Participants should know how to use a text editor and a current standard Web Browser. Some basic knowledge of HTML will be helpful

Date(s) & Time(s)

10:30 a.m – 12:30  p.m.(CDT) each of these days:

Fee

LITA Member: $135

ALA Member: $195

Non-member: $260

How to Register

For more information, visit the web site, contact LITA at (312) 280-4269 begin_of_the_skype_highlighting  or lita@ala.org

by vedmonds at July 10, 2014 06:38 PM

DPLA

DPLA in the Classroom: Resources on Japanese Internment

As an eighth grade Language Arts teacher, I often find myself sifting through a list of messy links on Google.  I scour crowded Internet pages for background information on Of Mice and Men or the Great Depression. And all too often, after landing upon tens of amateur resources with suspect information, I end my search frustrated and empty-handed. During times like these, the Digital Public Library of America could be an extremely useful tool. The DPLA provides access to thousands of primary sources that teachers can incorporate into classroom units at the middle, high school, or higher-ed levels.

This summer, I am researching ways that teachers can utilize the DPLA to enhance learning and encourage exploration in their classrooms. In this post, I want to share some of the exciting and informative resources I’ve found related to a sample classroom unit: Japanese-American Internment during WWII. Middle and high school Social Studies teachers discuss this topic while teaching about the home front during World War II, often encouraging students to question a black-and-white narrative of the war – one that depicts a heroic U.S. freeing victims of discrimination everywhere. Language Arts teachers might incorporate these resources while reading novels on the Japanese-American experience during that time, such as Jeanne Wakatsuki Houston’s Farewell to Manzanar, or Yoshiko Uchida’s Journey to Topaz. In my own classroom, students learn about Japanese internment through Steve Kluger’s epistolary novel Last Days of Summer. Teachers should feel encouraged to explore the resources I discuss, or to use this explanation as a model for their own research and activities.

In order to get students thinking about the conditions and concerns that led to Japanese internment, teachers may want to begin by sharing samples of World War II propaganda. This poster, depicting Uncle Sam’s fist knocking out Japanese Emperor Hirohito, or this pamphlet cover, proclaiming Japan as the enemy, both clearly illustrate American sentiment toward Japan in the aftermath of the Pearl Harbor attack. After examining these dramatic pictures, together with this warning about enemy spies in our midst, students should be able to infer the general hostility and suspicion directed at Japanese Americans during the time. Teachers interested in engaging artistic students might also find this this 1934 painting useful; in it, Japanese-American artist Kenjiro Nomura hints at dark times in his community through stormy clouds and eerie shadows.

"Now for the Knockout" poster of a U.S. arm punching Emperor Hirohito.

Courtesy North Carolina Department of Cultural Resources via North Carolina Digital Heritage Center.

Cover of "Know Your Enemy: Japan!" Pamphlet with image of armed Japanese soldiers.

“Know Your Enemy” pamphlet. Courtesy Portal to Texas History.

Letters and political documents available through the DPLA are additionally useful as students explore attitudes surrounding Japanese relocation. For instance, letters from Arizona Governor Sidney Osborn express his concern regarding internment in his state, while one letter to the governor argues that internment is necessary to protect Japanese Americans from vigilante violence. The latter text could spark important debate in the classroom, as students consider whether the U.S. government was truly “protecting” its residents by forcibly removing them from their homes. The DPLA also provides access to Executive Order 9066 – the order signed by FDR that establishes military zones and ultimately authorizes the relocation of tens of thousands of Japanese Americans.

Of course, teachers will want to expose students to the perspectives and experiences of those interned in the camps. The DPLA provides access to hundreds of useful photographs; some depict the journey to the camps, while others reveal various aspects of daily life within the camps themselves. This photo of a pre-evacuation sale and this one of jumbled piles of baggage each hint at families’ hasty efforts to sort and assemble belongings before leaving home indefinitely. Students might draw inferences about the crowded and uncomfortable life in the camps by examining this picture of the barracks or this one of residents waiting in line for rations of soap.

Image of baggage of Japanese during Relocation

“Baggage of Japanese during Relocation.”    Courtesy NARA.

Internees line up for soap ration at Manzanar Relocation Center in Manzanar California.

Manzanar Relocation Center. Courtesy NARA.

School children and teachers at Granada Japanese Relocation Camp, Amache, Colorado, ca. 1942

School children and teachers at Granada Relocation Camp. Courtesy USC Libraries.

Other resources available through the DPLA reveal how those living in the camps made time for recreation and education despite difficult circumstances. One photo, for instance, shows a swimming pool at a camp in Poston, Arizona, while another features school children and teachers at a camp in Colorado. Students might also draw inferences about attitudes through newspapers published in the camps. Page 12 of a newspaper from Topaz, displaying a “letter to Washington,” reveals how some internees, perhaps surprisingly, felt eager to serve in the American army despite restrictions prohibiting them from doing so.

A unit on Japanese internment would not feel complete without an analysis of the U.S. government’s efforts to apologize in the 1980s. After reading excerpts from these findings, students will discover how our government ultimately looked back on this period as a “grave injustice.” The act that followed, signed into law by President Reagan in 1988, granted reparations to former internees. Teachers might use these texts to encourage critical evaluation: Was the government’s response adequate? Or was it, instead, too little too late? Can reparations ever undo wrongdoings? And if not, are they worth pursuing?

My tour of the DPLA’s information on Japanese internment revealed a wealth of useful resources for teachers and students. Thousands of photos, posters, and texts – just a sampling of which I listed above – provide insight into the Japanese-American experience, and into the paranoia and panic that pervaded American consciousness during WWII. I hope this post gives teachers some ideas about how they can use the DPLA to create units on this topic, and on other important periods of our collective past.

Cover image: “Poston, Arizona. Living quarters of evacuees of Japanese ancestry at this War Relocation Authority.” Courtesy National Archives.


cc-by-iconAll written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

by Naomi Forman at July 10, 2014 05:01 PM

Open Knowledge Foundation

OpenCorporates invites you to join the launch of #FlashHacks

This is a guest blog post by OpenCorporates.

Screen Shot 2014-07-09 at 15.40.22

OpenCorporates is now 3 years old. Looking back our first blog on the Open Knowledge (Foundation) blog about reaching 20 million companies, it is heartening to see that we have come a long way. We now have over 70 million companies in 80 jurisdictions worldwide making us the world’s largest open database of companies. The success story of OpenCorporates is not that of a tiny team but that of the whole open data community because it has always been a community effort thanks to the efforts of Open Knowledge and others. From writing scrapers to alerting us when new data is available, deciphering language issues or helping us grow our reach – the open data community has been the driver behind OpenCorporates.

Yet, while our core target of a URL for every single company in the world is making great progress, there’s a bigger goal here – of de-siloing all the government data that relates to companies and connecting it to those companies. In fact, one of the most frequent questions has been “How can I help get data into OpenCorporates?” Now, we have an answer to that. Not just an answer – a brand new platform, that makes it possible for the community to help us get company-related data into OpenCorporates.

To start this new era of crowdscraping – we launched a #FlashHacks campaign which aims to get 10 million datapoints in 10 days. With your help, we are confident we can smash the target.

DSCF0648

Why is this important?

Information about public and private sector is of monumental importance to understanding and changing the world we live in. Transnational corporations can wield unprecedented influence on politics and economy and we have a limited capacity to understand this when we don’t know what these legal entities look like. The influence of these companies can be good or bad and we don’t have a clear picture of this.

Company information is often not available and when it is, it is buried under hard-to-use websites and PDFs. Fortunately, the work of the open data and transparency community has brought a tide of change. With the introduction of Open Government Partnership and G8 Open Data Charter, governments are committing to make this information easily and publicly available. Yet, action on this front remains slow. And that’s why scraping is at the heart of the open data movement! Where would the open data community be if it had not been for bot-writers spending time deciphering formats and writing code to release data?

DSCF0660

We want to use #FlashHacks as a celebration of the commitment of bot-writers and invite others to join us in changing the world through open data.

#FlashHacks at OKFestival

The last day of the campaign coincides with the last day of OKFestival, probably, the biggest gathering of the open data community. So, we will be putting on three #FlashHacks in partnership with Open Knowledge Germany, Code for Africa and Sunlight Foundation.

The OKF Germany #FlashHack will be releasing German data. Sign up here.

The Sunlight Foundation #FlashHack will be releasing political lobbying data. Sign up here.

The Code for Africa #FlashHack will be releasing African data. Sign up here.

How you can join the crowdscraping movement if you can’t make it to OKFest?

Any problems – you can post on our Google Group.

by Guest at July 10, 2014 04:28 PM

Eric Hellman

"Subtleism" is a Useful Word

Allison Kaptur has written about the last of Hacker School's lightweight social rules: "No Subtle -isms":

Our last social rule, "No subtle -isms," bans subtle racism, sexism, homophobia, transphobia, and other kinds of bias. Like the first three rules, it's targeting subtle, accidental, mildly hurtful behavior. This rule isn't targeting slurs, harassment, or threats. These kinds of severe violations would have consequences, up to and including expelling someone from Hacker School. 
Breaking the fourth social rule, like breaking any other social rule, is an accident and a small thing. In theory, someone should be able to say "Hey, that was subtly sexist," get the response "Oops, sorry!" and move on just as easily as if they'd well-actually'ed. In practice, people are less likely to point out when this rule is broken, and more likely to be defensive if they were the rule-breaker. We'd like to change this.
When this was explained to me by Hacker School Co-Founder Sonali Sridhar, I thought it was brilliant, but I heard "subtle -ism" as a single word, "subtleism". "Subtleism" conveyed to me the concept that something could be harmless by itself, but multiplied by a thousand could be oppressive. So for example, using "you guys" for the second person plural when both men and women are included, is never meant to be sexist, and is rarely taken the wrong way. But an ocean of hundreds or even thousands of tiny, insignificant locutions like "you guys" can drown even a strong swimmer.

The reason subtleism is a useful word is that it can convey forgiveness in a context of working together to create a culture that is supportive of a diverse team. Reminding someone of a subtleism doesn't need to be a "shaming ritual"; after all, everyone uses subtleisms all the time. Compare the word "micro-aggression", which is used as an accusation or a lamentation.

Also, the word we should be using for the second person plural is "youse".

by Eric Hellman (noreply@blogger.com) at July 10, 2014 04:43 PM

Casey Bisson

Air-gap flashes for fun, and more fun

This 2011 blog post by Maurice Ribble explains the problem with xenon flash tubes such as those typically used in photography:

[X]enon flash tubes have a minimum duration of 1/40,000th of a second. That’s fast enough for most things, but not for a shooting bullet [that] travels around 1000 feet/second. In 1/40,000th of a second that bullet can travel about 1/3rd of an inch leading to blurry photographs of bullets.

Image blatantly stolen from Maurice Ribble

Image blatantly stolen from Maurice Ribble

I realized I could build a sub-microsecond flash for just a few hundred dollars. A sub-microsecond flash means the flash duration is less than 1/1,000,000th of a second or about 25 times faster than a xenon flash.

Fair is fair, DIYing this sort of thing can be more dangerous than eating lead paint will skipping blindly across a busy intersection, but it looks awfully fun.

by Casey Bisson at July 10, 2014 04:17 AM

District Dispatch

ALA applauds Congress for passing the Workforce Investment and Opportunity Act

Today, the U.S. House of Representatives passed the Workforce Investment and Opportunity Act (WIOA), H.R. 803, in a bipartisan vote 415-6. The passage of the bill comes after the U.S. Senate passed this legislation on a 95-3 vote on June 25th. In a statement, the American Library Association (ALA) gave special thanks to Senator Jack Reed (D-RI) and Representative Rush Holt (D-NJ) for their long-time efforts to include libraries in this legislation.

“We would like to thank Senator Reed and Representative Holt for their many years of work to ensure that public libraries are recognized as places where jobseekers can go for job search assistance and employment training,” said ALA President Courtney Young. “With over 16,000 public libraries in the United States today, public libraries are often the only place in a community where people can receive these services at no charge. This bill recognizes the many ways that public libraries help Americans get back to work.”

The Workforce Investment and Opportunity Act provides the following provisions for American’s libraries:

Moving forward, the Workforce Investment and Opportunity Act will now be sent to the President for his signature.

The post ALA applauds Congress for passing the Workforce Investment and Opportunity Act appeared first on District Dispatch.

by Jazzy Wright at July 10, 2014 02:38 AM

DuraSpace News

DuraCloud at Goucher College

by carol at July 10, 2014 01:00 AM

July 09, 2014

Jonathan Rochkind

SAGE retracts 60 papers in “peer review citation ring”

A good reminder that a critical approach to scholarly literature doens’t end with “Beall’s list“, and maybe doesn’t even begin there. I still think academic libraries/librarians should consider it part of their mission to teach students (and faculty) about current issues in trustworthiness of scholarly literature, and to approach ‘peer review’ critically.

http://www.uk.sagepub.com/aboutus/press/2014/jul/7.htm

London, UK (08  July 2014) – SAGE announces the retraction of 60 articles implicated in a peer review and citation ring at the Journal of Vibration and Control (JVC). The full extent of the peer review ring has been uncovered following a 14 month SAGE-led investigation, and centres on the strongly suspected misconduct of Peter Chen, formerly of National Pingtung University of Education, Taiwan (NPUE) and possibly other authors at this institution.

In 2013 the then Editor-in-Chief of JVC, Professor Ali H. Nayfeh,and SAGE became aware of a potential peer review ring involving assumed and fabricated identities used to manipulate the online submission system SAGE Track powered by ScholarOne Manuscripts™. Immediate action was taken to prevent JVC from being exploited further, and a complex investigation throughout 2013 and 2014 was undertaken with the full cooperation of Professor Nayfeh and subsequently NPUE.

In total 60 articles have been retracted from JVC after evidence led to at least one author or reviewer being implicated in the peer review ring. Now that the investigation is complete, and the authors have been notified of the findings, we are in a position to make this statement.

Some more summary from retractionwatch.com, which notes this isn’t the first time fake identities have been fraudulently used in peer review.


Filed under: General

by jrochkind at July 09, 2014 04:15 PM

Open Knowledge Foundation

#OKStory

Everyone is a storyteller! Just one week away from the big Open Brain Party of OKFestival. We need all the storytelling help you can muster. Trust us, from photos to videos to art to blogs to tweets – share away.

The Storytelling team is a community-driven project. We will work with all participants to decide which tasks are possible and which stories they want to cover. We remix together.

We’ve written up this summary of how to Storytell, some story ideas and suggested formats.

There are a few ways to join:

We highlighted some ways to storytell in this brief 20 minute chat:

by Heather Leson at July 09, 2014 02:58 PM

DuraSpace News

READ The Library of Congress Digital Preservation Newsletter

From Susan Manus, Digital Project Coordinator, Library of Congress
 
Washington, DC  The latest Library of Congress Digital Preservation newsletter is now available here (PDF)! 
 
In this issue:
 
• Featuring "Digital Preservation and the Arts" including Web Archiving and Preserving the Arts, and Preserving Digital and Software-Based Artworks

by carol at July 09, 2014 01:00 AM

July 08, 2014

Roy Tennant

Tennant’s Simple Guide to Programming Languages

programminglanguagesA colleague recently pointed out that IEEE Spectrum had an interactive tool by which you could explore the top programming languages in various areas (e.g., mobile, web, enterprise, and embedded). Besides noting that my favorite web programming language barely made it into the top ten for the Web (Perl, which they mistakenly called PERL), I was astonished by something.

They included HTML and called it “A specialized language for describing the appearance and content of Web pages.” Say what? If they had called their tool “The Top Languages” (leaving out “Programming”), then fine, but they didn’t.

This nonsense, especially coming from IEEE of all places, set me off. I decided I would have to instruct them about what makes a programming language. I fired up OmniGraffle and got to work. Soon I had the chart you see here, which defines what makes a programming language a programming language for IEEE or anyone else who needs help figuring it out.

Then a different colleague pointed out the XSLT edge case. It’s an edge case because although you can use it to write loops, you can’t run it independently — you need a separate XSLT processing engine like Sax or XSLTproc to execute it. So it is really the combination of XSLT plus a processing engine that can be considered a programming language, given my definition above.

But one thing is perfectly clear — HTML does not make the cut. Also, oddly enough, when you look at only the “Web” programming languages HTML comes in 8th in the list — that’s right, 8th! — below my favorite language Perl. Go figure that one out.

by Roy Tennant at July 08, 2014 11:51 PM

District Dispatch

A mad dash to the end (or beginning) as FCC readies for an E-rate vote

Plus a few things we’re likely to know this Friday

Last night, about 40 minutes before the FCC filing system put up its “closed for business” sign, ALA filed a joint letter (pdf) with the Association for Rural & Small Libraries (ARSL), the Chief Officers of State Library Agencies (COSLA), the Public Library Association (PLA), and the Urban Libraries Council (ULC). The letter supports the proposed E-rate order (which is currently behind closed doors and under close review at the Commission in preparation for a vote on Friday). The letter is the result of extensive thought and negotiation among the library community—representing the smallest to the largest public libraries—and speaks to the shared vision that this E-rate order is a critical opportunity to dramatically improve library connectivity to and within the library building. Our message to the Commission and stakeholders at large:

We believe that the time to act is now so that changes made today in Washington, D.C., can take hold immediately in communities across the country… While our diverse organizations may differ on some of the details on the best path forward for program improvements, we are in agreement that to delay this important first step will shortchange our nation’s public libraries and the communities they serve.

Not only does the joint letter urge the Commission to move forward, it lays the groundwork for further strong library engagement in what is likely to be a continued reform process.

ALA filed its formula proposal, over which we have been stewing, well, for several months if not since last summer when the concept of a budget for the E-rate program first surfaced. After much input from ALA’s E-rate task force, network experts, state library staff, and library directors, among others, we arrived at a square foot model. Basing Wi-Fi and internal wiring service needs on the library’s square feet is a metric that can work for all sizes of libraries. It is not dependent on libraries reporting data that all libraries may not collect the same way (potentially leading to delays in funding down the road because of prolonged review by USAC). Libraries know and report their square footage as part of the data they are required to report to IMLS and therefore the numbers are publicly available (and easily checked by the Commission and/or USAC).

Gathering cost data as well as descriptions of how libraries design and implement their Wi-Fi networks resulted in a broad range of information that we distilled down to our proposal of $2.30 per square foot and a floor of 4000 square feet (or $9,200). Coming up with a refutable, robust, and reliable formula was a critical part of our advocacy at the Commission. Their current draft proposal appeared to be inadequate and, after an invitation from Commission staff to provide a better number, we were obligated to work a little harder and do a little more modeling (thank goodness for calculators and colleagues who can use them). We are gratified that PLA filed in support of the ALA proposal (also at the eleventh hour last night!). ALA is confident that the Commission hears our voice and is hopeful that the joint letter, our proposal and the independent support from PLA will help the Commission decide on a library formula that results in more libraries receiving more funding for Wi-Fi and internal wiring services.

But wait, what does all this mean…

Well that’s it exactly. Wait. After what has seemed like non-stop meetings, phone calls, letter drafting, and question answering, the public comment period is at an end. The Commission is set to vote on the draft order on Friday (3 more days!). At its open meeting, stakeholders will find out whether the draft order goes forward meaning that we’ll know

  1. How the $2 billion down payment will be spent and we’ll know how the application process and administration of the program will be streamlined;
  2. Which legacy services will be phased out and over what period of time, and
  3. What issues related to building broadband capacity “to” the library and school the Commission puts on the table to take up in what we hope will be the very near term.

In the meantime, I plan to sit on my hands and patiently wait for Friday.

The post A mad dash to the end (or beginning) as FCC readies for an E-rate vote appeared first on District Dispatch.

by Marijke Visser at July 08, 2014 09:05 PM