You are here

Feed aggregator

Open Knowledge Foundation: Call for applications for Data Journalism Philippines 2015

planet code4lib - Wed, 2015-05-27 07:26

Open Knowledge in partnership with the Philippine Center for Investigative Journalism is pleased to announce the launch of Data Journalism Ph 2015. Supported by the World Bank, the program will train journalists and citizen media in producing high-quality, data-driven stories.

In recent years, government and multilateral agencies in the Philippines have published large amounts of data such as the government’s recently launched Open Data platform. These were accompanied by other platforms that track the implementation and expenditure of flagship programs such as Bottom-Up-Budgeting via, Infrastructure via and reconstruction platforms including the Foreign Aid Transparency Hub. The training aims to encourage more journalists to use these and other online resources to produce compelling investigative stories.

Data Journalism Ph 2015 will train journalists on the tools and techniques required to gain and communicate insight from public data, including web scraping, database analysis and interactive visualization. The program will support journalists in using data to back their stories, which will be published by their media organization over a period of five months.

Participating teams will benefit from the following:

  • A 3-day data journalism training workshop by the Open Knowledge and PCIJ in July 2015 in Manila
  • A series of online tutorials on a variety of topics from digital security to online mapping
  • Technical support in developing interactive visual content to accompany their published stories
Apply now!

Teams of up to three members working with the same print, TV, or online media agencies in the Philippines are invited to submit an application here.

Participants will be selected on the basis of the data story projects they pitch focused on key datasets including infrastructure, reconstruction, participatory budgeting, procurement and customs. Through Data Journalism Ph 2015 and its trainers, these projects will be developed into data stories to be published by the participants’ media organizations.

Join the launch

Open Knowledge and PCIJ will host a half-day public event for those interested in the program in July in Quezon City. If you would like to receive full details about the event, please sign up here.

To follow the programme as it progresses go to the Data Journalism 2015 Ph project website.

FOSS4Lib Recent Releases: Koha - 3.20.0

planet code4lib - Tue, 2015-05-26 20:47
Package: KohaRelease Date: Friday, May 22, 2015

Last updated May 26, 2015. Created by David Nind on May 26, 2015.
Log in to edit this page.

Koha 3.20 is the latest major release. It includes 5 new features, 114 enhancements and 407 bug fixes.

For more details see:

Koha's release cycle:

Nicole Engard: Bookmarks for May 26, 2015

planet code4lib - Tue, 2015-05-26 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Open Hub, the open source network Discover, Track and Compare Open Source
  • Arches: Heritage Inventory & Management System Arches is an innovative open source software system that incorporates international standards and is built to inventory and help manage all types of immovable cultural heritage. It brings together a growing worldwide community of heritage professionals and IT specialists. Arches is freely available to download, customize, and independently implement.

Digest powered by RSS Digest

The post Bookmarks for May 26, 2015 appeared first on What I Learned Today....

Related posts:

  1. Learn about Open Source from Me and Infopeople
  2. Online Presentations
  3. CIL2008: Open Source Solutions to Offer Superior Service

FOSS4Lib Recent Releases: DSpace - 5.2

planet code4lib - Tue, 2015-05-26 20:12
Package: DSpaceRelease Date: Thursday, May 21, 2015

Last updated May 26, 2015. Created by David Nind on May 26, 2015.
Log in to edit this page.

DSpace 5.2 is now available. This is a bug-fix release and contains no new features.

See for details.

LITA: Learn to Teach Coding – Webinar Recording

planet code4lib - Tue, 2015-05-26 19:10

Tuesday May 26, 2015.

Today we had a lively half hour free webinar presentation by Kimberly Bryant and Lake Raymond from Black Girls CODE about their latest efforts and the exciting LITA preconference they will be giving at ALA Annual in San Francisco. Here’s the link to the recording from todays session:

LITA Learn to Teach Coding Free information webinar recording, May 26, 2015

For more information check out the previous LITA Blog entry:

Did you attend the webinar, or view the recording?  Give us your feedback by taking the Evaluation Survey.

Learn to Teach Coding and Mentor Technology Newbies – in Your Library or Anywhere!

Then register for and attend the LITA preconference at ALA Annual. This opportunity is following up on the 2014 LITA President’s Program at ALA Annual where then LITA President Cindi Trainor Blyberg welcomed Kimberly Bryant, founder of Black Girls Code.

The Black Girl Code Vision is to increase the number of women of color in the digital space by empowering girls of color ages 7 to 17 to become innovators in STEM fields, leaders in their communities, and builders of their own futures through exposure to computer science and technology.

DPLA: Presidents and Their Libraries

planet code4lib - Tue, 2015-05-26 18:30

To bring together the records of the past and to house them in buildings where they will be preserved for the use of men and women in the future, a Nation must believe in three things.

It must believe in the past.

It must believe in the future.

It must, above all, believe in the capacity of its own people so to learn from the past that they can gain in judgement in creating their own future.”

– Franklin Roosevelt At the dedication of his library on June 30, 1941

Earlier this month it was announced the President Barack Obama’s Presidential Library will be built on the south side of Chicago. It will be our 14th Presidential Library.

The idea originated with FDR who in his second term “on the advice of noted historians and scholars, established a public repository to preserve the evidence of the Presidency for future generations”

Then in 1955, Congress passed the Presidential Libraries Act, establishing a system of privately erected and federally maintained libraries.

Here’s a sampling  of images from the Digital Public Library of America related to our presidents and their libraries. Enjoy!

JFK Library and Museum in Boston. Courtesy of the University of Illinois at Urbana-Champaign University Library.

FDR laying the cornerstone of his presidential library. Courtesy of the Franklin D. Roosevelt Presidential Library and Museum via the Empire State Digital Network.

Herbert Hoover Presidential Library, West Branch, Iowa. Courtesy of the University of Illinois at Urbana-Champaign University Library.

Jimmy Carter Library and Museum. Courtesy of the Jimmy Carter Library via the Digital Library of Georgia

Presidential Room at The Eisenhower Presidential Library, Abilene, Kansas. Courtesy of the University of Illinois at Urbana-Champaign University Library.

Richard Nixon Library & Birthplace Site model, 1971. Photo by Julius Schulman. Courtesy of the J. Paul Getty Trust.

Inside the Harry S. Truman Presidential Library. Courtesy of the University of Illinois at Urbana-Champaign University Library.

Former President Gerald R. Ford and his Cabinet officers at the dedication of the Gerald R. Ford Library in Ann Arbor, Michigan, April 27-28, 1981. Courtesy of the Georgia State University Libraries Special Collections via the Digital Library of Georgia

Mourners pay their final respects to former US President Ronald Reagan as his body lay in repose inside a flag draped coffin at the Ronald Reagan Presidential Library. Courtesy of the National Archives and Records Administration.

Eric Lease Morgan: HathiTrust Resource Center Workset Browser

planet code4lib - Tue, 2015-05-26 15:49

In my copious spare time I have hacked together a thing I’m calling the HathiTrust Research Center Workset Browser, a (fledgling) tool for doing “distant reading” against corpora from the HathiTrust. [1]

The idea is to: 1) create, refine, or identify a HathiTrust Research Center workset of interest — your corpus, 2) feed the workset’s rsync file to the Browser, 3) have the Browser download, index, and analyze the corpus, and 4) enable to reader to search, browse, and interact with the result of the analysis. With varying success, I have done this with a number of worksets ranging on topics from literature, philosophy, Rome, and cookery. The best working examples are the ones from Thoreau and Austen. [2, 3] The others are still buggy.

As a further example, the Browser can/will create reports describing the corpus as a whole. This analysis includes the size of a corpus measured in pages as well as words, date ranges, word frequencies, and selected items of interest based on pre-set “themes” — usage of color words, name of “great” authors, and a set of timeless ideas. [4] This report is based on more fundamental reports such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]

The whole thing is written in a combination of shell and Python scripts. It should run on just about any out-of-the-box Linux or Macintosh computer. Take a look at the code. [9] No special libraries needed. (“Famous last words.”) In its current state, it is very Unix-y. Everything is done from the command line. Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a Renaissance cartoon, the Browser, in its current state, is only a sketch. Only later will a more full-bodied, Web-based interface be created.

The next steps are numerous and listed in no priority order: putting the whole thing on GitHub, outputting the reports in generic formats so other things can easily read them, improving the terminal-based search interface, implementing a Web-based search interface, writing advanced programs in R that chart and graph analysis, provide a means for comparing & contrasting two or more items from a corpus, indexing the corpus with a (real) indexer such as Solr, writing a “cookbook” describing how to use the browser to to “kewl” things, making the metadata of corpora available as Linked Data, etc.

‘Want to give it a try? For a limited period of time, go to the HathiTrust Research Center Portal, create (refine or identify) a collection of personal interest, use the Algorithms tool to export the collection’s rsync file, and send the file to me. I will feed the rsync file to the Browser, and then send you the URL pointing to the results. [10] Let’s see what happens.

Fun with public domain content, text mining, and the definition of librarianship.

  1. HTRC Workset Browser –
  2. Thoreau –
  3. Austen –
  4. Thoreau report –
  5. Thoreau dictionary (frequency list) –
  6. usage of color words in Thoreau —
  7. unique words in the corpus –
  8. Thoreau “catalog” —
  9. source code –
  10. HathiTrust Research Center –

David Rosenthal: Bad incentives in peer-reviewed science

planet code4lib - Tue, 2015-05-26 15:00
The inability of the peer-review process to detect fraud and error in scientific publications is getting some mainstream attention. Adam Marcus and Ivan Oransky, the founders of Retraction Watch, had an op-ed in the New York Times entitled What's Behind Big Science Frauds?, in which they neatly summed up the situation:
Economists like to say there are no bad people, just bad incentives. The incentives to publish today are corrupting the scientific literature and the media that covers it. Until those incentives change, we’ll all get fooled again.Earlier this year I saw Tom Stoppard's play The Hard Problem at the Royal National Theatre, which deals with the same issue. The tragedy is driven by the characters being entranced by the prospect of publishing an attention-grabbing result. Below the fold, more on the problem of bad incentives in science.

Back in April, after a Wellcome Trust symposium on the reproducibility and reliability of biomedical science, Richard Horton, editor of The Lancet, wrote an editorial entitled What is medicine’s 5 sigma? that is well worth a read. His focus is also on incentives for scientists:
In their quest for telling a compelling story, scientists too often sculpt data to fit their preferred theory. Or they retrofit hypotheses to fit their data. and journal editors:
Our acquiescence to the impact factor fuels an unhealthy competition to win a place in a select few journals. Our love of "significance" pollutes the literature with many a statistical fairy-tale. We reject important confirmations.and Universities:
in a perpetual struggle for money and talent, endpoints that foster reductive metrics, such as high-impact publication. National assessment procedures, such as the Research Excellence Framework, incentivise bad practices.Horton points out that:
Part of the problem is that no-one is incentivised to be right. Instead, scientists are incentivised to be productive and innovative.He concludes:
The good news is that science is beginning to take some of its worst failings very seriously. The bad news is that nobody is ready to take the first step to clean up the system.Six years ago Marcia Angell, the long-time editor of a competitor to The Lancet wrote in an review of three books pointing out the corrupt incentives that drug companies provide researchers and Universities:
It is simply no longer possible to believe much of the clinical research that is published, or to rely on the judgment of trusted physicians or authoritative medical guidelines. I take no pleasure in this conclusion, which I reached slowly and reluctantly over my two decades as an editor of The New England Journal of Medicine.In most fields, little has changed since then. Horton points to an exception:
Following several high-profile errors, the particle physics community now invests great effort into intensive checking and re-checking of data prior to publication. By filtering results through independent working groups, physicists are encouraged to criticise. Good criticism is rewarded. The goal is a reliable result, and the incentives for scientists are aligned around this goal.Unfortunately, particle physics is an exception. The cost of finding the Higgs Boson was around $13.25B, but no-one stood to make a profit from it. A single particle physics paper can have over 5,000 authors. The resources needed for "intensive checking and re-checking of data prior to publication" are trivial by comparison. In other fields, the incentives for all actors are against devoting resources which would represent a significant part of the total for the research to such checking.

Fixing these problems of science is a collective action problem; it requires all actors to take actions that are against their immediate interests roughly simultaneously. So nothing happens, and the long-term result is, as Arthur Caplan (of the Division of Medical Ethics at NYU's Langone Medical Center) pointed out, a total loss of science's credibility:
The time for a serious, sustained international effort to halt publication pollution is now. Otherwise scientists and physicians will not have to argue about any issue—no one will believe them anyway.(see also John Michael Greer). I am not optimistic, based on the fact that the problem has been obvious for many years, and that this is but one aspect of society's inability to deal with long-term problems.

Mark E. Phillips: Metadata normalization as an indicator of quality?

planet code4lib - Tue, 2015-05-26 14:00

Metadata quality and assessment is a concept that has been around for decades in the library community.  Recently it has been getting more interest as new aggregations of metadata become available in open and freely reusable ways such as the Digital Public Library of America (DPLA) and Europeana.  Both of these groups make available their metadata so that others can remix and reuse the data in new ways.

I’ve had an interest in analyzing the metadata in the DPLA for a while and have spent some time working on the subject fields.  This post will continue along those lines in trying to figure out what some of the metrics that we can calculate with the DPLA dataset that we can use to define “quality”.  Ideally we will be able to turn these assessments and notions of quality into concrete recommendations for how to improve metadata records in the originating repositories.

This post will focus on normalization of subject strings, and how those normalizations might be useful as a way of assessing quality of a set of records.

One of the the powerful features of OpenRefine is the ability to cluster a set or data and combine these clusters into a single entry.  Often times this will significantly reduce the number of values that occur in a dataset in a quick and easy manner.

OpenRefine Cluster and Edit Screen Capture

OpenRefine has a number different algorithms that can be used for this work that are documented in their Clustering in Depth documentation.  Depending on ones data one approach may perform better than others for this kind of clustering.


Case normalization is probably the easiest to kind of normalization to understand.  If you have two strings,  say “Mark” and “marK” if you converted each of the strings to lowercase you would end up with a single value of “mark”. Many more complicated normalizations assume this as a start because it reduces the number of subjects without drastically transforming the original string values.

Case folding is another kind of transformation that is fairly common in the world of libraries.  This is the process of taking a string like “José” and converting it to “Jose”.  While this can introduce issues if a string is meant to have a diacritic and that diacritic makes the word or phrase different than the one without the diacritic, often times it can help to normalize inconsistently notated versions of the same string.

In addition to case folding and lower casing, libraries have been normalizing data for a long time,  there have been efforts in the past to formalize algorithms for the normalization of subject strings for use in matching these strings.  Often referred to as NACO normalizations rules, they are Authority File Comparison Rules.  I’ve always found this work to be intriguing and have a preference for the work and simplified algorithm that was developed at OCLC in their NACO Normalization Service.  In fact we’ve taken the sample Python implementation there and created a stand-alone repository and project called pynaco on GitHub for the code so that we could add tests and then work to port it Python 3 in the near future.

Another common type of normalization that is performed on strings in library land is stemming. This is often done within search applications so that if you search one of the phrases run, runs, running you would get documents that contain each of these.

What I’ve been playing around with is if we could use the reduction in unique terms for a field in a metadata repository as an indicator of quality.

Here is an example.

If we have the following sets of subjects:

Musical Instruments Musical Instruments. Musical instrument Musical instruments Musical instruments, Musical instruments.

If you applied the simplified NACO normalization from pynaco you would end up with the following strings:

musical instruments musical instruments musical instrument musical instruments musical instruments musical instruments

If you then applied the porter stemming algorithm to the new set of subjects you would end up with the following:

music instrument music instrument music instrument music instrument music instrument music instrument

So in effect you have normalized the original set of six unique subjects down to one unique subject strings with a NACO transformation followed by a normalization with the Porter Stemming algorithm.


In some past posts here, here, here, and here, I discussed some of the aspects of the subject fields present in Digital Public Library of America dataset.  I dusted that dataset off and extracted all of the subjects from the dataset so that I could work with them by themselves.

I ended up with a set of text files that were 23,858,236 lines long that held the item identifier and a subject value for each subject of each item in the DPLA dataset. Here is a short snippet of what that looks like.

d8f192def7107b4975cf15e422dc7cf1 Hoggson Brothers d8f192def7107b4975cf15e422dc7cf1 Bank buildings--United States d8f192def7107b4975cf15e422dc7cf1 Vaults (Strong rooms) 4aea3f45d6533dc8405a4ef2ff23e324 Public works--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 City planning--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 Art, Municipal--Illinois--Chicago 63f068904de7d669ad34edb885925931 Building laws--New York (State)--New York 63f068904de7d669ad34edb885925931 Tenement houses--New York (State)--New York 1f9a312ffe872f8419619478cc1f0401 Benedictine nuns--France--Beauvais

Once I have the data in this format I could experiment with different normalizations to see what kind of effect they had on the dataset.

Total vs Unique

The first thing I did was to make the 23,858,236 long text file only contain unique values.  I do this with the tried and true method of using unix sort and uniq. 

sort subjects_all.txt | uniq > subjects_uniq.txt

After about eight minutes of waiting I ended up with a new text file subjects_uniq.txt that contains the unique subject strings in the dataset. There are a total of 1,871,882 unique subject strings in this file.

Case folding

Using a Python script to perform case folding on each of the unique subjects I’m able to see is that causes a reduction in the number of unique subjects.

I started out with 1,871,882 unique subjects and after case folding ended up with 1,867,129 unique subjects.  That is a difference of 4,753 or a 0.25% reduction in the number of unique subjects.  So nothing huge.


The next normalization tested was lowercasing of the values.  I chose to do this on the set of subjects that were already case folded to take advantage of the previous reduction in the dataset.

By converting the subject strings to lowercase I reduced the number of unique case folded subjects from 1,867,129 to 1,849,682 which is a reduction of 22,200 or a 1.2% reduction from the original 1,871,882 unique subjects.

NACO Normalization

Next we look at the simple NACO normalization from pynaco.  I applied this to the unique lower cased subjects from the previous step.

With the NACO normalization,  I end up with 1,826,523 unique subject strings from the 1,849,682 that I started with from the lowercased subjects.  This is a difference of 45,359 or a 2.4% reduction from the original 1,871,882 unique subjects.

Porter stemming

Moving along,  I looked at for this work was applying the Porter Stemming algorithm to the output of the NACO normalized subjects from the previous step.  I used the Porter implementation from the Natural Language Tool Kit (NLTK) for Python.

With the Portal stemmer applied,  I ended up with 1,801,114 unique subject strings from the 1,826,523 that I started with from the NACO normalized subjects. This is a difference of 70,768 or a 3.8% reduction from the original 1,871,882 unique subjects.


Finally I used a python porting of the fingerprint algorithm that OpenRefine uses for its clustering feature.  This will help to normalize strings like “phillips mark” and “mark phillips” into a single value of “mark phillips”.  I used the output of the previous Porter stemming step as the input for this normalization.

With the fingerprint algorithm applied, I ended up with 1,766,489 unique fingerprint normalized subject strings. This is a difference of 105,393 or a 5.6% reduction from the original 1,871,882 unique subjects.

Overview Reduction Occurrences Percent Reduction Unique 0 1,871,882 0% Case Folded 4,753 1,867,129 0.3% Lowercase 22,200 1,849,682 1.2% NACO 45,359 1,826,523 2.4% Porter 70,768 1,801,114 3.8% Fingerprint 105,393 1,766,489 5.6% Conclusion

I think that it might be interesting to apply this analysis to the various Hubs in the whole DPLA dataset to see if there is anything interesting to be seen across the various types of content providers.

I’m also curious if there are other kinds of normalizations that would be logical to apply to the subjects that I’m blanking on.  One that I would probably want to apply at some point is the normalization for LCSH that splits a subject into its parts if it has the double hype — in the string.  I wrote about the effect on the subjects for the DPLA dataset in a previous post.

As always feel free to contact me via Twitter if you have questions or comments.

Peter Murray: Advancing Patron Privacy on Vendor Systems with a Shared Understanding

planet code4lib - Tue, 2015-05-26 13:59

Last week I had the pleasure of presenting a short talk at the second virtual meeting of the NISO effort to reach a Consensus Framework to Support Patron Privacy in Digital Library and Information Systems. The slides from the presentation are below and on SlideShare, followed by a cleaned-up transcript of my remarks.

It looks like in the agenda that I’m batting in the clean-up role, and my message might be pithily summarized as “Can’t we all get along?” A core tenet of librarianship — perhaps dating back to the 13th and 14th century when this manuscript was illuminated — is to protect the activity trails of patrons from unwarranted and unnecessary disclosure.

This is embedded in the ethos of librarianship. As Todd pointed out in the introduction, third principle of the American Library Association’s Code of Ethics states: “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.” Librarians have performed this duty across time and technology, and as both have progressed the profession has sought new ways to protect the privacy of patrons.

For instance, there was once a time when books had a pocket in the back that held a card showing who had checked out the book and when it was due. Upon checkout the card was taken out, had the patron’s name embossed or written on it, and was stored in a date-sorted file so that the library knew when it was due and who had it checked out. When the book was returned, the name was scratched through before putting the card in the pocket and the book on the shelf. Sometimes, as a process shortcut, the name was left “in the clear” on the card, and anyone that picked the book off the shelf could look on the card to see who had checked it out.

When libraries automated their circulation management with barcodes and database records, the card in the back of the book and the information it disclosed was no longer necessary. This was hailed as one of the advantages to moving to a computerized circulation system. While doing away with circulation cards eliminated one sort of privacy leakage — patrons being able to see what each other had checked out — it enabled another: systematic collection of patron activity in a searchable database. Many automation systems put in features that automatically removed the link between patron and item after it was checked in. Or, if that information was stored for a period of time, it was password protected so only approved staff could view the information. Some, however, did not, and this became a concern with the passage of the USA PATRIOT act by the United States Congress.

We are now in an age where patron activity is scattered across web server log files, search histories, and usage analytics of dozens of systems, some of which are under the direct control of the library while others are in the hands of second and third party service providers. Librarians that are trying to do their due diligence in living up to the third principle of the Code of Ethics have a more difficult time accounting for all of the places where patron activity is collected. It has also become more difficult for patrons to make informed choices about what information is collected about their library activity and how it is used.

In the mid-2000s, libraries and content providers had a similar problem: the constant one-off negotiation of license terms was a burden to all parties involved. In order to gain new efficiencies in the process of acquiring and selling licensed content, representatives from the library and publisher communities came together under a NISO umbrella to reach a shared understanding of what the terms of an agreement would be and a registry of organizations that ascribed to those terms. Quoting from the forward of the 2012 edition: “The Shared Electronic Resource Understanding (SERU) Recommended Practice offers a mechanism that can be used as an alternative to a license agreement. The SERU statement expresses commonly shared understandings of the content provider, the subscribing institution and authorized users; the nature of the content; use of materials and inappropriate uses; privacy and confidentiality; online performance and service provision; and archiving and perpetual access. Widespread adoption of the SERU model for many electronic resource transactions offers substantial benefits both to publishers and libraries by removing the overhead of bilateral license negotiation.”

One of SERU’s best qualities is its brevity, and that is likely a significant factor in its success. For instance, the “Confidentiality and Privacy” section states — in its entirety — these two sentences: “The acquiring institution and the provider respect the privacy of the users of the content and will not disclose or distribute personal information about the user to any third party without the user’s consent unless required to do so by law. The provider should develop and post its privacy policy on its website.” As the complexity of the online information landscape increased, this two sentence paragraph is not sufficient to describe an understanding between library and information provider. Here are some examples of this complexity.

One of the features of the HTTP protocol — the mechanism used by web browsers to get content from web servers — is for the browser to tell the server how it knew to ask for the web page or image file or JavaScript file on that server. This is called the “Referer” header. Does your library catalog include a link to add a book to an Amazon wishlist? Does your library catalog page load a book cover image from Syndetic Solutions? If so, the address of the catalog page is included in those HTTP transactions with Amazon and Syndetic Solutions as the “Referer” header. What is in that library catalog URL? Are the patron’s search terms in that link? Is there personally identifiable information?

Today’s web service is filled with social sharing widgets (Facebook, Twitter, and the like), web analytics tools (Google Analytics), and content from advertising syndicates. While these tools provide useful services to the patrons, libraries and service providers, they also become centralized points of data gathering that can aggregate a user’s activity across the web. Does your library catalog page include a Facebook “Like” button? Whether or not the patron clicks on that button, Facebook knows that user has browsed to that web page and can gleen details of user behavior from that. Does your service use Google Analytics to understand user behavior and demographics? Google Analytics tracks user behavior across an estimated one half of the sites on the internet. Your user’s activity as a patron of your services is commingled with their activity as a general user.

A “filter bubble” is phrase coined by Eli Pariser to describe a system that adapts its output based on what it knows about a user: location, past searches, click activity, and other signals. The system is using these signals to deliver what it deems to be more relevant information to the user. In order to do this, the system must gather, store and analyze this information from patrons. However, a patron may not want his or her past search history to affect their search results. Or, even worse, when activity is aggregated from a shared terminal, the results can be wildly skewed.

Simply using a library-subscribed service can transmit patron activity and intention to dozens of parties, and all of it invisible to the user. To uphold that third principle in the ALA Code of Ethics, librarians need to examine the patron activity capturing practices its information suppliers, and that can be as unwieldy as negotiating bilateral license agreements between each library and supplier. If we start from the premise that libraries, publishers and service providers want to serve the the patron’s information needs while respecting their desire to do so privately, what is needed is a shared understanding of how patron activity is captured, used, and discarded. A new gathering of librarians and providers could accomplish for patron activity what they did for electronic licensing terms a decade ago. One could imagine discussions around these topics:

What Information is Collected From the Patron: When is personally identifiable information captured in the process of using the provider’s service. How is activity tagged to a particular patron — both before and after the patron identifies himself or herself? Are search histories stored? Is the patron activity encrypted — both in transit on the network and at rest on the server?

What Activity That Can Be Gleaned by Other Parties: If a patron follows a link to another website, how much of the context of the patron’s activity is transferred to the new website. Are search terms included in the URL? Is personally identifiable information in the URL? Does the service provider employ social sharing tools or third party web analytics that can gather information about the patron’s activity? Such activity could include IP address (and therefore rough geolocation), content of the web page, cross-site web cookies, and so forth.

How does patron activity influence service delivery: Is relevancy ranking altered based on the past activity of the user? Can the patron modify the search history to remove unwanted entries or segregate research activities from each other?

What is the disposition of patron activity data: Is a patron activity data anonymized and co-mingled with others? How is that information used and to whom is it disclosed? How long does the system keep patron activity data? Under what conditions would a provider release information to third parties?

It is arguably the responsibility of libraries to protect patron activity data from unwarranted collection and distribution. Service providers, too, want clear guidance from libraries so they can efficiently expend their efforts to develop systems that librarians feel comfortable promoting. To have each library and service provider audit this activity for each bilateral relationship would be inefficient and cumbersome. By coming to a shared understanding of how patron activity data is collected, used, and disclosed, libraries and service providers can advance their educational roles and offer tools to patrons to manage the disclosure of their activity.

Link to this post!

Terry Reese: MarcEdit 6 update

planet code4lib - Tue, 2015-05-26 06:26

I’ve been working hard on making a few changes to a couple of the MarcEdit internal components to improve the porting work.  To that end, I’ve posted an update that targets improvements to the Deduping and the Merging tools.


  • Update: Dedup tool — improves the handling of qualified data in the 020, 022, and 035.
  • Update: Merge Records Tool — improves the handling of qualified data in the 020, 022, and 035.

Downloads can be picked up using the automated update tool or by going to:


DuraSpace News: REGISTER for the Repository Fringe Festival Aug. 3-4, Edinburgh

planet code4lib - Tue, 2015-05-26 00:00

From Claire Knowles, Library Digital Development Manager, University of Edinburgh

Edinburgh, Scotland  We are pleased to announce that Repository Fringe returns this year on the 3rd and 4th of August 2015. The event will be held at the University of Edinburgh and coincides once again with preview week to the Edinburgh Festival Fringe.

District Dispatch: Ramping up negotiation skills to advance library agenda

planet code4lib - Mon, 2015-05-25 18:08

From the boardroom to City Hall, powerful negotiation skills make a big difference in advancing library goals. Power up your ability to persuade at the 2015 American Library Association (ALA) Annual Conference interactive session “The Policy Revolution! Negotiating to Advocacy Success!” 1:00 to 2:30 p.m. on Saturday, June 27, 2015. The session will be held at the Moscone Convention Center in room 2016 of the West building.

American Library Association Senior Policy Counsel Alan Fishel will bring nearly 30 years of legal practice and teaching effective and creative negotiation to the program. Bill & Melinda Gates Foundation Senior Program Officer Chris Jowaisas will share his experience advocating for and advancing U.S. and global library services. From securing new funding to negotiating licenses to establishing mutually beneficial partnerships, today’s librarians at all levels of service are brokering support for the programs, policies and services needed to meet diverse community demands. The session will jump off from a new national public policy agenda for U.S. libraries to deliver new tools you can use immediately at the local, state, national and international levels.

The Policy Revolution! initiative aims to advance national policy for libraries and our communities and campuses. The grant-funded effort focuses on establishing a proactive policy agenda, engaging national decision makers and influencers, and upgrading ALA policy capacity.

Speakers include Larra Clark, deputy director, ALA Office for Information Technology Policy; Alan G. Fishel, partner, Arent Fox; and Chris Jowaisas, senior program officer, Bill and Melinda Gates Foundation.

View all ALA Washington Office conference sessions

The post Ramping up negotiation skills to advance library agenda appeared first on District Dispatch.

Tim Ribaric: 1/2 of 1/2 done (Sabbatical Part 6)

planet code4lib - Mon, 2015-05-25 15:56


Some time ago I promised I'd keep this space up to date on how my return to grad school was doing. Turns out I'm pretty lazy with doing that.

read more

Islandora: Islandora Ontology

planet code4lib - Mon, 2015-05-25 12:19

While working on the migration mappings for fcrepo3->fcrepo4 properties, I documented all known RELS-EXT and RELS-INT predicates in the Islandora 7.x-1.x code base. The predicates came from two namespaces; fedora and islandora.

The fedora namespace has a published ontology that we use -- relations-external -- and that can be referenced. However, the islandora namespace did not have any published ontologies associated with it.

That said, I have worked over the last couple of weeks with some very helpful folks on drafting initial version of Islandora RELS-EXT and RELS-INT ontologies, and the Islandora Roadmap Committee voted that it should be published. The published version of the RELS-EXT ontology can be viewed here, and the published version of the RELS-INT ontology can be viewed here. In addition, the ontologies were drafted in rdfs, and include a handy rdf2html.xsl to quickly create a publishable html version. This available on GitHub.

What does this all mean?

We have now documented what we have been doing for the last number of years, and we have a referencable version of our ontologies. In addition, this is extremely helpful for referencing and documenting predicates that will be apart of an fcrepo3-fcrepo4 migration.

What's next?

The initial versions of each ontology have proposed rdfs comments, ranges and and skos *matches for a number of predicates. However, this is by no means complete, and I would love to see some community input/feedback on rdfs comments, ranges, additional skos *matches, or anything else that you think should be included in the RELS-EXT ontology.

How to provide feedback?

I'd like to have everything handled through 'issues' on the GitHub repo. If you comfortable with forking and creating pull requests, by all means do so. If you're more comfortable with replying here, that's works as well. All contributions are welcome! The key thing -- for me at least -- is to have community consensus around our understanding of these documented predicates :-)


I have not licensed the repository yet. I had planned on using the Apache 2.0 License as is done with PCDM, but I'd like your thoughts/opinions on proceeding before I make a LICENSE commit.


I hope I have covered it all. But, if you have have any questions, don't hesitate to ask.

FOSS4Lib Recent Releases: VuFind - 2.4.1

planet code4lib - Mon, 2015-05-25 12:14
Package: VuFindRelease Date: Monday, May 25, 2015

Last updated May 25, 2015. Created by Demian Katz on May 25, 2015.
Log in to edit this page.

Bug fix release.

District Dispatch: Tick tock. Section 215 expiration draws closer.

planet code4lib - Sat, 2015-05-23 13:53


Senate adjourns with no clear path forward on Patriot Act 

It is almost a sure bet that certain NSA programs will expire at the end of the month. The next Senate vote is set for May 31st. We will be sure to provide updates as we hear them.


The post Tick tock. Section 215 expiration draws closer. appeared first on District Dispatch.

Patrick Hochstenbach: Homework assignment #5 Sketchbookskool

planet code4lib - Sat, 2015-05-23 12:45
Filed under: Doodles Tagged: sketchbook, sketchbookskool, urbansketching

Open Knowledge Foundation: Open Knowledge International does IODC2015!

planet code4lib - Sat, 2015-05-23 12:18

It’s that time of the year again. That time when the international open data community descends on an unsuspecting city for a jam packed week of camps, meet-ups, hacks and conference events. Next week, open data enthusiasts will be taking over Ottawa and Open Knowledge will be there in full force! As we don’t want to miss an opportunity to meet with anyone, we have put together a list of events that we will be involved in and ways to get in touch.We have also started collecting this information in a spreadsheet!

The School of Data team is arriving early for the second annual School of Data Summer Camp. Every year we strive to bring the entire School of Data community together for three intense days to plan future activities, to learn from each other, to improve our skills and ways of working and to give new School of Data fellows the opportunity to meet their future collaborators! This year’s School of Data Summer Camp will take place at the HUB Ottawa. We’ll have a meet and greet on one of the evenings for School of Data family and friends – so watch this space for details, or follow @SchoolofData on Twitter.

On Tuesday, we are partnering with Open North, the Sunlight Foundation, Iniciativa Latinoamericana de Datos Abiertos (ILDA) and Aspiration Tech to put on the Open Data Con Unconference.

Wednesday is going to be a busy day as we will be spread out across three events – CKANCon, organised by the CKAN association, the Opening Parliaments Fringe Event and the Open Data Con Research Symposium, where we will be presenting new work on measuring and assessing open data initiatives and on “participatory data infrastructures”.

At the International Open Data Conference, Open Knowledge team members are co-organising or presenting at the following sessions:

As you can probably see, the week is going to be a busy one and we are aware that it will be difficult to schedule meetings with everyone! To accommodate, the Open Knowledge team and the entire School of Data family are organising informal drinks at The Brig Pub from 7:30 PM Thursday evening! We would love for you to come say hello in person or you can always find Pavel (Open Knowledge’s new CEO!!!!), Zara, Milena, Jonathan, Mor, Sander, Katelyn, School of Data & of course Open Knowledge on twitter!

Safe travels and we will see you in Ottawa!

Peter Murray: Institution-wide ORCID Adoption Test in U.K. Shows Promise

planet code4lib - Fri, 2015-05-22 21:24

Via Gary Price’s announcement on InfoDocket comes word of a cost-benefit analysis for the wholesale adoption of ORCID identifiers by eight institutions in the U.K. The report, Institutional ORCID, Implementation and Cost Benefit Analysis Report [PDF], looks at the perspectives of stakeholders, a summary of findings from the pilot institutions, a preliminary cost-benefit analysis, and a 10-page checklist of consideration points for higher education institutions looking to adopt ORCID identifiers in their information systems. The most interesting bits of the executive summary came from the part discussing the findings from the pilot institutions.

Perhaps surprisingly, technical issues were not the major issue for most pilot institutions. A range of technical solutions to the storage of researchers’ ORCID iDs were utilised during the pilots. … Of the eight pilot institutions, only one chose to bulk create ORCID iDs for their researchers, the others opted for the ‘facilitate’ approach to ORCID registration.

Most pilot institutions found it relatively easy to persuade senior management about the institutional benefits of ORCID but many found it difficult to articulate the benefits to individual researchers. Several commented that staff saw it as ‘another level of bureaucracy’ and it was also noted that concurrent Open Access (OA), REF and ORCID activities can make the message confused, as they overlap. … Clear and effective messages (as short and precise as possible), creating a well-defined brand for ORCID and the targeting of specific audiences and audience segments were identified as being especially important.

One thing I found surprising in the report was the lack of the mention of the usefulness of ORCID identifiers in the linked data universe. The word “linked” appeared six times in the report; five of the six mentions talk about connections between campus systems and ORCID. It would seem that some of the biggest benefits of ORCID ids will come when they appear as the object of a subject-predicate-object triple in data published and consumed by various systems on the open web. That is, part of the linked open data.

Link to this post!


Subscribe to code4lib aggregator