Blogs and feeds of interest to the Code4Lib community, aggregated.
July 04, 2009
This sounds too crazy to be true.
As part of a ploy to squeeze more money out of mobile phone companies, the American Society of Composers, Authors, and Publishers (ASCAP) has told a federal court that each time a phone rings in a public place, the phone user has violated copyright law. Therefore, ASCAP argues, phone carriers must pay additional royalties or face legal liability for contributing to what they claim is cell phone users' copyright infringement.
Emphasis added. More info at the
Electronic Frontier Foundation.
by David (noreply@blogger.com) at July 04, 2009 02:10 PM
When, in the course of human events, it becomes necessary for one people to dissolve the political bonds which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the laws of nature and of nature’s God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.
We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are life, liberty and the pursuit of happiness. That to secure these rights, governments are instituted among men, deriving their just powers from the consent of the governed. That whenever any form of government becomes destructive to these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their safety and happiness. Prudence, indeed, will dictate that governments long established should not be changed for light and transient causes; and accordingly all experience hath shown that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same object evinces a design to reduce them under absolute despotism, it is their right, it is their duty, to throw off such government, and to provide new guards for their future security. –Such has been the patient sufferance of these colonies; and such is now the necessity which constrains them to alter their former systems of government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world.
He has refused his assent to laws, the most wholesome and necessary for the public good.
He has forbidden his governors to pass laws of immediate and pressing importance, unless suspended in their operation till his assent should be obtained; and when so suspended, he has utterly neglected to attend to them.
He has refused to pass other laws for the accommodation of large districts of people, unless those people would relinquish the right of representation in the legislature, a right inestimable to them and formidable to tyrants only.
He has called together legislative bodies at places unusual, uncomfortable, and distant from the depository of their public records, for the sole purpose of fatiguing them into compliance with his measures.
He has dissolved representative houses repeatedly, for opposing with manly firmness his invasions on the rights of the people.
He has refused for a long time, after such dissolutions, to cause others to be elected; whereby the legislative powers, incapable of annihilation, have returned to the people at large for their exercise; the state remaining in the meantime exposed to all the dangers of invasion from without, and convulsions within.
He has endeavored to prevent the population of these states; for that purpose obstructing the laws for naturalization of foreigners; refusing to pass others to encourage their migration hither, and raising the conditions of new appropriations of lands.
He has obstructed the administration of justice, by refusing his assent to laws for establishing judiciary powers.
He has made judges dependent on his will alone, for the tenure of their offices, and the amount and payment of their salaries.
He has erected a multitude of new offices, and sent hither swarms of officers to harass our people, and eat out their substance.
He has kept among us, in times of peace, standing armies without the consent of our legislature.
He has affected to render the military independent of and superior to civil power.
He has combined with others to subject us to a jurisdiction foreign to our constitution, and unacknowledged by our laws; giving his assent to their acts of pretended legislation:
For quartering large bodies of armed troops among us:
For protecting them, by mock trial, from punishment for any murders which they should commit on the inhabitants of these states:
For cutting off our trade with all parts of the world:
For imposing taxes on us without our consent:
For depriving us in many cases, of the benefits of trial by jury:
For transporting us beyond seas to be tried for pretended offenses:
For abolishing the free system of English laws in a neighboring province, establishing therein an arbitrary government, and enlarging its boundaries so as to render it at once an example and fit instrument for introducing the same absolute rule in these colonies:
For taking away our charters, abolishing our most valuable laws, and altering fundamentally the forms of our governments:
For suspending our own legislatures, and declaring themselves invested with power to legislate for us in all cases whatsoever.
He has abdicated government here, by declaring us out of his protection and waging war against us.
He has plundered our seas, ravaged our coasts, burned our towns, and destroyed the lives of our people.
He is at this time transporting large armies of foreign mercenaries to complete the works of death, desolation and tyranny, already begun with circumstances of cruelty and perfidy scarcely paralleled in the most barbarous ages, and totally unworthy the head of a civilized nation.
He has constrained our fellow citizens taken captive on the high seas to bear arms against their country, to become the executioners of their friends and brethren, or to fall themselves by their hands.
He has excited domestic insurrections amongst us, and has endeavored to bring on the inhabitants of our frontiers, the merciless Indian savages, whose known rule of warfare, is undistinguished destruction of all ages, sexes and conditions.
In every stage of these oppressions we have petitioned for redress in the most humble terms: our repeated petitions have been answered only by repeated injury. A prince, whose character is thus marked by every act which may define a tyrant, is unfit to be the ruler of a free people.
Nor have we been wanting in attention to our British brethren. We have warned them from time to time of attempts by their legislature to extend an unwarrantable jurisdiction over us. We have reminded them of the circumstances of our emigration and settlement here. We have appealed to their native justice and magnanimity, and we have conjured them by the ties of our common kindred to disavow these usurpations, which, would inevitably interrupt our connections and correspondence. They too have been deaf to the voice of justice and of consanguinity. We must, therefore, acquiesce in the necessity, which denounces our separation, and hold them, as we hold the rest of mankind, enemies in war, in peace friends.
We, therefore, the representatives of the United States of America, in General Congress, assembled, appealing to the Supreme Judge of the world for the rectitude of our intentions, do, in the name, and by the authority of the good people of these colonies, solemnly publish and declare, that these united colonies are, and of right ought to be free and independent states; that they are absolved from all allegiance to the British Crown, and that all political connection between them and the state of Great Britain, is and ought to be totally dissolved; and that as free and independent states, they have full power to levy war, conclude peace, contract alliances, establish commerce, and to do all other acts and things which independent states may of right do. And for the support of this declaration, with a firm reliance on the protection of Divine Providence, we mutually pledge to each other our lives, our fortunes and our sacred honor.
New Hampshire: Josiah Bartlett, William Whipple, Matthew Thornton
Massachusetts: John Hancock, Samual Adams, John Adams, Robert Treat Paine, Elbridge Gerry
Rhode Island: Stephen Hopkins, William Ellery
Connecticut: Roger Sherman, Samuel Huntington, William Williams, Oliver Wolcott
New York: William Floyd, Philip Livingston, Francis Lewis, Lewis Morris
New Jersey: Richard Stockton, John Witherspoon, Francis Hopkinson, John Hart, Abraham Clark
Pennsylvania: Robert Morris, Benjamin Rush, Benjamin Franklin, John Morton, George Clymer, James Smith, George Taylor, James Wilson, George Ross
Delaware: Caesar Rodney, George Read, Thomas McKean
Maryland: Samuel Chase, William Paca, Thomas Stone, Charles Carroll of Carrollton
Virginia: George Wythe, Richard Henry Lee, Thomas Jefferson, Benjamin Harrison, Thomas Nelson, Jr., Francis Lightfoot Lee, Carter Braxton
North Carolina: William Hooper, Joseph Hewes, John Penn
South Carolina: Edward Rutledge, Thomas Heyward, Jr., Thomas Lynch, Jr., Arthur Middleton
Georgia: Button Gwinnett, Lyman Hall, George Walton
by ecorrado at July 04, 2009 11:36 AM
The first Swedish Rap recording was made by the great troubadour Evert Taube in 1960. It's called "Muren och böckerna", and here's a YouTube video for your listening pleasure:
I became aware of this recording from another song called "Evert berättar" by Peter Carlsson and the Blå Grodorna (Blue Frogs). My Swedish isn't that good, but one day a few months ago the song came up on my iPod Shuffle while I was running, and I suddenly realized that the song had something to do with burning books and the Great Wall of China. As soon as I got back home I started researching Evert Taube and Qin Shi Huangdi, the subject of the original song (whose title translates as "The Wall and the Books").
Shi Huangdi (pinyin: Shǐ Huáng Dì, Chinese: 始皇帝 ) means literally, "first emperor". Just as Julius Caesar's name became synonymous with Emperor continuing to the present in titles such as "Kaiser", "Czar" and "Shah", Huangdi was the term used for Chinese emperors for over two thousand years. Shi Huangdi was the king of the Qin state from 246 BCE to 221 BCE, when he became the first emperor of a unified China. Even the word "China" comes from his "Qin" state (pronounced “chin”), even though most Chinese people are really "Han" rather than "Qin".
Shi Huangdi's unification of China put an end to what historians call "the Warring States Period. Under his leadership, the Qin state defeated one rival state after the other. The Warring States period, though politically chaotic, saw a great deal of economic, cultural and technological growth. Iron replaced bronze, and both Confucianism and Taoism (the Hundred Schools of Thought) developed in this period. The Qin state, however, grew strong because of the adoption of a competing philosophy, called Legalism, which emphasized the rule of law in a totalitarian state. Like Caesar, Shi Huangdi extended his dominion by improving communications and implementing standards. He build roads and canals to link the different parts of China. He standardized the length of axles of carts, the units of weights and measures, and the coinage. His most important acheivement, however, may have been the standardization of the Chinese script. For the first time, the machinery of local governments could communicate with functionaries of government throughout the realm. You might say that this was the first semantic web.
Shi Huangdi's innovations were not achieved by gentle persuasion or community consensus, but rather by imperial edict and brutal force. In order to stifle dissent (not to mention the outlawed non-official scripts), he ordered the destruction of all books other than a few in subjects he deemed to be useful: agriculture, medicine and alchemy, and in particular, he outlawed the works of the competing Hundred Schools of Thought. Those caught possessing any of the illegal works were to be conscripted and sent to work on the public works project now known as the Great Wall of China. In many classical accounts, Shi Huangdi ordered 460 scholars to be buried alive, then beheaded.
Although the Qin dynasty of the first Emperor failed to last more than a decade after his death, the non-political aspects of the unification of China through communication, trade, laws, administration and a standard script have lasted more than 2200 years.
Why would a Swedish troubador be interested in Shi Huangdi? Why would he invent a form similar to modern HipHip to sing it in? Evert Taube seems to be most interested in Shi Huangdi's act of burning "all the books in China", so that "history could begin with him". Shi Huangdi exiled his mother because of some "court intrigue" and Taube thinks that burning the books was an act of destroying history, forgetting the his unhappy past, and thinking only of what can be accomplished for the future. It often strikes me that today we're in another period of forgetting the past- because the internet dates back a relatively short time, modern students often behave as if anything that isn't on the internet doesn't exist, and never has existed. There are ongoing monumental efforts to digitize books and bring them back into view; small wars are being fought over how this will occur and all of the combatants claim the banner of preserving history for eternity. Eternal life was also one of Shi Huangdi's obsessions- the famed army of terra cotta warriors he had made was a product of this obsession.
I think that the musical form chosen by Evert Taube is not an anticipation of HipHop, but rather an evocation of the history that society is so eager to forget. Taube had been a sailor and an adventurer, and no doubt had been exposed to the traditional musical forms of both the Far East and of Africa. I think his intent was to evoke the forgotten primitive past with rhythms that speak across the ages.
We live in a time when the language and mechanisms of human interaction are undergoing great change. We are entering an era in which machines are learning to participate in our conversation. Efforts are under way to standardize and unify notations for the real world concepts and entities that underlie our communication. Success in these endeavors may result in the creation of great wealth and power, and new projections existing wealth and power. It is possible that we are living in a Warring States/Hundred Schools of Thought period, and standardization of our notations will lend itself to a totalitarian communications regime with global extent such as Shi Huangdi's or Julius Caesar's. Another possibility is that our intercourse will become governed by something like a theocracy, in which texts are governed by a priesthood and preserved by monks. Or perhaps information and its underpinnings will devolve to a dictatorship of the proletariat.
On this 233rd anniversary of the Declaration of Independence, I'd like to suggest that a democratically derived and governed semantic machinery for the internet should also be possible. Humans who interact in large groups, such as they are doing in places like Facebook, Twitter and the like, naturally develop languages and syntax on their own, and machines should bow to our will if they are to participate helpfully in our conversations. We need not only a common language and script to be able to communicate with each other, we need liberty to say what we want to say.
Happy Fourth of July!
by Eric Hellman (noreply@blogger.com) at July 04, 2009 01:09 AM
July 03, 2009
I’m new to the term “data federation.” How about you?
Michael Bergman, federated search luminary, just wrote on the subject, preferring the term “data mixing.” He explains the concept:
What is Data Mixing and Why is it So Hard?
As a new term there is no “official” definition of data mixing. However, I think we can consider it as generally equivalent to the older data federation concept.
Data federation is the bringing together of data from heterogeneous and often physically distributed data sources into a single, coherent view. Sometimes this is the result of searching across multiple sources, in which case it is called federated search. But it is not limited to search. Data federation is a key concept in business intelligence and data warehousing and a driver behind master data management (MDM).
Bergman explains that data federation was a hot research topic in the 1980s. Computers of different hardware, operating systems, databases, and other software were proliferating. Today’s robust and ubiquitous networking protocols were far from mature then. There were no dominant standards for data representation in the 80’s. Today we take interoperability for granted; if two systems don’t speak to one another directly we expect that someone has already developed software to bridge the gap. The whole Internet speaks TCP/IP. XML is everywhere.
So, we can say that it took the solving of some major data federation problems to lay the foundation for the Internet and the Web that we enjoy today.
Bergman further explains that the next major challenge is in semantics:
The Internet and its TCP/IP and Web HTTP protocols and XML standards in particular, have been major contributors to overcoming respective physical and syntactical and data exchange heterogeneities. The current challenge is to resolve differences in meaning, or semantics, between disparate data sources. Your “glad” may be someone else’s “happy” and you may organize the world into countries while others organize by regions or cultures.
I recommend Bergman’s article, especially if you have an interest in the Semantic Web. It’s moderately technical but it’s worth the read to understand where data federation fits into the Semantic Web.
ShareThis
by Sol at July 03, 2009 11:11 PM
Earlier in the week I wrote again about "the flow" -- that is, sources of information and content that are mostly about getting your atte...
July 03, 2009 08:45 PM
I was recently going through Innovative’s web site and, out of curiosity, clicked on the small “Legal Notices” link at the bottom of their front page. It instructed me to “PLEASE READ THESE ‘TERMS OF USE’ CAREFULLY BEFORE USING THIS WEB SITE”. It occurred to me that, if they truly wanted all visitors to read the legal notices before using the site, they should probably either feature the link more prominently on the front page or force a redirect to make sure everyone has a chance to read it beforehand. Most of us aren’t accustomed to reading EULAs for websites.
What got my attention was their “Links Policy”. Apparently, you are not allowed to link to III’s site unless you follow specific rules, including:
- “(i) any link to the Web site must be a link clearly marked “Innovative Interfaces” OR “iii.com”;
- “(iii) the link must “point” to the URL (www.iii.com) and not to other pages within the Web site”;
- “(vi) Innovative Interfaces, Inc. reserves the right to revoke its consent to the link at any time and in its sole discretion.”
That means that if you want to point someone to a specific III product, such as Millenium or Encore, you are, according to III, not allowed to provide them with direct links. Evidently, Google doesn’t respect their policy either (of course, it might help if III provided a robots.txt file to help support their links policy).
It’s got a bit of a “Fight Club” ring to it: “The first rule of the Millenium web page is you don’t link to the Millenium web page. The second rule of the Millenium web pages is you don’t link to the Millenium web page.“

by Warren at July 03, 2009 01:31 PM
It was not unexpected, and happened last week, but since I mentioned the controversy enough on this blog, I figured I’d let everyone know that “OCLC has formally withdrawn the proposed policy [and a] new group will soon be assembled to begin work to draft a new policy with more input and participation from the OCLC membership.” This is good news but those interested in fair, open exchange of data, need to be vigilant about what the new policy will contain, especially with the announcement of WorldCat Local about a month or so ago. Talking about WorldCat Local, I highly recommend listening to the Library 2.0 Gang Podcast on “Library System Suppliers view of OCLC Web-scale.”
by ecorrado at July 03, 2009 01:10 PM
July 02, 2009
By: dempsey
Categories: Analytics and measurement• OCLC
I thought I would post some numbers here which were prepared by my colleague Brian Lavoie for another purpose. The question was: how many of the books in US libraries are in English?
First of all, what is a book? Deciding what a book is involves some choices (are theses in or out, for example?). This analysis uses the definition of 'print books' given in the Google 5 analysis published in DLib Magazine a while back [1].
a. All of WorldCat (Apr 09):
135.3 million records
Cataloged as "eng": 46 percent (so 54 percent non-English)
b. Print books only (Apr 09):
91.2 million
Cataloged as: "eng": 40 percent (so 60 percent non-English)
c. Print books in US libraries (Jan 09)
42.5 million
Cataloged as "eng": 57 percent (so 43 percent non-English)
d. Print books representing combined collections of three academic research libraries participating in GBS (April 2009):
7.2 million
Cataloged as: "eng": 54 percent (so 46 percent non-English)
Note - c is calculated on a slightly earlier version of the database as we had already pulled out US library holdings. The data in d is being looked at for another purpose: hence the slightly arbitrary selection of 3 libraries.
Note - these numbers are for records in the database, which represent 'manifestations' in FRBR terms. If one were to count holdings or actual copies the numbers would be different. The proportion of 'eng' would go up as English titles will be more widely held and in greater numbers of copies.
[1] Here is how the definition of a 'print book' was decided upon and operationalised for the Google 5 analysis. "Although there is no unambiguous bibliographic definition of a book, libraries have often used monographic language materials as a proxy for books, and this practice is adopted for this study. More specifically, in the context of a MARC21 record, a book is defined as a language-based monograph, identified by the codes "a" and "m" in bytes 6 and 7 of the leader, respectively. For the purposes of this study, theses/dissertations and government documents are excluded from the analysis, since these materials are usually acquired and managed as separate segments of the library collection. Records describing books in print format were identified by eliminating all non-print formats, such as digital, microform, Braille, and so on."
July 02, 2009 10:46 PM
from OTTO - Controllerism Instrument at djtechtools.com:
Controllerism continues to take small leaps forward as the software and techniques improve but the giant steps are going to happen in the realm of performance interfaces. Without a solid controller surface that has been designed to play like an instrument we wont be able to leave the realm [...]
by Rob Styles at July 02, 2009 08:31 PM
For Ian Davis‘ birthday, Danny Ayers sent out an email asking people to make some previously unavailable datasets accessible as linked data as Ian’s present. It was a pretty neat idea. One that I wish I had thought of.
Given that Ian is my boss (prior to about a month ago, Ian was just nebulously “above me” somewhere in the Talis hierarchy, but I now report to him directly) one could cynically make the claim that by providing Ian a ‘linked data gift’ that I would just be currying favor by being a kiss-ass. You could make that claim, sure, but evidently you are not aware of how I hurt the company.
Anyway, as my contribution, I decided to take the data dumps from LibraryThing that Tim Spalding pretty graciously makes available [whoa, in the time that I first started this post until now, the data has gone AWOL, I suppose I did this just in time]. The data isn’t always very current and not all of the files are terribly useful (the tags one, for example, doesn’t offer much since the tags aren’t associated with anything — it’s just words and their counts), but it’s data and between ThingISBN and the WikipediaCitations I thought it would be worth it.
I wanted to take a very pragmatic approach to this: no triple store, no search, no rdf libraries, minimal interface. Mostly this was inspired by Ed Summers‘ work with the Library of Congress Authorities, but, also, if Tim (or, whoever at LibraryThing) saw that making LibraryThing linked data was as easy as a few template tweaks (as opposed to a major change in their development stack) this exercise was much more likely to actually make its way into LibraryThing.
What I ended up with (the first pass released before the end of Ian’s birthday, I might add) was LODThing: a very simple application written in Ruby’s Sinatra framework, DataMapper and SQLite. The entire application is less than 230 lines of Ruby (including the web app and data loader) plus 2 HAML templates and 2 builder templates for the HTML/RDFa and RDF/XML, respectively. The SQL database has three tables, including the join table. This is really simple stuff. The only real reason it took a couple days to create was trying to get the data loaded into SQLite from these huge XML files. Nokogiri is fast (well, Ruby fast), but a 200 MB XML file is pretty big. It was nice to get acquainted with Nokogiri’s pull parser, though.
There are a few things to take away from this exercise.
- When data is freely available, it’s really quite simple to reconstitute it into linked data without any need to depart from your traditional technology stack. There is nothing even remotely semantic-webby about LODThing except its output.
- We now have an interesting set of URIs and relationships to start to express and model FRBR relationships.
- The Wikipedia citations data is extremely useful and could certainly be fleshed out more. One could imagine querying DBpedia or Freebase on these concepts and identifying if the Wikipedia article is actually referring to the work itself and use that. Right now LODThing makes no claims about the relationships except that it’s a reference from Wikipedia.
LODThing isn’t really intended for human consumption, so there’s no real “default way in”. The easiest way to use it is to make a URI from an ISBN:
If you know the LibraryThing ‘work ID’, you can get in that way, too:
Also, you can all of these resources as RDF/XML by replacing the .html with .rdf.
So, Tim, you wrote on the LT API page that you would love to see what people are doing with your data, here you go. It would be even more awesome if it made it’s way back into LT — after all, it would alleviate some of the need for you to have a special API for this stuff.
Also, special thanks to Toby Inkster for providing a ton of help in getting this to resemble something that a linked data aware agent would actually want and finally turning the httpRange-14 light bulb on over my head. He also immediately linked to it from his Amazon API LODifiier, which is sort of cool, too.
I’ll be happy to throw the sources into a github repository if anybody’s interested in them.
by Ross at July 02, 2009 04:39 PM
Bookworm is now available in Spanish!

I’m thrilled to finally have this up as Spanish was one of the languages I was most interested in adding.
by liza at July 02, 2009 03:16 PM
News from OLAC.
CAPC's Moving Image Work-Level Records Task Force has completed a draft of its report and recommendations for operational definitions for a sample of five attributes of or roles needed for moving image work/primary expression records.We started out with the intention to simply write definitions for each term. However, while thinking about these pieces of information in the context of a shared, online database, we decided that it would be useful to investigate at least some types of "data about data" and to consider how we might be able to accommodate different types of data (e.g., both identifiers and textual strings) and deal with different levels of data reliability. We have tried to explain our reasoning and process in the introductory section. We do not believe that this draft has reached its final form yet, but we do think that we have come to a point where it would be useful to get feedback from a larger group on the perceived viability of our general approach. To evaluate the document, you may find it helpful to attempt to create a few sample records using these guidelines.This section will also include an annotated list of potential sources for work-level information. The secondary sources section is not quite complete, but we hope to issue a draft in the near future.The draft report is available on the OLAC web site as Part 3a. We will take comments and suggestions on the draft through Friday, July 31.
Comments are sought.
by David (noreply@blogger.com) at July 02, 2009 04:04 PM
The Centre for Digital Library Research at the University of Strathclyde is currently investigating the links between university library catalogues and digital repositories as part of a JISC funded study.
Can library users find resources in their university’s digital repository through the library catalogue? Do library catalogues and repositories share records for the same items?
If catalogues and repositories generally aren’t linked in these sorts of ways at the moment, could they be?
These are some of the issues being explored by the study. The team are surveying repository managers and others about now - so, if you receive a request to respond to their online survey, your response would be much appreciated.
For further information about the study, please visit the project’s Web site and the project’s Web page on the JISC site.
by Ben Wynne at July 02, 2009 01:27 PM
Research funded by the JISC, RIN and others over recent years has helped to increase understanding of how students and researchers use electronic information resources. Analysis of Web logs - such as the work done for the e-Books Observatory Study by CIBER at UCL - has proved a fruitful line of inquiry.
A new study - which has now been underway for a few months (so apologies for this late post) - seeks to add to this evidence through detailed observation of how individual students and researchers in Business and Economics use a number of information resources in their area (such as Business Source Premier).
The aim is to observe how individuals react to and use particular interfaces and then to explore those behaviours through structured interviews.
The work is being conducted by Middlesex University and is being complemented by an analysis of Web logs for a selection of Business and Economics e-books and e-journals by CIBER.
A report of the findings is expected during the autumn.
For further information, please visit the project Web page on the JISC Web site.
by Ben Wynne at July 02, 2009 01:08 PM
We've a new feature to LibraryThing for Libraries, suggested by Lare over at the
Seattle Public Library. He was looking for a way to show off just
some of their reviews—reviews for their summer reading program.
Libraries can now add "categories" for their reviewers to check off—library book club books, Big Read books, reviews by library staff, etc. And the library can show off just one category of reviews in their LTFL blog widget.
Seattle has made blog-widget pages for their
kids section,
teen section, and even their
adult section of the site. By categorizing the reviews into age-related groups, they can feature items in their catalog that would interest the patrons for each demographic.
We'll be releasing some more cool features at
American Library Association meeting in Chicago next week.
by Sonya (noreply@blogger.com) at July 02, 2009 12:46 PM
The UK Access Management Federation and other similar initiatives worldwide provide a SAML-based single sign-on solution for access to online resources for the education and research community. Typically, a user must sign-on to their home institution, using their local username and password, before being granted access to a remote online resource. In the main, this prevents the user from having to remember a separate username and password for each online resource that they wish to access. However, there is a perceived problem that some users have several affiliations (their university, their employer, the NHS, their professional body, etc.), each of which may grant access to a different set of online resources, and that, currently, online services are not able to make seamless decisions about which resources a given user is entitled to access because they lack knowledge about these multiple affiliations.
We have recently funded Simon McLeish at LSE to undertake an investigation into this area, commonly known as the Scott Cantor is a member of the IEEE problem. (Scott Cantor is lead developer of the Shibboleth software and an editor of the SAML 2.0 specification). This investigation will try to discover the extent of this problem in UK HE - who is affected, how serious stakeholders perceive it to be, and what is expected from a solution - in order to inform future work in this area.
More information about this study can be found thru the project's Wiki. As usual, the final report will be made openly available to the community under a Creative Commons licence.
by Andy Powell at July 02, 2009 10:07 AM
The Office for Information Technology Policy (OITP) of the American Library Association (ALA) has released Fiber to the Library: How Public Libraries Can Benefit from Using Fiber Optics for their Broadband Internet Connections It "articulates the benefits of fiber optic technology for public libraries and strategies to obtain such fiber connectivity. An important goal of this policy brief is to help applicants include “fiber to the library” in their federal broadband stimulus funding proposals under the American Recovery and Reinvestment Act (ARRA)."
My local library, Helen Hall, is receiving Energy Efficiency and Conservation Block Grant funds to get a
new cooling system.
by David (noreply@blogger.com) at July 02, 2009 10:02 AM
As a part of our quest trying to optimize the speed of our search front end I recently tried out the Yahoo js and css minifyer – YUI Compressor.
At first glance the nice things about the YUI Compressor are that it is a Java based (we are a Java friendly team), open source and fairly easy to work with. The YUI Compressor handles both javascript and css but in this post I have chosen to focus on the js part.
The test integration into my IDE (Intellij IDEA) and the project was quite easy because somebody has taken the time to write YUIAnt. I just downloaded the YUI compressor version 2.4.2 and the YUIAnt.jar and added them to the project and modified my build scripts to run the compressor when the website is deployed to the web server. The beauty of this is that you naturally don’t have to look at the minified javascript when editing and if you for some reason want to debug the code run time you can easily setup a debug option in your build script and bypass the compressor for on the fly debugging. If you aren’t into all this build script stuff or have a simple project there are lots of online YUI Compressor sites out there where you can paste you js code or css and get a compressed version in return.
The version 2.4.2 of the YUI Compressor nearly worked without problems. For some reason – I didn’t bother to investigate further – the YUI Compressor had some issues with unterminated Strings in the jscalendar-1.0 library. I just excluded the directory and went on with my small non scientific test using Firebug as my test environment.
The first screen shot shows the size and load times for our js files. Business as usual – the YUI Compressor is disabled.

The next screen shot shows the size and load times for the same js files now with the compression enabled.

The file sizes have been reduced and the overall load time has shrunk approximately half a second. When the file sizes are very small the load times are very sensitive to queing effects but the file size is in most cases reduced. In the case of bigger js files the improvement in speed as well as size is clear. I have tried to compensate for caching effects in both cases (compress/not compressed). It seems that there is about a 20-25% reduction in file size and approximately the same reduction in load time for the js. These numbers are without using the obfuscation option (reduction of variable names to the shortest possible length and other tricks) simply because I don’t thing we will be comfortable with this knowing that it might cause errors.
As I am new to this I am interested to hear about any major drawbacks compressing/minifying may have.
This is of course a small step and not something which alone makes the difference between a slow and a fast site but I am hoping that attention to a number of different optimization issues will make a big difference in the long run.

by Jørn Thøgersen at July 02, 2009 06:12 AM
My fur was raised when I saw Serials Solutions’ claim that their discovery service was an evolutionary step beyond federated search. I raised my concerns a couple of times: here and here. My beef isn’t with Serials Solutions as a business, it’s with their position that it’s fine to not search content that they don’t provide access to. There’s no room (yet) in their discovery service model to include access to quality content that can only be searched live, i.e. via federated search. Carl Grant joined the conversation and various people commented, making the topic a very lively one.
My concern was, and is, that libraries and research organizations would consider giving away their responsibility to select quality sources for their patrons for what I imagine to be two primary reasons: (1) library patrons don’t like to wait 30 seconds for federated search results, and (2) (possibly) cost savings. I don’t have a lot of sympathy for the Google generation. Even though I’m an American and my culture has taught me that immediate gratification is a good thing I think 30 seconds is a small price to pay to see better results. Cost I can’t speak to as I don’t have any figures.
One of my colleagues pointed me to an article by scientist and writer Michael Nielsen, Is scientific publishing about to be disrupted?, which only strengthens my belief that access to content from aggregators only supplements access via other methods such as federated search.
Michael Nielsen is a very accomplished scientist. His bio lists some of his impressive credentials:
Michael Nielsen is one of the pioneers of quantum computation. Together with Ike Chuang of MIT, he wrote the standard text on quantum computation. This is the most highly cited physics publication of the last 25 years, and one of the ten most highly cited physics books of all time (Source: Google Scholar, December 2007). He is the author of more than fifty scientific papers, including invited contributions to Nature and Scientific American. His research contributions include involvement in one of the first quantum teleportation experiments (related), named as one of Science Magazine’s Top Ten Breakthroughs of the Year for 1998, quantum gate teleportation, quantum process tomography, the fundamental majorization theorem for comparing entangled quantum states, and critical contributions to the formula for the quantum channel capacity. A full list of papers is here.
Nielsen’s article argues that there is impending disruption of scientific publishing. The article is fascinating, Nielsen is a compelling and well-informed writer and I recommend you read the fairly long article and, if you have time, that you follow at least some of the numerous links. I want to also add that I had the opportunity to spend some time with Nielsen at a conference he helped to organize at the Perimeter Institute and I very much appreciate how incredibly down to earth the man is.
What I found most valuable in Nielsen’s writing were various examples of science being published in non-traditional ways.
One example is Nielsen’s response to a New York Times editorial about the death of newspapers. Here’s a snippet from the editorial:
There’s a great deal of good commentary out there on the Web, as you say. Frankly, I think it is the task of bloggers to catch up to us, not the other way around… Our board is staffed with people with a wide and deep range of knowledge on many subjects. Phil Boffey, for example, has decades of science and medical writing under his belt and often writes on those issues for us… Here’s one way to look at it: If the Times editorial board were a single person, he or she would have six Pulitzer prizes…
And here’s Nielsen’s poignant response:
[The New York Times editorial piece] demonstrates a deep commitment to high-quality journalism, and the other values that have made the New York Times great. In ordinary times this kind of commitment to values would be a sign of strength. The problem is that as good as Phil Boffey might be, I prefer the combined talents of Fields medallist Terry Tao, Nobel prize winner Carl Wieman, MacArthur Fellow Luis von Ahn, acclaimed science writer Carl Zimmer, and thousands of others. The blogosophere has at least four Fields medalists (the Nobel of math), three Nobelists, and many more luminaries. The New York Times can keep its Pulitzer Prizes.
Nielsen’s point is clear. The blogosphere is a tremendous resource to scientists. Libraries and research organizations miss huge amounts of valuable and current resources if they only provide access to content from major publishers (or their aggregators.) I do realize that the writings of probably all of the bloggers that Nielsen mentioned is available through Google and might not make sense to federate. The problem with searching Google for excellent science is that you need the time and discernment to find the good stuff. But, however one might access science content, the power of traditional publishers is waning which is a really good reason to not depend on them for all the science worth reading.
Here’s another excerpt from Nielsen’s article, this one on innovative ways to communicate science that are sprouting up everywhere:
What’s new today is the flourishing of an ecosystem of startups that are experimenting with new ways of communicating research, some radically different to conventional journals. Consider Chemspider, the excellent online database of more than 20 million molecules, recently acquired by the Royal Society of Chemistry. Consider Mendeley, a platform for managing, filtering and searching scientific papers, with backing from some of the people involved in Last.fm and Skype. Or consider startups like SciVee (YouTube for scientists), the Public Library of Science, the Journal of Visualized Experiments, vibrant community sites like OpenWetWare and the Alzheimer Research Forum, and dozens more. And then there are companies like Wordpress, Friendfeed, and Wikimedia, that weren’t started with science in mind, but which are increasingly helping scientists communicate their research.
These Web 2.0 science offerings, at least the ones that provide an API or other mechanism for efficient search, are prime candidates for federation as they constantly generate new content.
One last quote from Nielsen. I very much enjoyed the great examples Nielsen packed into this paragraph of outstanding science being found in blogs of all places.
It’s easy to miss the impact of blogs on research, because most science blogs focus on outreach. But more and more blogs contain high quality research content. Look at Terry Tao’s wonderful series of posts explaining one of the biggest breakthroughs in recent mathematical history, the proof of the Poincare conjecture. Or Tim Gowers recent experiment in “massively collaborative mathematics”, using open source principles to successfully attack a significant mathematical problem. Or Richard Lipton’s excellent series of posts exploring his ideas for solving a major problem in computer science, namely, finding a fast algorithm for factoring large numbers. Scientific publishers should be terrified that some of the world’s best scientists, people at or near their research peak, people whose time is at a premium, are spending hundreds of hours each year creating original research content for their blogs, content that in many cases would be difficult or impossible to publish in a conventional journal. What we’re seeing here is a spectacular expansion in the range of the blog medium. By comparison, the journals are standing still.
At SLA 2009, Abe delivered a presentation: A Journey to 10,000 sources. The talk was about (this blog’s sponsor) Deep Web Technologies‘ efforts to search initially hundreds, then thousands, and eventually 10,000 sources. The accompanying paper makes this important argument for making a wider range of science information available to researchers:
By relying on only the content available from the major publishers and aggregators, researchers miss other important content, in particular the output of scientists who do not publish in mainstream journals. The world is shrinking, the brain pool is growing, and the output of science is everywhere.
While one may argue about the merits of federation vs. crawling and indexing vs. discovery services those arguments frequently focus on the technological merits of particular approaches. The more important question, I think, is what information is worth your while to see? For most of us that information can’t all be federated, or all indexed, or all provided to us by a discovery service. I think federated search will continue to evolve into this hybrid being where multiple technologies are enlisted to give scientists what they need.
ShareThis
by Sol at July 02, 2009 03:21 AM
Another THATCamp has come and gone and it was, again, a lot of fun. I've grown used to the dynamics of an unconference in the past five years or so because that's the kind of event I attend most of the time, now. JCDL 2009 was the first academic conference I'd attended in years, and though I enjoyed it as well and met a lot of interesting people and learned some useful stuff, it was missing the energy the mix of people at a good unconference can generate. And, though I feel like a self-important prig as I write this, I hated that though I'd made the effort to attend, there was no chance for me to get up and show off some stuff I'd worked on in front of the group. I use software that lets a user to become a committer; I value friendships that let a student become a teacher; I attend conferences that let an attendee become a presenter. Take out that dynamic and it's nowhere near as compelling.
Because it features this principle, as any good unconference does, the best part of THATCamp is the people. Both years I've met so many fascinating people and learned about so much amazing work that it's taken the whole week following for my brain to settle back down and follow up on all the threads left dangling on sunday afternoon like so many thesis topics. There's talk of franchised THATCamps to be staged in Austin and London among other places, and that's exciting. There's a #thatcamp channel on freenode that threatens to become a regular hangout. I've got about 50 more people I'm following on twitter all of whom already fill my screen with fascinating stuff to read and look at all day and some of them are even following me, too. What more could you ask for?
Well, there are a few things. I think there are a few tweaks to the formula that could improve the event a bit. I offer these only in the hopes of making THATCamp even better, not to complain or kill anybody's leftover buzz.
- Shorter sessions. This year the sessions were 1:15 long; for intense topics that engage everybody in the room that's what you need to give everybody a chance to go deep. But for open-ended discussions where there's as much airing of concerns about how "this needs to happen" and "we have to do that", 1:15 is about 25 minutes too long. It might have just been the sessions I chose this year, but it seemed like I was in more of the latter type sessions than the former, and that was a bit of a let down. Also, there were as many as five or six sessions running concurrently in several slots on the first day, any three or four of which I would've liked to sit in on. Tightening the schedule could allow for more time blocks and cut down on the number of simultaneous tracks.
- More hacking. When you go from having Bill Turkel teaching people how to fire code into an Arduino and the Omeka developers teaching how to write plugins and even me doing a simple tutorial on how to make little colorful balls dance around on screen with Processing one year to basically none of that the next year, it's a bit of a drop off to somebody like me who likes to learn by doing, especially in realtime at a moment when I'm jazzed up by all the amazing people and ideas in the air.
We talked about this a bit in #thatcamp on IRC last night - maybe if the sessions were a bit shorter and there were fewer concurrent tracks, one of the extra rooms could be a "hackin' room" or some such. Sorta like the chillout room at a rave with plenty of water and comfy couches where people can take a breather but, er, well, the exact opposite of that.
It might just be that I'm a little bit disappointed in myself for not prepping a hackier topic myself. I put a lot of time into hacks just for THATCamp last year and it was great fun pulling them off. I'd like to think that it was fun for the people in the room with me, too, and either way I learned a lot from the experience and I hope that was mutual. This year I was burned out on conference travel and work and didn't have the extra cycles to put something fun and new together, and I'm sorry I didn't. If I get to go again, I promise to do whatever I can to bring the hackin' back in!
- Let us do our own scheduling. This is probably the biggest one. At the Foo Camp I went to the intro evening session ended with everybody mingling around big schedule boards where times, topics, and rooms get worked out among the attendees in realtime. It's messy and takes a while but it ends with drinks and everybody's just happy to be bumping into all the other fascinating people around them anyway so it serves as a nice icebreaker, too. At THATCamp, CHNM staff instead comb through ideas posted in advance to the blog and group and sort and lump and split topics into sessions with titles that don't necessarily match what the idea-posters had in mind. I wanted to talk about improving web sites with linked data but where do I go to talk about that in this schedule? "Standards"? "Publishing"? "Software Development"? "Libraries and Web 2.0"? (that's where I went, and did a bit of the talk, but I'm not sure my topic was what everybody else there had in mind, and I know I wasn't alone in this mode of confusion).
By cutting out this dynamic let-the-people-do-it-themselves step you minimize opportunities for catchy titles to draw people in, for people to negotiate whether or not they should merge their own topics, and for people to simply get to know each other and decide which other people they want to be sure to hear from and hang out with right off the bat. And imho you maximize confusion about which sessions to go to and where you can find the people you want to hear from.
I'd advocate for filling out a big whiteboard with a schedule with people putting the names of their talks and their names with it and leaving a good 60-90 minutes to work it all out. On a real board or on paper (vs. online), so we'd have to occupy the same physical space. With drinks nearby.
I know Jeremy put a ton of work into scheduling because I caught him in the act when I arrived late so I know it was no trivial feat. I just think opening it up would be easier on @clioweb and @digitalhumanist and better for the rest of us too.
- Three word intros. Another nice thing they did at Foo was *very* brief intros of everybody in the room: your name, your affiliation, and *just* three words about who you are or what you're into. Mine would be: "Dan Chudnov, Library of Congress, One Big Library". It's a chance to put names to faces, it's another friendly icebreaker, and it's a chance for all of us 140-charsmiths to be clever.
- The schedule. Maybe it might help to have an evening meeting the night before for the welcoming session, the scheduling, and maybe one or two lead talks to kick things off. Then everybody can go get dinner or drinks and talk and think about what's coming the next morning and maybe work on their slides or demos or whatever overnight. You'd know when your slot is the next day, and which sessions you want to be sure to get to.
I don't want to be all "they do it better at Foo Camp" but these last few points really do reflect things that Foo Camp does a little better that I think THATCamp could adopt to make it just that much better.
And not to repeat myself, but I offer all this up with the hope of leading folks to think about various ways to make a great event even greater. I ain't complaining - the organizers do a great job making a lot of people with diverse backgrounds comfortable in a terrific space with plenty of coffee and wifi and surprisingly good food and nicely designed t-shirts and as long as they'll have me, I'll keep applying to attend again. It's just that I'm a bit of a hacker at heart and I'm always thinking about little optimizations, so take this as nothing more than that.
I hope to see y'all again next year, or even sooner - and next time you're in DC please stop by LC to say hi if you like.
by dchud at July 02, 2009 03:16 AM
"Following People at real-world events in real-time". Pulls in from twitter, flickr, youtube, twitpic, tinyurl, bitly. Looks useful for remote-following of conferences!
by keyvowel at July 02, 2009 01:25 AM
July 01, 2009
"The Archivist is a Windows application that runs on your local system and allows you to archive tweets for later data-mining and analysis for a given search.""If you leave The Archivist open, it will update with the latest results every 10 minutes. You can also close The Archivist and open it later. The Archivist will save the tweets and get all the tweets it can since that search."
by keyvowel at July 01, 2009 11:24 PM
Have you ever watched a web server log? Thirteen years ago, I was starting up a scientific e-journal, and it was very gratifying to watch the monitor and see the traffic coming in from all over the world. Occasionally I would turn on the referrer log to see where people were coming from. One time, I was surprised to see that somebody in Poland was coming to my e-journal site from a russian web site with "xxx" in the URL. Curious about what sort of site might be linking to my e-journal, I checked out the site, and found it to be about blond, naked women. I wasn't sure about what this indicated about my e-journal. Perhaps the Polish scientists found the e-journal and the xxx site equally stimulating? Perhaps their boss had just walked into the room, and they needed a work-oriented internet site to cover their other browsing?
My perspective on the privacy of my internet browsing changed that day. I've become mildly paranoid about things that might spy on me. I am very selective about the Facebook apps that I load, for example, but I don't bother to flush my browsing history or block web bugs or things like that. I enjoyed finding out "what Google knows about me" (post it to Facebook and tag your friends to do the same!). I really worry about Firefox extensions (or "Add-ons"), because I know how extremely powerful and/or intrusive they can be. Even so, the 3 or 4 things I add to Firefox are the main reason I don't use Safari, despite its integration advantages. I'm not surprised that IE and Safari have declined to support practical extension mechanisms; they're sort of scary. On the other hand, Firefox Add-ons have presented very few spyware-related problems; this is due in part to the fact that they must be written in Javascript and delivered as source. It's relatively easy to go and open an Add-on and inspect its code, so if an Add-on does something other than what it says it does, it's likely that sooner or later someone will discover the truth.
A really interesting Firefox Add-on called "Glue" is being offered by a venture-funded company called AdaptiveBlue. (no relation whatsoever to my company, Gluejar, Inc.) Glue watches you browse the internet and when it sees you on one of a set of sites that it knows about, it reports the pages you're on to AdaptiveBlue, enabling them to construct a "Social Network of Things", where the Things might be Books, Music, Products, Wine, Companies, etc.

Overall there are over 300 sites that the Glue Add-on does something with. A lot goes on in Glue, and I didn't take the time to sort everything out. For example, when you go to a topic page in Wikipedia or a book page in WorldCat, or a stock page in Yahoo Finance, the url that you visited is reported to AdaptiveBlue. Usually, the Add-on then slides down a Glue header which tells you about what the Glue Social Network thinks about the Thing you are looking at. Personally, I find this very distracting, and I don't plan to continue using Glue, but I can imagine that many people will appreciate the consistent interface to the social network and other services that is presented. Other sites handled by glue include LibraryThing, Epicurious, Last.fm, ESPN, theStreet, ToysRUs, Expedia, GameSpy, Metacritic, WineLibrary, Flixster, Connotea, Flickr, Technorati, Walmart and eBay, just to name a few. It was very difficult to find the
official list of sites that Glue works with on the GetGlue web site; I wish the AdaptiveBlue people were more upfront about exactly what they do on these sites. Nonetheless, the Add-on appears to do what it says it does. I also would like to see the user given more control over the sorts of things that are reported to AdaptiveBlue- I'm much more relaxed about sharing my Wine and Sports browsing than I am about my Wikipedia and Stocks browsing. And I really don't want to share my Russian XXX site browsing!
It's interesting to compare Glue to the
OpenURL linking services that have been almost universally adopted in libraries. (I developed one of the first OpenURL link servers, which is now owned by OCLC, Inc.) Like Glue, the OpenURL link servers present users with relevant information and links to services surrounding "things" which are typically journal articles or books. One library that I worked with even used a social network to connect users to other users who had viewed the same item, just like Glue. There was even a
Firefox Add-on developed that routed "thing" links to link servers. The link server vendor community worked with publishers closely to enable OpenURL linking; although AdaptiveBlue promotes its "
SmartLinks", I doubt that many of the sites Glue is aware of understand what they are doing.
Glue makes heavy use of
Amazon web services, including the product information web service, the SimpleDB service and the S3 simple storage service. It's smart these days to outsource scalability and concentrate on your application's functions. Glue also makes nice use of the Dojo and Mochikit Javascript toolkits. In browsing the code, I noticed that many of the problems it addressed were exactly the same ones we encountered developing
Linkbaton 9 years ago, and the solutions look quite similar (in otherwords, I think the developers have done a pretty good job!) except that the tools available today are so much more advanced than what we had to work with 9 years ago.
Given that AdaptiveBlue makes a big deal about the Semantic-ness of its technology, I was surprised to find out how it identifies "Things". The canonical way to identify a Thing on the semantic web is to give it a URI, and then attach properties to it. When I spoke with AdaptiveBlue founder and CEO Alex Iskold at the Semantic Technology Conference, he told me that they only use title and author strings to define book Things. In fact, they bundle these strings into keys (such as
books/cryptonomicon/neal_stephenson), then use the keys as if they identified a book, when in the real world, it's more complicated. So the "Things" in the AdaptiveGlue "Social Network of Things" are entities that do not correspond to books, but rather correspond to descriptions of books. Interestingly, this is exactly the approach taken in OpenURL URI's, which are really descriptive metadata packages, not entity URI's.
The first of Tim Berners-Lee's
"Four Rules" for Linked Data is "Use URIs as names for things". Both Glue and OpenURL, which were designed separately as practical solutions for linking to things, seem to break this rule. Instead they build URIs using descriptions of the things, and don't bother naming the things themselves.
Maybe Tim BL's first rule is wrong!
by Eric Hellman (noreply@blogger.com) at July 01, 2009 10:50 PM
I’m surprised that there isn’t an easier way to go from a Gimp file (.xcf) to a PDF. Sure, you can always “print to pdf” if you are working with a single layer image, but what if you have a multi-layer image that you want to turn into a PDF with multiple pages (each page being a layer from the image)?
Here is one way that I’ve found to accomplish this. I’m using Ubuntu so any install stuff will be specific to that distribution, but the software I’m using should work on any Linux distro.
First, you’ll need Gimp. I’m assuming that’s already installed.
Gimp won’t save a multi-layer image to a .ps, .tif, or .pdf by itself, though, so you need to install a script called “Save Layers as Individual Files” (this script can be downloaded for Gimp 2.4 or newer from Panotools) .
Once you download this script it needs to be put in your Gimp scripts directory.
unzip -d ~/.gimp-2.6/scripts Save-layers-tiff-24.zip
Your scripts directory may be named something else if you are using another version of Gimp (other than 2.6). Once the script is in that directory, it will appear in the Script-Fu > Utils menu within Gimp (and can be applied to any open image).
Next, you need to install imagemagick. If you don’t already have it installed, it’s as easy (on Ubuntu) as:
sudo aptitude install imagemagick
Once that is installed, you’ll be able to use the mogrify program which comes with ImageMagick. From within the directory that contains all your TIF files, type:
mogrify -format pdf *.tif
This will generate PDFs for each of your TIF files. You can then merge all the PDFs files into one using a program called PDFTK. To install that, just type:
sudo aptitude install pdftk
Running that program is as easy as typing:
pdftk filename*.pdf cat output singlename.pdf
The filename*.pdf argument will catch all the individually named files created by the mogrify program (filename1.pdf, filename2.pdf, filename3.pdf, filename4.pdf, etc.)
And, that’s it! You can open your new singlename.pdf file and have all those XCF layers now represented by individual pages within the PDF. This is the easiest way that I’ve found to accomplish this task, but if you know of a better/easier way I’d love to hear it!
by ksclarke at July 01, 2009 09:00 PM
The Professional Development Committee of the Hesburgh Libraries at the University of Notre Dame a “mini-symposium” on the topic of mass digitization on Thursday, May 21, 2009. This text documents some of what the speakers had to say. Given the increasingly wide availability of free full text information provided through mass digitization, the forum offered an opportunity for participants to learn how such a thing might affect learning, teaching, and scholarship. *
Setting the Stage
After introductions by Leslie Morgan, I gave a talk called “Mass digitization in 15 minutes” where I described some of the types of library services and digital humanities processes that could be applied to digitized literature. “What might libraries be like if 51% or more of our collections were available in full text?”
Maura Marx
The Symposium really got underway with the remarks of Maura Marx (Executive Director of the Open Knowledge Commons) in a talk called “Mass Digitization and Access to Books Online.” She began by giving an overview of mass digitization (such as the efforts of the Google Books Project and the Internet Archive) and compared it with large-scale digitization efforts. “None of this is new,” she said, and gave examples including Project Gutenberg, the Library of Congress Digital Library, and the Million Books Project. Because the Open Knowledge Commons is an outgrowth of the Open Content Alliance, she was able to describe in detail the mechanical digitizing process of the Internet Archive with its costs approaching 10¢/page. Along the way she advocated the HathiTrust as a preservation and sharing method, and she described it as a type of “radical collaboration.” “Why is mass digitization so important?” She went on to list and elaborate upon six reasons: 1) search, 2) access, 3) enhanced scholarship, 4) new scholarship, 5) public good, and 6) the democratization of information.
The second half of Ms. Marx’s presentation outlined three key issues regarding the Google Books Settlement. Specifically, the settlement will give Google a sort of “most favored nation” status because it prevents Google from getting sued in the future, but it does not protect other possible digitizers the same way. Second, it circumvents, through contract law, the problem of orphan works; the settlement sidesteps many of the issues regarding copyright. Third, the settlement is akin to a class action suit, but in reality the majority of people affected by the suit are unknown since they fall into the class of orphan works holders. To paraphrase, “How can a group of unknown authors and publishers pull together a class action suit?”
She closed her presentation with a more thorough description of Open Knowledge Commons agenda which includes: 1) the production of digitized materials, 2) the preservation of said materials, and 3) and the building of tools to make the materials increasingly useful. Throughout her presentation I was repeatedly struck by the idea of the public good the Open Knowledge Commons was trying to create. At the same time, her ideas were not so naive to ignore the new business models that are coming into play and the necessity for libraries to consider new ways to provide library services. “We are a part of a cyber infrastructure where the key word is ’shared.’ We are not alone.”
Gary Charbonneau
Gary Charbonneau (Systems Librarian, Indiana University - Bloomington) was next and gave his presentation called “The Google Books Project at Indiana University“.
Indiana University, in conjunction with a number of other CIC (Committee on Institutional Cooperation) libraries have begun working with Google on the Google Books Project. Like many previous Google Book Partners, Charbonneau was not authorized to share many details regarding the Project; he was only authorized “to paint a picture” with the metaphoric “broad brush.” He described the digitization process as rather straightforward: 1) pull books from a candidate list, 2) charge them out to Google, 3) put the books on a truck, 4) wait for them to return in few weeks or so, and 5) charge the books back into the library. In return for this work they get: 1) attribution, 2) access to snippets, and 3) sets of digital files which are in the public domain. About 95% of the works are still under copyright and none of the books come from their rare book library — the Lilly Library.
Charbonneau thought the real value of the Google Book search was the deep indexing, something mentioned by Marx as well.
Again, not 100% of the library’s collection is being digitized, but there are plans to get closer to that goal. For example, they are considering plans to digitize their “Collections of Distinction” as well as some of their government documents. Like Marx, he advocated the HathiTrust but he also suspected commercial content might make its way into its archives.
One of the more interesting things Charbonneau mentioned was in regards to URLs. Specifically, there are currently no plans to insert the URLs of digitized materials into the 856 $u field of MARC records denoting the location of items. Instead they plan to use an API (application programmer interface) to display the location of files on the fly.
Indiana University hopes to complete their participation in the Google Books Project by 2013.
Sian Meikle
The final presentation of the day was given by Sian Meikle (Digital Services Librarian, University of Toronto Libraries) whose comments were quite simply entitled “Mass Digitization.”
The massive (no pun intended) University of Toronto library system consisting of a whopping 18 million volumes spread out over 45 libraries on three campuses began working with the Internet Archive to digitize books in the Fall of 2004. With their machines (the “scribes”) they are able to scan about 500 pages/hour and, considering the average book is about 300 pages long, they are scanning at a rate of about 100,000 books/year. Like Indiana and the Google Books Project, not all books are being digitized. For example, they can’t be too large, too small, brittle, tightly bound, etc. Of all the public domain materials, only 9% or so do not get scanned. Unlike the output of the Google Book Project, the deliverables from their scanning process include images of the texts, a PDF file of the text, an OCRed version of the text, a “flip book” version of the text, and a number of XML files complete with various types of metadata.
Considering Meikle’s experience with mass digitized materials, she was able to make a number of observations and distinctions. For example, we — the library profession — need to understand the difference between “born digital” materials and digitized materials. Because of formatting, technology, errors in OCR, etc, the different manifestations have different strengths and weaknesses. Some things are more easily searched. Some things are displayed better on screens. Some things are designed for paper and binding. Another distinction is access. According to some of her calculations, materials that are in electronic form get “used” more than their printed form. In this case “used” means borrowed or downloaded. Sometimes the ratio is as high as 300-to-1. There are three hundred downloads to one borrow. Furthermore, she has found that proportionately, English language items are not used as heavily as materials in other languages. One possible explanation is that material in other languages can be harder to locate in print. Yet another difference is the type of reading one format offers over another; compare and contrast “intentional reading” with “functional reading.” Books on computers make it easy to find facts and snippets. Books on paper tend to lend themselves better to the understanding of bigger ideas.
Lastly, Meikle alluded to ways the digitized content will be made available to users. Specifically, she imagines it will become a part of an initiative called the Scholar’s Portal — a single index of journal article literature, full text books, and bibliographic metadata. In my mind, such an idea is the heart of the “next generation” library catalog.
Summary and Conclusion
The symposium was attended by approximately 125 people. Most were from the Hesburgh Libraries of the University of Notre Dame. Some were from regional libraries. There were a few University faculty in attendance. The event was a success in that it raised the awareness of what mass digitization is all about, and it fostered communication during the breaks as well as after the event was over.
The opportunities for librarianship and scholarship in general are almost boundless considering the availability of full text content. The opportunities are even greater when the content is free of licensing restrictions. While the idea of complete collections totally free of restrictions is a fantasy, the idea of significant amounts of freely available full text content is easily within our grasp. During the final question and answer period, someone asked, “What skills and resources are necessary to do this work?” The answer was agreed upon by the speakers, “What is needed? An understanding that the perfect answer is not necessary prior to implementation.” There were general nods of agreement from the audience.
Now is a good time to consider the possibilities of mass digitization and to be prepared to deal with them before they become the norm as opposed to the exception. This symposium, generously sponsored by the Hesburgh Libraries Professional Development Committee, as well as library administration, provided the opportunity to consider these issues. “Thank you!”
Notes
* This posting was orignally “published” as a part of the Hesburgh Libraries of the University of Notre Dame website, and it is duplicated here because “Lot’s of copies keep stuff safe.”
by Eric Lease Morgan at July 01, 2009 05:23 PM
Call for Papers....
Cataloging & Classification Quarterly
CCQ welcomes the submission of research, theory, and practice papers relevant to the broad field of bibliographic organization.
This journal, published now 8 times a year by Taylor & Francis, LLC, is respected as an international forum that emphasizes research and review articles, description of new programs and technologies relevant to cataloging and classification, and considered speculative articles on improved methods of bibliographic control for the future.
Articles are particularly welcome in areas dealing with research-based cataloging practice, including user behavior, user needs and benefits.
Authors are encouraged to submit manuscripts via email with attached word document to the Editor, Sandra K. Roe, Bibliographic Services Librarian, Illinois State University (email: skroe@ilstu.edu).
Special Issues
Colleagues interested in guest editing a special issue or expanded double issue are invited to contact the Editor with a general proposal, tentative schedule, and CVs. Previous special issues have included:
Metadata and Open Access Repositories (Michael Babinec and Holly Mercer, Guest Editors)Bibliographic Database Quality (Jeffrey Beall and Stephen Hearn, Guest Editors)The Intellectual and Professional World of Cataloging (Qiang Jin, Guest Editor)Knitting the Semantic Web (Jane Greenberg and Eva Méndez, Guest Editor)Cataloger, Editor and Scholar: Essays in Honor of Ruth C. Carter (Robert Holley, Guest Editor)
Annual Best Paper Award
Taylor & Francis sponsors an annual prize for CCQ with a small financial stipend for the Best Paper of the Year.
Free Print Sample
A free print specimen copy may be obtained by sending an email to marisa.starr@taylorandfrancis.com>
For More Details
Further details may be found at the
CCQ home page.
by David (noreply@blogger.com) at July 01, 2009 05:12 PM
So I just gave (or co-gave) a presentation here on Umlaut as deployed here as our Find It service.
One of the most exciting parts to me was that various (non-IT) librarians in the room, un-prompted, starting throwing out ideas of what it could do in the future. Quite good ideas. I had to resist the techies urge to respond to them with “Well, yeah, but see, that’s harder than it might seem to make work like that…”, and instead try to be encouraging and positive, because it was great to have such a conversation. We hardly ever have such conversations.
Why? I think becuase usually a non-technical librarian has absolutely no way to put such innovative thoughts into practice. As Karen Schneider talked about in her 2007 Code4Lib Keynote, libraries have ended up outsourcing a significant part of their core business to vendors, in a way that we pay for it, and we get it, and we pretty much take what we get.
My experience made me realize today that one of the (many) negative side effects of this is that librarians have lost the opportunity (and thus been implicitly ‘trained’ not to even bother trying) of doing what librarians should be doing in this era when so many of our services are delivered over the web: Figuring out how to make these services meet our users needs better!
Contrary to popular belief, you can’t just let your users tell you what your services will be. Sure, of course you need to listen to your users. And if you listen and observe very carefully, you can figure out what your users needs are, some of which they may not even be able to articulate themselves, but others of which they most certainly can. But you can’t count on your users to identify the best solutions to these needs. That’s what we’re for, that’s why we’re professionals!
And, to me at least, it’s one of the most most interesting and rewarding parts of our jobs.
But the outsourcing of much of the libraries business to vendors has taken the opportunity to do that away from most of us — an IT geek like me in a library that let’s him get away with it still has some. Most non-IT librarians have had it reinforced that they shouldn’t even bother. And while you have to be an IT type to implement new online services or features, you shouldn’t have to be one to be engaged in dreaming up and planning them.
One thing open source can do is return this power to us. I’m pretty pleased where Umlaut (and my ability to explain it) is finally at the point where it’s future potential can be seen enough to encourage non-technical librarians to start suggesting “Hey, but what if it could do this and that to? Wouldn’t that be great?”
And, if I can somehow find the time amongst the way too many really great things that I’d like to do if I had time, maybe soon it will!
Posted in General

by jrochkind at July 01, 2009 03:52 PM
Digital librarians and archivists might be interested in this. NASA is seeking ideas on how to analyze and catalog notes from Wernher von Braun into an electronic system.
On the eve of the 40th anniversary of the historic first moon landing, NASA is seeking ideas from the public, academia, and industry about how to analyze and catalog notes from spaceflight pioneer Wernher von Braun into an electronic, searchable database or other system.Von Braun was the first director of NASA's Marshall Space Flight Center in Huntsville, Ala., and a key figure in the development of the Saturn V rocket and NASA's Apollo program. NASA has a full collection of "Weekly Notes" von Braun wrote during the 1960s and 1970s. These notes were used to track programmatic and institutional issues at Marshall, and are considered by many historians to be a valuable source of data.NASA has issued a request for information and is looking for concepts that will provide an innovative resource for agency engineers and scientists, as well as researchers in academia
by David (noreply@blogger.com) at July 01, 2009 04:01 PM
Stefan Tilkov recently announced the availability of the video of a presentation he gave a few months ago on design patterns (& anti-patterns) for REST. I recommend having a look at it, as it covers a lot of ground and has lots of useful examples, and I find his presentational style strikes a nice balance of technical detail and reflection. If you haven't got time to listen, the slides are also available in PDF (though I do think hearing the audio clarifies quite a lot of the content).
One of the questions that this presentation (and other similar ones) planted at the back of my mind is that of how some of the patterns presented might be impacted by the W3C TAG's httpRange-14 resolution and the Cool URIs conventions for distinguishing between what it calls "real world objects" and "Web documents", some of which describe those "real world objects". The Cool URIs document focuses on the implications of this distinction on the use of the HTTP protocol to request representations of resources, using the GET method, but does not touch on the question of whether/how it affects the use of HTTP methods other than GET.
In the early part of his presentation, Stefan introduces the notion of "representation" and the idea that a single resource may have multiple representations. Some of the resources referred to in his examples, like "customers" (slide 16 in the PDF; slide 16 in the video presentation), when seen from the perspective of the Cool URIs document, fall, I think, into the category of "real world objects" - things which may be described (by distinct resources) but are not themselves represented on the Web. So, following the Cool URIs guidelines, the URI of a customer would be a "hash URI" (URI with fragment id) or a URI for which the response to an HTTP GET request is a 303 redirect to the (distinct) URI of a document describing the customer.
But what about non-"read-only" interactions, and using methods other than GET? The third "design pattern" in the presentation is one for "resource creation" (slide 55 in the PDF; slide 98 in the video presentation). Here a client POSTs a representation of a resource to a "collection resource" (slide 50 in the PDF; slide 93 in the video presentation). The example of a "collection resource" used is a collection of customers, with the implication, I think, that the corresponding "resource creation" example would involve the posting of a representation of a customer, and the server responding 201 with a new URI for the customer.
I think (but I'm not sure, so please do correct me!) that the implication of the httpRange-14 resolution is that in this example, the "collection resource", the resource to which a POST is submitted, would be a collection of "customer descriptions", and the thing posted would be a representation of a customer description for the new customer, and the URI returned for the newly created resource would be the URI of a new customer description. And a GET for the URI of the description would return a representation which included the URI of the new customer.
(In the diagram above, http://example.org/customers/123 is the URI of a customer; http://example.org/docs/customers/123 is the URI of a document describing that customer
And, finally, a GET for the URI of the customer (assuming it isn't a "hash URI") would - following the Cool URIs conventions - return a 303 redirect to the URI of the description.
There is some discussion of this is in a short post by Richard Cyganiak, and I think the comments there bear out what I'm suggesting here, i.e. that POST/PUT/DELETE are applied to "Web documents" and not to "real-world objects".
The comment by Leo Sauermann on that post refers to the use of a SPARQL endpoint for updates - the SPARQL Update specification certainly addresses this area. It talks in terms of adding/deleting triples to/from a graph, and adding/deleting graphs to/from a "graph store". I think the "adding a graph to a graph store" case is pretty close to the requirement that is being addressed by the "post representation to Collection Resource" pattern. But I admit I struggle slightly to reconcile the SPARQL Update approach with Stefan's design pattern - and indeed, he highlights the "endpoint" notion, with different methods embedded in the content of the representation, as part of one of his "anti-patterns", their presence typically being an indicator that an architecture is not really RESTful.
I should emphasise that I'm trying to avoid seeming to adopt a "purist" position here: I recognise that "RESTfulness" is a choice rather than an absolute requirement. However, interest in the RESTful use of HTTP has grown considerably in recent years (to the extent that some developers seem keen to apply the label "RESTful", regardless of whether their application meets the design constraints specified by the architectural style or not). And now the "linked data" approach - which of course makes use of the httpRange-14 conventions - also seems to be gathering momentum, not least following the announcement by the UK government that Tim Berners-Lee would be advising them on opening up government data (and his issuing of a new note in his Design Issues series focussed explicitly on government data). It seems to me it would be helpful to be clear about how/where these two approaches intersect, and how/where they diverge (if indeed they do!). Purely from a personal perspective, I would like to be clearer in my own mind about whether/how the sort of patterns recommended by Stefan apply in the post-httpRange-14/linked data world.
by PeteJ at July 01, 2009 02:00 PM
By Tom Scott
| This guest post originally appeared on Tom Scott’s blog; republished under CreativeCommons License, and with kind permission of the author.

It’s starting to feel like the world has suddenly woken up to the whole Linked Data thing — and that’s clearly a very, very good thing. Not only are Google (and Yahoo!) now using RDFa but a whole bunch of other things are going on, all rather exciting, below is a round up of some of the best. But if you don’t know what I’m talking about you might like to start off with TimBL’s talk at TED.
TimBL is working with the UK Cabinet Office (as an advisor) to make our information more open and accessible on the web [cabinetoffice.gov.uk]
The blog states that he’s working on:
- overseeing the creation of a single online point of access and work with departments to make this part of their routine operations.
- helping to select and implement common standards for the release of public data
- developing Crown Copyright and ‘Crown Commons’ licenses and extending these to the wider public sector
- driving the use of the internet to improve consultation processes.
- working with the Government to engage with the leading experts internationally working on public data and standards
The Guardian has an article on the appointment.
Closer to home there have been a few interesting developments
Media Meets Semantic Web – How the BBC Uses DBpedia and Linked Data to Make Connections [pdf]
Our paper at this years European Semantic Web Conference (ESWC2009) looking at how the BBC has adopted semantic web technologies, including DBpedia, to help provide a better, more coherent user experience. For which we won best paper of the in-use track – congratulations to Silver and Georgie.
The BBC has announced a couple SPARQL endpoints, hosted by talis and openlink [welcomebackstage.com]
Both platforms allow you to search and query the BBC data in a number of different ways, including SPARQL — the standard query language for semantic web data. If you’re not familiar with SPARQL, the Talis folk have published a tutorial that uses some NASA data.
A social semantic BBC?
Nice presentation from Simon and Ben on how social discovery of content could work… “show me the radio programmes my friends have listen to, show me the stuff my friends like that I’ve not seen” all built on people’s existing social graph. People meet content via activity.
PriceWaterhouseCooper’s spring technology forecast focuses on Linked Data [pwc.com]
“Linked Data is all about supply and demand. On the demand side, you gain access to the comprehensive data you need to make decisions. On the supply side, you share more of your internal data with partners, suppliers, and—yes—even the public in ways they can take the best advantage of. The Linked Data approach is about confronting your data silos and turning your information management efforts in a different direction for the sake of scalability. It is a component of the information mediation layer enterprises must create to bridge the gap between strategy and operations… The term “Semantic Web” says more about how the technology works than what it is. The goal is a data Web, a Web where not only documents but also individual data elements are linked.”
Including an interview with me!
You should also check out…
sameas.org a service to help link up equivalent URIs
It helps you to find co-references between different data sets. Interestingly it’s also licenced under CC0 which means all copyright and related or neighboring rights are waived.
Image: “Semantic Web Rubik’s Cube” by dullhunk, CC License, via flickr
by admin at July 01, 2009 01:45 PM
Maybe it's just me but sometimes I need to recharge my batteries. Here is my solution: spend a couple of days with energized library technologists, FOSS developers, and systems librarians. Well, I did say that maybe it's just me. Fortunately my batteries got a full charge this week at Access2008, Canada's premiÚre library technology conference, which was being hosted just down the road from me by McMaster University. The librarians attending Access2008 totally get the need to take a holistic approach to ICT in libraries. And they mostly get FOSS as well. In fact I think I met more dedicated proponents of FOSS in libraries over the course of this conference than I had ever known existed.
One of the highlights for me was the opportunity to see keynote speaker Karen Schneider, whose blog has long been a must for librarians concerned with technology. Karen is now Community Librarian for Equinox Software which is the principal support company for Evergreen, a FOSS ILS. I thoroughly enjoyed her talk entitled Open++ - dispatches from the OSS frontlines. Karen was sharing some of the pluses (or "++" - which signals praise and potential karma points in the IRC channels that library technology geeks frequent) and a few minuses of her task of explaining open source on the ground in libraries. It is no small task to set out to demystify the FOSS community and ethos, but it is all part of the effort to spread the word about Evergreen.
Perhaps it is just the nature of the Access conferences, or maybe it is a reflection of the state of libraries in North America at the moment, but I found FOSS everywhere I turned. Dale Askey of Kansas State University gave a great talk about the anxieties some of us have about letting people see our code, and the real need to get it out there. Eric Lease Morgan spoke about his MyLibrary project at the University of Notre Dame. Walter Lewis and Slavkio Manojlovich spoke about the partnership between AlouetteCanada and OurOntario.ca All of these are FOSS efforts, naturally.
Other FOSS-relevant talks were given by a whole panel of librarians demonstrating their various uses of the Drupal content management system, and I was astounded by the simplicity and elegance of LibX (which started as a FireFox plugin but is also avaialbe for IE). Karen Coombs from the University of Houston gave a great presentation on the extremely modular approach she takes there for library services, disavowing monolithic solutions and instead knitting her library web space together with contributions from both proprietary and FOSS components. And of course one of the talks I was most keen on hearing was that of John Fink of McMaster University and Dan Scott of Laurentian University on progress in the Conifer project, which bring together a number of Canadian university libraries in one very large Evergreen instantiation. Dan, of course, is no stranger to eIFL.net having led the Evergreen training component of the eIFL-FOSS ILS project working in Armenia earlier this year. The news on Conifer is that progress is going well and the current expected date for all of these libraries to "go live" with Evergreen is the spring of 2009.
Evergreen did tend to be ever present at this conference. But other FOSS ILSs were also heard from. At least one group of public libraries located in the Ontario hinterland have decided to band together and share expertise on Koha.
Of course this conference wasn't entirely about or for FOSS in libraries. Access2008 is a conference for library technologists and there were lots of other solutions being canvassed. But perhaps it is only human for the most exciting buzz to come from IT solutions that librarians are creating for themselves that they can share with their peers. Thus one appeal of open source, perhaps.
I haven't followed news of mass digitization projects closely so perhaps I was the only one astounded by the talk by Jonathan Bengston and Sian Meikle of the University of Toronto on the mass digitization project going on there. I confess I had no idea of the scale of this. It is immense. Literally thousands of books are being digitized on a daily basis. This is impressive even as merely a feat of organization. But the results were also impressive. Sadly this mammoth effort has a shakey fundation now that Microsoft has decided to end its funding. But it certainly gave us food for thought about what is possible with sufficient resources.
The conference was rounded off after two and a half days with an inspiring talk from Bob Young, famed local entrepreneur and co-founder of RedHat.com and Lulu.com. Bob is always good value as a speaker but I found him especially insightful today as he contrasted his life in technology firms with one of his current roles as owner of a professional sports franchise, the Hamilton Tiger Cats.
I should finish this conference report with a mention of something that happened the day before the conference began: Hackfest. Hackfest is a day-long event in which librarians and programmers gather, divvy up a problem set, and set to work. You might say, it is the very spirit of what Access2008 is all about. You might also be wondering just how much real development work can actually get done in a day. The answer: lots! I was consistently impressed as the various groups that had worked together reported back during the conference. Here, for example, is Dan Scott's blog post on his Hackfest activity in which he was sorting out how to use Zotero with Evergreen. Cool!
My thanks to the organisers of Access2008. My batteries are re-charged. Full steam ahead!
by randy-m at July 01, 2009 11:41 AM
(Guest blog post from Amos Kujenga, National University of Science and Technology (NUST) Library, Zimbabwe)
The week before last, Misheck Nyaluso and I were at the University of Nairobi where we conducted a 5-day Greenstone Workshop (from Monday 22 - Friday 26 September). The event was sponsored by UNESCO (Nairobi Cluster Office), organised by the Kenya Information Preservation Society (KIPS), and held at the Jomo Kenyatta Memorial Library. Several people spoke at the opening ceremony and of interest was the presence of Mrs Jacinta Were, the eIFL country coordinator for Kenya. She was also part of a Steering Committee which in 2005 was involved in the initial feasibility study for the establishment of a Greenstone Support Organisation for Africa.
A total of about 24 participants (mostly librarians) were trained and given attendence certificates on the closing day. We borrowed a bit from the Lesotho workshop style by concentrating on the general DL issues on the first day. This was of great benefit to some of the participants who (believe it or not) thought Greenstone was some scanning software! It was also interesting to note how digital libraries have been so closely associated with scanning that people sometimes fail to realize that there are many many collections that they can build from "born digital" material.
KIPS has to date produced a Greenstone CD-ROM of a list of abstracts of Theses and Dissertations about Kenya. Infact, most of the participants were drawn from organisations that contributed content towards this collection. There was also a demo of a Greenstone CD-ROM of articles on Gender Issues from Kenyan newspapers by the Kenya Indexing Project.
With KIPS playing a leading role, there's much potential for big time Greenstone projects in Kenya, moreso since they've already set the pace by virtue of their existing collection. They expressed great zeal to establish a network of "Greenstoners" in Kenya and judging from the performance of some of the participants, the future looks bright. If KIPS can work closely with other organisations, e.g., the local eIFL consortium (Kenyan Libraries and Information Services Consortium - KLISC) much can be achieved to build an effective user and support network. We also continually encouraged the participants to play an active role on the sagreenstone discussion list, in addition to using the other technical support resources.
We also had Zoe Cormack from the Rift Valley Institute giving a brief talk on their work on the Sudan Open Archive project. This was an eye opener for many who got to get ideas of how to handle complex scanning/digitisation issues.
The UNESCO representative, Mr Hezekiel Dlamini (to whom we're quite grateful for inviting us to assist in running this workshop) also indicated willingness to have an advanced workshop - which, however, would only be for those institutions that would have evidence of some work with Greenstone.
On the whole, we had an interesting week in Nairobi, not to mention the confusion on the roads! To quote one taxi driver, some tourist once exclaimed, "Anyone who can drive in Nairobi for a month without a scratch deserves an international driving license!"
by randy-m at July 01, 2009 11:37 AM

The code for our search front end has over time grown to a considerable size and we have started to suspect that the web site’s response time could be better. With this in mind I have for some time now been keen on looking into optimizing the speed of our front end – especially when the underlying search engine Summa has proven to be blazingly fast.
There are a lot of things we could do better such as:
1. Optimizing the javascript code by trawling through the lot and removing redundancy as well as rewriting some of the methods to be more efficient.
2. A thorough cleanup of the css. There is a lot we can do here as we have loads of redundancy, classes which are not in use anymore and declarations which could be handled way cooler. Another thing I noticed is we like divs – loads of divs.
3. Taking a critical look at our numerous DOM transformations. Some of them are down right unnecessary.
4. General optimizing of the server side code. In fact this part isn’t all that bad but a general clean up once in a while doesn’t hurt anybody.
Because my summer holiday is coming up soon I have chosen to start with some light weight stuff. I have tried out the newest version of the YUI Compressor – tool to compress/minify javascript and css. As we don’t use minifying at the moment we should be able to benefit from it performance wise. In order not to clutter up this post I will post my experience with this in a separate post soonish.

by Jørn Thøgersen at July 01, 2009 11:15 AM
I was at Moseley Bar Camp last Sunday and there were some great sessions. Andy Mabbett stood up to lead a discussion entitled Let’s Play Tag: recent developments and emerging issues in the use of tagging for added semantic richness.
Andy was looking for discussion on how to solve the problem of ambiguity in hash tags [...]
by Rob Styles at July 01, 2009 10:30 AM
I decided the other day that it would be useful to have a representative accession or two to play with. This way we could test for scalability and robustness (in dealing with different file formats, crazy filenames, and the like) of the various tools that will make up BEAM and also try out some of our ideas regarding packaging, disk images and such.
It isn't really possible to use a real accession for this purpose, mostly due to the confidentiality of some of our content. But I did want the accession to be as genuine as possible and here is how I did it. Any ideas for alternatives would be great!
The way I saw it, I needed three things to create the accession:
- A list of files and folders that formed a real accession
- A set of data that could be used - real documents, images, sound files, system files, etc.
- Some way of tying these together to create an accession modelled on a real one but containing public data
Fortunately Susan already had a list of files that made up the a 2GB hard drive from a donor - created from the forensic machine - which I thought would be a good starting point. Point 1 covered!
Next question was where to get the data. My first thought was to use public (open licensed) content from the Web - obtaining images when required through Flickr, getting documents via Scribd, etc. This is still a good approach for certain accessions. However, looking at my file list I quickly realised I wasn't just dealing with nice, obvious "documents". The accession contained a working copy of Windows 95 for example, jammed full of DLLs and EXEs. It also contained files pulled from old PCW disks by the owner, with no extension, applications from older systems, and all manner of oddities - "~$oto ", "~~S", "!.bk!" are just some examples.
It occured to me that I needed a more diverse source of files - most likely a real live system that could meet my request for a DLL while not revealing much about the original file. Where would I find this source? My own PC of course!
In theory my PC is dual-boot, running Ubuntu and Windows XP. The Windows XP partition is rarely used (I've nothing against XP, it just isn't so good a software development environment as Linux), but it struck me it'd make an excellent source of files, even if it was a version of Windows some way down the tracks from 95. By pulling files from my Windows disk (mounted by Ubuntu) I could, hopefully, create a more representative accession with a few more problems to solve than just "document" content.
(I also thought I could try creating a file system with a representative set of files to choose from - dlls from Windows 95 disks, etc. - but that would mean some manual collation of said files. This may be where I go next!).
So, 1 and 2 covered, what I needed next was a way to tie the file list to the data. I decided to use the file extension for this. For example, if the file list contains:
C:\WINDOWS\SYSTEM32\ABC.DLL
I wanted to grab
any file with a ".DLL" extension from my data source (the XP disk). Any random file, rather than one that matched the accession, because the random here is likely to cause problems when it comes to processing this artificial accession and problems is what we really need to test something.
This suggested I needed a way to ask my file system "What do you have that has '.dll' at the end of the path?". There were lots of ways to do this - and here is where Linux shines. We have 'find', 'locate', 'which', etc. on the command line to discover files. There is also 'tracker' that I could have set indexing the XP filesystem. In the end I opted for Solr.
Solr provides a very quick and easy way to build an index of just about anything - it uses Lucene behind the scenes. (I like the way that almost rhymes!) If you're unfamiliar with either, then find out all about them quickly! In short, you tell it which fields you want to index (and how you want them indexed) and create XML documents that contain those fields. POST these to the Solr update service and it indexes them there and then.
I installed the Solr Web app., tweaked the configuration (including setting the data directory because no matter what I did with the environment and JNDI variables, it kept writing data to the directory from which Tomcat was started!), and then started posting documents for indexing to it. The document creation and POSTs were done with a simple Java "program" (really a script, and I could've just just about any language, but we're mostly using Java and I'm trying to de-rust my Java skills, I figured why not do this with Java too). The index of around 140,000 files took about 15 minutes (I've no idea if that is good or not).
(Renhart suggested an offshoot of this indexing too - namely the creation of a set of OS profiles, so that we can have a service that can be asked things like "What OS(es) does a file with SHA-1 hash XYZ belong to?" - enabling us to profile OSes and remove duplicates from our accessions).
The final step was to use another Java "program" to cludge the list of files in the accession with a lookup on the Solr index and some file copying. Then it is done - one accession that mirrors a real live file structure, contains real live files, but none of those files are "private" or a problem if they're lost. Even better, because we used more recent files, the accession is now 8GB rather than 2GB, aligning more with what we'd expect to get in the future.
Hooray! Now gotta pack it into disk images and start exploring processing!
Should anyone be interested, the
source code is available for download.
by pixelatedpete (pixelatedpete@gmail.com) at July 01, 2009 10:30 AM
The MODS Editorial Committee is looking for community input on geospatial information in MODS.
In considering changes for future versions of MODS, the MODS/MADS Editorial Committee is starting to think about how to better handle geospatial information. Detailed geospatial information in the form of coordinates, etc. is becoming more and more common and can promote many innovative user interactions with resources. Currently MODS has poor support for this information.
The committee would like to bring together use cases for supporting geospatial access to resources from MODS and/or MADS implementations. We are interested both in use cases that you already have in your MODS/MADS implementation and that any local geospatial experts you have access to can provide, to help us inform how MODS and/or MADS should evolve to better handle this information. It should be noted this discussion came to the Editorial Committee from the more specific geospatial elements (latitude/longitude, equinox/epoch) in RDA, although we want to look beyond RDA for guidance in this decision.
So far, we have identified the following use cases for geospatial data:To allow resources with a geospatial component (interpreted widely) to be plotted on an interactive map-based interfaceInteractively overlaying different maps, including aerial photographs, digitized historic maps, and current maps in a GIS environmentTo index coordinate data about the geographic coverage of a resource for retrieval purposesTo index coordinate data about the geographic origin of a resource for retrieval purposes
What others can you provide?
Are there more specific use cases both for geospatial *coverage* (what a resource is about or represents) and geospatial *origin* (where a resource is from, for example, a soil sample)? This distinction seems important but it would be useful to understand what is done differently in each case.
There is some question as to whether the appropriate place for this information is MODS or MADS - thoughts on this issue? Should MODS/MADS be looking to embedding or referencing other standards for this information, and, if so, which and where? What is the best balance between functionality (and potentially complexity) and ease of creation/maintenance/use?
We look forward to hearing discussion on this issue - it's a complex but important one that will benefit from community contribution.
by David (noreply@blogger.com) at July 01, 2009 10:17 AM
A few years ago I wrote one of my Library Journal "Digital Libraries" columns on the phenomenon of "flow" ("Hustle and Flo...
July 01, 2009 03:45 AM
June 30, 2009
A brief personal note: I have left my job at Ball State University so I can have more time to pursue my career as a freelance web developer.
For about two years, now, my wife and I have been building up our business, Adelie Design, Stephanie doing the design, me doing the coding. Business is good—good enough that working nights and weekends isn’t enough anymore (and doesn’t leave enough time to see the family!). So I’m “retiring”, which is to say that I’ll be staying home and working 40 fewer hours each week.
I’m glad to be leaving, but it’s been a pretty good four years at Ball State. To my colleagues and friends there: thank you. I hope to remain involved in libraries, and especially the code4lib community. Hopefully I’ll have some free time I can devote to “fun” projects.
If you, reader, know of anyone needing a web developer or designer, please have them contact me.
by Jonathan Brinley at June 30, 2009 08:00 PM
New legislation was introduced in the U.S. Senate last week to support the publication of federally-sponsored research results under open access terms.
Sponsored by Senator Lieberman of Connecticut and co-sponsored by Senator Cornyn of Texas, it mandates open access to author pre-print versions with peer review changes in federally-run repositories within six months of publication. Called S.1373, it is a nearly identical version to the bill of the same name that these two senators introduced in 2006, which ultimately died in committee. The 2006 version was supported by a wide variety of organizations including the American Library Association, as tracked by the Alliance for Taxpayer Access (ATA).
In his statement on the floor of the Senate introducing the bill, Senator Cornyn described the benefits of the legislation:
Our bill will ask all Federal departments and agencies that invest $100 million or more annually in research to develop a public access policy. Our goal is to have the results of all government-funded research to be disseminated and made available to the largest possible audience. By speeding access to this research, we can help promote the advancement of science, accelerate the pace of new discoveries and innovations, and improve the lives and welfare of people at home and abroad.
The practical reality of the legislation would be to endorse the NIH open access policy, apply it to a wider array of departments, and run counter to the U.S. Representative John Conyer’s proposed “Fair Copyright in Research Works Act” (discussed earlier on DLTJ). The differences between the two bills are described below. The major change is the exclusion of progress reports at meetings or conferences from the open access provisions, plus oversight by additional committees in the U.S. House and Senate. ATA released a statement that supports this version of the bill.
| Line 1: | Line 1: |
| - | 109th CONGRESS
| + | 111th CONGRESS |
| | | | |
| - | 2d Session
| + | 1st Session |
| | | | |
| - | S. 2695 | + | S. 1373 |
| | | | |
| | To provide for Federal agencies to develop public access policies relating to research conducted by employees of that agency or from funds administered by that agency. | | To provide for Federal agencies to develop public access policies relating to research conducted by employees of that agency or from funds administered by that agency. |
| Line 9: | Line 9: |
| | IN THE SENATE OF THE UNITED STATES | | IN THE SENATE OF THE UNITED STATES |
| | | | |
| - | May 2, 2006
| + | June 25, 2009 |
| | | | |
| - | Mr. CORNYN (for himself and Mr. LIEBERMAN) introduced the following bill; which was read twice and referred to the Committee on Homeland Security and Governmental Affairs | + | Mr. LIEBERMAN (for himself and Mr. CORNYN) introduced the following bill; which was read twice and referred to the Committee on Homeland Security and Governmental Affairs |
| | | | |
| | A BILL | | A BILL |
| Line 21: | Line 21: |
| | SECTION 1. SHORT TITLE. | | SECTION 1. SHORT TITLE. |
| | | | |
| - | This Act may be cited as the `Federal Research Public Access Act of 2006‘. | + | This Act may be cited as the `Federal Research Public Access Act of 2009‘. |
| | | | |
| | SEC. 2. FINDINGS. | | SEC. 2. FINDINGS. |
| Line 79: | Line 79: |
| | (d) Exclusions- Each Federal research public access policy shall not apply to– | | (d) Exclusions- Each Federal research public access policy shall not apply to– |
| | | | |
| | + | (1) research progress reports presented at professional meetings or conferences; |
| | | | |
| - | (1) laboratory notes, preliminary data analyses, notes of the author, phone logs, or other information used to produce final manuscripts; | + | (2) laboratory notes, preliminary data analyses, notes of the author, phone logs, or other information used to produce final manuscripts; |
| | | | |
| - | (2) classified research, research resulting in works that generate revenue or royalties for authors (such as books) or patentable discoveries, to the extent necessary to protect a copyright or patent; or | + | (3) classified research, research resulting in works that generate revenue or royalties for authors (such as books) or patentable discoveries, to the extent necessary to protect a copyright or patent; or |
| | | | |
| - | (3) authors who do not submit their work to a journal or works that are rejected by journals. | + | (4) authors who do not submit their work to a journal or works that are rejected by journals. |
| | (e) Patent or Copyright Law- Nothing in this Act shall be construed to affect any right under the provisions of title 17 or 35, United States Code. | | (e) Patent or Copyright Law- Nothing in this Act shall be construed to affect any right under the provisions of title 17 or 35, United States Code. |
| Line 93: | Line 95: |
| | (A) the Committee on Homeland Security and Governmental Affairs of the Senate; | | (A) the Committee on Homeland Security and Governmental Affairs of the Senate; |
| | | | |
| - | (B) the Committee on Government Reform of the House of Representatives; and | + | (B) the Committee on Oversight and Government Reform of the House of Representatives; |
| | + | (C) the Committee on Science and Technology of the House of Representatives; |
| | + | (D) the Committee on Commerce, Science, and Transportation of the Senate; |
| | + | (E) the Committee on Health, Education, Labor, and Pensions of the Senate; and |
| | | | |
| - | (C) any other committee of Congress of appropriate jurisdiction. | + | (F) any other committee of Congress of appropriate jurisdiction. |
| | | | |
| | (2) CONTENT- Each report under this subsection shall include– | | (2) CONTENT- Each report under this subsection shall include– |
Gavin Baker at Open Access News has some things to watch as the bill makes its way through the legislative process.
Post from: Disruptive Library Technology Jester
Federal Research Public Access Act Reintroduced
by the Jester at June 30, 2009 07:29 PM
"Papers Past contains more than one million pages of digitised New Zealand newspapers and periodicals. The collection covers the years 1839 to 1932 and includes 52 publications from all regions of New Zealand. "
by keyvowel at June 30, 2009 07:17 PM
"NASA is taking the rare step of reaching out to the public for help. The space agency is looking for the best way to analyze and electronically catalog a precious collection of notes that chronicle the early history of the human space flight program."
by keyvowel at June 30, 2009 06:51 PM
For those of us, like me, who missed it in the LITA Update last week:
Friday, July 10, 2009
LITA Happy Hour
5:30 pm - 8:00 pm, Potter’s Lounge, Palmer House
Please join the LITA Membership Development Committee and members from around the country for networking, good cheer, and great fun! Expect lively conversation and excellent drinks. Cash bar.
by AaronDobbs at June 30, 2009 04:41 PM
Annual Conference Highlights for Those Attending
All programs and meetings details
LITA BIGWIG gCal
Friday Evening with LITA
Friday, July 10, 2009
LITA 101: Open House
4:00 pm - 5:30 pm, Water Tower Place in the Palmer House Hotel
LITA Open House is a great opportunity for current and prospective members to talk with Library and Information Technology Association (LITA) leaders and learn how to make connections and become more involved in LITA activities.
Andrew Pace, LITA President; Donald Lemke, LITA Membership Development Committee Chair; Holly Yu, LITA Interest Group Coordinator; and Scott Muir, LITA Committee Coordinator and many other LITA leaders will be present.
(and the reason we all get to annual on Friday night)
LITA Happy Hour
5:30 pm - 8:00 pm, Potter’s Lounge, Palmer House
Please join the LITA Membership Development Committee and members from around the country for networking, good cheer, and great fun! Expect lively conversation and excellent drinks. Cash bar.
Sunday Afternoon with LITA
Sunday, July 12, 2009
Top Technology Trends
1:30 pm - 3:00 pm, the Grand Ballroom at the InterContinental Hotel
This program features our ongoing roundtable discussion about trends and advances in library technology by a panel of LITA technology experts. The panelists will describe changes and advances in technology that they see having an impact on the library world, and suggest what libraries might do to take advantage of these trends.
LITA Awards and Scholarships Reception/Ceremony
3:00 pm - 4:00 pm, the Empire Ballroom at the InterContinental
Presentation of LITA Awards and Scholarships. John Blyberg will receive the Brett Butler Award for entrepreneurship, Bill Misho will receive the Frederick G. Kilgour Award for research, Meredith Farkas will receive the Library High Tech award for communications in continuing education, and Michael Silver will receive the Student Writing Award.
LITA President’s Program: Make Stories, Tell Stories, Keep Stories
4:00 pm - 5:30 pm, the Grand Ballroom at the InterContinental
In 2007, Erik Boekesteijn, Jaap van de Geer, and Geert van den Boogaard took off from DOK Delft Public Library to embark on a North American tour of libraries en route to the Internet Librarian Conference. Their popular video tour captured the passion and enthusiasm of the people working on library innovation in the States, a theme that they have recently repeated in Australia. Now it’s time to tell their story. Come learn about innovations from our library colleagues in the Netherlands and join Erik Boekesteijn (DOK Delft Public Library), Jenny Levine (The Shifted Librarian), and Michael Stephens (Tame the Web) as they discuss the current state and future of library innovation and the opportunities to learn from the vast network of international stories about library innovation. The panel discussion will be followed by a book signing, Shanachie Tour – a library roadtrip across America, with all three authors present.
Speakers: Erik Boekesteijn, Jaap van de Geer, Geert van den Boogaard, Jenny Levine, and Michael Stephens
Many other excellent programs are being offered as well. Get a complete list with descriptions and locations provided.
Annual Conference Highlights for Those Attending or Not
LITA is offering two pre-conferences on Friday, July 10, from 9:00am to 5:00pm in Chicago. You do not need to attend Annual Conference to register for a LITA preconference. Also, please note that LITA will accept registrations on site. The registration rate for each is: LITA Member $235, ALA Member $315, or Non-Member $380.
A Thousand Words: Taking Better Photos for Telling Stories in Your Library
9:00am to 5:00pm, McCormick Place, W-475
Speaker: Helene Blowers and Michael Porter are joining Cindi Trainor
In this hands-on workshop, learn techniques for shooting and editing better photos, camera settings that make for the best photos, and basics of editing an image. Learn how to capture library events more effectively and artistically, take and select better photos for websites and promotional materials. Licensing work and finding others via Creative Commons will also be covered. Participants should bring a digital camera and laptop; familiarity with moving photos from camera to computer is a must.
Creating Library Web Services: Mashups and APIs
9:00am to 5:00pm, McCormick Place, W470a
Speaker: Karen Coombs, University of Houston
del.icio.us subject guides, Flickr library displays, YouTube library orientation; with mashups and APIs, it’s easier to bring pieces of the web together with library data. Learn what an API is and what it does, the components of web services, how to build a mashup, how to work with PHP, and how to create web services for your library. Participants should be comfortable with HTML markup and have an interest in learning about web scripting and programming and are encouraged to bring a laptop for hands-on participation.
[cribbed from LITA Update on 6/19]
by AaronDobbs at June 30, 2009 04:39 PM
I just returned from the 2009 m-Libraries conference at the University of British Columbia in Vancouver (which–let’s get this out of the way up front– was an exceptionally beautiful and well-organized venue). The topic of the conference, the second annual, was the influence of the rapid expansion of mobile technology on libraries.
While those of us here at the Digital Experience Group have plenty on our plate with the upcoming overhaul of NYPL.org and more, a good part of our “charter” is to anticipate ways that changes in technology can keep the Library relevant in the future. It was in this spirit that I approached the conference, with an ear towards how other institutions are dealing with one massive technological upheaval: the explosive growth of mobile technologies.
As a good conference on new and speculative topics should do, m-Libraries 2009 left me with more questions than answers, but fortunately they are much better questions than I had when I arrived. I have already seen a few blog posts covering the conference session by session (see Paul Coyne, Vicki Owen, and the official conference blog, among others), so I won’t post another. Rather, in the spirit of pushing the conversation forward, I’d like to enumerate the questions I feel went incompletely answered.
What is mobile technology?
“Mobile” is one of those large amorphous concepts that means something different to different people, and all attendees seemed to be coming at it in different ways. It’s just too big. Are we talking about just phones? Text? Mobile web? How about Wifi? Is it enough to just be wireless? If I have a satellite Internet connection in my shack in the middle of the desert but I never move, am I a mobile learner? In a sense, this question of “what is mobile?” was implied but never addressed head-on, and I naively wanted to try to get people to answer it. It’s not really necessary to have an answer to this question, but I think it would have been a good plenary or workshop topic leading to interesting discussion.
What is the difference between the mobile user, the web user, and the physical user?
I would have liked to have seen more research about the patterns of mobile use, and specifically more about the movement between modes. There was a fair amount of talk about “the mobile user”, yet the mobile user is also the wired user as well as the offline user. The interesting question is where those modes overlap. This can be observed anecdotally just by walking into the Rose Reading Room–our most physical and traditional of spaces–any day of the week and observing the sheer variety of mobile devices spread out in front of each patron. This is fertile ground for future research.
What is the context of mobile tools?
A related question. The applications of mobile technology can be brought to bear on the library in any number of ways, and at all levels. There is neither a “right” way nor a necessary way to use mobile technology. In session after session, presenters stressed the need to listen to users and meet their expectations rather than expect them to adapt to the library’s tools.
One of my favorite sessions was by a pair of reference librarians from Temple University who described their mobile reference-on-demand project as a “complete failure”. There plan was to let students who might be wandering in the stacks of their (dark, low-ceilinged, 1960s Brutalist) library send a text the reference desk when they needed help. Cheekily titled “Ask Us Upstairs”, the project fared dismally. Yet they offered a half-dozen reasons why things went wrong, and in the process illuminated the intimacy with which mobile tech operates: Students didn’t like being approached in the stacks by people they didn’t know, the stacks were considered “student-occupied territory”, voice calls were a better fit than text in the stacks, students did very little browsing in the stacks and were already very directed and intentional before going into what was considered a slightly claustrophobic space, and so on. By sharing the problems they ran into, the Temple folks perfectly illustrated the degree of psychological and emotional design challenges that need to be overcome to make a successful mobile application.
Are we doing enough with SMS text messages?
A common refrain of the conference was that mobile doesn’t necessarily mean mobile internet, SMS text messages are still the standard means of communication for a vast majority of the wired world, due to both the lack of mobile data infrastructure (in the developing world) and the prohibitive expense of data plans for students and low-income users (in the developed world). While there’s understandably lots of excitement about iPhone apps and the mobile Web, it was a strong takeaway that we not undersell the utility of text-based reminders and other tools.
Do we see this coming?
The scale of the mobile revolution is staggering, and hearing about how rapidly the developing world is becoming connected was a real “We’re gonna need a bigger boat” moment for many in the audience. Ken Banks of the FrontlineSMS project pointed out in a highly engaging talk that 65% of the world’s population now has access to a mobile phone (with more coming online all the time), an absolutely staggering number. As technologies start to piggyback on those connections, there is serious potential for the mobile network to be a disruptive technology (to use Clay Christensen’s term) in any number of ways. What does society look like when a majority of citizens carries an always-on global communication device in its pockets? For The Library, we need to have mobile tech on our radar both in terms of reaching new audiences (perhaps a future source of distance learning resources) and expectations of the tools we deliver to our existing audiences.
Where do we start?
We’ve already (sort of) launched NYPL Mobile, but that’s still a beta project. There’s a lot of room for experimentation, and the rest of the world is driving innovation at breakneck pace. Mobile traffic on NYPL.org, while still a small part of the overall, has increased over 300% in the past year without any concession to mobile web surfing on our part. At a certain point, a critical mass of dedicated local users (and our mobile users are overwhelmingly local; there’s that intimacy thing again) are going to expect state-of-the-art mobile services for exploring our collections and physical spaces, and not as a cool luxury, but as a standard practice.
As we plan our strategy for the next couple of years, we need to keep an eye on developments in this field to insure that we’re not looking irrelevant to our most dedicated users.
by Michael Lascarides at June 30, 2009 03:13 PM
As much as I’m sometimes frustrated by our common inherited legacy cataloging practices, I actually do think the cataloging theory developed by Lubetzky, Svenonius, Cutter, and others is still useful — sometimes you just need to ‘translate’ it to the modern environment.
I’ve been thinking about how having persistent unique identifiers (bib IDs) for our records is really important — but not generally prioritized in some of our legacy cataloging practice. There are a bunch of ways to explain why this is important (and it’s kind of obvious to the CS-perspective-inclined).
But I realized another way goes back to some language used in my cataloging class. A cataloging record is called a ’surrogate’ for the physical item described. That’s exactly what it is, even more so in the digital age: it allows the physical item to be ‘projected’ into the digital environment as a digital object which is a ’surrogate’ for the physical object (or sets of objects, depending on context you consider it in) it represents.
Perhaps this helps explain why a persistent bib ID is important using cataloging theory language. As a surrogate for the physical object in the digital environment, we want to be able to link to the surrogate in different ways — from simply bookmarking it, to building more complicated ’semantic’ relationships based upon it. All of that depends on having a persistent identifier — a persistent bib ID — for the surrogate. Changing the bib ID of the surrogate in the digital environment in unpredictable ways would be analagous to periodically changing where the physical item is physically shelved in unpredictable ways! The internal unique identifier for the surrogate is essentially it’s digital “location”.
[That's a bit of an oversimplification -- giving the digital surrogate a reliable digital 'location' requires some layering on top of the unique internal ID, to give it a unique persistent URI too. But the pre-requisite for that is a persistent unique internal ID.]
[And, incidentally, for the semantic web geeks reading, this gets at some of my dissatisfaction with this focus on 'real world objects' vs 'documents' or whatever they're currently calling the second class. I don't think it's at all a clear distinction, and can often get confusing right quick, and I think it's probably a mistake to rely on such a confusing distinction for crucial parts of your 'specs'. A cataloging record is a 'web document', surely, but it's also a surrogate (not JUST a 'description') for a real world object. Sure, we can split hairs and talk about how to handle that. But the fact that it gets so confusing and abstract and hair-splitting and subject to debate worries me and makes me suspicious of relying on such a distinction for describing how to 'do business' in the sem web.]
Posted in General

by jrochkind at June 30, 2009 03:02 PM
The Library of Congress now has content on iTunes U. iTunes U is the area of the iTunes Store which offers open educational audio and video content from universities and other educational institutions. The Library’s initial iTunes U content includes historical videos such as original Edison films and a series of 1904 films from the Westinghouse Works, as well as event videos such as author talks from the National Book Festival, the "Books and Beyond" series, discussions with curators, and lectures from the Kluge Center. The audio content includes Library podcast series such as "Music and the Brain," slave narratives from the American Folklife Center, and interviews with authors from the National Book Festival. The collection also includes Library-produced classroom and educational materials, such as courses from the Catalogers’ Learning Workshop.
You must be running iTunes to be able to view
the LoC content.
by Leslie Johnston (noreply@blogger.com) at June 30, 2009 03:35 PM
I'm not the most diligent of bloggers, by any means, and the contents of this blog are pretty narrow in terms of topics. Mostly I have written about Google books, about RDA and other library metadata developments, and recently about OCLC. Although each post is probably offensive to someone out there, the total number of enemies that I can make is probably quite small -- and compared to some bloggers nearly infinitesimal.
So imagine my surprise this morning when I received a notice from Google saying that my blog had been marked as Spam, and would be removed if I didn't take action. There are two ways that your blog can get the Spam qualification: 1) if it is caught by Google's automatic spam detectors and 2) if someone clicks on the "flag blog" link and reports it as spam.
Given the technical nature of my posts, I find the first possibility highly unlikely. This means that I must consider the latter. I hope it is only coincidence that my latest post (and one that has lingered here as the latest for a bit too long, perhaps) is a critique of OCLC and its record use policy. I would love to be able to say that I know that OCLC would not stoop to this kind of censorship, but unfortunately I have experience to the contrary.
Earlier this year I arrived in Dublin only to be refused admittance to a meeting that they had agreed that I could attend (and that I had flown all of the way to Ohio to attend). Than, a few months ago when OCLC was told that I would be writing an article for InfoToday on their "web-scale service" the journal's editor received numerous phone calls from OCLC's press person voicing OCLC management's "concern" that I had been chosen to write the article. What the editor was supposed to do about that concern wasn't articulated, but she kept me on the story and even resisted their request to review the article before it was published. It was a dramatic couple of days, and I'm very grateful to her for her unwavering defense of freedom of the press.
I admit that it is at least equally likely that some random person with a cosmic grudge decided to click on "this is spam," but you may understand why I'm beginning to be a bit paranoid, and wondering if I don't have real enemies.
by Karen Coyle (noreply@blogger.com) at June 30, 2009 01:48 PM
Sorry everybody, but I've been attacked by spammers of late, and have had to switch moderation on, at least for now, but I'm terribly liberal and will approve every single message that talks badly of my, uh, bum. When things calm down again I'll turn it off I'm sure, but I seriously wish Blogger.com had a better comments system (or even a better way to kill spam from an infected site; the current way is just absolute rubbish and painful!). Or maybe this is another sign from below to switch to WordPress which I've got a half-finished Topic Maps plugin for and integrates against my shiny new xSiteable Framework 3.0. Hmm.
by Alexander Johannesen (noreply@blogger.com) at June 30, 2009 10:46 AM
I've been inspired to write this post based on a discussion on the Code4Lib list about embedding HTML in MARC records. Even worse, perhaps, it turn...
June 30, 2009 05:15 AM
Short version: If the Web knows I like a TV show, why can’t my TV be more useful?
So I have just joined a Facebook group, “Spaced Appreciation Society“:
Basic Info
Type: Common Interest - Pets & Animals
Description: If you’ve ever watched (and therefore loved) the TV series Spaced, then come and pay homage to the great Simon Pegg and Jess Stevenson. “You f’ing plum”
Contact Details
Website: http://www.spaced-out.org.uk/
Location: Meteor Street
That URL is (as with many of these groups) from a site whose primary topic is the thing the group’s about. In this case, about a TV show. It’s even in the public page for that group:
<tr><td class=”label”>Website:</td>
<td class=”data”><div class=”datawrap”><a href=”http://www.spaced-out.org.uk/” onmousedown=”return wait_for_load(this, event, function() { UntrustedLink.bootstrap($(this), "", event) });” target=”_blank” rel=”nofollow”>http://www.spaced-out.org.uk/</a></div></td></tr>
If I search Google (Yahoo BOSS might be wiser, they have APIs) with:
link:http://www.spaced-out.org.uk/ site:wikipedia.org
It finds me:
http://en.wikipedia.org/wiki/Spaced
Although “link:http://www.spaced-out.org.uk/ site:dbpedia.org” doesn’t find anything, some URL rewriting gets me to:
http://dbpedia.org/page/Spaced
“Spaced is a British television situation comedy written by and starring Simon Pegg and Jessica Stevenson, and directed by Edgar Wright. It is noted for its rapid-fire editing, frequent dropping of pop-culture references, and occasional displays of surrealism. Two series of seven episodes were broadcast in 1999 and 2001 on Channel 4.”
dbpedia-owl:author
* dbpedia:Jessica_Hynes
* dbpedia:Simon_Pegg
dbpedia-owl:completionDate
* 2001-04-13 (xsd:date)
dbpedia-owl:director
* dbpedia:Edgar_Wright
dbpedia-owl:episodenumber
* 14
dbpedia-owl:executiveproducer
* dbpedia:Humphrey_Barclay
dbpedia-owl:genre
* dbpedia:Situation_comedy
dbpedia-owl:language
* dbpedia:English_language
dbpedia-owl:network
* dbpedia:Channel_4
dbpedia-owl:producer
* dbpedia:Gareth_Edwards
* dbpedia:Nira_Park
dbpedia-owl:releaseDate
* 1999-09-24 (xsd:date)
dbpedia-owl:runtime
* 24
dbpedia-owl:starring
* dbpedia:Jessica_Hynes
* dbpedia:Simon_Pegg
There are also links from here to Cyc (but an incorrect match) and to Freebase (to http://www.freebase.com/view/en/spaced).
Unfortunately, the Wikipedia “external links” section, with the URL for http://www.spaced-out.org.uk/ (marked “offical, fan-operated site” is not part of the DBpedia RDF export. I guess as it is not in an infobox. Extracting these external-link URLs at least for the TV, Actor and Movie related sections of Wikipedia might be worthwhile. And DBpedia would be useful for identifying the relevant subset to re-extract.
This idea of using such URLs as keys into Wikipedia/dbpedia data would also work with Identi.ca groups and others. In fact the matching might be easier in Identi.ca - I’m not sure how the Facebook APIs expose this stuff.
Anyway, if a show is about to be broadcast that includes eg. an interview with dbpedia:Jessica_Hynes or dbpedia:Simon_Pegg I’d like to hear about it.
So… is there any way I can use BBC’s /programmes to get upcoming information about who will be on the radio or telly, in a way that could be matched against dbpedia URIs?
Edit: I should’ve mentioned that Facebook in particular also has a more explicit “is a fan of” construct, with Products, Celebs, TV shows and Stores as types of thing you can be a fan of. Furthermore these show up on your public page, eg. here’s mine. I’m certainly interested in using that data, but also in a model that uses general groups, since it is applicable to other sites that allow a group to indicate itself with a topical URL.
by danbri at June 30, 2009 04:17 AM
By: dempsey
Categories: General - systems and technologies• Libraries - systems and technologies• User experience
Layar created a ripple of interest a while ago. It is yet to be released. It is an application for Android based phones which will allow data from various partner resources to be 'layered' over the view through a camera phone. Partners discussed include banks (for ATMs), realtors, and a social network site with data about venues. They describe it as an augmented reality application: objects viewed through the camera may be augmented by data about those objects.
Layar is derived from location based services and works on mobile phones that include a camera, GPS and a compass. Layar is first avaliable for handsets with the Android operating system (the G1 and HTC Magic). It works as follows: Starting up the Layar application automatically activates the camera. The embedded GPS automatically knows the location of the phone and the compass determines in which direction the phone is facing. Each partner provides a set of location coordinates with relevant information which forms a digital layer. By tapping the side of the screen the user easily switches between layers. This makes Layar a new type of browser which combines digital and reality, which offers an augmented view of the world. [Sprxmobile - Layar]

[image: layar.eu via Tito Sierra M-Libraries 09 ppt]
Tito Sierra referenced Layar in his presentation about WolfWalk, project, at the Second M-Libraries Conference in Vancouver last week. WolfWalk is a pilot project at NCSU as I mentioned the other day. It is working on an iPod application which aims to create what Tito called a 'geomobile collection'. Here is what the project pages say:
A pilot project to create a mobile application that enables users to explore NC State campus history using a location-aware map-based interface. The application supports a map view (using Google Maps) with geotagged placemarks for approximately 60 major sites of interest on the NCSU campus, and a browse view for quickly locating a known site by name. Each site has several historical images associated with it that are sourced from NCSU Special Collections Research Center digital archives. [WolfWalk]


Tito discussed how they thought this would be of interest to alumni and was careful to describe it as a modest proof-of-concept. I thought it succeeded very well in demonstrating his contention that the challenge is not just to provide small-screen versions of digital collections but to leverage the capabilities of new mobile technologies to provide new ways of experiencing those collections. In this case, the collections augment the experience of the buildings on campus by providing historic context at the point of interaction; at the same time, the app provides a map-based approach to digital collections.
The screencast and powerpoint presentation are well worth a look. The WolfWalk pictures above are screenshots of the screencast.
Mlib09
June 30, 2009 02:45 AM
Back in March I wrote a longish post about “My personal journey into ebooks.” Things have since changed so I feel that I ought to add some commentary to those thoughts.
As a caveat, these comments only pertain to me, at least as intended. They may apply to you as an individual reader but I do not intend for them to be generalized.
I have for all intents and purposes currently quit reading ebooks on my Touch. None of the issues I mentioned in the original post are the issue though. Simply put …
I came to the realization that the circumstances in which I was using my Touch to read books were not good circumstances in which to do so. Other than as stated in my previous post, and to no greater extent, there are no interface issues that have brought about this change.
Context: I was reading books on my Touch during bus rides to and from campus, waiting for the bus at the end of the day, and at lunch. My bus rides are about 10 minutes long and my average and usual bus wait is 10 minutes.
Trying to read while watching for the correct bus or the correct stop does not make for quality reading. Perhaps if I had a longer bus ride reading on the bus would be better. But I don’t. So I quit.
[I have also not been reading much in the way of print lately either but for other reasons. I am trying to get back in the swing since between all the other things I have going on I do need to "relax" and sustained reading is good for that.]
Today I did start reading from my Touch again at lunch (The Importance of Being Earnest). Lunch is a longer sustained period than the bus waiting/riding and it is easier to choose my stopping point so retention is greatly improved. Also, truth be told, it is easier to read from the Touch at lunch than a print book. It lays flat and stays open with no problems. If I need to eat with my fingers it becomes a small problem but I eat at a place where I need a fork (or chopsticks) most days of the week.
I have no aversion to reading on my Touch at home if need be and I will on occasion. But then I also have several 100s of print books here that need reading (A very conservative estimate).
I did read several more books than those mentioned in my earlier post before I quit using the Touch to do so. Assuming I can find more sources of free books for the Touch I imagine I will continue to use it for reading at times where I can have a semi-sustained reading experience but it is inconvenient to carry a print book.
So I guess the main point is I realized that the situations in which I was trying to read ebooks were generally not good for reading for me. It was the situations and not ebooks or the Touch itself that caused me to quit. I will just have to see where it goes from here.
by Mark at June 30, 2009 02:11 AM
LITA needs a system or process to gather, post, and share LITA sponsored programs presented at conferences and events online.
Charge:
To explore and recommend a systematic approach to gather and post LITA programs presented at events such as ALA Annual Conference, LITA National Forum, LITA Camp, etc.
- Identify the types of programs that are presented and which are most appropriate for online posting
- Identify other organizations such as ACRL, PLA, WebJunction, etc, who are currently providing this service to learn about their experiences
- How should the content be delivered online i.e. live webcast, produced in a studio, screencast, etc.
- Identify, evaluate, and analyze available systems
- Identify which systems are best for delivering each type of program
- Determine who should have access to what types of programs and how. This should include:
- Identifying the appropriate delivery method
- Who should have access. Should it be available to all LITA members, available to everyone, etc.?,
- Should access be different for different user communities?
- Should their be a registration fee for certain types of programming
Task Force:
The task force should include representatives from:
- LITA Program Planning Committee
- LITA National Forum Committee 2009
- LITA National Forum Committee 2010
- LITA Web Coordinating Committee
- LITA Education Committee
- LITA BIGWIG
Thanks to the following people for agreeing to serve on this task force:
- Aaron Dobbs, chair
- Melissa Shepard
- Anne Graham
- Cody Hanson
- Michael Witt
- Jenny Emanuel
- Kristine Feery
- David Ward
Timeframe:
The task force should submit their recommendations to the LITA Board of Directors no later then ALA Midwinter Conference 2010 and run a pilot project at the Annual Conference 2010.
by mfrisque at June 30, 2009 01:36 AM
June 29, 2009
Engineers and technologists generally resent needing to know anything about the law, because most often the lawyers are telling them they can't do something for some inane reason. For their part, many lawyers are surprisingly interested in technical matters, but even the most technically informed lawyers resent having to acknowledge that technology often trumps the law into irrelevance.
Today's announcement by the Open Knowledge Foundation of the release of version 1.0 of the Open Database License (ODbL) will create resentment in both professions- information
technologists need to understand some of its complications, and lawyers will need to understand some technological limits of the license. In this post, I will try to articulate what some of the hard bits are.
The goal of the ODbL is to provide a means in which databases can be made widely available on a share-alike basis. Suppose for example, that you spent a lot of time assembling a cooperative of volunteers to compile a database of conferences and their hashtags. If you then made it available under the ODC Public Domain Dedication and License (PDDL), a commercial company could copy the database and begin competing with your cooperative without being obliged to contribute their additions and corrections to your effort. Under a share-alike arrangement, they would be obligated to make their derivative work available under the same terms as the original work. So-called "copyleft" licenses with share-alike provisions have proven to be very useful in software as the legal basis for Open Source development projects.
The difficulty with applying copyleft licenses to databases is that open source licenses that implement them are fundamentally rooted in copyrights which cannot easily applied to databases, hence the need for the work of the Open Knowledge Foundation. Usually, databases are collections of facts, and you can't use copyright to protect facts. However, it gets more complicated than that. In the US, it's also not possible to copyright collections of facts which can in fact be copyrighted in Europe under the "Sweat of the Brow" doctrine.
So copyright protection (and thus licenses including GPL and Creative Commons) can be asserted on entire dataspaces, but that protection is invalid in the United States. What the ODbL does to address that issue is to invoke contract law to paper over the gaps created by international non-uniformity of copyright for databases. The catch is that contract law, and thus the share-alike provisions it carries in the ODbL, can only be enforced if there is agreement to the license by the licensee. That's the thing that causes such user-experience monstrosities as click-through licenses and the like. So pay attention, engineers and technologists (Linked Data people, I'm talking to you!), if the provisions of the ODbL are want you want, you'll also need to implement some sort of equivalent to the click-through license. If you expect to involve machines in the distribution of data, you'll need to figure out how to ensure that a human is somewhere in the chain so they can consent to a license, or at least you'll need to socialize the expectation of a license.
Pay attention too, you legal eagles. Be aware that mechanisms of the click-through ilk can be prohibitively expensive if implemented without thought about the full system design. The most valuable databases may have hundreds of millions of records and can be sliced and diced all sorts of ways, so you want to avoid doing much on a record-by-record basis. Also be aware that databases can be fluid. Legitimate uses will mix and link multiple databases together, and the interlinks will be a fusion that should not be judged a derived work of either of the source databases. Records will get sent all over and recombined without anyone being able to tell that they came from a database covered by ODbL.
Any organization considering the use of ODbL should study the criticisms of ODbL posted by the Science Commons people. My own view is that there are lots of different types of databases with different characteristics of size, application, and maintenance effort. ODbL provides an important new option for those situations where neither PDDL or a conventional proprietary license will maximize benefits to the stakeholders in the database. But most of all, technologists and engineers need to consider the requirements needed for successful open licensing early on in the development of database distribution infrastructure.
More on Linked Data Business models to come.
![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_e.png?x-id=fc478c26-85fb-42d6-8c95-5520dacd2b12)
by Eric Hellman (noreply@blogger.com) at June 29, 2009 10:57 PM
Issue 7, 2009-06-26. JAbbr is an online tool developed at Cornell University to help users decipher journal title abbreviations.
by bcarson at June 29, 2009 06:41 PM
Laurentian University is part of the Ontario Council of University Libraries (OCUL), and a user of the centrally hosted Ontario Scholars Portal SFX link resolver, so one of the things we needed when we migrated to Evergreen was a target parser for our link resolver. This is the target associated with Search the library catalogue that is the last resort when the resolver fails to turn up any full-text resources for a given OpenURL - so hopefully it won't need to be invoked too often, as we have a very rich set of full-text electronic resources at Laurentian University.
The code
Here is a quick implementation of a target parser that generates search URLs based on ISSN, ISBN, book title, or journal title. Pretty impoverished from an OpenURL perspective, but it maintains the same level of functionality from our previous system. In TargetParser/Evergreen/Conifer.pm I created a target parser called Evergreen::Conifer that implements a subset of the Parsers::TargetParser API for SFX as follows:
package Parsers::TargetParser::Evergreen::Conifer;
use Parsers::TargetParser;
use base qw(Parsers::TargetParser);
use strict;
sub getHolding {
my ($this,$genRequestObj) = @_;
my $objectType = $genRequestObj->{'objectType'};
my $ISBN = $genRequestObj->{'ISBN'};
my $eISBN = $genRequestObj->{'eISBN'};
my $ISSN = $genRequestObj->{'ISSN'};
my $eISSN = $genRequestObj->{'eISSN'};
my $CODEN = $genRequestObj->{'CODEN'};
my $bookTitle = $genRequestObj->{'bookTitle'};
my $journalTitle = $genRequestObj->{'journalTitle'};
# Canonical search results URL for simple searches:
# http://laurentian.concat.ca/opac/en-CA/skin/lul/xml/rresult.xml?rt=keyword&tp=keyword&t=0895-2779&l=105&d=2&f=&av=
my $svc = $this->{svc};
my $egHost = $svc->parse_param('eg_host');
my $egLocale = $svc->parse_param('eg_locale');
my $egSkin = $svc->parse_param('eg_skin');
my $egOrgUnit = $svc->parse_param('eg_org_unit');
my $egDepth = $svc->parse_param('eg_depth');
my $path = "http://${egHost}/opac/${egLocale}/skin/${egSkin}/xml/rresult.xml?l=${egOrgUnit}&d=${egDepth}";
my $searchString = '&rt=keyword&tp=keyword&t=';
if (defined($ISSN)) {
if ($ISSN =~ m/x/i) {
# Current indexer doesn't deal well with ISSNs containing an X, so break it up
$ISSN =~ s/^(\d{4})-?(\d+)x/$1 -$2 x/i;
$searchString .= $ISSN;
} else {
$searchString .= "\"$ISSN\""; # format 9999-9999 for MARC
}
}
elsif (defined($ISBN)) {
# Evergreen doesn't force ISBNs to be stripped of hyphens, so take whatever
$searchString .= "\"$ISBN\"";
}
elsif (defined($journalTitle)) {
# Restrict searches to title index, with bibliographic level = s
$searchString .= "ti:${journalTitle}&bl=s";
}
elsif (defined($bookTitle)) {
# Restrict searches to title index, with bibliographic level = m
$searchString .= "ti:${bookTitle}&bl=m";
}
return ($path . $searchString);
}
1;
And here's the help that I added to the corresponding Conifer.hlp file:
General Information
Target - LOCAL_CATALOGUE_EVERGREEN_CONIFER
Service - getHolding
Parser - Evergreen::Conifer
Information needed in the Target Service:
In the PARSE_PARAM field, replace the following information:
eg_host = $$LOCAL_CATALOGUE_SERVER
eg_locale = Locale (en-US, en-CA, fr-CA, etc)
eg_skin = algoma, default, lul, nohin, uwin
eg_org_unit = 103, 1, etc
eg_depth = 0, 1, 2, 3, etc
Findings and wishlists
While it's quite easy to set up Evergreen as a searchable resource, thanks to its straightforward URL syntax, one of the things that leaps out at me is that Evergreen, by default, has no identifier index for limiting searches by ISBN / ISSN / LCCN / OCLCnum. Ideally, we would disable full-text indexing on this index so that we can more accurately search for ISSNs that include an x. Right now we have to split ISSNs with an "x" into constituent parts and generate searches on those parts, which results in false hits from across the database. This would also be useful for limiting Z39.50 searches.
I would also like to teach Evergreen about ISBN-10/ISBN-13 equivalence, to broaden the search while maintaining precision. And I would like to automatically normalize ISSN and ISBN formats so that I don't have to worry about whether a cataloguer entered hyphens or not - and the same for incoming search terms.
Finally, to support services like xISBN that search for multiple formats and editions of a given work by generating a shotgun blast of ISBNs for all known representations, I would love to teach Evergreen how to accept a list of identifiers as search input.
Don't ask me when these things will happen, though; if it requires work from me, it will probably be 2010 before any of it happens.
by dan@coffeecode.net (Dan Scott) at June 29, 2009 05:00 PM
code4lib journal, Issue 7, 2009-06-26
by pablog at June 29, 2009 04:46 PM
This article presents the application of part-of-speech (POS) based statistical text analysis to the task of bibliographic metadata extraction from electronic dissertations. By using the approach described here it is possible to detect the title of a Ph.D. paper with an accuracy of about 80%. The accuracy measurements are done using a conceptually simple approach and implementation.
by htomren at June 29, 2009 03:31 PM
NYU has gone live with Umlaut. I’m holding my breath hoping that nothing will go wrong with their installation that’s my fault.
Hi all,
We’ve deployed Umlaut to our production Primo environment at NYU.
Umlaut is available through the “GetIt” link on a search results page at
http://www.bobcat.nyu.edu and is hosted at http://getit.library.nyu.edu
Thanks,
Scot Dalton
Web Development
Division of Libraries
New York University
It’s interesting to me that they are using Umlaut to work around an exceptionally poor part of Primo’s user experience — the page (or really pages in a ‘tabbed’ frameset wrapper) that actually gets the user to accessing the document (physical location/availability or electronic availability etc).
Turns out Umlaut is exceptionally well suited to replace this role in Primo, because Primo already well relies/supports calling out to an OpenURL receiver, and because Umlaut is designed for this kind of ‘known item’ and/or ‘last mile’ service. I think (un-humbly) that the mark of a well-thought-out piece of software is when it can serve well in situations that aren’t exactly like it was designed for. A ‘known item service provider’ is something we needed all along but didn’t realize it, and once you have one you can find ways to use it I never thought of. I expect that more Primo customers will become interested in Umlaut.
And, my understanding is that Summon will also rely on sending out an OpenURL for actual local ‘last mile’ access, so I predict that Summon customers will similarly be interested in Umlaut.
I hope anyway! Thanks very much to Scot from NYU for spearheading the Umlaut deployment there; I have been very impressed by how quickly Scot was able to get things up and running, with little help from me, including writing some new features and plug-ins to talk to Aleph. Although I’d like to think that the quality of Umlaut’s code and documentation gets some credit here, Scot has been a pleasure to work with, and I hope he will continue working on Umlaut.
Somewhat oddly from my point of view, NYU has deployed Umlaut only in the context of their Primo OPAC/discovery layer. Traditional link resolver use still goes right to SFX. Personally, I think that our users in most of our libraries already have too many different interfaces to deal with, and I place a priority on consolidating and integrating them. Umlaut’s goal is to serve this role by providing a ‘known item last mile’ interface in as many contexts as possible. But I understand that politically it can be difficult to make big changes at once, and my understanding is that NYU does eventually plan to target Umlaut for traditional link resolver use too.
Posted in General

by jrochkind at June 29, 2009 03:21 PM
About a year ago, in Bye, bye Athens... hello UK Federation, I questioned the rather grand claims being made about the then new UK Access Management Federation for Education and Research, notably that it would deliver "improved services to users", and wondered what the reality would be like.
I think we're still waiting to find out to be honest but there doesn't yet seem to be much evidence that anything has really improved over what we had before - certainly not in terms of usability for the end-user!
Last week I attended a meeting set up by the JISC-funded Service Provider Interface Study project, looking specifically at usability issues within the federation as things currently stand.
Firstly... hats off to both the project team and JISC for being so open about the issues. The meeting was a real eye-opener (for me at least), not only in that it demonstrated just how poor usability is across all the players that make up the federation, but also in the realisation that, actually, most service provider access control is done via IP address checking rather than by SAML-based authentication, in part because the usability issues are so great. For most users, SAML only comes into play when they are off-site (at least according to the service providers present at the meeting). Note: I appreciate that this was also the case under the old Athens system... I mention it here only because it seems to me that the continued use of IP address checking hasn't been widely acknowledged in the way the federation is generally presented, so it came as something of a surprise (to me at least).
Usability problems hit almost every aspect of the Federation as it is currently deployed, from the point that the end-user is initially asked to sign-on right thru to the ways in which service provider services are personalised (or not). Overall usability is made worse because the end-to-end experience is distributed across several players - the service provider, the identity provider, and (optionally) a 'where are you from?' service - each of which can, and do, make different decisions about naming and design. The result is a confusing experience for the end-user, combining poor usability of the individual components in the system coupled with a lack of consistency between the different parts, leading to a situation where it must be near impossible, for example, to write user-support documentation (i.e. help pages) covering everything in a comprehensive form. Even trivial issues, such as what 'sign-on' is called and where it is positioned on the page, are handled differently by different players.
It seems to me that privacy and security issues seem to have driven much of the thinking behind the Federation in its current form. At one point I asked the meeting whether anyone had actually asked real end-users whether they would like to have the option of sharing more information between the identity provider and the service provider in order to enjoy a more seamless and usable experience overall (even if it theoretically compromised their privacy in some way)? I'm not sure I got a clear answer... but it is hard not to draw the conclusion that the Federation has been designed by people with a primary interest in the technology rather than the user experience. A bit like the 'good old days' when we let the techies have full control over firewall policies, disregarding the fact that people actually had jobs that they needed to get done.
I'm sorry if all this seems very blunt but the current deployments are so un-friendly that something has got to be done - otherwise we might as well just bite the bullet and go back to having separate login accounts for every service we access (something that most people are perfectly accustomed to these days anyway!).
So... I want to focus on two, related, aspects of usability for the remainder of this post: naming the authentication process and discovering the identity provider.
Firstly... what do you call the process by which you gain access to a service provider in a SAML-based world? How are things 'branded' if you like? This is a non-trivial question to answer but a great example of how largely technical considerations (like the need for federations) have been allowed to get in the way of user-oriented usability issues. It's also something that the OpenID crowd have got cracked. That's not to say that there aren't other problems with OpenID - there are - but at least they have a single global brand (and associated logo) which makes it easy for any user, anywhere in the world to realise when they are being asked to sign-in using their OpenID.
What's the equivalent brand in the SAML world? There isn't one. Nobody pushes the use of a "SAML sign-on" (quite rightly in my opinion) since it would be meaningless to people. Shibboleth, as I've argued before, names a particular bit of software rather than an approach, and so is inappropriate. Some service providers in the UK still use 'Athens' (because it's what end-users are used to!) - again, clearly wrong in a SAML world. That leaves branding at the level of the federation... but who on earth wants to refer to their "UK Access Management Federation login" - what a horrible mouthful that is. And remember... most service providers offer their services globally, so if we brand things at the federation level then service providers have to start asking their users which federation they are part of - something that I suspect most of us neither know nor care!
That brings us on to my second usability issue. In a SAML-based world, service providers have to direct the end-user back to their institution in order that they can sign-in using their institutional username and password before being redirected back to the target service. This is typically done using a 'where are you from' service, either stand-alone on the network or embedded into the service provider website. Typically, this involves the end-user selecting from a pull-down list of identity providers (there are over 500 in the UK Federation currently), optionally preceded by a pull-down list of possible federations. This is horrible.
I'd like to propose a new rule of thumb for the design of user-interfaces in a SAML world... if we ever have to explicitly ask the end-user to choose from a list which federation they are part of, then we have a totally borked approach and we need to do something different. This seems obvious to me - yet it is exactly the direction we are heading in right now :-( .
I'd actually go much further and say that if we ever have to explicitly ask the end-user to tell us which institution they are a member of just so they can sign-in to something, then we have similarly broken the system - but I appreciate that is a more difficult part of the process to remove given the current technology. We can, however, make the selection of the institution rather easier than scrolling thru a list of 500 (or 1000, or 5000) names. How about looking at the way TheTrainLine let you select a station name? How about using the JQuery Auto-Complete function to narrow down the list of available names as the end-user begins to type? Here's a demo of just that. Much more intuitive than a pull-down list. (Thanks to my colleague, Mike Edwards, for the sample code to build this, based on the JISC "what do we do?" page.)
It'll be interesting to see what recommendations the Service Provider Interface Study project comes up with. Here are mine. Let's stop thinking in terms of asking the end-user what federation they belong to and think instead of questions they are likely to know the answers to. What is the name of the institution you belong to? What's the URL of your institutional website? What country are you in? Let's make the machines do the hard work of sorting out which federation is relevent. In short, let's start building user-interfaces, no... scrub that... let's start designing the system as a whole such that usability and the needs of the end-user are put first rather than being tacked on as an after-thought!
Finally... I'm surprised that publishers, and other service providers, aren't driving this much more forcibly than they appear to have done to date. There was a strong feeling in the meeting that things have got much worse (in usability terms) since the demise of Athens - yet the publishers present seemed rather defeatest about what they could do about it. Every time the usability of the system breaks, a service provider somewhere stands to lose a customer and while they are not typically paying for content individually it ultimately all adds up. Publishers should be pressing the UK Federation (and all other federations) for a system that works end-to-end, not just because of their own self-interests, but because of the interests of their primary user-base. I also think that they should be working much more closely together to bring greater consistency to the way that SAML-based sign-on is presented to the end-user.
by Andy Powell at June 29, 2009 03:03 PM
Catching up on something from last month: Sharing Standards for Bibliographic Data Worldwide: An Overview of Changes in Cataloguing Practices, a talk by Barbara Tillett at the Atlantic Provinces Library Association Conference 2009 in Halifax, Nova Scotia.
Built on foundations established by the Anglo-American CataloguingRules (AACR), RDA (Resouce Description and Access) will provide a comprehensive set of guidelines and instructions on resource description and access covering all types of content and media. The new standard is being developed for use primarily in libraries, but consultations are being undertaken with othercommunities (archives, museums, publishers, etc.) in an effort to attain an effective level of alignment between RDA and the metadata standards used in those communities, increasing the ability to share metadata among diverse communities. Cataloguers aren’t the only professionals who will be affected by these new rules. Increasing the ability to share metadata outside of our own organizations and changing description and access rules will impact the entire information profession. Along with providing an overview of RDA and its underlying conceptual model (FRBR- Functional Requirements for Bibliographic Records), examples of how FRBR can benefit circulation, reference and serials will be explored.
Laurel Tarulli says it was a very good talk:
Not only did she explain RDA and FRBR in a way that made complete sense (and I’ve been to other RDA sessions), but she also touched on how this is something the entire profession needs to be paying attention to, not just cataloguers. This is interesting because, up until now, many librarians have brushed it aside as a cataloguing issue. Not so! How information is retrieved, what it will retrieve and how it is presented will all change. The relationship gathering is what really excites me. And, it should excite all librarians in and out of the cataloguing department.
by William Denton (wtd@pobox.com) at June 29, 2009 02:50 PM
A few weeks ago, a reporter at the Chronicle of Higher Education interviewed Adam Smith, Google’s director of product management, about the Google Book Search settlement and posted the interview in audio form. The page isn’t dated, but guessing from metadata in the URL it was somewhere around the publication of paper issue dated June 26, 2009. I’m calling out this particular interview because Mr. Smith said things that I hadn’t heard in other forms yet — Google’s intentions about privacy in Google Book Search, an explicit statement about the Book Rights Registry releasing information about the status of orphan works, and a statement on what Google expects the size of the orphan works problem to be once the Registry has been in operation for a while.
Below is a rough transcript of portions of the interview. I’ve added emphasis in the transcript to the parts that I hadn’t heard Google representatives say before.
Chronicle host: There has been a lot of concern among librarians and in the library community about access and privacy. Can you alay some of those fears?
Adam Smith: There has been a lot of discussion about how this settlement affect things such as access and privacy, and what we are really looking at is creating a product that will be broadly accessible to the university community as well as the internet community generally. [...] I think with respect to privacy, Google hasn’t designed the product yet so it is hard to have a privacy policy for it, but we fully intend to have a policy that is consistent with a lot of the standard procedures in the library community today. Things such as allowing authentication to happen via IP. But we take privacy seriously and it will be consistent with Google’s privacy policy as well as have some specific provisions when we actually get down to designing the product.
Chronicle host: There have been a lot of interest and concern in so called “orphan works” — where do those fit into the settlement and how do respond to some of the anxiety about that.
Adam Smith: So there is no technical definition of “orphan works” but for the purposes here we’ll say a book for which no rightsholder exists. Google’s mission in this is to really provide broad access to all of these books and when you look at the corpus as a whole, the percentage of books that are available — say — is about 20% are in the public domain or more, about 5% are kind of in print. What that leaves is this center of books that are not in print but may be or may be not in copyright. And what we believe is through the settlement agreement and the establishment of the Books Rights Registry, which is an author- and publisher-controlled entity that will try to track down the rights holders of the particular book, we believe that over time what will happen is that rightsholders will come forward to claim the money that was generated via the economic models and this will allow for better identification of the specific rightsholders to the works. And the Books Rights Registry has committed to making any information — or making the information about whether or not a book has been claimed — making that public so that someone who’s interested in making use of one of these potentially orphan works can understand as to whether or not a rightsholder has come forward for that particular book.
[...]
Chronicle host: Another concern is maybe the one that Google encounters the most — is the question of monopoly. And why we should be happy that the idea that a private company has essential control over 10 million plus works?
Adam Smith: So I think at its root what’s really important here is to look at the agreements. And Google has non-exclusive agreements at the root of all of its agreements. So, its agreements with its library partners are non-exclusive, its agreements with its publishers and authors are non-exclusive. So anyone is free to enter into agreements with those institutions or those publishers. With respect to the settlement agreement, for all works for which a rightsholder comes forward, the Books Rights Registry will have the ability to license or enter into economic models with other parties for those works. So really this is not an exclusive license to Google, but rather it’s establishing the ability for them to get access to these. Obviously for the public domain works, there is no rights or contract associated with that. So what this really leaves is what we believe is a very thin slice of the remaining books, which are the orphan worked books.
I’m glad to see some sensitivity to the notion of privacy in Mr. Smith’s response to that question. The notion of privacy goes beyond using IP address authentication to enable institutional subscription users to access the scanned books, of course — specifically to the collection and disposition of log files related to individuals’ use of the Google books database. I wonder if Google will really consider severing the link between reader and work, as is common practice in libraries today. In the case of online books, that would mean not collecting — or at least immediately anonymizing — the IP address of the machine used to read portions of the book. Time will tell, and this is certainly an area where I hope there is more dialog between Google and academic libraries (should the settlement agreement be approved).
It is interesting that a Google representative is making statements about what the Books Rights Registry will do with orphan works information. I would think it would be up to the registry’s board of directors to decide whether or not they publicly release information about the orphan status of a work. I don’t recall reading in the settlement agreement that it would be mandatory.
Mr. Smith’s answer to the monopoly question ignores the “most favored nation” clause in the settlement agreement that says the Registry cannot offer licensing terms to another party that are more favorable than the ones offered to Google. While that might not be a monopoly in the strictest sense, it certainly makes it harder for any other entity to compete effectively with Google. That same answer also shows Google’s optimism in the estimate that there will be “a very thin slice” of works that will turn out to be orphans — in copyright but without an identified rightsholder. I can only assume that they have internal research to back that up. My gut tells me that there is considerably more than a thin slice, but that part of Mr. Smith’s answer plays well with the notion that Google won’t really have a monopoly because there will be so few books that Google will have the exclusive protections in the class action lawsuit settlement to digitize.
Adam Smith also has answers to questions about why Google didn’t fight it out in court, what Google is doing to help the settlement be approved, and what Google’s reaction might be if the settlement isn’t approved.
Post from: Disruptive Library Technology Jester
Google Book Search Privacy, Orphan Works, and Monopoly
by the Jester at June 29, 2009 12:37 PM
I have been asked to give a few workshops on the value of Twitter to libraries and librarians. So far, I have collected a series of articles and guides and wanted to share the list with you all.
Please feel free to recommend other links I should read or share with my students.
Technorati Tags: twitter
by Nicole at June 29, 2009 12:30 PM
A couple of articles on linked data.
Linked Data - The Story So Far by Christian Bizer, Tom Heath, and Tim Berners-Lee
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions - the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
DBpedia - A Crystallization Point for the Web of Data by Christian Bizer, Jens Lehmann, Georgi Kobilarov, Soren Auer, Christian Becker, Richard Cyganiak and Sebastian Hellmann
The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier that can be dereferenced over the Web into a rich RDF description of the entity, including human-readable definitions in 30 languages, relationships to other resources, classifications in four concept hierarchies, various facts as well as data-level links to other Web data sources describing the entity. Over the last year, an increasing number of data publishers have begun to set data-level links to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of data. Currently, the Web of interlinked data sources around DBpedia provides approximately 4.7 billion pieces of information and covers domains such as geographic information, people, companies, films, music, genes, drugs, books, and scientific publications. This article describes the extraction of the DBpedia knowledge base, the current status of interlinking DBpedia with other data sources on the Web, and gives an overview of applications that facilitate the Web of Data around DBpedia.
by David (noreply@blogger.com) at June 29, 2009 11:08 AM
The latest issue of SCATNews, the newsletter of the IFLA Cataloguing Section, is now available.
by David (noreply@blogger.com) at June 29, 2009 10:53 AM

IFLA has a new book available, Functional Requirements for Authority Data: A Conceptual Model
.
This book represents one portion of the extension and expansion of the Functional Requirements for Bibliographic Records. FRBR has been published as Nr 19 in the present Series. It contains a further analysis of attributes of various entities that are the centre of focus for authority data (persons, families, corporate bodies, works, expressions, manifestations, items, concepts, objects, events, and places), the name by which these entities are known, and the controlled access points created by cataloguers for them. The conceptual model describes the attributes of these entities and the relationships between them.
by David (noreply@blogger.com) at June 29, 2009 09:58 AM
When I reviewed Going Beyond Google I made a mental note to try to find an inexpensive consumer-oriented guide to performing research in the deep Web. While Going Beyond Google is a great book that I highly recommend for use in LIS programs, the book is a class text and at $65 it’s not a book that is aimed at the masses.
When I learned about About.com’s $18 guide to Online Research I became very curious to see if I had found a complement to Going Beyond Google. I got a review copy from the publisher and what follows are my impressions of the book.
The Online Research book is authored by Wendy Boswell, About.com’s guide to Web Search. The book is 276 pages long and has 15 chapters plus several appendices. The book was published in 2007. While this may seem pretty current, depending on what month the book was published it might be two and a half years old. That’s getting old given the numerous references to web resources.
My main interest was in the value of the book for proselytizing about the value of federated and deep Web searching. Chapters 8, 9, and 10 were most relevant:
Chapter 8: Digging Deeper with the Invisible Web
Chapter 9: The Web as Your Personal Librarian
Chapter 10: Evaluating Web Sites for Credibility
For the sake of completeness I’ll list the other chapters although I only skimmed them:
Chapter 1: An Introduction to the World Wide Web
Chapter 2: The Basic Web Search Toolbox
Chapter 3: Using Search Engines
Chapter 4: Google Tips and Tricks
Chapter 5: Searching the Web with RSS
Chapter 6: The Niche Web
Chapter 7: Using the Social Web in Searches
Chapter 11: Finding Multimedia on the Web
Chapter 12: Mining the Blogosphere
Chapter 13: Keeping Your Web Searches Private
Chapter 14: Most-Requested Reader Searches
Chapter 15: Web 2.0
Chapter 8: Digging Deeper with the Invisible Web
This chapter provides a really good introduction to the deep Web. I particularly appreciated this paragraph:
Why is the invisible Web important? I can answer that in one word: quality. Most of the information on the invisible Web is very topic-focused, simply because most of this fantastic information is packaged in various databases concerning everything from archeology to zoology. Because this information is so narrow - and for the most part, academically oriented - you’re more likely to obtain higher than average quality search results in a shorter amount of time, which definitely comes in handy when you’re trying to do a research paper on a deadline.
Bingo! I couldn’t have said it better. I like the author’s clear and simple style of writing. She goes on to discuss the size of the deep Web, citing statistics from Michael Bergman’s Bright Planet seminal paper on the subject. She explains how crawling differs from deep Web searching and how “invisible Web gateways” provide access to deep Web content. Most of the rest of the chapter lists deep Web resources (portals and search engines.)
I learned a handy trick for finding deep Web databases in this chapter. Add the word “database” to your queries. Sure enough, when I tried the example of searching for the two words “flowers” and “database” (not as a phrase) the top few results were all to searchable databases of flower-related information. I found a pressed flower database, a gardening plant finder from the BBC, and a searchable database of companies in the flora industry, to name a few.
Chapter 9: Using the Web as Your Personal Librarian
This chapter is about finding a topic to research. It provides more web resources; these are general reference resources intended to get a researcher high level information about a subject. Some of the resources are deep Web ones: The Arts Database at Yale, SciSearch, Science.gov, and Biography.com are just a few of the ones mentioned. The chapter does touch on how to evaluate resources for credibility but leaves the deeper discussion for the next chapter.
I have to note that this book is chock full of resources. I find this chapter, like much of the book, is filled with descriptions and URLs to many great web sites. While I like the level of detail I also find it a bit overwhelming. Too many sources and not enough time to discover them all. That’s how I feel. So, I find myself skimming much of the book, looking for what’s relevant to me. Maybe that’s the author’s intent.
Chapter 10: Evaluating Web Sites for Credibility
Of course, with the scholarly federated search engines out there one needn’t worry about the credibility of information. It’s when one strays from the deep Web search engines that one has to worry about the credibility of the content found. I do think that this question of credibility is a critical one, especially for researchers. But, even the public should be more concerned about what’s true. Just because it’s online doesn’t make it true, right? So, how do we know what’s true?
This chapter considers factors that determine credibility: outside editorial oversight, double-checking of facts, and maintenance by trained experts. Specific advice is provided on how to evaluate a web-site:
- Who’s in charge?
- Is it absolutely clear which company or organization is responsible for the information on the site?
- Is there a link to a page describing what the company or organization does and the people who are involved (an “About Us” page)?
- Is there a valid way of making sure the company or organization is legit - is this a real place that has real contact information?
- Is the site telling me the truth?
- What is your source really trying to tell you?
I appreciate this exploration of critical thinking skills. These skills are not ones I hear discussed particularly often. As young people enter their college years, given how much time they’re going to be spending online, I think it’s important that they learn to filter what they read.
What do I think of the book? Do I recommend it? I like it and I recommend it. It is certainly not a replacement for Going Beyond Google. It is not an academic book. I wouldn’t use this as the only book in a college course but I would use it as a second source. The book provides a very readable introduction to the deep Web. It provides too many resources but you may not find that overwhelming. It gives a really great introduction to web searching, which applies to federated search as much as it does to searching the crawlers. This is a great book to give to a child heading off to college, especially if the child has an aptitude for or an interest in information science.
ShareThis
by Sol at June 29, 2009 03:40 AM
June 28, 2009
OCLC has published the final report from the OCLC Review Board on Principles of Shared Data Creation and Stewardship and announced the formal withdrawal of the proposed Policy on Use and Transfer of WorldCat Records. In doing so, OCLC has reaffirmed the existence and applicability of the “Guidelines for the Use and Transfer of OCLC-Derived Records” (the 1987 guidelines) and announced its intention to assemble a new group to draft a policy with “with more input and participation from the OCLC membership.”
There will also be an open forum at the ALA Annual conference in Chicago where the Review Board will discuss their findings and answer questions. The forum is on Sunday, July 12th from 10:30am to noon in the Waldorf room of the Chicago Hilton. OCLC is offering a registration form for the event.
Found via a post by Jennifer Eustis.
Post from: Disruptive Library Technology Jester
OCLC Formally Withdraws Proposed Record Use Policy
by the Jester at June 28, 2009 07:03 PM
A controversy is starting to pick up in the business librarian community — primarily in the U.K. it would seem — regarding the licensing demands of Harvard Business Press (HBP) for the inclusion of Harvard Business Review articles in EBSCOhost. HBP content in EBSCOhost carries a publisher-specific rider that says use is limited to “private individual use” and explicitly bars the practice of putting “deep links” of articles from EBSCOhost (so called “persistent links“) into learning management systems. In my words, HBP is attempting to limit access to its content in EBSCOhost to those who find it through the serendipity of searching. And now HBP is going after schools that are using persistent linking, and this raises all sorts of troubling questions.
The only visible sign of the publisher-specific rider (that I can find) is text appended to the end of each article from Harvard Business Review in EBSCOhost (copied from a post by Paul Pival):
Harvard Business Review Notice of Use Restrictions, May 2009 Harvard Business Review and Harvard Business Publishing Newsletter content on EBSCOhost is licensed for the private individual use of authorized EBSCOhost users. It is not intended for use as assigned course material in academic institutions nor as corporate learning or training materials in businesses. Academic licensees may not use this content in electronic reserves, electronic course packs, persistent linking from syllabi or by any other means of incorporating the content into course resources. Business licensees may not host this content on learning management systems or use persistent linking or other means to incorporate the content into learning management systems. Harvard Business Publishing will be pleased to grant permission to make this content available through such means. For rates and permission, contact permissions@harvardbusiness.org.
It was Pival’s next statement, though, in that same post where he relays a conversation with a colleague at a different institution that raised my eyebrows (emphasis added):
He also mentioned that HBSP [Harvard Business School Publishing] had leaned on his school and when they decided not to pay, EBSCO turned off the ability for them to create PURLs for that publisher.
Huh? How does HBP know that deep links are being used in that way?
EBSCO Information Services Website Privacy Policy

Graphical representation of the HTTP Referer Header
Pival asks “So how does Harvard BSP know whether a given link is being used for ‘private individual use’ or for within electronic reserves, electronic course packs, a syllabi, or within a learning management system?” The answer is probably the
HTTP “referer”1 header. With every page, your web browser sends the address of the page you came from to the remote web server. You can see this with the
BrowserSPY service. If you follow that link, you’ll see that the page you came from was this
DLTJ page (
http://dltj.org/article/ebsco-hbp/ in the HTTP_REFERER row). Whether you know it or not (or have blocked the HTTP referer header before it gets to the remote server), you leave these traces of where you came from with every web request you make. So what I’m surmising is that the EBSCOhost servers record and process the HTTP referer information for deep links, and can see patterns when a number of people come to EBSCOhost from the same web page — that web page is probably a reading list in a course management system, an electronic reserves page, or something similar.
2So now that we know how it is probably happening, we can ask “Is there anything in EBSCO’s terms-of-use that permits them to share usage information with content suppliers?” The answer would seem to be “probably yes”. The place to look is the EBSCO privacy policy. Here is an extract from the policy dated December 26, 2006 (the current version as of the time of writing). The HTTP Referer header seems to fall in the category of “Non-Personal Identifying Information:
B. Collection of Non-Personal Identifying Information
We collect and use non-personal identifying information, including IP addresses and web server log files to track trends, administer the website, track user movement, and gather demographic information. We use this non-personal identifying information in the aggregate. We do not combine these types of non-personal identifying information with personal identifying information [a term defined earlier in the privacy policy; e.g. user's name, address]. We may also share aggregated demographic information with our business partners, sponsors, advertisers, and companies that control, are controlled by, or are under common control with EBSCO Information Services.
The HTTP referer information comes from the web server log files (it is a byproduct of running a web server), and HBP is probably considered a “business partner”, so HBP can make a request of EBSCO like “Give me all of the HTTP referer addresses that link to HBP articles, and the number of times each referer addresses is used.” It would be pretty simple then for HBP to determine which institutions were deep linking directly to Harvard Business Review articles.
EBSCO Publishing License Agreement (Terms of Use)
So the next logical question might be “Can Harvard Business Publishing create these added restrictions?” The answer to that question is most definitely “yes”. As librarians, we may not like the fact that publishers can put added restrictions on content in our aggregation databases, but EBSCO’s
Terms of Use certainly allow for it (emphasis added):
C. Licensee and Authorized Users agree to abide by the Copyright Act of 1976 as well as any contractual restrictions, copyright restrictions, or other restrictions provided by publishers and specified in the Databases. [...] Publishers may impose their own conditions of use applicable only to their content. Such conditions of use shall be displayed on the computer screen displays associated with such content. The Licensee shall take all reasonable precautions to limit the usage of the Databases(s) to those specifically authorized by this Agreement.
And the repercussions of violating the Terms of Use? They are spelled out in the “Termination” section:
A. In the event of a breach of any of its obligations under this Agreement, Licensee shall have the right to remedy the breach within thirty (30) days upon receipt of written notice from EBSCO. Within the period of such notice Licensee shall make every reasonable effort and document said effort to remedy such a breach and shall institute any reasonable procedures to prevent future occurrences of such breaches. If the Licensee fails to remedy such a breach within the period of thirty (30) days, EBSCO may (at its option) terminate this Agreement upon written notice to the Licensee.
B. If EBSCO becomes aware of a material breach of Licensee’s obligations under this Agreement or a breach by Licensee or Authorized Users of the rights of EBSCO or its licensors or an infringement on the rights of EBSCO or its licensors, then EBSCO will notify the Licensee immediately in writing and shall have the right to temporarily suspend the Licensee’s access to the Product(s). Licensee shall be given the opportunity to remedy the breach or infringement within thirty (30) days following receipt of written notice from EBSCO. Once the breach or infringement has been remedied or the offending activity halted, EBSCO shall reinstate access to the Databases. If the Licensee does not satisfactorily remedy the offending activity within thirty (30) days, EBSCO may terminate this Agreement upon written notice to the Licensee.
I haven’t seen mention of EBSCO going so far as to terminate access to all or part of EBSCOhost, but there are indications that EBSCO is disabling the deep links to HBP content. There are reports of HBP asking libraries to pay an additional fee to EBSCO for the ability to have deep linking for HBP content. By one account, a UK university might have to pay an additional £15,000 (about $25,000 at current conversion rates) to “create persistent links for use in teaching.”
Do We Have to Take It?
So that is the technical and the legal perspectives on this controversy. The final logical question is “Do we have to put up with it”? Andy Priestner, Judge Business School’s Head Librarian (at Cambridge University),
asks several good questions about this twist of electronic article publishing and distribution. In an article a few months ago, Jon Rochkind
talks about the eroding of rights associated with the
‘first sale doctrine’ in U.S. Copyright.

Screen capture of a record for a Harvard Business Review article on EBSCOhost
I’m of mixed minds about the issue. HBP can license its content however it likes, and it probably does make a lot of money from
reselling articles3 in course packs and the like — revenue that is lost by the sorts of deep linking that can happen into a journal aggregtor’s service (like EBSCOhost). On the other hand, the added restrictions by the publisher are not clearly spelled out on the page where the permanent link is displayed. (See the figure to the right; click on it to open up to a larger screen capture of an EBSCOhost record display.) Unless an instructor knows of the special conditions through some other channel (or reads the mind-numbing text at the end of the article PDF — mind-numbing text that never seems to change from article to article, except in these special circumstances), they won’t know that they are doing something that violates the publisher-specific terms of use. And more to the point, the library — the one paying the bills — won’t know that terms have been broken until the publisher comes knocking on the door asking for more money. And if the library balks, who looks like the bad guy?
And what is EBSCO’s role in this? Isn’t a library’s contract with EBSCO, not Harvard Business Publishing? Is EBSCO earning more revenue from this HBP license requirement to enable deep linking to article content? If so, isn’t that just an incentive for EBSCO to do the same with other high-profile publishers?
I don’t have answers to these last questions, but I do get a very uncomfortable feeling.
Update
June 28, 2009: I missed
this post from Meredith Farkas when I was looking for reactions to Paul Pival’s post. She says her university has had their deep links into EBSCOhost for
Harvard Business Review turned off, although it wasn’t for links in the course management system.
Footnotes
Post from: Disruptive Library Technology Jester
EBSCO in Cahoots With Harvard Business Press
by the Jester at June 28, 2009 06:10 PM
How To Save The Newspapers, Vol. XII: Outlaw Linking
Brilliant. Who said great legal minds needed to understand the world around them to make decisions. Ban linking without permission? Why not just lock us all in our homes, cut the phone/internet lines and send food to the back door? That would surely prevent any use of IP without permission. Brilliant.
Scraped from TechCrunch
by mleggott at June 28, 2009 01:43 PM
June 27, 2009
I posted this update Thursday. It fixes some problems found in 0.9.1. The main issue with 0.9.1 was that the svn got foobared a little bit and some files needed for install were left out of the package. Files that were left out and errors that they caused:
1) migrations 031 & 032 – would cause a method not found error of is_private when adding/editing collections
2) QueryAPI note found – the services directory was left out
3) classic_pagination not found – the classic_pagination plugin was left out
The errors were a bit of suprise because all the files were in the svn, but they had become weirdly locked. I had to run svn cleanup a few times and then switch branches between the dev and trunk to finally get everything synced up. But, I’ve had outside confirmation that everything is good now. So if you were trying to install 0.9.1 and had trouble, pick up 0.9.2. It should solve your problem.
You can find it at: http://www.libraryfind.org
–TR
by Administrator at June 27, 2009 05:34 PM
This week saw a couple of events around the BagIt specification and tools.
A revision of the BagIt specification went out this week. You will note that it is still 0.96 -- the revisions were only in language to clarify some questions that had been received. There are some discussions going on about 0.97 - join the Digital Curation Google group. I'd like to see some more activity there!
Version 3.0 of BIL, the BagIt Library for Java, was released on SourceForge this week. It's available as binary and source code.
Plus, there was the BagIt video ...
by Leslie Johnston (noreply@blogger.com) at June 27, 2009 04:56 PM
The first in a planned series of digital preservation videos is available on the digitalpreservation.gov site -- an introduction to BagIt! Brian Vargas did a great job as "the talent" -- e.g., the narrator -- but folks should know that Brian was not selected just for his acting experience: he wrote many of our transfer tools (like the transfer scripts on SourceForge) and is a co-author of the BagIt specification.
The video premiered this week at the annual NDIIPP Partner's Meeting to great acclaim. It's aimed at a general audience.
EDIT: The NDIIPP site has added a great new page on the Transfer Tools with a link to the video.
by Leslie Johnston (noreply@blogger.com) at June 27, 2009 04:45 PM
Recently I was in Bamako, Mali, visiting the ever impressive Abdrahamane Anne at the Library of the Faculty of Medicine Pharmacy and Dentistry, University of Bamako. The purpose of my visit was to gather information for a case study on the Koha ILS pilot work that Anne, as he is known, is leading there. I spent two full days with Anne as he walked me through his methodology for exporting and migrating their catalogue data which is currently held in a non-standard CDS/ISIS database. Anne's solutions to a variety of tricky problems to get the data into standard UNIMARC format for importing into Koha were a delight to see. But this post is about something apparently only tangentially related to the eIFL-FOSS ILS project.
The Library of the Faculty of Medicine Pharmacy and Dentistry at the University of Bamako is not presently blessed with an abundance of network connectivity. There are (or were) no public access machines with which researchers might query their legacy CDS/ISIS database. A traditional card catalogue is the primary search tool that students use, unless they wish to request a specific computer search to be undertaken by one of the 3 library staff who share an office. But all that is about to change.
Through the kind offices of a colleague in France, Anne has come into possession of a number of near-obsolete computers. Obsolete, at least, in terms of the environment in which they were originally located. But here, where resources are a bit thinner on the ground, these machines will soon to be turned into viable public access machines for users of the library. And the path to this lay in a FOSS solution implemented at Birzeit University: LTSP, which stands for the Linux Terminal Server Project.
Anne had read about the work of Dr. Wasel Ghanem, head of the Electrical and Computer Systems Engineering department at Birzeit University through the eIFL.net spotlight article. Once he had those machines arrive from France he could seriously contemplate doing something with them. It did not take long for me to connect Anne and Dr. Ghanem. And that connection is already bearing fruit.
I heard from Anne last week with news that he has successfully implemented his LTSP installation. He is using a set of Pentium III machines as his thin clients and a Pentium IV as his LTSP server. All that is left now is to gather sufficient network and electrical cables and his library will have public access machines available for its users.
And what exactly will these machines be accessing? They won't be accessing the Internet, at least not for the foreseeable future. What they will be accessing is the Library of the Faculty of Medicine Pharmacy and Dentistry's new Koha OPAC (Online public access catalogue). What started as a project to introduce a full-fledged FOSS integrated library system to this library has grown into the target of a new LTSP installation that will transform the users' interaction with the resources in the library.
One FOSS project connecting to another FOSS project building a new FOSS-enabled future. Yeah, that sounds about right to me.
by randy-m at June 27, 2009 12:57 PM
Seen on David Bigwood’s Catalogablog, quoting something else:
IFLA Working Group on Functional Requirements for Subject Authority Records (FRSAR)
Invitation to participate:
Review of “Functional Requirements for Subject Authority Data (FRSAD) — Draft Report” Available through: http://nkos.slis.kent.edu/FRSAR/index.html or directly from: http://nkos.slis.kent.edu/FRSAR/report090623.pdf (2,800 kb)
Comments deadline: July 31, 2009
FRSAD is the new name for FRSAR, just as FRAD started as FRANAR, Functional Requirements and Numbering of Authority Records. Which you can now hold in your hands, because Functional Requirements for Authority Data is finished and now in book form.
This book represents one portion of the extension and expansion of the Functional Requirements for Bibliographic Records. FRBR has been published as Nr 19 in the present Series. It contains a further analysis of attributes of various entities that are the centre of focus for authority data (persons, families, corporate bodies, works, expressions, manifestations, items, concepts, objects, events, and places), the name by which these entities are known, and the controlled access points created by cataloguers for them. The conceptual model describes the attributes of these entities and the relationships between them.
It costs €69.95 or USD $84 for North Americans.
There are no links on IFLA’s site to a downloadable FRAD, and there’s no mention of the FRSAD draft. The FRSAD group is hosting the draft on their own web site. Neither group announced their news on the FRBR mailing list. I’m bewildered. I assume the final FRAD text will be available to download soon. Open access to FRBR was a major contributor to its success.
by William Denton (wtd@pobox.com) at June 27, 2009 11:16 AM
Editorial Introduction - Code4Lib: Long May You Run by Tom Keays
http://journal.code4lib.org/articles/1695
The Code4Lib Journal mirrors the diversity and depth of interests and expertise of its readership. Our successes, indeed, are yours.
How Hard Can it Be? : Developing in Open Source by Joann Ransom with Chris Cormack and Rosalie Blake
http://journal.code4lib.org/articles/1638
read more
by Christine Schwartz at June 27, 2009 01:48 AM
June 26, 2009
As a new mother, I spend a a lot of time awake with Reed when most sensible people are asleep. Consequently, I’ve seen plenty of infomercials and commercials that are rarely if ever on television when sensible people are awake (my personal favorite is the Lee Majors Bionic Ear — “it won’t cost six million, but you’ll think it’s worth it”). The first time I saw a kgb commercial, though, I assumed that I was so sleepy I hadn’t heard it right. It took seeing a second one another night to make me realize that they’re offering for money what we’ve been offering for free forever.
Get this — kgb (short for Knowledge Generation Bureau) a “unique” service where people can get answers to their questions via text message:
Users who text 542542 (kgbkgb) receive real-time responses to questions any time, day or night, from any cell phone, for a cost of ninety-nine cents.
In one commercial I saw, a man was trying to remember the name of the Red Sox player who lost the Word Series for them in 1986 (Bill Buckner) and kgb gave him the answer. Users pay $.99, plus any fees they normally pay to send and receive text messages. Their questions are answered by “agents”, regular folks who are paid 10 cents per answer they give.
Now, what if there was a service where people could ask questions via text message, IM, phone and email for free, only their questions would be answered by individuals with specialized training in finding the most accurate and authoritative answers? If only such a thing existed!
What does this tell us? People don’t think of librarians when they want answers? Librarians aren’t available when people want answers? Librarians don’t get answers to people quickly enough? Many people would rather get answers via text than phone/IM/email? Or all of the above?
What can we learn from the service kgb provides?
by Meredith Farkas at June 26, 2009 06:46 PM
So I have a server with an SSL cert. It requires a passphrase every time I restart. This is slightly annoying. So one of my coworkers recommended the following:
- create a simple perl script that prints the password
- in the ssl.conf file change SSLPassPhraseDialog builtin to SSLPassPhraseDialog exec:/location/to/passphrase.pl
- restart apache
It works beautifully. I am of course writing this down because I will forget what he told me in like 5 minutes.
Share/Save
by Rosalyn Metz at June 26, 2009 06:06 PM
Paul Pival wrote today and yesterday about “mafia tactics by Harvard Business School Publishing”, wherein they are trying to charge libraries to link to articles from Harvard Business Review in EBSCO for online classroom use and then are turning off PURLs to HBR articles in Business Source products if the school refuses to pay.
I’ve known about this for almost a year as my library had its links shut off because we didn’t want to pay to be able to link to HBR in our online classes. Fortunately there weren’t any links to HBR in the course management system when our links were shut off, so it didn’t have any real impact on us. I’d assume that we were approached by Harvard because our online programs spend quite a bit of money on case studies from Harvard Business School Press, since we’re certainly not a big fish otherwise. When I was told by our rep about the new service where we could pay to link to HBR articles in EBSCO, I’d had no idea that we had previously been unable to link to them in the first place (how many of us have access to our contracts with our vendors?). The links to HBR articles are available in the same way as links to any other article in the Business Source products. If there’s a persistent link in the database to an article that a professor wants to use for their class, they’re going to use it. And apparently, I’m not the only one who was unaware of this.
These are the current use restrictions, which have changed since my school signed an agreement with EBSCO:
“Harvard Business Review Notice of Use Restrictions, May 2009 Harvard Business Review and Harvard Business Publishing Newsletter content on EBSCOhost is licensed for the private individual use of authorized EBSCOhost users. It is not intended for use as assigned course material in academic institutions nor as corporate learning or training materials in businesses. Academic licensees may not use this content in electronic reserves, electronic course packs, persistent linking from syllabi or by any other means of incorporating the content into course resources. Business licensees may not host this content on learning management systems or use persistent linking or other means to incorporate the content into learning management systems. Harvard Business Publishing will be pleased to grant permission to make this content available through such means. For rates and permission, contact permissions@harvardbusiness.org.”
One has to wonder what “any other means of incorporating the content into course resources” means. Does that mean one can’t tell students in a class to access a HBR article from Business Source Premier without providing a link? Absurd!
Personally, I find the whole thing really sleazy. We are already paying to access the content from Harvard Business Review in the EBSCO database, just like every other journal in there. We link to other journals in EBSCO databases in our course management system without incident. Why not this one? Why we would need to essentially double-pay just to have a direct link to the content? And, as Paul also asks, how does EBSCO know that a school is using links to HBR content in a course management system or e-reserve?
I guess HBSP can make whatever rules they want with regards to their content, since they’re big and basically essential to any MBA program. But I’m curious — are any of your libraries actually paying HBSP to be able to create permalinks? And have any of you had your EBSCO permalinks to HBR shut off because you wouldn’t pay?
by Meredith Farkas at June 26, 2009 05:54 PM
The Code4Lib Journal mirrors the diversity and depth of interests and expertise of its readership. Our successes, indeed, are yours.
by Tom Keays at June 26, 2009 05:22 PM
In 2000 a small public library system in New Zealand developed and released Koha, the world’s first open source library management system. This is the story of how that came to pass and why, and of the lessons learnt in their first foray into developing in open source.
by Joann Ransom with Chris Cormack and Rosalie Blake at June 26, 2009 05:22 PM
This paper discusses the analysis of Apache web server logs from a faceted catalog interface (OPAC) at North Carolina State University. By grouping individual HTTP requests into user sessions and analyzing in that context, requests can be understood as particular user actions, with more specificity as to purpose and effect of an action. Client IP address and time are used as a sufficient proxy for determining user sessions from logs. Some initial exploratory findings of user behavior in the NCSU OPAC are provided, including that users make use of facets less than of text searching, and that some facet groups are used significantly more than others. Links are provided to the scripts used to make this session-based analysis, which could be modified for use with other facetted OPACs which use an Apache front-end.
by Cory Lown and Brad Hemminger at June 26, 2009 05:22 PM
The UW-Madison Libraries Library Course Page system is used to deliver electronic reserves materials and course-focused library instruction webpages to students. As part of a rewrite of our system we broke the application into three component pieces: a file repository, a course timetable data service, and an interface application for building and viewing individual course pages. The new three-piece system was written with an inward facing service-oriented architecture that allowed us to choose the best technologies to solve each of the tasks the entire system needs to accomplish.
by Stephen Meyer at June 26, 2009 05:22 PM
JAbbr is an online tool developed at Cornell University to help users decipher journal title abbreviations. This article discusses why these abbreviations are so problematic, and how traditional tools are often insufficient, and then describes the novel approach used by JAbbr. Given an abbreviation, JAbbr creates a regular expression for fuzzy matching, tests it against a list of serial titles extracted from the library catalog, and returns a list of possible matches to the user. JAbbr is available as a web site and as a web service.
by Keith Jenkins at June 26, 2009 05:22 PM
This article describes the workflow used by the University of Iowa Libraries to populate their institutional repository and their catalog with the data collected by ProQuest UMI Dissertation Publishing during the submission of students' theses and dissertations. Re-purposing the metadata from ProQuest allowed the University of Iowa Libraries to streamline the process for ingesting theses and dissertations into their institutional repository The article includes a discussion of the benefits and limitations of the workflow described.
by Shawn Averkamp and Joanna Lee at June 26, 2009 05:22 PM
This article presents the application of part-of-speech (POS) based statistical text analysis to the task of bibliographic metadata extraction from electronic dissertations. By using the approach described here it is possible to detect the title of a Ph.D. paper with an accuracy of about 80%. The accuracy measurements are done using a conceptually simple approach and implementation.
by Götz Hatop at June 26, 2009 05:22 PM
By: dempsey
Categories: General - systems and technologies• Libraries - systems and technologies
I traveled home from the 2nd M-Libraries Conference in UBC, Vancouver, yesterday. I was interested to come across several relevant news stories in the reading materials I had bought en route: The Globe and Mail, The Economist (last week's, as it turns out), and The Financial Times. This underlined the topicality of the conference themes.
The iPhone was prominent at the conference, in discussion, but also in practice as they were slipped in and out of pockets and bags throughout. In an interesting presentation about their WolfWalk project, Tito Sierra of NCSU opened with some general remarks about geo-location and touch screens as distinctive capacities supporting new applications. He also reminded people of the importance of the Apple Apps store in reducing transaction costs for users: search and acquisition of apps was now straightforward. What Apple has done is to create a network of developers around its successful platform. The App Store is key to this as it allows app developers to find users, and users to find apps, and in the process the value of the iPhone/iTouch is increased. This point was reinforced in a story about Apple's success in countering the effects of the current downturn in the Financial Times. John Gapper quotes work by Hagel and Seely Brown of Deloitte which shows that lower costs of entry brought about by regulatory and technical trends are creating stronger competitive challenges for companies. Apple's ability to resist this trend depends on the way in which it has created a platform around which a network of partners has built thousands of apps. So, for the Palm Pre to be successful, for example, it not only has to compete with the iPhone on price and features, but also on its ability to become a platform for app developers. Much of the value of the iPhone now derives from the apps which are available to its users.
I was also struck by the number of Mlib09 delegates who were using netbooks. I suppose you would expect this at such a conference, but this did not make it any less striking. The Economist had an article on netbooks, focusing on their challenge to the computer and software industry generally. They report Gartner figures that 21M netbooks will ship this year, twice as many as last year, accounting for more than 15% of the laptop market. By the end of 2008, netbook pioneer Asus had sold nearly 5M Eees. I was interested to read that Microsoft was heavily discounting Windows XP to netbook providers to counter the Linux challenge. Acer and other firms plan to use Android.
One of the hits of the conference was the discussion by Kate Robinson of the use of QR Codes in the catalog at the University of Bath (blogged here earlier this year). It prompted discussion of the variety of ways in which people and materials could be tied into the network.
The Globe and Mail had several stories about capturing data from codes.
- Databars. A discussion of the use of Databars, smaller than barcodes, in retail and supply-chain operations.
- Samplesaint: a story about how this company, which creates digital media for cell phones, now distributes discount coupons for redemption by on-screen scanning at the checkout. Coupons can be received in various ways, including in response to an on-the-spot request by texting a number found on the relevant shelf.
- There is also a general discussion of the use of cell phones as payment devices.
Interestingly, these were opposite an advert for IBM (featuring a barcode image) which promoted its ability to make supply chains smarter and more efficient.
June 26, 2009 04:49 PM
The new issue of TechKNOW is available.
Joining Together to Recycle Library Discards and E-Waste / Miriam Kahn, MBK Consulting and Adjunct Faculty at KSU School of Library and Information ScienceNew ALA ALCTS Public Libraries Technical Services Group Holds Inaugural Meeting in Chicago / Cynthia Whitacre, OCLCCoordinator's Corner / Andrea Christman, Technical Services Division Coordinator, Dayton Metro LibraryMarcEdit: A Powerful Program / Roger M. Miller, Cataloging Services Department Manager, Public Library of Cincinnati and Hamilton CountyBook Review: Magic Search: Getting the Best Results from Your Catalog and BeyondSocial Tagging, Folksonomies and Controlled Vocabularies--Can't They Just be Friends? / Margaret Maurer, Head, Catalog & Metadata, Kent State University LibrariesSERIES-L: A New Tool for Cooperative Quality Control / Ian Fairclough, George Mason UniversityBook Review: Standard Cataloging for School and Public Libraries, 4th edition
by David (noreply@blogger.com) at June 26, 2009 05:19 PM
It took 8 full days for the New York Times to make the same announcement on its "Open" blog that it made last week at the Semantic Technology Conference. Being that it's Friday afternoon, I present here my purely hypothetical speculations on what took so long, based on reading of tea leaves and semantic hyperparsing of the subtle, almost hidden differences between today's text and a transcription of the announcement of last Wednesday.
- A pitched battle between entrenched factions within the New York Times has waged over the past week, pitting a radical cabal of openists versus the incumbent "we've always done it that way" faction. The openists slipped the announcement of the announcement into their blog while the traditionalists were occupied with the battle over the type size of the headline for the Michael Jackson story today.
- The TimesOpen team missed last week's deadline for the "Sunday Styles" Announcements section.
- Normally, announcements like these take two weeks to process, but the business section was starting to get worried that USAToday was going to scoop them with a front pager on Monday.
- The written announcement was held up because a patent lawyer feared that the admission that the Thesaurus was "almost 100 years" old could hurt the Times' efforts to obtain a patent on the semantic web.
- The announcement was actually made last Thursday, but the printf() command in the Blog's subtitles crashed some key RSS syndication agents.
- The fact checker was on vacation.
If you have ever worked in an organization of even moderate size, you know that the real reason is almost certainly banal and boring.
On a more serious note, I think it's important to understand how organizations (not just the New York Times) adapt their internal processes to enable semantic technology in general. Over the past 15 years, the necessity to produce a web site has required many organizations to overhaul many of their internal processes, resulting in new efficiencies and capabilities that go well beyond the production of a website. At last weeks Semantic Technology Conference, there were a number of presentations that solved problem X using semantic technologies, raising immediate questions about what was so wrong with solving problem X the conventional way. Implicit in the presentations was an assumption that by approaching problems using semantic techniques, one could achieve a level of interoperability and software reuse that is not being achieved with current approaches. That's a sales pitch that's been made for many other technologies. What is certainly true is that many problems that are causing pain these days can only be solved by reengineering of corporate processes; maybe semantic technologies will be a catalyst for this re-engineering, at least in the publishing industry.
A very thoughful review of last weeks conference has been
posted by Kurt Cagle. I leave you with this quote from Kurt:
There comes a point in most programmers careers where they make a startling realization. Computer programming has nothing to do with mathematics, and everything to do, ultimately, with language. It’s a sobering thought.
A reassuring thought as well.
by Eric Hellman (noreply@blogger.com) at June 26, 2009 03:25 PM
I’m going to ALA Annual in Chicago. I’ve finally set my schedule sort of…there are a few sessions I’ll go to depending on my mood that day. I decided to be a loser and post what it is I’m hoping to attend, you know just in case anyone cares (or wants to stalk me…why though I’m not so sure).
Not on this list is the fact that I would like to crash the Ex Libris Reception. I’m not sure who from my old gig will be there but its always nice to see my favorite product manager (yes you Nettie).
Fri Jul 10
4pm – 5:30pm LITA 101: Open House – Palmer House Chicago, IL
Sat Jul 11
8am – 10:30am RSS All Committee Meeting – Swissotel, Chicago, IL
10:30am – 12pm The Open Library Environment Project: Building an ILS for Service Oriented Architecture Integration – McCormick Place, Chicago, IL
1:30pm – 5:30pm Look Before You Leap: Taking RDA For a Test-Drive – McCormick Place, Chicago, IL
– OR –
1:30pm – 3:30pm The Secret Life of Our Data: Privacy in the Digital Age – McCormick Place, Chicago, IL
3:30pm – 5:30pm Open Access Digital Initiatives in the Humanities: Creation, Dissemination, Preservation – McCormick Place, Chicago, IL
Sun Jul 12
12pm – 1:30pm OCLC Developer Network Luncheon – Intercontinental Chicago Hotel, Chicago, IL
1:30pm – 3pm Top Technology Trends – Intercontinental Chicago Hotel, Chicago, IL
3:30pm – 5:30pm Improving User Services Through Open Source Solutions: Potentials and Pitfalls – McCormick Place, Chicago, IL
– OR –
3:30pm – 5:30pm You Got Me, Do You Like Me? Evaluating Next Generation Catalogs – McCormick Place, Chicago, IL
Mon Jul 13
8am – 10am Resuscitating the Catalog: Next-Generation Strategies for Keeping the Catalog Relevant – McCormick Place, Chicago, IL
10:30am – 12pm Social Software Showcase 2009 – McCormick Place, Chicago, IL
1:30pm – 3pm Content Management Systems in Libraries: Opportunities and Lessons Learned – McCormick Place, Chicago, IL
Share/Save
by Rosalyn Metz at June 26, 2009 02:18 PM
Are any libraries making use of the funds available in the Serve America Act? Seems like there are several areas where libraries could fit the requirements.
by David (noreply@blogger.com) at June 26, 2009 03:08 PM
I came across a very interesting resource today -- the Chesapeake Project Legal Information Archive -- and the just-released results of a study they did on archiving legal resources on the web:
The Chesapeake Project Legal Information Archive has released a comprehensive report evaluating its digital preservation efforts during the project's two-year pilot phase.
The project evaluation reveals that nearly 14 percent — or approximately one in seven — of the online publications archived between March 2007 and March 2009 have already disappeared from their original locations on the Web but, due to the project's efforts, remain accessible via permanent archive URLs. A similar analysis in 2008 showed that slightly more than 8 percent of archived titles had disappeared from their original URLs, demonstrating a dramatic increase in "link rot," or inactive URLs, among archived content over the past year.
During the two-year pilot phase, the libraries participating in the project archived more than 4,300 digital objects and tracked more than 177,000 visits to www.legalinfoarchive.org, the home of The Chesapeake Project's digital archive collections. Users of the project's Web site visited from educational, government, and military institutions in the United States, as well as from countries abroad throughout the Americas, Europe, the Middle East, Asia, Africa, Australia, and the Pacific Islands.
Not too surprisingly, the second highest class of domain to where resource loss is found is .edu, after .info. Academic institutions are not always very conscientious about preserving access to their content, and with their academic term structure and the movement of faculty between institutions, web content on .edu sites is highly variable in its longevity. I don't see a characterization of how old the resources are that they harvested -- that can be very difficult to identify -- but it is a high percentage of bitrot, and there was quite an increase from the end of the first year to the end of the second year.
Download the PDF of their report.
by Leslie Johnston (noreply@blogger.com) at June 26, 2009 02:04 PM