Blogs and feeds of interest to the Code4Lib community, aggregated.


May 21, 2012

Open Knowledge Foundation

Petition the White House to Open Up Publicly Funded Research

John Wilbanks, co-author of the Panton Principles and past OKFN Advisory Board Member, just launched a petition to ask the White House to mandate free access to publicly funded research in the US. Here’s what it says:

We believe in the power of the Internet to foster innovation, research, and education. Requiring the published results of taxpayer-funded research to be posted on the Internet in human and machine readable form would provide access to patients and caregivers, students and their teachers, researchers, entrepreneurs, and other taxpayers who paid for the research. Expanding access would speed the research process and increase the return on our investment in scientific research. The highly successful Public Access Policy of the National Institutes of Health proves that this can be done without disrupting the research process, and we urge President Obama to act now to implement open access policies for all federal agencies that fund scientific research.

If more than 25,000 people sign it within the 30 day time frame, then the White House is required to consider the proposal and to give an official response. At the end of the first day there have been over 3,000 signatures.

Anyone can sign the petition – you do not need to be a citizen or resident in the US to support the initiative. If you believe in open access to research, please do consider lending your name, and encouraging friends and colleagues to do the same. You can find the petition here.


by Jonathan Gray at May 21, 2012 07:34 PM

JISC Access Management Focus

Lightning Talks at #TNC2012

I’m attending the lightning talks on the first day at #TNC2012. Some of the things we are hearing about (I didn’t get them all, twitter and unicorns are distracting):

I’ve been trying to find a common theme to cunningly link all of the talks. It’s difficult but I think most of the speakers were saying:

by nicole at May 21, 2012 04:34 PM

Engard, Nicole

Reports in Koha 3.2



Koha’s reporting tool is one of my favorite features of the ILS. This tutorial only scratches the surface of how you use the reporting tool in Koha, but it should give you a good foundation for using it in your library.

If you have an idea for a video, please just let me know and I’ll add it to my list of things to record.

Related posts:

  1. Using the News Tool in Koha 3.2 and 3.4
  2. Disabling Holds in Koha
  3. Saving Koha Bibs to Refworks

by Nicole at May 21, 2012 03:00 PM

Open Knowledge Foundation

Infokultura and Apps4Russia

During recent years, the Russian Federation has undertaken a number of developments in its open data legislation strategy. This trend inspired a team of professionals to get together and start a non-profit organization, “INFOKULTURA”.

Understanding that data availability is crucial for an information society and the development of an information culture, we emphasise the establishment and promotion of Open Data concept though a number of activities, for example the Apps4Russia Contest, conferences, seminars, research and expertise provision. It is worth mentioning that not only technical capacities and access to the Internet are needed to promote the idea of informational culture as an important social topic: the lack of administrative burden, technological, legal, time and other constraints on the data availability are essential. These problems can be solved with the help of open licenses, open standards and open data sources.

warfly Warfly – one of last year’s Apps4Russia winners

Here in Russia, there is growing interest among society and State authorities in Open Data expertise, and in responding to it “INFOKULTURA” has developed a set of proposals in support of the development, implementation and promotion of effective tools for efficient and successful information interaction between government and society.

During discussions and team meetings it was agreed that the main goals of the organization should be focused on promoting the following ideas:

Apps4Russia Contest

In April 2012 the second Apps4Russia contest was announced at RIF+CIB 2012, the Internet industry’s main event in Russia.

The Apps4Russia contest was initiated to promote the idea of Open Data. The main goal is to encourage Russian developers to create projects based on Open Government Data, aiming to increase public benefit and improve government transparency.

In 2011 the Apps4Russia was held in the private initiative format with a prize fund of EUR 4,000. This year the prize fund has been significantly increased since it is expected that more developers from various levels, inlcuding school students, will submit to the contest.

In 2011 the following Submissions were nominated and awarded:

datapult Datapult – one of last year’s Apps4Russia winners

This year all submissions will be reviewed under the following categories:

Call for Proposals timeframe: April – September
Summing up date – 12 September 2012 (Day of the Programmer).

Official web-site: http://www.apps4russia.ru

by Ivan Begtin at May 21, 2012 01:45 PM

Kick-starting the School of Data!

Earlier this year, we announced plans to launch the School of Data. Thanks to the generous support of Open Society Foundations and the Shuttleworth Foundation, we’re now ready to go! We’re holding a kick-off sprint next week, and we invite you to get involved.

What is the School of Data?

The School of Data is led by the Open Knowledge Foundation (OKFN) and Peer 2 Peer University (P2PU). The School will provide online training for data ‘wrangling’ skills – the ability to find, retrieve, clean, manipulate, analyze, and represent different types of data.

The School of Data is a collaborative and community-orientated project, and we welcome partners and participants. We’ve already had exciting conversations with several organisations and individuals, and we look forward to drawing upon their expertise during the development of the School. We are particularly excited to welcome the Tactical Technology Collective to our sprint next week, and look forward to benefiting from their wide-ranging experience (see e.g. their drawing by numbers project). We hope many more will join us – read on to find out how you can take part!

For more information about the School of Data, please visit our FAQs

The Kick-off Sprint

Next week, a small team of us will be gathering in Berlin for the School of Data kick-off sprint. This is a great opportunity to get to know one another, and to start building materials and resources for the School.

During the sprint we will be holding a designated virtual session, and we warmly invite you to join us online!

Details of the sprint are as follows:

When: Thursday 24th May, 12pm-4pm UTC Where: Online, through our IRC channel (#schoolofdata on freenode)
How: Sign-up on the etherpad and then just drop in!

Everyone is welcome, and we will try to have a range of hands-on activities to suit everyone’s interests. We particularly welcome:

If you can’t make the IRC session, you can still get involved. Sign-up to the School of Data mailing list and introduce yourself, or drop a quick note to schoolofdata [@] okfn.org.

In Berlin? Join the Social!

If you are located in or around Berlin, join us for our social! This will be held on the evening of Thursday 24th May at St. Oberholz, (Rosenthaler Straße, 72a). This is a great opportunity for anyone in the area with an interest in Open Data to meet casually for a drink and a chat.

To give us an idea of numbers, please sign-up here.

Get Involved in the School of Data

As well as the kick-off sprint, there are many ways to get involved with the School of Data.

Right now, we are looking for volunteers to:

And once the School is up and running, we will also be seeking volunteers to:

If you’d like to get involved in these or any other ways, please sign-up to the mailing list and introduce yourself!

You can follow us on Twitter: @SchoolofData

You may also be interested in our recent blog post about job vacancies. The Data Wrangler’s role in particular will be closely involved with the School of Data.

Finally, if you are able to contribute towards the ongoing maintenance of the School, please do feel free to donate via our secure PayPal account.

by Laura Newman at May 21, 2012 01:10 PM

del.icio.us

The Code4Lib Journal – Improving the presentation of library data using FRBR and Linked data

by thmmylibrary at May 21, 2012 09:14 AM

Coyle, Karen

Google goes semantic

In a long-awaited move [1], Google has announced that its search will now be "semantic." They don't actually mean "semantic" in the sense of the semantic web, although there are similarities. While what Google is doing may not formally follow the W3C standards for the semantic web, there is no doubt that they are performing acts of "data linking" that make use of the concepts of linked data. The W3C standards for linked data are designed for openness, so that data from disparate communities can come together. Google has no obligation to play well with others and, as we saw with the development of schema.org, is in a position to make its own rules, many of which are known only within the giant Google-verse. They call their technology a "knowledge graph" and talk about "things not strings." I've used this same phrase myself in numerous presentations on linked data.

Google has always been about using links between things on the web to determine its brand of "relevance" of a web resource to a search query. By using existing linked data, via large stores of links like Dbpedia, Wikipedia, Freebase, and presumably others, Google can now expand its offerings from a single list of results to additional information about the topic that might be the intended topic of the searcher. I say "might be" without any irony; whether in a web search engine or a library catalog, the communication between the searcher's mind and the device that provides results is always only approximate. What the additional data provides is not only more context but a more ample explanation of the topics that have been retrieved. No longer do users have to guess from snippets the meaning of the results in the result set, but they can see a Wikipedia-like entry that not only gives them more information, but it contains links to other sources of information of the topic.

Snippet
"Knowledge Graph" result

"Knowledge graph" detail



At a meeting of the Northern California Technical Services Group in Berkeley last Friday, I said to the group:

Imagine that you have an 18-year-old user who finds a novel on your library's shelf by Oliver North. The user looks up the author in  your catalog and sees that this person has written a few other books, but oddly always with a "co-author." Is someone so inept worth reading? Now imagine that your catalog also presents the user with the context: Ollie North, Iran Contra, and related persons. Suddenly the user sees where North fits into US history, has a chance to find out what an interesting character he is, and the books take on a whole new meaning.

That was before I saw this Google result.

We treat library users as if they are all-knowing; as if they know each author in our catalog, as if the title of the book and the number of pages is sufficient for them to decide if it is a good read or has the information they need. This is so obviously false that I am at a loss to explain how we continue to work under this illusion.

[1] Google purchased the only linked-dated search system, Freebase, in July of 2010, thus tipping their hand that they were moving in that direction. Not only did they acquire Freebase and the skills of its employees, they eliminated a potential rival (although it may be silly to consider that anyone could really be a rival to Google).

by Karen Coyle (noreply@blogger.com) at May 21, 2012 09:03 AM

May 20, 2012

Manage Metadata (Phipps and Hillmann)

Taggregations

The technique described in Using the sub-property ladder works well to “dumb-up” raw, level 0 data from MARC21 fixed-length data fields to interoperate with metadata from other schemas. Unfortunately, it cannot be used with most MARC21 variable data fields (tags) and subfields. We cannot simply dumb-up a subfield to the level of its parent tag because most tags have more than one subfield; the meaning of a tag is a combination of the meanings of its subfields and tag-level data  is a composite of subfield-level data.

There is another technique we can use to bridge the semantic gap between a subfield and its tag: tags generally can be treated as “aggregated statements”, where the value of a tag is a literal string, or statement, which is composed of the values of subfields.

For example, a record may contain a tag 260 (Publication, Distribution, etc.) with subfield a (Place of publication, distribution, etc.) = “Edinburgh :”, subfield b (Name of publisher, distributor, etc.) = “Castle Press,”, and subfield c (Date of publication, distribution, etc.) = “2012.”. The contents of the tag, “$aEdinburgh :$bCastle Press,$c2012.” can be turned into a tag-level value, “Edinburgh : Castle Press, 2012.”, by substituting a space for each subfield indicator ($) and code pair. We can then use a tag-level property with the label “Publication, Distribution, etc. (Imprint)” and URI “m21plus:T260″ to publish the metadata statement “This resource – has Publication, Distribution, etc. – ‘Edinburgh : Castle Press, 2012.’” as an RDF triple.

The instructions for deriving the tag-level value or aggregated statement from the subfield values are known as a syntax encoding scheme (SES). This is part of the Dubin Core abstract model, allowing specific SESs to be used in an application profile. There can be many different ways of deriving the value; the example above works because MARC21 subfields contain embedded punctuation that delineates the component parts when the subfield encoding is removed. This simple SES allows a MARC21 record to conform to the syntax prescribed by the International Standard Bibliographic Description (ISBD) for compound statements. Unfortunately, this makes it difficult to apply any other SES to the subfields without first removing the punctuation.

It would be much better if the instructions for adding ISBD punctuation to MARC21 data were embedded in an SES. Then a different SES could produce “Published in 2012 by Castle Press in Edinburgh” rather than “Published in 2012. by Castle Press, in Edinburgh :”. This is the approach taken by ISBD itself, and there is clearly an opportunity here for collaboration between the MARC21 and ISBD communities. The same approach is envisaged for RDA.

The aggregated statement technique is also very useful when a MARC tag is repeated. Using tag 260 again as an example, a record may contain multiple publication statements for intervening publishers, where the tag’s first indicator has value “2″. If there are two such tags, then there may be two or more publication places and two or more publisher names, for example “$32001-2005$aEdinburgh :$bMudhut Publishing” and ”$32006-$aEdinburgh :$bCastle Press” (subfield 3 is for Materials specified). A linked data representation of the record needs to keep the places, names, and dates correctly associated so that they don’t get mixed up, for example “Mudhut Publishing” with “2006-”. The tag-level RDF property (m21plus:T260) can be used with an aggregated statement to keep the level 0 data associated with the correct repeat of the tag, avoiding the use of blank nodes in the RDF graph of a specific record.

RDF graph of MARC21 Publication statement data

RDF graph of MARC21 Publication statement data

As the graph shows, the two Publication statements must have URIs so that they can link to the correct subfield values. The URIs identify the literal strings of the aggregated statements, and are instances of an SES; all SESs are sub-classes of the class of literal strings. A blank node, on the other hand, has no URI and uses a local identifier to make the links; such links appear broken in a non-local environment.

To sum up, it seems useful to represent MARC21 tags as RDF properties associated with a syntax encoding scheme. We intend to add these properties to the Open Metadata Registry. Specific encoding schemes can then be assigned using an application profile. There must be many examples of instructions for processing tag subfields for output and display which can form the basis of suitable encoding schemes.

by Gordon Dunsire at May 20, 2012 10:08 PM

Scott, Dan

Running libraries on PostgreSQL: PGCon 2012 talk

On Friday, May 18th I gave a talk at the PGCon 2012 conference on the use of PostgreSQL by the Evergreen project. My talk fell in the case study track, which meant that I had been asked to describe to PostgreSQL developers what Evergreen was, why it was a project they might want to care about, enumerate the advantages that Evergreen gets from using PostgreSQL, and where our project has some difficulties with PostgreSQL.

I have given a lot of talks before, but I’m used to being on the developer side of the discussion. In this case, the tables were turned; with noted PostgreSQL contributors like Josh Berkus, Chris Brown, Simon Riggs, and Robert Treat in the audience, I was a user talking to the developers of something that I was very much dependent on and which I understood at a much more basic level than they did. This was both liberating and humbling; it definitely adds some perspective to my experiences as a developer in the Evergreen project.

Along with my slides, the whole talk has been professionally recorded - both video and audio - thanks to Heroku’s sponsorship, so you will be able to relive each and every word if you really want to. I’ll summarize the main points that I wanted to convey to the PostgreSQL developers:

  • I was quite candid that most libraries can’t afford dedicated database administrators, and that therefore the more that PostgreSQL can provide reasonable out-of-the-box configuration settings, the better. For example, results from the survey that I sent out at the last minute (THANK YOU to the nine sites that responded!) showed many sites running with a default statistics target of 50, whereas the default had been increased to 100 back in PostgreSQL 8.1 and much higher settings are often recommended to help the planner make its decisions. That said, my survey didn’t ask for table-level statistics settings (did you know that you could change the statistics for particular tables?), so perhaps some sites are using higher statistics levels for particular tables and a lower default threshold.

  • It was probably hokey, but I noted that as libraries are often called the heart of their community, that PostgreSQL was effectively the heart of Evergreen — and I invited the PostgreSQL community to help our heart beat faster. With the Evergreen Oversight Board contemplating a strategic investment fund for initiatives that will have a long-term benefit to Evergreen, this might be an avenue for getting PostgreSQL experts to assist us on areas that represent particular bottlenecks (beyond helping us out of the goodness of their own hearts). As well, I invited the PostgreSQL community to join in advocacy efforts to get their local libraries to consider adopting Evergreen.

  • I described, at a high-level, many of the PostgreSQL features that Evergreen relies on (full-text search, stored procedures, Hstore, inheritance) and tried to convey why our schema takes up 355 tables (and counting) to deal with what, from outside a library perspective, must seem like a relatively simple problem to deal with. And of course I gave most of the credit for Evergreen’s PostgreSQL-savviness on multiple levels to Mike Rylander.

The talk was well-received, based on a number of people who approached me afterward to continue the discussion. Josh called it one of the first times he had seen a presentation designed to solicit assistance directly from the developers in attendance (I probably overplayed the "help us poor harried library system administrators" hand) and thought that it hit the mark for a case study; similarly, Simon was quite interested in Evergreen’s adoption patterns with (I suspect) an eye towards offering possible consulting in administration and optimization efforts.

On the "immediate takeaways" from that talk:

  • For straightforward connection pooling, pgbouncer is the current recommendation over the more flexible but more complicated pgpool-II.

  • Recent versions of Slony have lifted limitations that bit us in the past, like the inability to replicate a TRUNCATE command.

  • Solr, as a potential alternative to PostgreSQL’s full-text search, is seen as fast but brittle to manage, and adds in overhead to maintain consistency with the contents of the database. (I’m not so sure about the brittleness, given Hathitrust’s ability to run a massive Solr index, but it is worth following up on…)

  • Streaming replication in 9.1 has improved significantly over 9.0, although you’ll still want to have WAL archiving in case of disaster.

I have a lot more to say about the intersection of the PostgreSQL and Evergreen communities in general, but on the whole I think that a closer relationship has been long overdue. I was delighted that Ben Shum and Robin Isard were both able to attend the conference, and I firmly believe that building more PostgreSQL development and administration expertise within the Evergreen community is critical to our long-term success. While I have long been an advocate of pointing community members to the documentation of the underlying infrastructure components for specific administration recommendations, I believe that effective PostgreSQL tuning and administration is so critical to the successful implementation of a production Evergreen site that we should add a section to the Evergreen documentation containing a small set of considerations and/or processes for going into production—and I hope to start that relatively soon.

by dan@coffeecode.net (Dan Scott) at May 20, 2012 05:57 PM

Voss, Jakob

Goethe erklärt das Semantic Web

Seit Google vor einigen Tagen den “Knowledge Graph” vorgestellt hat, rumort es in der Semantic Web Community. Klaut Google doch einfach Ideen und Techniken die seit Jahren unter der Bezeichnung “Linked Data” und “Semantic Web” entwickelt wurden, und verkauft das ganze unter anderem Namen neu! Ich finde sowohl die Aufregung als auch die gedankenlose Verwendung von Worten wie “Knowledge” und “Semantic” auf beiden Seiten albern.

Hirngespinste von denkenden Maschinen, die “Fakten” präsentieren, als seien es objektive Urteile ohne soziale Herkunft und Kontext, sind nun eben Mainstream geworden. Dabei sind und bleiben es auch mit künstlicher Intelligenz immer Menschen, die darüber bestimmen, was Computer verknüpfen und präsentieren. Wie Frank Rieger in der FAZ gerade schrieb:

Es sind „unsere Maschinen“, nicht „die Maschinen“. Sie haben [...] kein Bewusstsein, keinen Willen, keine Absichten. Sie werden konstruiert, gebaut und eingesetzt von Menschen, die damit Absichten und Ziele verfolgen – dem Zeitgeist folgend, meist die Maximierung von Profit und Machtpositionen.

In abgeschwächter Form tritt der Irrglaube von wissenden Computern in der Fokussierung auf “Information” auf, während in den meisten Fällen stattdessen Daten verarbeitet werden. Statt eines “Knowledge Graph” hätte ich deshalb lieber einen “Document Graph”, in dem sich Herkunft und Veränderungen von Aussagen zurückverfolgen lassen. Ted Nelson, der Erfinder des Hypertext hat dafür die Bezeichnung “Docuverse” geschaffen. Wie er in seiner Korrektur von Tim Berners-Lee schreibt: “not ‘all the world’s information’, but all the world’s documents.” Diese Transparenz liegt jedoch nicht im Interesse von Google; der Semantic-Web-Community ist sie die Behandlung von Aussagen über Aussagen schlicht zu aufwendig.

Laut lachen musste ich deshalb, als Google ein weiteres Blogposting zur Publikation von gewichteten Wortlisten mit einem Zitat aus Goethes Faust beginnen lässt:

Yet in each word some concept there must be…

Im “Docuverse” wäre dieses Zitat durch Transklusion so eingebettet, dass sich sich der Weg zum Original zurückverfolgen ließe. Hier der Kontext des Zitat von Wikisource:

Mephistopheles: [...] Im Ganzen – haltet euch an Worte! Dann geht ihr durch die sichre Pforte Zum Tempel der Gewißheit ein.

Schüler: Doch ein Begriff muß bey dem Worte seyn.

Mephistopheles: Schon gut! Nur muß man sich nicht allzu ängstlich quälen; Denn eben wo Begriffe fehlen, Da stellt ein Wort zur rechten Zeit sich ein. Mit Worten läßt sich trefflich streiten, Mit Worten ein System bereiten, An Worte läßt sich trefflich glauben, Von einem Wort läßt sich kein Jota rauben.

Die Antwort von Google (und nicht nur Google) auf den zitierten Einwand des Schülers gleicht nämlich bei näherer Betrachtung der Antwort des Teufels, wobei das “System” das uns hier “bereitet” wird ein algorithmisches ist, das nicht auf Begriffen sondern auf Wortlisten und anderen statistischen Verfahren beruht.

In der Zeitschrift für kritische Theorie führt Marcus Hawel zu eben diesem Zitat Goethes (bzw. Googles) aus, dass Begriffe unkritisch bleiben, solange sie nur positivistisch, ohne Berücksichtigung des “Seinsollen des Dings”, das bestehende “verdoppeln” (vgl. Adorno). Wenn Google, dem Semantic Web oder irgend einem anderen Computersystem jedoch normative Macht zugebilligt wird, hört der Spaß auf (und das nicht nur aufgrund der Paradoxien deontischer Logik). Mir scheint, es mangelt in der semantischen Knowledge-Welt an Sprachkritik, Semiotik und kritischer Theorie.

by jakob at May 20, 2012 01:49 PM

del.icio.us

@ranti @mreidsma @aaroncollie an article on hackathons, since a mini one was mentioned for #code4lib http://t.co/Pq7lsIW2 – ksattler (ksattler) http://twitter.com/ksattler/status/204021499177336834

by ranti at May 20, 2012 05:17 AM

Learning From Hackathons and How Not to Fail at One

by ksattler at May 20, 2012 01:31 AM

May 19, 2012

del.icio.us

Main Page - Code4Lib

by heycarey at May 19, 2012 08:49 PM

Grimmelmann, James

Orphan Works and Error Costs

This is a brief observation about the central role that error costs must play in any discussion of orphan works policy. I made it at the Berkeley orphan works conference last month. By popular request (okay, by one person’s request), I’m putting it online here.

The first, and most obvious, errors are those made by copyright owners whose works become orphaned. Works don’t end up orphaned unless there’s been a mistake by the copyright owner. They make mistakes about whether they’re copyright owners, about whether they’re findable, and about whether there’s a potential audience interested in their works.

But these errors interact with errors made by potential users of the works. If users knew with certainty whether copyright owners would emerge and object to possible uses, there’d be no orphan works problem, because every search would lead either to genuine negotiations or to use without fear of suit. False negatives that expose users to the risk of being sued and copyright owners to mistaken uses; false positives chill use without benefitting copyright owners.

And finally, there are error costs in the judicial system, which magnify the effects of errors at the previous two stages. They award remedies more than sufficient to compensate copyright owners, or they fail to award sufficient remedies. And the same problems face any system for dealing with orphan works: it could mistakenly declare that works are orphan when they’re not, or vice versa, and that searches were diligent when they weren’t, or vice-versa.

The point is that the ubiquity of errors isn’t just an incidental feature of the orphan works debate: it’s the defining reality that causes there to be an orphan works problem at all, and with which any response to the problem must grapple.

by James Grimmelmann (james@grimmelmann.net) at May 19, 2012 03:46 PM

Schneider, Jodi

Google Docs ‘research’ tab

Increasingly, I’m using Google Docs with collaborators. Yesterday, one of them pointed out the new “Research” search tab within Google Docs. (Tools->Research). I’m a bit surprised that your searches don’t show up on your collaborators’ screen. I’m particularly surprised that sharing searches doesn’t seem possible.

Google Docs' new 'Research' tab promotes search within Google Docs.

Apparently, it is pretty new. More at the Google Docs blog.

by jodi at May 19, 2012 01:38 PM

Engard, Nicole

WordPress Plugins I’ve Used



This week I taught a workshop in FL for PLAN on WordPress. I was asked what Plugins I use/recommend so I thought I should share a list. I went through all of my sites and grabbed the link from the plugins page that said ‘Visit Plugin Site,’ but I recommend searching the WordPress Plugins database for these and installing them from there.

Depending on the type of site I use all of the following on my WordPress sites:

Related posts:

  1. Speeding up WordPress Dashboard
  2. WordPress Automatic Upgrade
  3. Menus in WordPress 2.7

by Nicole at May 19, 2012 12:32 PM

Farkas, Meredith

Setting priorities



In academic libraries, there are usually so many levels of priorities. There are the priorities of the university. There are the priorities of the library. Each unit probably has its own priorities, as does each individual. Ideally, these all sync up nicely, where an individual can show how their priorities mesh with library’s and university’s priorities. However, it’s not always easy for the library to support all of those university priorities. That’s often because the library doesn’t have the people-power or financial resources to do everything well. So the library has to choose whether they follow every university priority in a superficial way, or whether they focus on the priorities that they can accomplish well in light of limited resources. Neither is a completely satisfying choice.

At my library, and really at the University a a whole, there is definitely a tug-of-war going on between the original access mission of the University and the growing importance of research. Clearly both are important and both require library support. My colleagues are deeply committed to both roles, but it’s frustrating when you know you can’t do it all as well and completely as you’d like. You can’t develop a vibrant scholarly communications and data management program AND have a comprehensive program of outreach and instruction to the neediest students when the same people are involved in both. And yes, we’re doing all of those things, but not to the extent that we’d like to. Having been at a small place before, we certainly dealt with those limitations too (we still don’t have an institutional repository at Norwich), but the expectations of the academic community were lower because we weren’t a large research institution. And in light of budget cuts, I’m sure many, many academic libraries are feeling similarly frustrated by what they can’t do (or do enough of).

And this tug-of-war is seen in the instruction program as well. We can’t do all of the teaching we’d like given our staffing, so we have to prioritize. But how? With the growing research priority, do we focus more on faculty outreach and graduate-level instruction? With the focus on Freshman retention, do we put more time and effort in teaching first-year students? We have a strong liaison program and a ton of teaching goes on in upper-level undergraduate classes, especially those that are core to majors (like research methods). This is fantastic! I remember when I got to Norwich, very little library instruction was going on outside of the lower-division classes and we worked hard over the years to get information literacy instruction integrated into core courses in the majors. PSU has been there for a long time. Is that less important than reaching Freshman or more? Or is there, as I suspect, no one right answer to that question?

So how do we set priorities? How do we determine how much focus to put on each thing we do? A colleague recently showed me stats on what percentage of the total enrollment is each class (Freshman, Sophomore, Junior, etc.). Do we use that to determine our instructional priorities? Do we say “sophomores make up x% of student enrollment, so we will provide x% of our teaching in 200-level classes?” It’s certainly a concrete way of making decisions, and probably as good as any, but I don’t feel like needs and priorities translate so easily to exact numbers and percentages. We still need to take into account Univerity priorities, student needs, what classes are the most valuable to be involved with, and in what classes can we make the greatest difference. If someone comes up with a formula for figuring this out, they deserve some kind of award.

Another thing we talk a great deal about is using learning objects to augment and/or replace the one-shot. And I’ve started to wonder where is the best place in the curriculum to implement this? Should we replace Freshman-level instruction with online learning modules because most students are not really at an emotional/intellectual space yet where they are capable of serious research or do we focus on face-to-face instruction because they need the high-touch approach? Do we employ learning objects in upper-division classes because the students are more self-motivated once they’re in their majors, or is that the critical time to connect with them because the sort of research they’re doing is higher-level? Do we stop teaching grad students face-to-face because of their much higher motivation level, or is that the perfect reason to focus on them? I don’t know if there have been studies on this, but it would be interesting to figure out at which level does it make the most sense to provide face-to-face instruction and at what level would students benefit most from learning objects. It seems like most suites of learning objects designed to replace face-to-face instruction happen at the Freshman level, but that might just be because there are so many sections of the same few courses and it’s easier to create something that works for many, many, many classes.

None of these issues is unique to my University; in fact, I’d argue that in a world of rising materials costs and shrinking budgets, they’re pretty darn universal. Even at little old Norwich, where the student/librarian ratio was so much smaller, we had to prioritize. It got to a point where I had to start cutting down on the number of history classes I was teaching, because it was taking up such a disproportionate amount of my time (although I really enjoyed it!). So, at your institutions, how have you determined what to prioritize in terms of library instruction? When demand for your services exceeds supply, what do you stop doing? Where have you replaced face-to-face instruction with other lower-touch models and why?

by Meredith Farkas at May 19, 2012 04:56 AM

Bisson, Casey

Composited Timelapse and Real-Time Skateboarding Video

Click here to view the embedded video.

Russel Houghten‘s Open Horizon is part skate film, part time lapse, and mostly awesome.

Then somebody pointed to this Jimmy Plmer/Z-Flex video that shares a number of features with Houghten’s work, but is less ambitious in scope. At least they did a behind the scenes video that shows the sweet Red camera and rails.

by Casey Bisson at May 19, 2012 03:54 AM

Gorman, Jonathan

Unglue.it launched!

Those following me in my various social circles are probably already sick of hearing about this, but unglue.it launched yesterday.  Unglue.it, started by Eric Hellman (aka @GlueJar) and other folks w/ connections to Code4Lib, is a effort to release copyrighted books to the world.  Working with right-holders in a kickstarter style to raise enough money to license an ebook  a creative commons non-commercial license.  It's a way of "front-loading" profits so the author can be compensated for their work, but the world gets access.

They also have a mechanism for adding books to a "wishlist" that will give them an indication of works that people want and what right-holders they should track down.

This is a brilliant way to deal with some of those really important and hard to find out of print books.  For example, I've wishlisted a very good biography by Greg Rickmann of Philip K. Dick called To the High Tower.  It's a work that I stumbled across that's sought out by a small circle of Philip K Dick fans. 

It's not clear if there's enough demand for another printing, but unglue.it offers a chance that it could be made easily available again while also giving the author further profit he's not going to see of this long-since sold out book otherwise.  I have had the good fortune to read it due to the fact I work at a major library that has a ton of access of books, but I know many a sci-fi fan that doesn't have the resources I do.

I also must admit that I'm interested to see if this model works.  I've thought about trying to do a small hobby side business of making value-added public domain works and perhaps doing something similar to unglue.it with near-orphaned copyright works.  However, tracking down right-holders has proved troublesome enough that it's remained in my large bucket o' ideas I'd like to do someday.  I'm hoping unglue.it takes off enough that it'll create a infrastructure that might make it easier to do projects of this nature.

Here's what I'm pledging to:

 

by Jon Gorman at May 19, 2012 12:19 AM

May 18, 2012

Grimmelmann, James

GBS: Oral Argument Report in HathiTrust

Yesterday, Judge Baer held an hour-long hearing in the HathiTrust case. Although most of the time was spent on procedural matters, the Authors Guild’s lead attorney, Edward Rosenthal did a very effective job leveraging them into substantive points.

The first problem for the court was a discovery dispute. Some of the plaintiffs live far from New York and have objected to having their depositions taken there, and a fourth, J.R. Salamanca, is in ill health and bedridden. After some discussion not worth recounting, the defendants’ attorneys agreed to take the deposition of Salamanca’s literary agent instead, and the two sides agreed on logistical arrangements for the others.

The most significant consequence of the deposition skirmish is that the close of discovery has been effectively pushed back. It had been scheduled to be finished by May 20, which is self-evidently impractical now that some depositions won’t even happen until next week. Instead, it now appears that discovery will last until June 8. This fact puts pressure on the schedule for summary judgment. Judge Baer had asked for the motions for summary judgment to be fully briefed by July 20. But allowing the necessary time for each side to respond to the other’s papers means that the actual motions would need to be filed in mid-June, i.e. uncomfortably close to the end of discovery. Judge Baer at one point asked parties if they could finish their briefing by the start of July so he could “put it under his pillow” when he goes away for the month. They agreed to go off and discuss the schedule, but I’d be quite surprised if the summary judgement deadlines were moved up.

And this scheduling tempest will spill over beyond its teapot: it seems likely to shape how the case will be argued. Joseph Petersen, appearing for the HathiTrust, tried to suggest that a quick ruling from Judge Baer on the motions for partial judgment on the pleadings (HathiTrust’s on associational standing and the Authors Guild’s on the applicability of copyright defenses) would help winnow the issues in the case, making for more narrowly focused summary judgment motions. Judge Baer wasn’t buying. He said, gruffly, that he was inclined to hold over these issues and decide them together with the summary judgment motions. This isn’t good news for HathiTrust, for reasons shortly to become apparent.

The first phase of the substantive oral argument dealt with HathiTrust’s motion to have the Authors Guild and other associations removed from the case for lack of standing (leaving only the individual plaintiffs). W. Andrew Pequignot delivered the argument in a style familiar to anyone who’s watched a moot court. He give a clear, but completely wooden, summary of HathiTrust’s argument against the associations, focusing on the argument that each copyright plaintiff must prove individual ownership of the works on which they sue, so that an association would need to present individual facts for every one of its members. The judge tried to ask him what practical difference associational versus individual standing would make if HathiTrust ended up losing on the merits, a question which raises subtle questions about the scope of a possible injunction, but Pequignot didn’t engage with the question.

Ed Rosenthal then gave the Authors’ Guild reply, and showed why he’s the chair of his firm’s IP and litigation groups. The defendants copied ten million books, he said, in an act of “preemptive mass digitization,” and now they want to look at individual books in evaluating standing. The response to Rosenthal’s point, if there is one, is that the Copyright Act really does require proof of individual ownership, a requirement that has nothing to do with whether the infringer is accused of copying one book or a million. Rosenthal could have replied by saying that this would leave copyright owners without a way to challenge mass infringement, and the defendants’ natural surreply would have been that individual lawsuits would be more appropriate. But that last point was precisely the question Pequignot ducked—thereby not only ceding much of the standing issue but also the rhetorically intuitive high ground.

That mattered, because Rosehtnal used the standing issue as a pivot to his argument that HathiTrust’s copying was substantively impermissible under the Copyright Act. Having set up the issue as a mass challenge to mass digitization, he was ready to roll with his argument that Section 108 provides the only relevant permission for copying here, permission that HathiTrust has far exceeded in copying books wholesale rather than retail. Thus, he claimed, the associations were the perfect plaintiffs to mount a program-wide challenge.

HathiTrust’s next moot court argument came from Allison Roach, who argued that no one had standing to challenge the Orphan Works Program since no identifiable books with copyrights owned by any of the plaintiffs had been made available or were in imminent likelihood of being made available through the OWP. Judge Baer was skeptical, saying that he was bothered that the libraries did “all of this” before there was an opportunity for plaintiffs to complain. Roach said was that no books had been made available, only a list of candidates, and that the plaintiffs were asking for an injunction against the entire Orphan Works Project without concrete facts about specific books it would infringe.

Rosenthal’s response here was a little less vivid. He emphasized that the University of Michigan had set up a mechanism for its orphan works. Some plaintiffs found their books on the list; the University suspended the program. If, he argued, this meant there was no right to object because there was currently no program, then there would never be a circumstance in which the program’s legality could be addressed. Any copyright owner who tried to object would be defined out of the class of copyright owners with standing to object, and this couldn’t be. (His point illustrates why the standing argument may be too clever by half when it comes to the Orphan Works Program, and why suspending the program might end up being ineffective in insulating it from judicial review.)

This brought the court to the plaintiffs’ motion for judgment on the pleadings that the libraries couldn’t raise fair use, Section 108, or other Copyright Act defenses. Here, Rosenthal led off by arguing that Congress passed a specific statute with directions for libraries, which the defendants disregarded. He then acted annoyed that the defendants, in their responses (see the bottom of our page on the case) characterized this as a broader attempt to stop libraries from claiming fair use, ever. No, Rosenthal said, the plaintiffs don’t argue against other library uses, just that they can’t digitize every book. They chose to scan in a large project, and the burden should be on them to justify that project. Once again, it was an oral advocacy gem.

Joseph Petersen then gave a rebuttal that ran through HathiTrust’s brief. The plaintiffs, he said, tried to argue that libraries have no fair use rights, but only the specific rights granted in Section 108. When shown how absurd it would be to claim that libraries alone in society have no fair use rights, the plaintiffs changed course and argued that the case isn’t about library copying in general, but only about this program. And this, he said, showed why this issue wasn’t appropriate for the “rule 12” context (i.e. a motion for judgment on the pleadings): it obviously depends on specific facts about the libraries and what they’re doing. He then recounted, quickly, some of the libraries’ arguments about the symbiosis between Section 108 and fair use, about the noncommerciality of the project, and about the text of Section 108.

He was followed by Daniel Goldstein, on behalf of the National Federation of the Blind. He ran through some of the history of accessibility of books to the blind, and emphasized that digitizing books brings the number of accessible titles from tends of thousands to tens of millions. Now, blind and visually disabled students can access HathiTrust’s digital database (when they provide appropriate certification of their disability). They’re the only group that has access to the database, but now they have equal access to the books themselves as sighted students would. He used this to argue that the plaintiffs’ assertions about categorical exclusions from fair use and other copyright defenses would tell libraries that they can’t make the copies for the print-disabled that they need to to comply with the Americans with Disabilities Act and the Rehabilitation Act.


All in all, yesterday’s skirmish was a minor one in the arc of the case. The discovery disputes were sorted out, and the schedule will be. Because Judge Baer strongly signaled that he’ll put the immediate motions off until he considers the summary judgment motions, that just puts the interesting and important issues off until the even more interesting and important summary judgment ruling.

Still, the skirmish was a clear win for the Authors Guild and its co-plaintiffs. Rosenthal made common-sense arguments about standing that—from the audience at least—seemed like they were persuasive to Judge Baer. He leveraged his responses to the defendants’ motions on standing to bolster his own argument on the applicability of fair use. And because Judge Baer is likely to hold the present motions over, he put the defendants in the difficult position of arguing that they are entitled to a blanket fair use defense at the same time as they argue that fair use is a fact-specific inquiry requiring individual participation.

The defendants’ decision to press the standing issues, at least in the way they did, now appears like a mistake. Both at the hearing and in the case overall, the plaintiffs have been able to use their responses to the standing motion to wrong-foot HathiTrust and take control of the case’s timing. As readers of this blog know, I don’t think much of the plaintiffs’ own judgment on the pleadings motion, but I have to give them and their lawyers credit for using it at the hearing to define the narrative of the case on their terms. They chose their counsel well.

The other matter on display yesterday is how different Judge Baer is from Judge Chin. Where Chin’s attitude is generally thoughtful and gentle, Baer tends more towards the gruff and the impatient. (It may not have helped that the hearing was sandwiched between three criminal matters and an afternoon of conferences, and that Judge Baer’s, schedule as he announced, had no room for lunch.) His fast-track schedule for the case is an indication of where Baer’s priorities lie, and my sense is that he saw the hearing more as a way to keep the case moving properly than as an occasion for deep reflection on the issues.

Assuming no curveballs, the next major dates in these cases will be in mid-June, when a variety of major motions will fall due. Motions for summary judgement will be due June 14 in Authors Guild, the visual artists’ motion for class certification will be due June 13, and summary judgment motions in the HathiTrust case will arrive somewhere around then, too, depending on what the parties agree to.

by James Grimmelmann (james@grimmelmann.net) at May 18, 2012 08:57 PM

OCLC Dev Network

Enhancements for OCLC Web services set for 20 May 2012

The planned install this weekend has a number of enhancements designed to make your life better, as a developer. Of course, I say this with a caveat because it still feels like step 2 of 132 that we have envisioned for having all the services look, act and smell the way we want them to. But, “walk before you run” is what we keep telling ourselves.

What you need to know:

Principle to Principal

Small but important change for all WorldShare (WMS) APIs except NCIP:

read more

by alicesneary at May 18, 2012 07:23 PM

Prom, Chris

Yahoo Mail Download

Courtesy of Seth Shaw and Ben Goldman, who pointed this out to me, I’d like to take note of an email program that has very specialized, but potentially important, use: the YPOPs program.  This is a Windows application which can be used to establish an IMAP or POP3 Connection to Yahoo email account, from which you would be able to download email for preservation purposes.

The project seems to be dormant, but there is a project website at http://ypopsemail.com/   and a sourceforge site at http://sourceforge.net/projects/yahoopops/.

by Chris Prom at May 18, 2012 06:44 PM

Powell, Andy and Johnston, Pete

Big Data - size doesn't matter, it's the way you use it that counts

IMG_6404...at least, that's what they tell me!

Here's my brief take on this year's Eduserv Symposium, Big Data, big deal?, which took place in London last Thursday and which was, by all the accounts I've seen and heard, a pretty good event.

The day included a mix of talks, from an expansive opening keynote by Rob Anderson to a great closing keynote by Anthony Joseph. Watching either, or both, of these talks will give you a very good introduction to big data. Between the two we had some specifics: Guy Coates and Simon Metson talking about their experiences of big data in genomics and physics respectively (though the latter also included some experiences of moving big data techniques between different academic disciplines); a view of the role of knowledge engineering and big data in bridging the medical research/healthcare provision divide by Anthony Brookes; a view of the potential role of big data in improving public services by Max Wind-Cowie; and three shorter talks immediately after lunch - Graham Prior talking about big data and curation, Devin Gafney talking about his 140Kit twitter-analytics project (which, coincidentally, is hosted on our infrastructure) and Simon Hodson talking about the JISC's big data activities.

All of the videos and slides from the day are avaialble at the links above. Enjoy!

For my part, there were several take-home messages:

As I mentioned in my opening to the day, Eduserv's primary interest in Big Data is somewhat mundane (though not unimportant) and lies in the enabling resources that we can bring to the communities we serve (education, government, health and other charities), either in the form of cloud infrastructure on which big data tools can be run or in the form of data centre space within which physical kit dedicated to Big Data processing can be housed. We have plenty of both and plenty of bandwidth to JANET so if you are interested in working with us, please get in touch.

Overall, I found the day enlightening and challenging and I should end with a note of thanks to all our speakers who took the time to come along and share their thoughts and experiences.

by AndyP at May 18, 2012 03:44 PM

Murray, Peter

Unglue.It — a service to crowdsource book licensing fees — launches

You could say “this is a service to watch” but that would be missing the point. Yesterday the ‘Unglue.It‘ service launched as a way to crowdsource the funding of a fee to authors to release their own works under a Creative Commons license.


Unglue.It's launch announcement

At its heart, it is an experiment to see if authors can raise the funds they think are appropriate to allow readers to make unlimited digital copies of their works. It is an alternative to the existing model where single users pay individual licensing fees to download books that are encumbered by forms of digital rights management (DRM).

This is new model that doesn’t have an equivalent in the analog, print world or the existing digital, ebook world. In the analog world, the right of first sale reigns supreme; as a purchaser of the book, you have the right to do just about anything you want with that copy: you can give it to a friend, you can sell it at a garage sale, you can donate it to a library. Since it only exists as a physical object, once you give it up you no longer have access to it.

In the digital world, exact replicas are as easy as clicking and dragging a file to a flash drive, posting it to a website, or e-mailing it to friends. And you still retain the original copy. The concern for this sort of proliferation of copies is why many (most?) publishers insist on using various forms of DRM. DRM is a royal pain in the butt, though, (and publishers know it) because most forms don’t allow you to convert a file to be read across various readers (e.g. from Kindle to Nook). And you don’t really “own” the file; the DRM software “phones home” resulting in cases where files are removed from devices and files that are permanently locked because “home” has shut down.

Enter the Unglue.It model. Rights holders can set a price that, when reached, they promise to release a digital version of the work that is unencumbered by DRM under a Creative Commons License. (The
— CC BY-NC-ND 3.0">
— CC BY-NC-ND 3.0">
— CC BY-NC-ND 3.0">worldwide non-commercial, no-derivatives license
is the default.) Readers that want to see that happen pledge an amount that they choose towards the campaign. If the accumulated pledges reach the price by the campaign deadline, readers’ credit cards are charged, the author receives their money (minus a commission fee to the Unglue.It service), and posts the ebook somewhere for anyone to download and read. Rights holders can also offer incentive premiums, such as signed posters from cover illustrators, virtual author visits, and opportunities to interact with the author as they create a new work. If the amount isn’t reached by the deadline, the campaign fails — no one’s credit card is charged and the book isn’t released.

The economics of this are fascinating. Publishers will let titles go out-of-print when they think the market for new copies reaches a point where maintaining the inventory is no longer cost effective. In some cases, print-on-demand from devices like the Espresso Book Machine have filled that void for single copies, but no one is really sure what the value of a work is worth as time marches on. The Unglue.It model enables us to see what the value for some of these works really are after their initial publication by creating a market for wholesale, after-first-run, unlimited copy licenses. It will be interesting to see how these first few campaigns go and how the market settles out.

The book I'm supporting to unglue.

So I’m into this experiment. I have pledged to unglue Riverwatch by Joseph Nassise. The author has put up a page describing his own reasoning for being a part of the inaugural Unglue.It campaigns:
The Unglue.it funding process serves two very important ends – it helps writers and content providers like myself get paid for the work we produce while at the same time making that work freely available to readers all over the world. No one else, anywhere, is doing anything like this and I couldn’t be more excited about being a part of this ground-breaking launch.

So, rather than to be “something to watch”, go ahead and register at Unglue.It, participate in one of the campaigns or add books to your wish list to encourage other rights holders to join, and see where this new model takes us.

by Peter Murray at May 18, 2012 03:40 PM

Rosenthal, David

Dr. Pangloss' Notes From Dinner

The renowned Dr. Pangloss greatly enjoyed last night's inaugural dinner of the Storage Valley Supper Club, networking with storage industry luminaries, discussing the storage technology roadmap, projecting the storage market, and appreciating the venue. He kindly agreed to share his notes, which I have taken the liberty of elaborating slightly.
Tip of the hat to Jim Handy for correcting my math.

by David. (noreply@blogger.com) at May 18, 2012 10:25 AM

Prom, Chris

University of Illinois Archives: Position Available

The University of Illinois Archives is current searching for a full-time Archival Operations and Reference Specialist.  The position will report directly to me and will have responsibilities for overseeing the American Library Association Archives and for providing reference services for University Archives.  Although this is initially a two-year appointment, there is a possibility that it could be extended or made into a continuing appointment, depending on funding.  It offers the ability to work with a wide range of archival functions, from pre-custodial work through access.

A copy of the position description is available at https://jobs.illinois.edu/default.cfm?page=job&jobID=18473.

If you are interested in learning more about this job, please contact me via email.

by Chris Prom at May 18, 2012 02:02 AM

Hellman, Eric

We Made Some Matches, Lighting Them is Up To You


Unglue.it launched at noon yesterday. It was a good day, which means nothing awful happened. That's what launching a new website is like.

Here are some stats for our first 12 hours:

The best performing campaign was for Ruth Finnegan's 1970 classic "Oral Literature in Africa", which raised 7% of its target in just 12 hours.  2 campaigns aimed at younger audiences lagged. If you care about six year olds or sixth graders, perhaps you'll consider supporting these.

A good half day, but we need to do better.

The publishing establishment is sneering at us. Its collective voice was captured by "Publishers Lunch" in 1 tweet,  2 dismissive words and 4 dots:
Hope Springs.... Unglue.it Prepares to Launch Effort to Make Books Free Via Crowdsourcing
(Nothing against Publisher's Lunch- they're the smartest journalists in the publishing biz; their best and worst moments are when they snipe at mistakes in the Wall Street Journal and the New York Times. And notice how they didn't fill in the dots, leaving room for whatever really happens!)

But when it comes right down to it, Unglue.it hasn't changed a thing. All we've done is to create a new tool. Some self-striking matches. If you want anything to actually change, it's entirely up to you. You have to light the matches.

If you want book publishing and book reading to be chained to a pay-per-copy pretend-its-print economic model, forget about Unglue.it and go buy ebooks locked to your kindle and monitored by Amazon.

On the other hand, if you want ebooks that can't be taken away from you and don't subject you to surveillance by Amazon or Apple or Adobe, then its time to cast your ballot for a different path.

If you want libraries to be able to focus on putting ebooks in front of readers rather than enforcing digital rights management and putting friction into every library transaction, then now is the time to act.

If you want to reward creators who made the books that you love instead of feeding a voracious supply chain that manages to spit a few pennies of royalties to an author for every $14.99 out of your pocket, then now is the time to send a message to that publishing establishment.

Because there are a lot of smart people in the business of creating books. I've met a lot of them. And at least 5 of them are also courageous. (Don't worry, more are coming!) But they are powerless to change anything without YOU. Whether you're a reader, a librarian, a school teacher or a scholar, it's you who gets to decide.

So do it. Unglue.it.

Update 3PM: The second 12 hours weren't bad either.





Enhanced by Zemanta

by Eric Hellman (noreply@blogger.com) at May 18, 2012 01:44 AM

May 17, 2012

Cryer, Phil

HOWTO create a normal MySQL user

I found this online, and it’s a perfect example of a bad habit I’ve been trying to clean up for some time. When I’m trying out software that needs a MySQL database, I’m used to create database foo; but not creating a specific user for that database. Sure, if it’s in the install steps it’s easy... Read more »

From HOWTO create a normal MySQL user on fak3r by

Related posts:
  1. HOWTO: replicate, backup, copy or move a mySQL database
  2. MySQL Cheat Sheet
  3. HOWTO: configure MySQL's my.cnf file

by fak3r at May 17, 2012 07:04 PM

Tennant, Roy

Pay to Free a Book

Eric Hellman of Openly Informatics fame (subsequently bought by my employer OCLC) has launched his new project: Unglue.it. The site uses a crowd-sourced funding (or “crowdfunding”) model to raise enough money to pay book authors to open up their books as ebooks for free. As described on the site:

Unglue.it is a a place for individuals and institutions to join together to give their favorite ebooks to the world. We work with rights holders to decide on fair compensation for releasing a free, legal edition of their already-published books, under Creative Commons licensing. Then everyone pledges toward that sum. When the threshold is reached (and not before), we collect the pledged funds and we pay the rights holders. They issue an unglued digital edition; you’re free to read and share it, with everyone, on the device of your choice, worldwide.

This follows the model of sites like Kickstarter.com, where individuals pledge various amounts to support projects. Like KickStarter, Unglue.it offers various rewards pegged at specific pledge amounts as compensation to contributors. Also like KickStarter, each book “campaign” on Unglue.it has an end date.

For example, the campaign for Riverwatch by Joseph Nassise is well underway, but nearly $25,000 must be raised in 43 days to open it up for everyone. One of the benefits offered is a 45-minute Skype video chat for your  library or reading group for $150.

I’ll be watching this with interest, to see if it is indeed possible to open up books for everyone this way. Who knows? I may even be pledging.

 

by Roy Tennant at May 17, 2012 05:04 PM

Bradley, Fiona

Unintended consequences

One of the outcomes I’m looking for in this series of visits is unintended consequences. Did things happen that we didn’t predict, or plan for in the project? Did things change for the better, or worse? Here in Peru, I’ve been surprised and impressed at how the project has spun off across the country. A small pilot project to visit schools to promote libraries and librarianship as a profession. Collaborations with municipal governments on the development of new regional libraries. Workshops to sensitise mayors to the public library of today. An association that is supporting libraries and library workers across the country, whether they are members or not (in Latin America, many associations are hamstrung by legislation that requires associations, or Colegios, to only admit professional graduates as members).

The project has also had an impact on the library association serving Peru’s Amazon region and peoples. Under threat from resource exploitation and displacement, the association is a critical advocate for preservation of cultural heritage, traditional information, and lifestyles in the country’s northern jungles. The challenges they describe in preserving lifestyles and culture remind me very much of an excellent comic book produced by Survival International a few years ago, “There you go” about the problems associated with imported approaches to sustainable development. If you haven’t seen it, I highly recommend it.

This has been a welcome surprise because we didn’t set out to have an impact on this association as well. They’ve made use of the workshops, and I’m looking forward to sharing what they are doing with others – I cannot think of another association that has such a specific, dedicated advocacy purpose although of course there are many working on traditional knowledge and cultural issues around the world.

Change in Peru is on the up, with more sustainable and environmental initiatives being promoted by the government and a focus on inclusion, both of which are positive for libraries. For more on this, see Beyond Access’ blog, which talks about their recent visit here.

I heard on the weekend while cycling around Barranco and climbing the Temple  of the Sun at Pachacamac that pride in local culture and patrimony is “trendy” amongst Peruvians right now (pride in local culture, visits to local sites, purchase of local products), which does bode well for cultural preservation initiatives.

Now, for me, after a great few days in Peru during what they call “Winter” (it’s tshirt weather) it’s time to head some 13,000km North East. Next stop: Ukraine!

by Fiona at May 17, 2012 01:28 PM

Crosstech (CrossRef)

CrossRef and DataCite unify support for HTTP content negotiation

Last year CrossRef and DataCite announced support for HTTP content negotiation for DOI names. Today, we are pleased to report further collaboration on the topic. We think it is very important that the two largest DOI Registration Agencies work together in order to provide metadata services to DOI names.

The current implementation is documented in detail at http://crosscite.org/cn. The documentation explains HTTP content negotiation as implemented by both Registration Agencies and provides a list of supported content types.

An example application of HTTP content negotiation is a citation formatting service. You can try it at http://crosscite.org/citeproc. This service will accept DOIs from both CrossRef and DataCite, unlike the previous formatting service which accepted only CrossRef DOI names (http://citation.crrd.dyndns.org). This is possible because CrossRef and DataCite support a shared, common metadata format. When you input a DOI into the formatting service, it doesn't know where the DOI was registered. The service will make an HTTP content negotiation request to the global DOI resolver specifying which format of the metadata should be returned in the HTTP Accept header. The global DOI resolver will notice (Accept header!) that this is not a regular DOI resolution request; it will turn to CrossRef or DataCite accordingly for the relevant metadata instead of redirecting to a landing page. The format of metadata is shared between both registration agencies so the formatting service can interpret it without knowledge of the DOI origin.

In summary HTTP content negotiation lets you process a DOI's metadata without knowledge of its origin or specifics of the registration agency.

If you have any problems, email us at tech@datacite.org or labs@crossref.org. For general discussion please kindly leave a comment below.

May 17, 2012 01:17 PM

Fiacre O'Duinn

DreamVendor: a 3D printer vending machine

At universities across the country, you’ll find soda and snack vending machines in the hallways of higher learning. I wouldn’t doubt you might also find some machines selling beer. But, here’s something different: in the lobby of Virginia Tech’s College of Engineering, they’ve got a vending machine that is one of the coolest things I’ve seen: it delivers creations, 3D printed ones.

It’s called the DreamVendor, a mysterious sounding name, which actually pretty accurately describes it. It’s a one-of-a-kind, interactive 3D printing station to enable Virginia Tech students to freely and quickly fabricate prototypes for their academic, and even personal, design projects.

I think this offers an excellent example of how 3D printing could be done at public libraries. Anyone want to take on the challenge?

(via 3D Printer)

by Fiacre at May 17, 2012 02:50 AM

May 16, 2012

Singer, Ross

Installing yaz, yazpp and metaproxy on RHEL 6.2

Here the steps I just took to install metaproxy (which requires yaz and yaz++) on Red Hat Enterprise Linux 6.2.  The reason for this exercise is because Indexdata’s RPMs don’t work for 6.2 (the versions of boost-devel and icu-devel they require seem to only be available in 5.5).  Since I expect Indexdata to eventually release 6.2 compatible RPMs, I installed all of this into /opt/local (so it’s easy to remove — of course, if you’re already using /opt/local, you might want to try somewhere else).  Also, this assumes you’ll put a metaproxy.xml in /opt/local/etc/metaproxy/, so keep that in mind.

  1. yum install boost boost-devel icu icu-devel libxml2 libxml2-devel gnutls gnutls-devel libxslt libxslt-devel gcc-c++ libtool
  2. Install yaz:
    1. wget http://ftp.indexdata.dk/pub/yaz/yaz-4.2.33.tar.gz
    2. tar -zxvf yaz-4.2.33.tar.gz
    3. cd yaz-4.2.33
    4. ./configure –prefix=/opt/local
    5. make
    6. make install
  3. Install yaz++
    1. wget http://ftp.indexdata.dk/pub/yazpp/yazpp-1.3.0.tar.gz
    2. tar -zxvf yazpp-1.3.0.tar.gz
    3. cd yazpp-1.3.0
    4. ./configure –prefix=/opt/local/ –with-yaz=/opt/local/bin
    5. make
    6. make install
  4. Install metaproxy
    1. wget http://ftp.indexdata.dk/pub/metaproxy/metaproxy-1.3.36.tar.gz
    2. tar -zxvf metaproxy-1.3.36.tar.gz
    3. cd metaproxy-1.3.36
    4. ./configure –prefix=/opt/local –with-yazpp=/opt/local/bin/
    5. make
    6. make install
  5. cd /opt/local
  6. mkdir etc; mkdir etc/metaproxy; mkdir etc/sysconfig
  7. Copy this gist as /etc/rc.d/init.d/metaproxy
  8. chmod 744 /etc/rc.d/init.d/metaproxy
  9. Copy this gist as /opt/local/etc/sysconfig/metaproxy
  10. chkconfig –add /etc/rc.d/init.d/metaproxy
  11. /etc/init.d/metaproxy start

by Ross at May 16, 2012 08:08 PM

del.icio.us

Midwest - Code4Lib

by ksattler at May 16, 2012 07:48 PM

Ng, Cynthia

The Downsides of a CMS in Keeping Up: WordPress & HTML5

As a web developer, I cringe at deprecated code and try my best to keep up to date, which right now means familiarizing myself with HTML5 and CSS3. In reflecting on how best to update our website, I realized that with a CMS, naturally some things are out of my control.

Giving Up Control & Relying on Developers

Whether it’s the core or plugins, users of a CMS are reliant on its developers to keep things up to date. Is that lost of control worth the benefits? Generally, I would say yes, but that doesn’t stop me from wishing that the technology that we use to adopt new specifications.

WordPress & HTML5

Image Tags & Properties

I think it’s interesting that in HTML5 there is now the figure and figcaption elements. If they are taken advantage of, I think it definitely helps to parse information in a webpage and to identify text that is directly related to images.

One thing that does bother me about WordPress (which actually has noting to do with HTML5) is that it forces users to have a title, and leaves alt text blank by default. I don’t know what the best solution may be, but I would propose to insert the title text into the alt text by default and then allowing the user to change it. If they want to leave it blank, then there should be a checkbox to mark it “intentionally left blank” or something. Perhaps this could be an admin option, but I would definitely want something like that since I would really like to force our users to have alt text, but I don’t want to touch the WP core obviously.

Text Formatting Tags

It’s a bit of a minor thing and while some may argue the usefulness of the different semantic tags, users of the rich text editor would have no notion that they’re using <strong> instead of <b> or <em> instead of <i>. While I admit that even I struggle on the appropriate use of each (I have to look it up every time I think about it), if we want to see widespread adoption, then we need to get users to think about their writing and what they intend to do when using any of strong, em, b, i.

Tables

While we avoid tables and it should never be used for layouts, users will still want to insert tables to display data without resorting to an image. I’ve always wondered that WordPress doesn’t have a table insertion button even under the kitchen sink. What worries me is that then users who have a basic knowledge of HTML will insert it themselves using the HTML view with improperly formed code.

Layout & Forms

You might wonder why I’d lump the two, and that’s because, other than (using the default) comment form, both of these are dependent on a WordPress setup.

Forms will generally depend on the plugin. Similarly, whether the layout is in HTML5 is very dependent on the theme, along with many elements of accessibility.

Unfortunately, while HTML5 themes are relatively easy to find, most form plugins do not tell you whether they are using HTML5 or how much of it.

Why Not Adopt HTML5

I do realize that while there are a number of advantages to HTML5, especially in terms of structure,  it’s still in development. Working in an educational institution, it’s also more work and sometimes difficult in some cases to ensure backwards compatibility.

In particular, screen readers do not necessarily support all the new HTML5 elements and will frequently ignore whole chunks of text or have difficulty with reading links, etc. Even the newest versions of screen readers do not necessarily recognize elements and properties designed to make webpages easier for screen readers to interpret.

I would like to think that since WordPress talks about trying to be accessible that anything in the WordPress core will be updated once there is widespread adoption not only among browsers, but also screen readers. Obviously, adoption will take time though. For example, many form input types have been adopted by most browsers, but has not been adopted by IE at all (will be in IE10).

One can only hope that adoption will pick up once various part of the HTML5 specifications are ‘cemented.’


Filed under: Web design

by Cynthia at May 16, 2012 06:50 PM

LITA

Jobs in Library Technology: May 16

New vacancy listings are posted weekly on Wednesday at approximately 11:00 a.m. Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Associate University Librarian for Learning and Engagement ,Oregon State University Libraries & Press
Corvallis, OR

 Business Liaison Librarian, The Pennsylvania State University Libraries,  University Park, PA

Senior Library Systems Analyst, Lehigh University,  Bethlehem, PA

 

Visit the LITA Job Site  for more available jobs and for information on submitting a  job posting. 

by vedmonds at May 16, 2012 04:56 PM

Leggott, Mark

Thank the Flying Spaghetti Monster, There is Still Leadership in This World

UBC announced today that they ARE NOT signing the Access Copyright agreement. This is a sensible decision and one that speaks to what every university in Canada should do. If you are a Canadian institution that has not yet signed on to the Access Copyright/AUCC "license" pleae forward this to your decison-makers and encourage the same approach. I love the statement encapsulating why: 

We believe we are taking the bolder, more principled and sustainable option, which best serves the fundamental and long-term interests of our academic community.

Finally. Leadership that stands up for the fundamental rights and freedoms of learners.

by mleggott at May 16, 2012 12:15 AM

May 15, 2012

del.icio.us

The Code4Lib Journal - Communicat: The Next Generation Catalog That Almost Was…

by verwirrung at May 15, 2012 08:57 PM

The Code4Lib Journal - Beyond OPAC 2.0: Library Catalog as Versatile Discovery Platform

by verwirrung at May 15, 2012 08:57 PM

LITA

Don’t Miss the Top Tech Trends Panel in Anaheim!

We have a great panel this year for Top Tech Trends at the ALA Annual Conference in Anaheim! The panelists will describe changes and advances in  technology that they see having an impact on the library world, and suggest what libraries might do to  take advantage of these trends. Presentation of LITA Awards and Scholarships will take place prior to the Top Tech Trends program.

When: Sunday, June 24, 2012 – 1:30pm – 3:30pm
Where: Anaheim Convention Center, Ballroom A

The panelists:

 

by mprentice at May 15, 2012 08:47 PM

del.icio.us

blueheadpublishing/bookshop (ebook ruby gem, mobi, epub, etc)

by jrochkind at May 15, 2012 08:30 PM

Evergreen ILS

Return of the Evergreen Newsletter

The May 2012 edition of the Evergreen newsletter focuses on the April International Conference in Indianapolis, Indiana.

You can read the full text of the newsletter by visiting the following Evergreen wiki page.

To submit your own entries for the June newsletter, you can email Amy Terlaga at terlaga@biblio.org.

 

by Amy Terlaga at May 15, 2012 06:29 PM

Summers, Ed

diving into VIAF

Last week saw a big (well big for library data nerds) announcement from OCLC that they are making the data for the Virtual International Authority File (VIAF) available for download under the terms of the Open Data Commons Attribution (ODC-BY) license. If you’re not already familiar with VIAF here’s a brief description from OCLC Research:

Most large libraries maintain lists of names for people, corporations, conferences, and geographic places, as well as lists to control works and other entities. These lists, or authority files, have been developed and maintained in distinctive ways by individual library communities around the world. The differences in how to approach this work become evident as library data from many communities is combined in shared catalogs such as OCLC’s WorldCat.

VIAF’s goal is to make library authority files less expensive to maintain and more generally useful to the library domain and beyond. To achieve this, VIAF seeks to include authoritative names from many libraries into a global service that is available via the Web. By linking disparate names for the same person or organization, VIAF provides a convenient means for a wider community of libraries and other agencies to repurpose bibliographic data produced by libraries serving different language communities

More specifically, the VIAF service: links national and regional-level authority records, creating clusters of related records and expands the concept of universal bibliographic control by:

If you went and looked at the OCLC Research page you’ll notice that last month the VIAF project moved to OCLC. This is evidence of a growing commitment on OCLC’s part to make VIAF part of the library information landscape. It currently includes data about people, places and organizations from 22 different national libraries and other organizations.

Already there has been some great writing about what the release of VIAF data means for the cultural heritage sector. In particular Thom Hickey’s Outgoing is a trove of information about the project, which provides a behind-the-scense look at the various services it offers.

Rather than paraphrase what others have said already I thought I would download some of the data and report on what it looks like. Specifically I’m interested in the RDF data (as opposed to the custom XML, and MARC variants) since I believe it to have the most explicit structure and relations. The shared semantics in the RDF vocabularies that are used also make it the most interesting from a Linked Data perspective.

Diving In

The primary data structure of interest in the data dumps that OCLC has made available is what they call the cluster. A cluster is essentially a hub-and-spoke model with a resource for the person, place or organization in the middle that is attached via the spokes to conceptual resources at the participating VIAF institutions. As an example here is an illustration of the VIAF cluster for the Canadian archivist Hugh Taylor

Here you can see a FOAF Person resource (yellow) in the middle that is linked to from SKOS Concepts (blue) for Bibliothèque nationale de France, The Libraries and Archives of Canada, Deutschen Nationalbibliothek, BIBSYS (Norway) and the Library of Congress. Each of the SKOS Concepts have their own preferred label, which you can see varies across institution. This high level view obscures quite a bit of data, which is probably best viewed in Turtle if you want to see it:

<http://viaf.org/viaf/14894854>
    rdaGr2:dateOfBirth "1920-01-22" ;
    rdaGr2:dateOfDeath "2005-09-11" ;
    a rdaEnt:Person, foaf:Person ;
    owl:sameAs <http://d-nb.info/gnd/109337093> ;
    foaf:name "Taylor, Hugh A.", "Taylor, Hugh A. (Hugh Alexander), 1920-", "Taylor, Hugh Alexander 1920-2005" .

<http://viaf.org/viaf/sourceID/BIBSYS%7Cx90575046#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/BIBSYS> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/BNF%7C12688277#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/BNF> ;
    skos:prefLabel "Taylor, Hugh Alexander 1920-2005" ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/DNB%7C109337093#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/DNB> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/LAC%7C0013G3497#skos:Concept>
    a skos:Concept ;
    skos:inScheme <http://viaf.org/authorityScheme/LAC> ;
    skos:prefLabel "Taylor, Hugh A. (Hugh Alexander), 1920-" ;
    foaf:focus <http://viaf.org/viaf/14894854> .

<http://viaf.org/viaf/sourceID/LC%7Cn++82148845#skos:Concept>
    a skos:Concept ;
    skos:exactMatch <http://id.loc.gov/authorities/names/n82148845> ;
    skos:inScheme <http://viaf.org/authorityScheme/LC> ;
    skos:prefLabel "Taylor, Hugh A." ;
    foaf:focus <http://viaf.org/viaf/14894854> .

The Numbers

The RDF Cluster Dataset http://viaf.org/viaf/data/viaf-20120422-clusters.xml.gz is 2.1G gzip compressed RDF data. Rather than it being one complete RDF/XML file, each line has a complete RDF/XML document on it, which represents a single cluster. All in all there are 20,379,541 clusters in the file.

I quickly hacked together a rdflib filter that reads the uncompressed line-oriented RDF/XML and writes the RDF as ntriples:

import sys
 
import rdflib
 
for line in sys.stdin:
    g = rdflib.Graph()
    g.parse(data=line)
    print g.serialize(format='nt').encode('utf-8'),

This took 4 days to run on my (admittedly old) laptop. If you are interested in seeing the ntriples let me know and I can see about making it available somewhere. It is 2.8G gzip compressed. An ntriples dump might be a useful version of the RDF data for OCLC to make available, since it would be easier to load into triplestores, and otherwise muck around with (more on that below) than the line oriented RDF/XML. I don’t know much about the backend that drives VIAF (has anyone seen it written up?)…but I would understand if someone said it was too expensive to generate, and was intentionally left as an exercise for the downloader.

Given its line-oriented nature, ntriples is very handy for doing analysis from the Unix command line with cut, sort, uniq, etc. From the ntriples file I learned that the VIAF RDF dump is made up of 377,194,224 assertions or RDF triples. Here’s the breakdown on the types of resources present in the data:

Resource Type Number of Resources
skos:Concept 26,745,286
foaf:Document 20,379,541
foaf:Person 15,043,112
rda:Person 15,043,112
foaf:Organization 3,722,318
foaf:CorporateBody 3,722,318
dbpedia:Place 195,472

Here’s a breakdown of predicates (RDF properties) that are used:

RDF Property Number of Assertions
rdf:type 84,851,159
foaf:focus 45,510,716
foaf:name 44,729,247
rdfs:comment 41,253,178
owl:sameAs 32,741,138
skos:prefLabel 26,745,286
skos:inScheme 26,745,286
foaf:primaryTopic 20,379,541
void:inDataset 20,379,541
skos:altLabel 16,702,081
skos:exactMatch 8,487,197
rda:dateOfBirth 5,215,150
rda:dateOfDeath 1,364,355
owl:differentFrom 1,045,172
rdfs:seeAlso 1,045,172

I’m expecting these statistics to be useful in helping target some future work I want to do with the VIAF RDF dataset (to explore what an idiomatic JSON representation for the dataset would be, shhh). In addition to the RDF, OCLC also makes a dump of link data available. It is a smaller file (239M gzip compressed) of tab delimited data, which looks like:

...
http://viaf.org/viaf/10014828   SELIBR:219751
http://viaf.org/viaf/10014828   SUDOC:052584895
http://viaf.org/viaf/10014828   NKC:xx0015094
http://viaf.org/viaf/10014828   BIBSYS:x98003783
http://viaf.org/viaf/10014828   LC:24893
http://viaf.org/viaf/10014828   NUKAT:vtls000425208
http://viaf.org/viaf/10014828   BNE:XX917469
http://viaf.org/viaf/10014828   DNB:121888096
http://viaf.org/viaf/10014828   BNF:http://catalogue.bnf.fr/ark:/12148/cb13566121c
http://viaf.org/viaf/10014828   http://en.wikipedia.org/wiki/Liza_Marklund
...

There are 27,046,631 links in total. With a little more Unix commandline-fu I was able to get some stats on the number of links by institution:

Institution Number of Links
LC NACO (United States) 8,325,352
Deutschen Nationalbibliothek (Germany) 7,732,546
SUDOC (France) 2,031,452
BIBSYS (Norway) 1,822,681
Bibliothèque nationale de France 1,643,068
National Library of Australia 977,141
NUKAT Center (Poland) 894,981
Libraries and Archives of Canada 674,088
National Library of the Czech Republic 598,848
Biblioteca Nacional de España 519,511
National Library of Israel 327,455
Biblioteca Nacional de Portugal 321,064
English Wikipedia 301,345
Vatican Library 247,574
Getty Union List of Artist Names 202,711
National Library of Sweden 161,845
RERO (Switzerland) 119,366
Istituto Centrale per il Catalogo Unico (Italy) 45,208
Swiss National Library 33,866
National Széchényi Library (Hungary) 33,727
Bibliotheca Alexandrina (Egypt) 26,877
Flemish Public Libraries 4,819
Russian State Library 997
Extended VIAF Authority 109

The 301,345 links to Wikipedia are really great to see. It might be a fun project to see how many of these links are actually present in Wikipedia, and if they can be automatically added with a bot if they are missing. I think it’s useful to have the HTTP identifier in the link dump file, as is the case for the BNF identifiers. I’m not sure why the DNB, Sweden, and LC URLs aren’t expressed URLs as well.

One other parting observation (I’m sure I’ll blog more about this) is that it would be nice if more of the data that you see in the HTML presentation were available in the RDF dumps. Specifically, it would be useful to have the Wikipedia links expressed in the RDF data, as well as linked works (uniform titles).

Anyway, a big thanks to OCLC for making the VIAF dataset available! It really feels like a major sea change in the cultural heritage data ecosystem.

by ed at May 15, 2012 05:57 PM

LITA

New LITA Webinar: Social Networking the Catalog

LITA is pleased to announce the availability of a new webinar, “Social Networking the Catalog: Community Based Approaches to Building Catalogs and Collections,” presented by Margaret Heller (Dominican University) and held June 7, 11:00 am – Noon CDT. This presentation will introduce the Read/Write Library Chicago, a new model for libraries that exists to illuminate and create connections between people, materials, and institutions in the city of Chicago. Participants will learn about new trends and features in social reading and cataloging, social library catalogs and integrated library systems, and crowdsourcing platforms. The catalog builds a social network, and the social network in turn builds the catalog. We love the read/write internet. It’s time to create the read/write library, where everyone has a voice in the collections, catalog, programming, and mission. This model is extensible to more conventional libraries through a multitude of tools and approaches that are already widely used.

Additional information and registration are available at http://www.ala.org/lita/learning/online/socialcatalog

by mprentice at May 15, 2012 05:54 PM

del.icio.us

The Code4Lib Journal

by mcbrydemlis at May 15, 2012 03:35 PM

Evergreen ILS

Evergreen 2.2 rc1

Hello everyone,

Evergreen 2.2 rc1 was just released today, 15 May 2012. This is the
release candidate. The Evergreen community hopes that Evergreen 2.2.0
will follow in just about two weeks, depending as always on feedback from
those who contribute their feedback after testing.

This release includes various bug fixes, please see the full list of
changes
.

The 2.2 series includes many new features over the 2.1 series, including
the Template Toolkit OPAC (TPAC) and too many others to count.

Please report any new bugs on Launchpad.

I would like to particularly thank Thomas Berezansky, Ben Shum, Jason
Stephenson and Dan Scott for assisting in innumerable ways with the
mechanics of publishing this release candidate. I am surely neglecting a
couple of other folks whose help was invaluable, but at least they have
their karma.

Thanks everyone!

by Lebbeous Fogle-Weekley at May 15, 2012 03:10 PM

D-Lib Magazine

An Introduction to the Current Issue

Editorial by Laurence Lannom, CNRI

May 15, 2012 01:18 PM

Implementing DOIs for Research Data

Article by Natasha Simons, Griffith University, Australia

May 15, 2012 01:18 PM

Metadata Clean Sweep: A Digital Library Audit Project

Article by R. Niccole Westbrook and Dan Johnson, University of Houston Libraries; Karen Carter, Rutgers School of Communication and Information; Angela Lockwood, Texas Women's University School of Library and Information Studies

May 15, 2012 01:18 PM

Bitcurator: Tools and Techniques for Digital Forensics in Collecting Institutions

Article by Christopher A. Lee, Alexandra Chassanoff, and Kam Woods, University of North Carolina, Chapel Hill; Matthew Kirschenbaum and Porter Olsen, University of Maryland

May 15, 2012 01:18 PM

Information Bulletin on Variable Stars - Rich Content and Novel Services for an Enhanced Publication

Article by Andras Holl, Konkoly Observatory, Budapest, Hungary

May 15, 2012 01:18 PM

In Brief: SPRUCE Project Tackles Digital Preservation Challenges with Hands On Events

May 15, 2012 01:18 PM

In Brief: The TIMBUS Project - Timeless Business Processes and Services

May 15, 2012 01:18 PM

In Brief: Network of Museums, Libraries and Public Cultural Institutions in the EU-funded MeLa Project

May 15, 2012 01:18 PM

In Brief: Orphan Works and Mass Digitization: Obstacles and Opportunities

May 15, 2012 01:18 PM

In Brief: Report on the 14th Asia-Pacific Web Conference

May 15, 2012 01:18 PM

In Brief: Report on the Electronic Resources & Libraries Conference (ER&L 2012)

May 15, 2012 01:18 PM

In Brief: Report on the Distance Education Association of New Zealand (DEANZ) 2012 Conference

May 15, 2012 01:18 PM

Open Knowledge Foundation

OKFestival Topics of 2012 Announced, 2nd Call for Proposals Published, Experimentation Encouraged!

OKFestival 2012 Organising Team

For those looking for yet another reason to join us for OKFestival in Helsinki this September, the OKFestival Core Organising Team is proud to announce the inspiring public outcomes of our unconventional First Call for Proposals – and to request your participation for our Second Call to share your ideas in Finland.

As we’ve noted previously, because OKFestival is the first event of its kind, combining Open Knowledge Conference and Open Government Data Camp together for a week-long celebration of action and collaboration, we decided to take a risk by opening up over 2/3 of the week’s programme to you as festival participants.

So last month, we released the First Call for Proposals, crossing our fingers expectantly as we did it. A few of us on the Core Organising Team (photo) were, admittedly, a tad worried – would global communities rise to the challenge? Or would we be left alone in cyberspace without even a programme to our name? We presented the festival to audiences at FREE CITY in Tallinn, at Re:Publica in Berlin and to local stakeholders in Finland. And we waited in anticipation.

In the end, we didn’t have to worry at all. The response to our First Call for Proposals was both overwhelming and encouraging. Open knowledge and data enthusiasts around the world did take the reins – and now, a month later, we have a groundbreaking, action-focused programme planned in co-operation with citizen teams of Guest Programme Planners all over the world. For a summary of the Open Knowledge Festival planning process in 14 slides, see our first Slideshare presentation here.


As you'll see above, the First Call for Proposals allowed the Core Organising Team to determine the most important themes and salient ideas, the subjects of which are highlighted through our 13 guest-organised Topic Streams of 2012:

  1. Open Democracy and Citizen Movements
  2. Open Government Data
  3. Open Cities
  4. Open Design, Hardware & Manufacturing
  5. Open Cultural Heritage
  6. Open Development
  7. Open Research and Education
  8. Open Geodata
  9. Open Source Software
  10. Data Journalism and Data Visualization
  11. Gender / Diversity in Openness
  12. Open Business and Corporate Data
  13. Open Knowledge and Sustainability

The breadth of these topics is quite diverse - indeed, the variance is somewhat unprecedented for an event of this kind. Going through the topics above and learning more about how their Guest Programme Planners are determining the programming on the Public Planning Wiki, it's hard not to feel a sense of excitement about what's to come.

For the Second (and last!) Call for Proposals, we encourage ideas that further enrich each of these themes with new perspectives. We want your lightning talks, lectures, panel discussions, workshops, hackathons and all things in between. Let's fill Helsinki's streets with innovative new ideas, new collaborations between civil society and government, and new projects that provoke openness in unexpected ways.

It is our hope that together, these themes will illustrate the importance of diverse understandings within open knowledge and open data communities - and we look forward to seeing even more of you get involved in this inspiring process.

The Second Call for Proposals is here. Deadline for submission is June 1st - go to okfestival.org for details. And feel free to mix and remix the Slideshare presentation above for your own uses - it's meant to be shared!

Core Organising Team at work in Helsinki

by Kat Braybrooke at May 15, 2012 12:05 PM

May 14, 2012

Hellman, Eric

Unglue.it Launches on Thursday


I started blogging a little over three years ago. I found that it was a great way to organize my thoughts and it gave me an excuse to talk to people and ask questions about things that interested me. It became an extended conversation with so many readers about the future of libraries and the role of books and readers in our changing society. Also polarons, faster-than-light neutrinos, and log-normal distributions.

But I'm not the type to just write about things. We live in a time where it's easier than ever before for small groups of people to build new things, and if you've been reading the blog, you've had a front row seat to watch the development of such a thing. You've heard the story of sculptors who chip away stone to free the figures trapped inside the rock, or the novelist whose characters struggle to tell their stories. For me, Unglue.it is like that, it's something formed from the raw material of ideas from many people.  It just wants to exist.

If you've not been paying attention, Unglue.it is an effort to crowd-fund creative commons ebooks. If you can find a way to cover the fixed costs, you can make the ebooks free to everyone, everywhere. Libraries, who can make possible the effective distribution of these ebooks, are tired of being shut out of popular ebook lending and need new ways forward.

One really exciting thing is that it's not just us. There's starting to be a Movement. Making books more available and more useful to everyone, everywhere is a huge undertaking, and there are a variety of efforts nucleating to address many different bits of the problem. Last week, I got together with Francis Pinter, whose "Knowledge Unlatched" effort could revolutionize scholarly monograph publishing. In April, I got together with Ash Kalb, who's bringing vintage science fiction books back to life at Singularity and Co. I've written here about DPLA, Internet Archive, Hathitrust, Library Renewal, Project Gutenberg and more. We're all on the same team.

This morning, we started the last testing of the Unglue.it machinery before launch. We're using real money. I'm offering to "unglue" an ebook comprised of five blog posts I wrote last year on Open Access eBooks. The campaign will end tomorrow no matter what, and we'll verify that we can collect money through Amazon Payments. (See the Unglue.it blog for the payment processor saga.)

If you want, you can help us test the site. You can enter a pledge (remember, it's real money!) and request premiums. Whether you pledge or not, you'll end up with a real ebook with a CC BY-SA license. You can make derivatives, add content, make translations, experiment. (But you might need to wait a week or two to get it). We'll use any cash we take in to cover some expenses (like the block of ISBNs that we bought. My lawyer says we can't offer premiums that include alcohol, but she didn't say I couldn't let people hit me up for a beer.

Already we've received a bunch of really great bug reports and suggestions. It turns out that if you want to pledge $100 billion billion, for example, the website isn't going to let you, and it won't give you a sensible error message.

We start "real" campaigns at noon (EDT) on Thursday (fingers crossed). Our launch line-up will have 5 campaigns. Until then we're frantically busy making sure everything is working as well as possible.

See you on the other side.

by Eric (noreply@blogger.com) at May 14, 2012 07:29 PM

Engard, Nicole

Using the News Tool in Koha 3.2 and 3.4



The Koha News Tool allows librarians to post news items to the OPAC and the staff client. In Koha 3.4 the news tool can also be used to add news items to your circulation receipts or slips. This tutorial will walk you through the simple steps of adding and viewing news items.

If you have an idea for a video, please just let me know and I’ll add it to my list of things to record.

Related posts:

  1. Koha Offline Circulation Tool
  2. So much Koha news today
  3. Adding a Child Patron in Koha 3.2

by Nicole at May 14, 2012 03:00 PM

Morgan, Eric Lease

Using OAI-PMH to populate the “Catholic Portal” is not straight-forward

Using OAI-PMH to populate the “Catholic Portal” is not straight-forward, and this posting outlines some of my investigations in this regard.

Introduction

As you may or may not know, OAI-PMH is a “standard” protocol designed for harvesting metadata. It only understands six commands (or in OAI-PMH parlance, “verbs”). These commands are sent to remote computers in the form of URLs, and the remote computer is expected to respond in the form of specifically shaped XML streams. These commands include:

Through a conversation of these verbs and the returned XML streams, metadata between computers can be exchanged. It is then up to the computer doing the harvesting to implement some sort of cool and interesting service with the harvested content. Here at Catholic Portal Central we want to index the metadata and provide immediate access to remote digitized content.

Investigations

At least three Catholic Research Resources Alliance (CRRA) members have OAI-PMH repositories: Duquesne University, Boston College, and Loyola University Chicago. Using a little Perl script, I most recently investigated the content of the repositories of Boston College and Loyola University Chicago. Through this process I learned what metadata formats they supported, what sets were used to subdivided their collections, and output Dublin Core metadata from a few selected sets.

The harvested Dublin Core metadata was typical of OAI-PMH repositories: thin, a bit ambiguous, and somewhat inconsistant across repositories. It was thin because many of the Dublin Core elements are left unpopulated. It is ambiguous because many of the fields are repeated, and the values of repeated elements are of different types. For example, a description field may be empty, contain an abstract of the work, the full text of the work, or the process used to digitize the material. It is inconsistant because things like dates, names, and subject entries are formatted differently. In some places names are listed in first name/last name order. Other times it is last name/first name order. Dates can be anything from “February 12, 2012″ to “2012-02-12″ to “Twelfth Century”. None of this is new the world of OAI-PMH. It is typical.

All is not lost. There are patterns to this apparent randomness. Using my script I can sometimes output titles, descriptions, subject headings, and URLs of digitized objects. For example, here is such a list from the Loyola University Chicago repository:

item: 46

key: oai:content.library.luc.edu:coll6/45

title(s): Letter to the Secretary of the Literary Agency of London, 1908
title(s): Catholic Women Poets

identifier(s): cudahy219e3

identifier(s): 003_kayesmith_1908;pg3.jpg

identifier(s): http://content.library.luc.edu/u?/coll6,45

subject(s): Shelia Kaye-Smith; poets; women poets; Catholic poets

subject(s): Local

description(s): third page of letter requesting appointment

description(s): does not suit you any other time up to 4 15 will do Would you kindly send a reply to me c o Miss F E Walters Girton College Cambridge With apologies for troubling you believe me Yours faithfully Sheila Kaye Smith

description(s): Master file scanned at 600 dpi RGB in reflective mode from original document using MicroTek ScanMaker 1000XL

description(s): http://www.luc.edu.archives

type: image

From this output it becomes apparent that the first title is the title of the artifact, the third identifier is the URL of the digitized object, the first subject field is a delimited list of keywords, the first description is a sort of abstract, and the type field contains a value denoting what kind of digitized thing is in question. Thus, the output follows a pattern, and computers are very good at patterns, therefore a computer program could easily be written to read this particular OAI-PMH output and stored in the Portal’s index.

Next steps

My next steps are two-fold. First, I will harvest and index some of the metadata from selected Loyola University Chicago OAI-PMH sets. Second, I will let colleagues from various CRRA committees (specifically the Digital Access Committee as well as the Collection Committee) peruse the results. In the end I hope to get feedback on how to proceed. Should I index more content? Less? None? If more, then how should records be displayed, and exactly how ought the Dublin Core metadata be mapped to VuFind’s underlying Solr index fields?

All of this work is entirely feasible. At the same time it is not enormously scalable. Hand-crafting the parsing of OAI-PMH output, and handcrafting how it all gets mapped to Solr’s index is time consuming and fragile. The Portal Home Planet can easily do this work for no more than a dozen different repositories, but after that some other means of production will need to be examined.

by Catholic Portal at May 14, 2012 02:59 PM

Schmidt, Aaron

Consider the Checkout Slip

Call me crazy, but I think the secret life of checkout slips is fascinating.
Some moms use their foot-long slips filled with children’s books as a master list, crossing off items as they’re returned. One regular patron I knew kept every checkout slip she ever received. Upon returning items, she’d ask us to cross off the titles on the original slip and initial it. This behavior was the result of a typical “I returned that”/“Not according to our computers” interaction.

And, of course, countless slips are used as bookmarks or refrigerator-mounted notices or simply left in dust jackets for weeks. However small, these slips are touchpoints—ways that people interact with us—and collectively we’re pumping out thousands of these things daily.

Likewise, in some small way, we’re representing ourselves through these little scraps of paper. Yet, most of us are churning out slips that could be easier on the eye and more helpful to our users.

This isn’t something to keep you up at night, but it’s still worth thinking about, because details matter. All of these little touchpoints add up to create people’s experience of our libraries. And dispensing ugly checkout receipts illustrates that we haven’t spent enough time sweating these details. Even worse, this inattention is at the root of complaints about hard-to-use websites and repeated questions about where the restrooms are.

What is a good checkout slip?
To answer this we have to know what a checkout slip is supposed to do. As I see it, there are a few core ­functions:

Beyond that, some slips have secondary functions:

With these factors in mind, we can now think of some other factors surrounding the design of an ideal checkout slip:

Checking out the fun
After compiling the functions’ lists, I started to think about whether there was a way to make checkout slips more fun, or whether that was a terrible impulse. More seriously, I considered what would be the minimum amount of information required to make an item easily identifiable and other basic considerations, such as why there is a due date listed for each item when most items share a due date with others. With all of these things in mind, I took a crack at designing a checkout slip.

There’s nothing very different about this design, but I reckon it is a bit easier to use when hanging on a refrigerator than the current crop. Aside from sensible typography, the only thing notable is that items are grouped by due date rather than listing a due date for each item.

I really like the idea of a checkout slip that includes an extra bit that’s specifically meant to be displayed on a refrigerator or corkboard, though such a design could add about three or more inches of length per due date. It might be cumbersome, but consider how much better this communicates your library’s philosophy.

Just remember: the details matter, especially when these checkout slips are the most visible output of your library that most users will see.

This first appeared “The User Experience,” a column I write for LJ.

by Aaron Schmidt at May 14, 2012 12:35 PM

Rosenthal, David

Lets Just Keep Everything Forever In The Cloud


Dan Olds at The Register comments on an interview with co-director of the Wharton School Customer Analytics Initiative Dr. Peter Fader:

Dr Fader ... coins the terms "data fetish" and "data fetishist" to describe the belief that people and organisations need to capture and hold on to every scrap of data, just in case it might be important down the road. (I recently completed a Big Data survey in which a large proportion of respondents said they intend to keep their data “forever”. Great news for the tech industry, for sure.)
The full interview is worth reading, but I want to focus on one comment, which is similar to things I hear all the time:
But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition."
Follow me below the fold for some thoughts on data hoarding.

Clearly, the value that could be extracted from the data in the future is non-zero, but even the Big Data zealot believes it is probably small. The reason the Big Data zealot gets away with saying things like this is because he and his audience believe that this small value outweighs the cost of keeping the data indefinitely. They believe that because they believe Kryder's Law will continue.

Lets imagine that everyone thought that way, and decided to keep everything forever. The natural place to put it would be in S3. According to IDC, in 2011 the world stored 1.8 Zettabytes (billion TB) of data. If we decided to keep it all for the long term in the cloud, we would be effectively endowing it. How big would the endowment be? Applying our model, starting with S3's current highest-volume price of $0.055/GB/mo and assuming that price continues to drop at the 10%/yr historic rate for S3's largest tier, we need an endowment of about $6.3K/TB. So the net present value of the cost of keeping all the world's 2011 data in S3 would be about $11.4 trillion. The 2011 Gross World Product (GWP) at purchasing price parity is almost $80 trillion. So keeping 2011's data would consume 14% of 2011's GWP. The world would be writing S3 a check each month of the first year for almost $100 billion, unless the world got a volume discount.

IDC estimates that 2011's data was 50% larger than 2010's; I believe their figure for the long-run annual growth of data is 57%/yr. Even if it is only 50%, compare that with even the most optimistic Kryder's Law projections of around 30%. But we're using S3, and a 10% rate of cost decrease. So 2012's endowment will be (50-10)=40% bigger than 2011, and so on into the future. The World Bank estimates that in 2010 GWP grew 5.1%. Assuming this growth continues, endowing 2012's data will consume 19% of GWP. On these trends, endowing 2018's data will consume more than the entire GWP for the year.

So, we're going to have to throw stuff away. Even if we believe keeping stuff is really cheap, its still too expensive. The bad news is that deciding what to keep and what to throw away isn't free either. Ignoring the problem incurs the costs of keeping the data; dealing with the problem incurs the costs of deciding what to throw away. We may be in the bad situation of being unable to afford either to keep or to throw away the data we generate. Perhaps we should think more carefully before generating it in the first place. Of course, thought of that kind isn't free either ...

by David. (noreply@blogger.com) at May 14, 2012 10:00 AM

Rochkind, Jonathan

RDFa :: HTML5 microdata

RDFa and HTML5 microdata, are, I think, basically interchangeable.

RDF and microdata both use the same fundamental triple data model. Please note that schema.org is just a specific set of vocabularies that can be used with HTML5 microdata,  HTML5 microdata goes beyond this. schema.org is a pretty good microdata tutorial though, if you remember you don’t have to use it’s vocabularies.  Here’s the actual microdata spec. Here’s a good microdata tutorial that pre-dates schema.org and is not schema.org-specific.

You can take pretty much anything that’s RDF, from any vocabularies, and use an RDFa style approach to express (basically) the same semantics is in HTML5 microdata  instead.

This is a good thing for RDF, because there’s no good way to do RDFa in HTML (or anything but xHTML which is basically an abandoned approach — RDFa needs XML namespaces).  You can go from (any) html5 microdata to RDF too — although there are a couple gaps I’ll discuss at the end.

First, let’s show how you’d do RDFa-style RDF semantics expressed in HTML5 microdata. Let’s take the complete example from the RDFa wikipedia article, as it’s small but makes us actually use a pretty complete complement of microdata features. There are in fact a couple weird details I’m not sure about.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    version="XHTML+RDFa 1.0" xml:lang="en">
  <head>
    <title>John's Home Page</title>
    <base href="http://example.org/john-d/" />
    <meta property="dc:creator" content="Jonathan Doe" />
    <link rel="foaf:primaryTopic" href="http://example.org/john-d/#me" />
  </head>
  <body about="http://example.org/john-d/#me">
    <h1>John's Home Page</h1>
    <p>My name is <span property="foaf:nick">John D</span> and I like
      <a href="http://www.neubauten.org/" rel="foaf:interest"
        xml:lang="de">Einstürzende Neubauten</a>.
    </p>
    <p>
      My <span rel="foaf:interest" resource="urn:ISBN:0752820907">favorite
      book is the inspiring <span about="urn:ISBN:0752820907"><cite
      property="dc:title">Weaving the Web</cite> by
      <span property="dc:creator">Tim Berners-Lee</span></span>
     </span>
    </p>
  </body>
</html>

Here’s the same thing, using the same vocabularies, with HTML5 microdata. (yes, contrary to some belief, you can mix and match more than one vocabulary in microdata too, although you’ve got to spell out the complete URI for all but one in any given scope.

<html lang="en">
  <head>
    <title>John's Home Page</title>
    <base href="http://example.org/john-d/" />
    <link rel="http://xmlns.com/foaf/0.1/primaryTopic" href="http://example.org/john-d/#me" />
  </head>
  <body itemscope itemtype="http://purl.org/dc/elements/1.1/"  itemid="http://example.org/john-d/#me">
    <h1>John's Home Page</h1>
    <p>My name is <span itemprop="http://xmlns.com/foaf/0.1/nick">John D</span> and I like
      <a href="http://www.neubauten.org/" itemprop="http://xmlns.com/foaf/0.1/interest"
        lang="de">Einstürzende Neubauten</a>.
    </p>
    <p>
      <span itemscope itemtype="http://purl.org/dc/elements/1.1/" itemprop="http://xmlns.com/foaf/0.1/interest" itemid="urn:ISBN:0752820907 ">
      My favorite
      book is the inspiring <cite
      itemprop="title">Weaving the Web</cite> by
      <span itemprop="creator">Tim Berners-Lee</span>
     </span>
    </p>
  </body>
</html>

Mismatches and missing semantics

While the fundamental approach is compatible, there are a few mismatches and semantics lost or less clear in html5 microdata. Here are some I feel like noting, there may be others.

Whither RDF/RDFa

(That’s “whither”, not “wither”. Hopefully).

There are probably other rough spots than the ones I’ve identified. And the one’s I mentioned include some tough ones (the itemtype+itemprop==URI issue).

But by and large, HTML5 microdata’s fundamental model is RDF compatible.  Hopefully the RDFistas are focused on figuring out how to lessen the impedence mismatches, if neccesary by lobbying the html5 working group to make minimized interventions.  Hopefully they’re not still stuck on an xhtml/rdfa/why-didn’t-they-do-things-our-way train, because that train isn’t leaving the station.  Instead though, they can contribute to sanding off a few rough spots in microdata to make it quite capable of doing what they want (and, if they’re right, everyone else will eventually realize they want too). Work on tools to turn microdata to RDF, too, hopefully.

microdata could actually be the a great thing for RDF.  If handled correctly, it should be possible to express full RDF semantics in microdata — microdata can be the RDF-in-HTML-markup standard that RDFa wanted to be. (microdata’s designers clearly knew about RDF/RDFa and were influenced by it). It’s also possible to leave a lot of semantics out when writing microdata — but often in ways you could do with RDF/RDFa too, lots of blank nodes, etc, RDF/RDFa just tries to make it inconvenient and non-idiomatic.

While the RDFistas may be rueing that microdata makes it so easy to not have completely specified triples with no blank nodes everywhere — I think the flip side of this is actually what will allow it to possibly get more uptake, and be an easy start on the road to RDF, if RDF plays it’s cards right.   That because you have to think through the complete vocabularies and semantics less, you can get started with just the semantics you need, and not be forced to do more up front metadata design than you need for your identified use cases, or more than you can afford or have the skills to do. That, and some the immediate use cases in ‘Google will use it!’ of course. But if Google had tried to say they used RDFa (didn’t they once, maybe, sort of?), I don’t think it would have gone anywhere — RDFa is just too overwhelming.


Filed under: General

by jrochkind at May 14, 2012 04:59 AM

May 13, 2012

Grimmelmann, James

Spam Alert: The Institute for Cultural Diplomacy

In the past few years, I’ve received numerous emails from the Institute for Cultural Diplomacy announcing upcoming conferences and educational programs. The messages say, “If you do not wish to receive emails from the ICD in the future, please send us an email to info@culturaldiplomacy.org indicating this.” I have, six times, spread out over half a year. It didn’t work. Twice, I cc:ed Mark Donfried, the ICD’s “Director and Founder,” over whose name the emails are written. I never received a response from him, just more spam. I called the ICD’s office, in Germany, and asked to be removed from their list. The woman who answered the phone promised I would be. She lied: the email continued.

This is unethical behavior, inconsistent with the values the ICD supposedly represents. It’s disrespectful, dishonest, and disreputable. I doubt that any of my readers run in ICD circles, but if you do, please think hard about what it says about the ICD as an organization.

by James Grimmelmann (james@grimmelmann.net) at May 13, 2012 06:58 PM

Inside the Georgia State Opinion

On Friday, the long-awaited decision in the Georgia State e-reserves case (a.k.a. Cambridge University Press v. Becker) dropped. By way of context, the case is a challenge by three academic publishers (Oxford University Press, Cambridge University Press, and Sage Publications) against Georgia State University’s e-reserves policy. The publishers sued in April 2008, in a lawsuit funded by the Association of American Publishers and the Copyright Clearance Center, claiming that the e-reserves policy went far beyond the bounds of fair use. Georgia State, as a state university, invoked the doctrine of sovereign immunity, the practical implication of which is that the publishers can only obtain injunctions against future infringements, not damages for past infringements. Since it also tightened up its e-reserves policy in December 2008, it also successfully argued to the court that only the uses made under the new policy should be relevant to any potential injunction.

There was a trial a year ago, and then long silence from the court. Now we know why it was taking so long: the opinion is 350 pages. That number is a little misleading, in that over two thirds of the opinion are dedicated to a highly methodical copyright ownership, infringement, and fair use analysis of seventy-four separate claims of infringement, using standard templates and highly repetitive language. Having now dug through the details, I’d like to offer a few observations.

First, over a third of the claims didn’t even make it to the fair use stage at the heart of the case. In many cases, the publishers were unable to prove to the court’s satisfaction that they owned the copyright in the portions of the books that were copied and uploaded. Sometimes they couldn’t produce a timely registration certificate and there were proof problems with originality; sometimes they couldn’t find a work-made-for-hire agreement or copyright assignment from the authors of individual chapters in edited volumes. The court was unsympathetic: no documented chain of title, no lawsuit. There’s a looming e-rights mess, loosely akin to the robosigning mess around ownership of securitized mortgages: in both cases, the putative owners don’t have all their papers in order. This opinion either recognizes or contributes to the mess, depending on your point of view.

Other claims dropped out before the fair use stage because they were uploaded to the e-reserves system but never downloaded by students. The court dismisses these from the lawsuit as de minimis, explaining that these uses by the University, while technical implicating the copyright owners’ exclusive rights, don’t affect the incentives for authors to create. This puts more teeth in the de minimis doctrine in copyright: it goes beyond the view that de minimis means “not substantially similar.” It also strengthens the argument that “internal use” copies never used to reach an to an audience that reads them for their content don’t infringe. Think, for example, of the HathiTrust’s archive of scans from Google Books.

(As an aside, the e-reserve logfiles played a key evidentiary role in the case. Specific users were never identified, but if a file had a total hit count of two, it’s unlikely that students actually read it. This stands in contrast to other cases, like American Geophysical, which was tried by sampling: the parties selected a single scientist at random, examined his files looking for photocopies, and treated him as representative of a cohort of 500. Here, the logs permitted an analysis of the copying done for numerous faculty members—presumably all those who assigned any excerpts from any of the plaintiffs’ books.)

When the court did reach fair use, it held across the board that two of the four factors favored Georgia State. The purpose of the use, while not transformative, was nonetheless for highly favored educational purposes by a nonprofit institution. And the nature of the works was consistently informational.

On the third factor, the amount copied, the court repudiated the Classroom Guidelines, calling them “not compatible with the language and intent of § 107.” It noted that the numerical limits in the Guidelines are so stringent that not one of the excerpts at issue in the case would fit within them. It was particularly uninterested in the Guidelines’ position that copying not “be repeated with respect to the same item by the same teacher from term to term,” which the court described as “an impractical, unnecessary limitation.”

Instead, the court fashioned its own quantitative test. For books of nine or fewer chapters, the court set a threshold of 10% of the total page count; for books of ten chapters or more, the threshold was a single complete chapter. (The chapter-based rule creates an odd incentive for publishers to create books with a surfeit of tiny chapters.) Copying of any amount under this threshold, the court held, would be treated as “decidedly small.” In practical terms, this ended up being a one-sided bright-line rule: copying of less than 10% or one chapter always ended in a fair use win for Georgia State.

Finally, the fourth factor, the effect on the market, favored the publishers whenever CCC was offering a digital license for copying the book in question, and favored Georgia State whenever there was “no evidence in the record to show that digital excerpts from this book were available for licensing” as of the date of infringement.” In practice, this was another one-sided bright-line rule: no digital license meant an instant win for Georgia State. The court repeatedly emphasized that students would not have bought the assigned books as a substitute for the excerpts posted on the e-reserve system.

This treatment of licensing is likely to have significant implications. On the one hand, it suggests that libraries may have a freer hand to make expanded uses of orphan works, since by definition, no one will be licensing them. And on the other, the court didn’t consider photocopying licenses to be a suitable substitute for digital licenses. This will put significant pressure on publishers to turn on digital licensing.

Only in seven instances did Georgia State use more than 10% or one chapter of a book that was available for digital licensing. When this happened, the court took a more detailed look at the specifics of the book’s licensing market and the portion copied. Generally, this turned on whether the book made significant revenues via licensing: if so, the use was unfair. (In one instance, the court did a “heart of the work” analysis under factor three to find no fair use because the professor had assigned chapters that “essentially sum up the ideas in the book.”)

Thus, the operational bottom line for universities is that it’s likely to be fair use to assign less than 10% of a book, to assign larger portions of a book that is not available for digital licensing, or to assign larger portions of a book that is available for digital licensing but doesn’t make significant revenues through licensing. This third prong is almost never going to be something that professors or librarians can evaluate, so in practice, I expect to see fair-use e-reserves codes that treat under 10% as presumptively okay, and amounts over 10% but less than some ill-defined maximum as presumptively okay if it has been confirmed that a license to make digital copies of excerpts from the book is not available.

The most interesting issue open in the case is the scope of any possible injunction. Given that Georgia State won on sixty-nine out of seventy-four litigated claims, while the publishers won on only five, I expect that the any injunction will need to be rather narrow. But given how amenable the court’s proposed limits are to bright-line treatment, it is likely that the publishers will push to write them in to the injunction.

My bottom line on the case is that it’s mostly a win for Georgia State and mostly a loss for the publishers. The big winner is CCC. It gains leverage against universities for coursepack and e-reserve copying with a bright-line rule, and it gains leverage against publishers who will be under much more pressure to participate in its full panoply of licenses.

by James Grimmelmann (james@grimmelmann.net) at May 13, 2012 04:25 PM

O'Steen, Ben

raymarch

I hope many of you have seen the excellent and fun WebGL GLSL sandbox at http://glsl.heroku.com/. This live editing of shaders is an excellent learning tool, as it allows you to watch the consequences of any changes you make.

I am constructing some simple OpenGL ES scripts as a learning resource for anyone new to it, in part as a way to help me learn it too. As part of this, I’ve written a little script that gives a similar kind of experience, but on the commandline, with a Raspberry Pi. It overlays the render display over the top of the framebuffer window, and reloads the shader any time you save the script.

You can run this from the terminal (as I do in the video later on) or from within LXDE (X11). (You may wish to replace ‘nano’ with ‘leafpad’ in glsl_sandbox.sh if you are running it in LXDE)

Prerequisites:

* Install pyinotify

$ sudo apt-get install python-pyinotify

* Get the repository:

$ wget https://github.com/benosteen/pyopengles/zipball/master -O pyopengles.zip
$ unzip pyopengles.zip
  Inflating benosteen-pyopen....
  ...

cd into the repository and you are ready to go!

Usage of ‘glsl_sandbox.sh’:

$ bash glsl_sandbox.sh [NAME_OF_SHADER_FILE]

The repository has a few demo shaders that are known to work included for you to try – they are copied from the WebGL sandbox site (http://glsl.heroku.com/)

‘basic.glsl’ – From http://glsl.heroku.com/e#2423.0


‘leds.glsl’ – From http://glsl.heroku.com/e#2450.0


‘raymarch.glsl’ – From http://glsl.heroku.com/e#2171.0

Pass the script any new filename, and it will create a new shader from the template and save it to that location.

This uses the “nano” text editor by default, as that is installed on the reference debian image, but if you look in the script, it is not hard to change to your preference :)

(NB Ctrl-O, followed by enter to save the file, and Ctrl-X to quit nano.)

Here is a video of it in action, as it is quite hard to describe in words.


by benosteen at May 13, 2012 02:10 PM

Coyle, Karen

RDA, DBMS, RDF

I have written before about some issues relating to RDA and RDF. Today I want to actually consider some things we should consider that should cause us to question the concept of "RDA in RDF."

For many decades we have been using relational databases to store our bibliographic data, bibliographic data that we create and exchange using the MARC format. Doing so was not by any means natural or intuitive because there is nothing about the structure or content of the MARC record that lends itself to being stored and managed in a relational database. The results were often awkward, inefficient, and unsatisfying.

Part of the reason for this is the unitary and flat nature of MARC. In spite of the long history of creating separate authority files, each MARC record is a complete and closed document with no actual connections to data outside of itself. While some database implementations for MARC do create relational tables for headings, the degree to which a MARC record can be separated out into tables is minimal and gains us very little in terms of the functionality of an RDBMS.

The underlying problem, however, is not in the structure of the MARC record but in the content of our catalog records. Moving from the card to a database for our data requires more than adding mark-up coding around the catalog data; to do so successfully requires re-thinking the data in terms of relational database principles. There are two basic principles to relational database design: repetition and combination.

To design for relational databases you look at your data to see what elements will be repeated in many different records. Rather than carrying those data elements in multiple records, you create a separate database table for each repeating element, and you store that element once. For example, if you are creating a database of mailing addresses, you see quickly that elements like state and zip code will appear in multiple records. You therefore create a table of state names and one of zip codes, and perhaps even one that links zip codes to city names. In this way, your database carries the string "Mississippi" only once, and that string is replaced in the records with a database pointer that uses much less internal storage. Ditto the zip code. And if the zip code is associated in a table with a city name, you also only store city names once, and each address record needs only a pointer to the zip code, not a city name. In fact, with a zip code you can get the city and state, and your design might look like:



In this way you have saved a huge amount of storage space. You have also made selection of your records on zip code, city and state much more efficient than if they were stored in every address record, because a search on a zip code, for example, retrieves a single entry in the zip code table, and that entry has database-managed links to the relevant records.

In a database of customer orders that has your inventory information along with customer addresses, you use the tables in your database to search for things like "all customers in Mississippi who have ordered WidgetX in the last six months." Information about your inventory and information about purchases are all in appropriate sets of tables in your database and you can combine the data elements to develop different views of the data.

Where the goal in relational database design is to identify and isolate data elements that are the same, the goal in library cataloging data is exactly the opposite: headings are developed primarily to differentiate at the data creation point rather than allow combination within the database management system. The goal is to have each data point be as unique as possible and to be assigned to as few records as possible. Thus, library cataloging creates headings whose purpose is to distinguish between entries:

Shakespeare, William, 1564-1616. As you like it
Shakespeare, William, 1564-1616. As you like it. 1905
Shakespeare, William, 1564-1616. As you like it. 1911.
Shakespeare, William, 1564-1616. As you like it. 1919.
Shakespeare, William, 1564-1616. As you like it. Czech
Shakespeare, William, 1564-1616. As you like it. French

These headings are counter to the functioning of a database management system. If moved to a database table to facilitate retrieval, they will each point to only one or a very small number of records. This negates both the space-saving aspect of database management and it also does not facilitate combination of data elements for retrieval. In the case of headings, the combination of elements is pre-coordinated in the data, rather than post-coordinated in the database retrieval function.

A database approach might break this data into four tables:




In this way one could search for this data by title, by title + author, date + language, or by any other combination of these four data elements. To search the library headings as anything but a single keyworded string, that is to use these headings to perform searches on title or date or language, would be incredibly inefficient. The upshot is that library headings are not "relational" and do not contribute to the functionality that database management systems can provide. Instead, database management systems make use of the separate coded elements, such as date and language, for combinatorial retrieval. Names and titles, because they are text strings and do not have an identified presence in the stored records, must be searched separately rather than being available for relational combination. The results of this type of searching are less than optimal in speed and accuracy.

All of this may seem obvious to some of you, so you may be asking yourselves why I bring this up. I bring it up because even though RDA claims to have as its goal the creation of records in a relational design (see scenario one in this JSC document), it continues to instruct catalogers to create pre-coordinated headings like the ones above. Not only will these not be efficient or fruitful in a relational database, this brings into question whether RDA is truly modeled on the principles it claims to embrace. If it is not we have cause to worry: we cannot move forward with data that does not conform to a modern model.

Note that in this post I have been emphasizing the use of relational database design for the data. The current plans for a new bibliographic framework appear to plan to create a data model for RDA that is based on semantic web principles. Those principles are yet another significant evolution following on the database model, which is now considered waning technology. Other communities, ones that have been designing for database management requirements for their data for decades, are now looking at ways to transform that data to RDF. It is possible that we can skip the relational database phase of our data development and move directly into a semantic web model. However,  to think that data created following RDA instructions, which is not even suitable for a relational database, could be made usable on the semantic web without major modifications is simply wrong. If we create a bibliographic framework that takes RDA as it has been described and ports that, unchanged, to RDF we will create a data model that does not serve us, does not serve our users, and that cannot reasonably interact with other linked data on the web.

What we need is an analysis of our data, not a transformation of it "as is" to a new technology. If we aren't ready to admit that some traditional practices, like headings, are no longer useful or usable in today's technological environment, we cannot have any hope that our data will be relevant in the future.

(p.s. I anticipate that someone will state that headings are needed for alphabetical displays, to "collocate" the records. To that I reply: 1) you can do the same collocation using the data elements, and in fact you could devise multiple collocations by combining the elements in different ways and 2) a linear, alphabetic display is so anachronistic with today's technology, and so seldom used when available, that it is hard to justify the use of human catalogers to create these fields. If you still believe that library records must contain hand-crafted headings, all I can say is: you can believe what you want, but some of us will be exploring other solutions.)

by Karen Coyle (noreply@blogger.com) at May 13, 2012 11:42 AM

Pattern, Dave

A week on Summon (revisited)

After yesterday's blog post, I thought I'd have a go at narrowing down my definition of a "separate search".

If a user enters some search terms, and then uses 2 facets to refine the search before clicking on a result, I was classing that as 3 separate searches — what niggled me overnight was that that approach might inflate the facet use statistics …after all, 30.6% of all searches used at least one facet felt a little high given that I'm forever hearing staff moan that students never use the facets, no?!

For today's blog post, I've removed all searches that didn't lead to a result click. (There's a small caveat that my jQuery code currently doesn't capture a result click for links to the OPAC where the user clicked on the availability message (highlighted in red below) — this is because my jQuery code that captures the result clicks runs once the page has loaded, but before the AJAX'd availability information has been retrieved. When I get some time, I'll see if I can find a way around that.)

So, let's see how much of a difference that makes to yesterday's stats

So, that overall figure for the % of searches which used at least one facet hasn't dropped by much from yesterday's figure of 30.6%.

Anyone who follows me on Twitter will know that I like to cheekily mock the importance of Boolean and the data from the last 7 days reveals a few things:

  1. no-one who used a Boolean NOT in their search clicked on a result
  2. only 0.07% of searches (that's just 7 searches in every 10,000!) used a Boolean OR, which is arguably the most useful operator to use
  3. unless you're using a search that includes one of the other Boolean operators, the use of AND is pretty much redundant as it's the default Boolean operator in a search (i.e. the search "dogs AND cats" is the same as "dogs cats")… so why are we telling students to use it in Summon?

After poking a bit of fun at someone for entering a 356 word search query yesterday, I can reveal that the longest search in the last 7 days that resulted in a result click was 98 words (it was a paragraph copied and pasted from a journal article).

I guess the big question here is why the disconnect between the "students don't use facets" mantra and the actual usage data?

Finally, I thought I'd figure out how many results are clicked on after a search…

summonclickspersearch

by Dave Pattern at May 13, 2012 09:54 AM

May 12, 2012

Denton, William

Correcting timestamp on photographs taken on an Acer Android phone

I have an Acer Liquid E phone running Android. All of the photographs it takes are timestamped in the internal EXIF metadata to 8 December 2002. Turns out there's a bug in the camera app: it seems that instead of using "insert correct date here" in libcamera.so somehow the December 2002 date got hardcoded in.

The timestamp on the file itself is correct, however, so I wrote this script to use that to edit the EXIF times. It uses ExifTool, which is probably in your package manager:

#!/bin/bash

for I in 2002-12-08*jpg
do
  TIMESTAMP=`stat --printf "%y" "$I"`
  NAME=`echo "$I" | sed 's/2002-12-08 12.00.00-//'`
  NAME="acer-$NAME"
  echo "$I --> $NAME ($TIMESTAMP)"
  cp -p "$I" $NAME
  exiftool -P -DateTimeOriginal="$TIMESTAMP" -CreateDate="$TIMESTAMP" $NAME
done

For example, a file named 2002-12-08 12.00.00-460.jpg timestamped 2012-04-10 19:30 would have the DateTimeOriginal and CreateDate EXIF fields corrected to to 2012-04-10 19:30 and the file would be renamed to acer-460.jpg. The original file is left untouched.

It worked for me, and it won't delete your files, so use it if it helps. Make sure that whenever you copy files off your Acer phone you use cp -p to preserve the original timestamp. Otherwise your photos will have their internal dates set to today!

by wtd at May 12, 2012 06:26 PM

Manage Metadata (Phipps and Hillmann)

Using the sub-property ladder

I discussed the utility of the sub-property relationship in Getting to higher MARC branches, Netting more MARC fruit, and Adding MARC fruit to the cornucopia. Coincidentally, Bob DuCharme posted Simple federated queries with RDF which outlines the same technique and provides additional information on its use for resource discovery. Those posts are somewhat technical, and I tried to lighten up in my presentation Turtle dreaming at the recent Dublin Core Metadata Initiative (DCMI) seminar Five years on. This post is another attempt to demonstrate in a non-technical way (I hope) how useful and powerful the sub-property relationship can be.

A metadata attribute, like ‘title’, that is to be used for linked data in the Semantic Web is usually represented in Resource Description Framework (RDF) as a property. A property can be used as the predicate part of a triple: “Subject – predicate – object”, where ‘Subject’ is what the triple is about (e.g. a resource), ‘predicate’ is the aspect of the subject, and ‘object’ is the value of that aspect for the specified subject. For example:

“This resource – (has) title – ‘Using the sub-property ladder’”

is a single metadata statement in triple format. We can think of this as conforming to the triple template:

“Specified resource – (has) attribute – value”.

Note that prefixing the predicate with ‘has’ turns it into a verbal phrase and renders the statement in (near) natural language.

We can also make meta-metadata statements in triple format. These are ‘data about metadata’ rather than ‘data about data’, and are often referred to as ontological triples to distinguish them from data triples such as the example above. The triple template for one type of meta-metadata statement is:

“Specified RDF element – (has) relationship – Other specified RDF element”

Note that a relationship between metadata elements is also represented in RDF as a property. In particular, ‘sub-property’ is a pre-defined relationship between two RDF property elements, giving the ontological triple:

“Property 1 – (is) sub-property of – Property 2″

Furthermore, such relationships can embed semantic rules that can be processed automatically by software known as ‘semantic reasoners’ or just plain ‘reasoners’. The rule embedded in the sub-property relationship is: If “P1 – (is) sub-property of – P2″, then any data triple using P1 as its predicate can generate another data triple using P2 as its predicate, with the same subject and object. Let’s call this kind of ontological triple a mapping triple, because it effectively maps one property to another.

Suppose we have two attributes ‘title’ and ‘varying form of title’. I can create the mapping triple:

“‘varying form of title’ – (is) sub-property of – ‘title’”.

If we have a data triple:

“This resource – (has) varying form of title – ‘Pat presents cataloguing for beginners’”

then a reasoner will automatically generate the data triple:

“This resource – (has) title – ‘Pat presents cataloguing for beginners’”

In a similar way, we can create the mapping triple:

“‘title statement’ – (is) sub-property of – ‘title’”

and from the data triple:

“This resource – (has) title statement – ‘Cataloguing for beginners’”

generate:

“This resource – (has) title – ‘Cataloguing for beginners’”

So what? Further suppose that the ‘title’ attribute is from the DCMI metadata terms, and the ‘varying form of title’ and ‘title statement’ attributes are from the MARC21 tags 245 and 246. So a MARC21 record for the resource might contain the set of data triples:

“This resource – (has) 245 [title statement] – ‘Cataloguing for beginners’”
“This resource – (has) 246 [varying form of title] – ‘Pat presents cataloguing for beginners’”

A reasoner can generate the set of data triples:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

In other words, we have generated a DC record from a MARC21 record. Or we have generated a title index for the MARC21 record. Or both.

Let’s add an RDA attribute and an ISBD attribute mapping to the mix:

“[RDA] ‘title proper’ – (is) sub-property of – [DC] ‘title’”
“[ISBD] ‘has title proper’ – (is) sub-property of – [DC] ‘title’”

The data triples:

“That resource – (has) [RDA] title proper – ‘Cataloguing for geeks’”
“Another resource – [ISBD] has title proper – ‘Does cataloguing have a future?’”

can generate the corresponding DC triples, and we end up with:

“This resource – (has) [DC] title – ‘Cataloguing for beginners’”
“That resource – (has) [DC] title – ‘Cataloguing for geeks’”
“Another resource – (has) [DC] title – ‘Does cataloguing have a future?”
“This resource – (has) [DC] title – ‘Pat presents cataloguing for beginners’”

So now we have a title index to metadata from multiple heterogeneous sources. And the beginnings of a set of records in Dublin Core format.

Note that the attribute which is the sub-property must be entirely narrower in its semantics than the related super-property. If we create the mapping triple:

“‘title’ – (is) sub-property of – ‘varying form of title’”

then we generate the data triple:

“This resource - (has) 246 [varying form of title] – ‘Cataloguing for beginners’”

which is incorrect.

As a result, a data triple generated by a sub-property mapping triple is usually ‘dumber’ than the original data triple; detail is lost because the generated triple uses an attribute which is broader in meaning than the original. This ‘dumbing-up’ is necessary to produce interoperable metadata from different schemas – but data is not permanently lost because the original triple is still available for use in other applications. Needless to say, data triples created with broad attributes cannot be “smartened-down”, at least on their own.

The sub-property relationship can be chained. We can create a new attribute property, MARC21 ‘title’, which could be used in an application for making a title index to MARC21 records, as already mentioned. This new attribute is a super-property of all the MARC21 title-type attributes, and is also a sub-property of the DC ‘title’ attribute:

“[MARC21] ‘title statement’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘varying form of title’ – (is) sub-property of – [MARC21] ‘title’”
“[MARC21] ‘title’ – (is) sub-property of – [DC] ‘title’”

Doing this does not affect the previous mapping triples relating each MARC21 title-type attribute directly to the DC ‘title’ attribute, although it  makes them redundant because this new set of mapping triples generates exactly the same data triples at the DC level from the MARC21 originals.

Different application can therefore re-use and, if necessary, augment the sub-property chains for each of the high-level core attributes found in most bibliographic metadata schemas, such as title, author/creator/agent, subject, target audience, etc. The chains form a net(work) of mappings, or map, which can automatically dumb-up triples from any level of semantic granularity to any higher level.

We should only have to publish such maps or part-maps once, openly so that anyone can use them and add to them. If the professional communities develop the maps first, much effort will be saved and much authority imparted. This requires collaboration and action real soon now – the ISBD Review Group and the Joint Steering Committee for Development of RDA have started with the development of a mapping between the ISBD and RDA element sets.

These maps should remain valid forever, so the effort is worth expending. The original data triples use the original properties based on the schema attributes at the time and they will be valid “for their time”, in the same way that many catalogues are likely to contain records created under the Anglo-American Cataloguing Rules, with its ‘general material designation’ attribute long after the successor standard RDA: resource description and access has been adopted with its ‘content type’ and ‘carrier type’ attributes.

And mappings from the MARC21 element sets will show, we hope, that it may not be necessary to convert the entire contents of every MARC21 record as a result of the Bibliographic Framework Transition Initiative!

But the professional communities lack a framework to help them collaborate as a super-community. A network of mappings is more (socially) efficient than an aggregation of one-to-one mappings between pairs of schemas. We need (name)spaces to add intermediary attribute properties and publish the mappings; we need protocols for managing semantic change as schemas evolve; we need lightweight protocols for authorizing mappings; we need systems for ensuring the long-term preservation of RDF element sets and mapping triples.

by Gordon Dunsire at May 12, 2012 06:17 PM

Pattern, Dave

A week on Summon

[ update: slightly revised stats are available here! ]

We've just started collecting in-depth data about how students are searching Summon (keywords entered, facets selected, etc) and I thought some of you might be interested in an early analysis from the last 7 days (just under 40,000 separate searches by 2,807 students)…

notes:

[1] – One student copied & pasted the following 356 word title & abstract into the search box!

Peter J. Shaw, David J. Rawlins Article first published online 2 AUG 2011 DOI:10.1111/j.1365-2818.1991.tb03168.x 1991 Blackwell Science Ltd Issue Journal of Microscopy Volume 163, Issue 2, pages 151–165, August 1991 Additional Information(Show All) How to CiteAuthor InformationPublication History SEARCH Search Scope Search String Advanced >Saved Searches > ARTICLE TOOLS Get PDF (1119K) Save to My Profile E-mail Link to this Article Export Citation for this Article Get Citation Alerts Request Permissions Share Abstract References Cited By Get PDF (1119K) Keywords Confocal microscopy;three-dimensional fluorescence microscopy;point-spread function;deconvolution;computer image processing SUMMARY We have measured the point-spread function (PSF) for an MRC-500 confocal scanning laser microscope using subresolution fluorescent beads. PSFs were measured for two lenses of high numerical aperture—the Zeiss plan-neofluar 63 × water immersion and Leitz plan-apo 63 × oil immersion—at three different sizes of the confocal detector aperture. The measured PSFs are fairly symmetrical, both radially and axially. In particular there is considerably less axial asymmetry than has been demonstrated in measurements of conventional (non-confocal) PSFs. Measurements of the peak width at half-maximum peak height for the minimum detector aperture gave approximately 0·23 and 0·8 μm for the radial and axial resolution respectively (4·6 and 15·9 in dimensionless optical units). This increased to 0·38 and 1·5 μm (7·5 and 29·8 in dimensionless units) for the largest detector aperture examined. The resulting optical transfer functions (OTFs) were used in an iterative, constrained deconvolution procedure to process three-dimensional confocal data sets from a biological specimen—pea root cells labelled in situ with a fluorescent probe to ribosomal genes. The deconvolution significantly improved the clarity and contrast of the data. Furthermore, the loss in resolution produced by increasing the size of the detector aperture could be restored by the deconvolution procedure. Therefore for many biological specimens which are only weakly fluorescent it may be preferable to open the detector aperture to increase the strength of the detected signal, and thus the signal-to-noise ratio, and then to restore the resolution by deconvolution. Get PDF (1119K) More content like thisFind more content like this article Find more content written by Peter J. ShawDavid J. RawlinsAll Authors ABOUT USHELPCONTACT USA

…sadly, Summon failed to find a result for that as we don't subscribe to the article!

[2] – Normally, you search Summon by entering your keywords then, after the results appear, you select facets to refine your search and each facet selection invokes a new search. So, if you ran a search and then select 2 facets, that will be logged as 3 separate searches (1 without any facets, and 2 with).

[3] – Mostly, the publication date facet is being used to limit the search to the X most recent years.

[4] – The vast majority of the content in our Summon instance is in English and, apart from one search that refined the results to just Italian, every use of the language facet was to refine the results to English only.

[5] – Boolean operators have to be entered in UPPER CASE in Summon, with an invisible AND being implict in any multi keyword search that doesn't include Boolean. Looking at the searches queries that included a Boolean operator, 6% were entered entirely in upper case, implying that the user wasn't conciously invoking a Boolean search.

by Dave Pattern at May 12, 2012 01:24 PM

May 11, 2012

Equinox Software Incorporated

Here we grow again! Link checker functionality in Evergreen

Equinox Software, Inc. is excited to announce the development of link checker functionality in Evergreen. Evergreen currently has no built-in mechanism for verifying the validity of URLs stored in MARC records. The ability to verify URLs will be of particular benefit to locations with large electronic resource collections. The requirements for this project are being developed in partnership with NRCan Library and Statistics Canada Library. The technical specifications for this project will be shared with the Evergreen Community once they are ready. Equinox developers estimate that coding will be completed no later than the end of the third quarter of 2012.

Once the coding is finished, the code will be submitted to launchpad, where another developer will need to review and approve it. Once it has been signed off on by another developer, it can be included in the next major release of Evergreen. End user documentation will also be made available to the Evergreen Community. For additional information, contact George Duimovich, NRCan Library, or Suzannah Lipscomb, Equinox Software.

Add to:

by slipscomb at May 11, 2012 07:08 PM

del.icio.us

The Code4Lib Journal - Building an Archival Collections Portal

by jrcwiok at May 11, 2012 06:47 PM

Umlaut - Code4Lib

by mindhiker at May 11, 2012 04:46 PM

Pattern, Dave

Dave's Law

I'd love to have a law named after me, so here goes:


Dave's Law

users should not have to become mini-librarians in order to use the library


If you ever find yourself needing to invoke Dave's Law, please let me know :-)

by Dave Pattern at May 11, 2012 11:52 AM

Open Knowledge Foundation

#OpenDataEDB 2: 16th May

Following the fun we had at March’s Meet-up ‘launch’, we will be having another gathering of people interested in open data next Wednesday 16th May. Hosted by the Wash Bar, Edinburgh, from 19.00, come and join us to discuss ideas, projects and plans in relation to openness.

Lightning Talks will include Federico Sangati on crowdsourcing and education, ahead of his presentation at Dev8ed later this month, and a sneak preview of the hackathon that Open Biblio will be running 12-14th June in collaboration with OKFN’s Open GLAM and Cultural Heritage Working Group and DevCSI.

If you would like to give a lightning talk (informal 2-3 minute presentations) about anything related to open data or knowledge, contact naomi.lillie [@] okfn.org.

Sign up here and we’ll see you there!

Sticker Design 1

For this and other events in Edinburgh and the rest of Scotland, sign up here.

by Naomi Lillie at May 11, 2012 08:30 AM

Denton, William

Ref desk 5: Fifteen minutes for under one per cent

This is the fifth and last in a series about using R to look at reference desk statistics recorded in LibStats. Previously:

I've been making some other charts showing other kinds of ratios and calculations but I'm going to skip to one last pair of charts where I bring in the number of our students to figure out how many students we help with research help each week and for how long.

First, a brief review of the four branches of the York University Libraries system we're looking at:

(Osgoode is law but they don't use LibStats so we'll forget about them for now.)

I calculated how many "home students" each library has. Bronfman handles everyone in the business school and in the administrative studies program in another faculty. Steacie handles everyone in the science and health faculties (except psychology, which is handled at Scott). Frost handles everyone at Glendon. Scott handles everyone else. The York University Factbook let me look up how many students were in each faculty, and I did a bit of adding and subtracting and figured out:

That's 53,133 students total, as of last fall. (We have about 43 librarians and archivists, for a ratio of 1235 students to each librarian, which is one of the worst in Canada.)

You can figure out something very similar for your library, probably.

With those numbers, we're all set for some more work in R.

First, I make a libstats.bigscott data frame, which gloms together all of the reference desk activities that happen in the Scott Library building (which as I said contains three smaller libraries) into one. This is necessary to group together all possible arts/humanities/social sciences questions. These lines below rename certain library.name fields by saying, for example for SMIL, for every entry in this data frame where library.name equals "SMIL", make library.name equal "Scott." Nice example of vector thinking in R.

> libstats.bigscott <- libstats
> libstats.bigscott$library.name[libstats.bigscott$library.name == "SMIL"] <- "Scott"
> libstats.bigscott$library.name[libstats.bigscott$library.name == "ASC"] <- "Scott"
> libstats.bigscott$library.name[libstats.bigscott$library.name == "Maps"] <- "Scott"
> libstats.bigscott$week <- as.Date(cut(as.Date(libstats.bigscott$timestamp, format="%m/%d/%Y %r"), "week", start.on.monday=TRUE))

Next, use our old friend ddply to count how many research questions are asked each week.

> research.users <- ddply(subset(libstats.bigscott,
                                 question.type %in% c("4. Strategy-Based", "5. Specialized")),
                          .(library.name, week), nrow)
> names(research.users)[3] <- "users"
> research.users$user.ratio <- NA
> head(research.users)
> library.name       week users user.ratio
1     Bronfman 2011-01-31    48         NA
2     Bronfman 2011-02-07    80         NA
3     Bronfman 2011-02-14    42         NA
4     Bronfman 2011-02-21    61         NA
5     Bronfman 2011-02-28    53         NA
6     Bronfman 2011-03-07    59         NA

Now, another probably heinous non-R way of dividing the number of users (or, actually, questions) each week by the number of "home students":

> for (i in 1:nrow(research.users)) { 
    if (research.users[i,1] == "Bronfman"          ) { research.users[i,4] = research.users[i,3] / 6050  }
    if (research.users[i,1] == "Frost"             ) { research.users[i,4] = research.users[i,3] / 2677  }
    if (research.users[i,1] == "Scott"             ) { research.users[i,4] = research.users[i,3] / 34388 }
    if (research.users[i,1] == "Steacie"           ) { research.users[i,4] = research.users[i,3] / 10018 }
  }
> library.name       week users  user.ratio
1     Bronfman 2011-01-31    48 0.007933884
2     Bronfman 2011-02-07    80 0.013223140
3     Bronfman 2011-02-14    42 0.006942149
4     Bronfman 2011-02-21    61 0.010082645
5     Bronfman 2011-02-28    53 0.008760331
6     Bronfman 2011-03-07    59 0.009752066

user.ratio there is what we're after. It looks low, doesn't it? Multiply it by 100 to get a percentage. It's still low.

Percentage of students seen regarding research

The y-axis is per cent, so this shows that usually through term time we see give research help to under 1% of our students. There are a few weeks in some branches where it gets above that, but it's never above 1.5%.

That really surprised me. I have no idea what the numbers are like at other universities. If you figure it out for where you work, let me know. Perhaps one per cent is a common figure? Could it be five per cent at some universities? It would have to be a small university, I think, or have a lot of librarians.

Know that we know how many students we help with research, I wondered how long we spend helping them. More calculations in R, using ref.desk.spent, the function I defined in the last post to add up an estimate of how much time is spent at the desk. Here we break it down by branch by week, create a research.time.bigscott data frame, which I then merge with research.users so I can divide to create the research.mins.ratio which is what I'm after:

> research.time.bigscott <- data.frame(library.name = factor(), week = factor(), research.mins = numeric())
> branches <- c("Scott", "Frost", "Bronfman", "Steacie")
> for (i in 1:length(branches)) {
    branchname <- branches[i]
    for (j in 1:length(weeks)) {
      spent <- desk.time.spent(ddply(subset(libstats.bigscott,
                                            library.name == branchname & week==weeks[j] &
                                            question.type %in% c("4. Strategy-Based", "5. Specialized")),
                                     .(time.spent), nrow))
      rbind(research.time.bigscott,
            data.frame(library.name = branchname, week = weeks[j], research.mins = spent)) -> research.time.bigscott
    }
  }
> research.users$week <- as.factor(research.users$week) # Necessary for merge to work cleanly
> research.time.bigscott <- merge(research.time.bigscott, research.users, by=c("library.name", "week"))
> research.time.bigscott$research.mins.ratio <- research.time.bigscott$research.mins / research.time.bigscott$users
> head(research.time.bigscott)
  library.name       week research.mins users  user.ratio research.mins.ratio
1     Bronfman 2011-01-31           758    48 0.007933884            15.79167
2     Bronfman 2011-02-07          1340    80 0.013223140            16.75000
3     Bronfman 2011-02-14           595    42 0.006942149            14.16667
4     Bronfman 2011-02-21           997    61 0.010082645            16.34426
5     Bronfman 2011-02-28           775    53 0.008760331            14.62264
6     Bronfman 2011-03-07           901    59 0.009752066            15.27119
> xyplot(research.mins.ratio ~ as.Date(week) | library.name, data = research.time.bigscott,
         type = "h",
         ylab = "Length of average research interaction (minutes)",
         xlab = "Week",
         main = "Average length of research interactions (Scott includes ASC/Maps/SMIL)",
         sub = paste("From Feb 2011 to", up.to.week),
         abline=list(h=15, lty=3, col="lightgrey"),
        )

In this xyplot command I throw in an extra abline to draw a dashed light grey line along y=15 to help point out that generally we spend about fifteen minutes on each research interaction.

Time spent per research interaction

The Steacie library stands out from the others, and there are some peaks here and there, but overall we spend on average about fifteen minutes on each research interaction with students.

Put those two charts together and it shows that during term time we spend on average about fifteen minutes a week giving research help to each of under one per cent of our students.

by wtd at May 11, 2012 05:44 AM

May 10, 2012

Rochkind, Jonathan

The Semi-Isolated Rails Engine

(All of this is accurate, so far as I know, in Rails 3.2.3. If you are reading this later in future Rails versions, mileage may vary).

Rails 3 introduced plugins-as-gems, and the special case of Engines. An Engine is basically  a library of code that can define it’s own views, controllers, models, assets, etc, in it’s own codebase, that will be available for the Rails app. (An Engine doesn’t actually need to be defined in it’s own gem, it can be defined anywhere that ends up in the load path. but it’s own gem is typical).  You can have Rails generate a skeleton for an Engine plugin as gem, with `rails plugin new enginename --full`.  (Without the –full, it’d be a less powerful plugin without full Engine features — actually it ends up being pretty much just an ordinary gem).

A “plain” engine (as opposed to ‘isolated’ engine we’ll discuss later) basically “inserts” controllers, views, and models into the host app — they’re added to the load paths to part of the host app same as any locally defined controller, view or model.

Additionally, routes defined in your $engine/config/routes.rb will be automatically included in the host app. I’m not sure if they’ll be included before or after host app routes; route definition order matters in Rails3 routing.

Name collisions?

If there’s a name collision, the thing with the same name in the host app will usually ‘win’, and the one in the gem will be in accessible to most code (in gem or in host app).  If there’s name collision between two gems, it probably depends on load order (what order they’re referenced in the Gemfile, usually).

This is pretty much what you’d expect to happen, so long as the host app version really wins, I think it’s “right”.  (With helpers specifically, things can sometimes get confusing and not behave how you expect. I now can’t find the message I think I sent to the rails-user listserv on this at some point, and maybe it’s been changed/fixed in recent versions of rails.)

You can put your models, views, and controllers in module namespaces just exactly the same as you can if you were adding em to any Rails app, in order to try and prevent namespace collision. They’ll work just exactly the same way — the point of an Engine is the stuff in an engine is in the host apps load paths just the same as if it was really in the host app source locations.

Avoiding routing name collisions can be handled the same way, in a ‘plain’ engine, using the Rails3 router :namespace function, or any of the other related router functions (:as, :module, :path, etc.)

Some Engines handle routing by not including routes in $engine/config/routes.rb, where they’ll be automatically loaded by Rails, but instead loading routes into the host app using their own logic, so it can be done just so. This is especially useful for routes that should be changed by host app configuration. For instance, Devise and it’s `devise_for` method that the host app calls manually in it’s own routes.rb.

Isolated Engines: Rails 3.1

Rails 3.1 introduced the “isolate_namespace” directive, which you can add to your engine module.

The one main effect this has is actually on routing. $engine/config/routes.rb are not added to the host app’s routing.  Instead, Rails creates a little Rack mini-app out of your engine (or maybe any Engine already is this?), with your engine’s routing in it, so that host app can mount the Engine into the host app’s own routing, using the standard Rails routing ‘mount’ directive for Rack apps. See the Engines guide (or the edgeguide version, with slightly expanded information).

It also makes the engine’s $config/routes.rb behave a bit differnet as far as default routing params, assuming all routes are :namespace’d, making sure  the routing helper methods are available to your Engine’s controllers and helpers (and at the right method names), etc.

On top of this, it changes how rails generators work inside your engine. You can use rails generators inside an engine to add controllers and models. In a ‘plain’ engine, if you call `rails generate controller foo`, it’ll add an $engine/controllers/foo_controller.rb, just like any rails app.  It’ll add an `$engine/views/foo` directory and an `$engine/helpers/foo_helper.rb`. Just like an app.

In an Engine with `isolate_namespace`, if you call `rails generate controller foo`, it’ll namespace everything it generates for you:  `$engine/controllers/$enginename/foo_controller.rb` will contain a controller whose class is EngineName::Foo.  Similarly, view folder in `$engine/controllers/$enginename/foo`, etc.

Isolated engines are convenient for many cases.  You can have Rails generate a new skeleton for an isolated engine with `rails plugin new enginename --full --mountable`

There’s one aspect of them, though, that you may or may not want — and is fortunately pretty easy to change, giving you what I’ll call a Semi-Isolated Engine.

More Isolation Than you Might Want: Controller inheritance

There’s one aspect of isolated engines that ends up being a bit confusing — It’s actually not caused by the `isolate_namespace` directive in the Engine, but purely by the Rails generators — in fact, purely by the `--mountable` arg to `rails plugin new engine_name --full --mountable`.

Let’s look at how controller inheritance works.

If you use the `rails generator controller` to generate in your engine, if you look at it you’ll see that it’s defined as < ApplicationController — inheriting from the class called ApplicationController — just like a controller in a normal app. But your engine gem doesn’t have an ApplicationController (at least it ought not to, at least not a top-level-namespace ::ApplicationController) — what’s it inheriting?  Well, it’s inheriting from the ApplicationController in whatever host app it happens to be running in.

This means common logic in the host apps ApplicationController is available to engine controllers. (Say, a current_user? method; the engine would obviously need to document it’s conventions).  It also means all the helper methods loaded into the host app in a way that they apply to all controllers, will be available to engine controllers/views.  It also means that, by default, the default rails template layout for controllers in the engine is the host app’s `application` layout — or any other default layout specified in host app ApplicationController.

Sometimes that’s all actually nice, but sometimes you want more isolation. If you generate an engine with `rails plugin x --full --mountable`, you get it.  But how you get it is a bit confusing at first.

mountable/isolated generation of Engine::ApplicationController

If you generate a `mountable` (ie, isolated) engine, and then you use `rails g controller` to generate a controller, you’ll see it’s still defined as `< ApplicationController`. And yet it doesn’t actually inherit the behavior of the host app ApplicationController — it’s got no logic from host app ApplicationController, no helpers, won’t find it’s layout, etc.

What’s going on? It’s a different ApplicationController.  When you generate an engine with rails –full –mountable, it generates an EngineName::ApplicationController to $engine/controllers/$engine_name/application_controller.rb.

Because of the way Rails constant lookup works, it’s finding this ApplicationController.

And it generated a layout in your engine too at $engine/views/layouts/$engine_name/application.html.erb.

That’s the layout used by all your engine controllers, by default too.

multiple ApplicationController’s, really?

While this level of isolation is perhaps useful for many (most?) Engines, I question the decision to ‘override’ the ApplicationController class name and count on ruby constant-lookup in namespaces to get to the right one. ruby namespaced constant lookup is notoriously confusing, and changes from ruby version to version not always in documented ways.  I think it’s just asking for developer confusion and bugs.

Fortunately, it’s only a feature of the Rails generators (both the ‘rails plugin new‘ and `rails generate controller` within an isolated_namespace engine). Got nothing to do with actual rails runtime logic.

If you want to do it differently, no problem.  Go change $engine/controllers/application_controller.rb to, say, engine_name_controller.rb instead, and the layout to engine_name.html.erb.  All of your engine controllers should now “< EngineNameController” instead of “< ApplicationController“.

You’ve got the exact same behavior, just with less confusing and error prone names.

Sadly, `rails g controller` in an isolated_namespace engine will still generate< ApplicationController“, you’ll have to manually change it each time you use the generator.

Now, for the Semi-Isolated Engine

Okay, now we can get to the actual point. While isolating controllers like this can be useful sometimes, sometimes it’s not. You might still want the routing isolation that “isolate_namespace” gives you, and the convenient change in behavior of the rails generators under that condition.

But you do want your engine controllers to inherit from the host app ApplicationController. No problem!  Just change that engine ‘main’ controller to “< ApplicationController”. You could do that even without the name change we discussed above, by properly scoping to top-level namespace, but that would lead to the confusing (but correct!) EngineName::ApplicationController < ::ApplicationController.

Less confusing if we changed the name as recommended above, say if your engine is the Widgetizer, Widgetizer::WidgetizerController < ApplicationController.

Now,

So there you have it, the “Semi-Isolated Rails Engine”, a design that works well for me for certain kinds of engines. It’s a testament to Rails 3.x nice, clean, flexible, consistent, well-designed architecture that we don’t need to fight with Rails actual runtime logic at all to do this, we don’t even need to change it, we just need to make different choices than the Rails engine generators make. If someone wanted to, they could even make their own generators that behaved this way for a ‘semi-isolated rails engine’.


Filed under: General

by jrochkind at May 10, 2012 08:17 PM

Equinox Software Incorporated

Bibliomation has planted a seed – look at what’s growing in Evergreen now!

Bibliomation, Inc., Connecticut’s largest library consortium, is sponsoring the integration of Syndetic Solutions by Bowker with Template Toolkit OPAC (TPAC) in EvergreenEquinox developers will be writing the code for this project.  TPAC will be able to support cover images, reviews, summaries, table of contents, excerpts, and author notes from Syndetic Solutions.  Once the code is written, it will be  submitted on launchpad, where another developer will need to review and approve it.  Once the code is signed off on by another developer, then it can be submitted for inclusion in the next major release of Evergreen.  For more information, contact Amy Terlaga at Bibliomation / terlaga@biblio.org or Suzannah Lipscomb at Equinox / slipscomb@esilibrary.com

 



 

 

Add to:

by slipscomb at May 10, 2012 05:26 PM

Open Knowledge Foundation

Launching YourTopia Italia: Progress in Italy, defined by You

YourTopia logo How do we measure social progress? Academics and international institutions have struggled with employing measures of human development which go beyond GDP per capita: education, health the the economy, but then what values do we attach to these?

In countries like Italy stark regional differences have dominated over time. Particularly in times of fiscal austerity when the country attempts to recover from an economic crisis with major social consequences, seeing how and why the South and the North differ is an important step in a consensus-building process to find solutions and realise collaboration with the citizens.

Sliders

The Open Economics Working Group of the Open Knowledge Foundation released YourTopia Italia – an application which gives the users a chance to input their priorities in eight categories of socio-economic progress:

Each category is comprised of sub-indicators e.g. Neighbourhood Safety, Income Inequality, Problems with Air Quality or Friends Networks. While the Northern regions fare rather well in most indicators, which are highly correlated with income per capita, Social Life seems to be better in the Italian South, where more people get married, fewer people separate and more people meet friends in their free time.

Maps-YourTopia

YourTopia Italia gives a chance to the user to adjust weights of their personal priorities and see how the map changes when some indicators are excluded altogether. A timeline visualisation also gives the perspective of how Italian regions have developed over time.

Timeline

All YourTopias can be saved and shared through social media.

So, join our efforts: go to italia.yourtopia.net and define the YourTopia that reflects your vision of social progress!

The application was created with a dataset assembled from istat, and the source code of the application is released under an open license. This project is a result of a team work effort and follows up on ideas initiated during the Open Economics Hackday in January this year.

by Velichka Dimitrova at May 10, 2012 01:00 PM

Bradley, Fiona

Collaborative librarianship: working smarter

Tomorrow, I will travel to Peru for the third BSLA review meeting. I’m looking forward to hearing what they’ve been up to (a lot!) and sharing what’s already come out of Botswana and Cameroon. One of the best things about this project has been how enthusiastically everyone has embraced working with colleagues in their own region, and across the world when we meet. I’m sorry to say it, but far, far too often at home I encounter that all-too-familiar syndrome: Not Invented Here. Every country, organisation, and many individuals want to put their own stamp on the profession, which is great, but at the same time that can lead to a lot of reinventing of the wheel and ignorance of resources that already exist. I’ll list just one institutional obvious example: study guides. I’ve been guilty of this too. Research, statistics, standards and training materials are other resources that also tend to be rewritten frequently. There is a barrier of research to practice. One way we’ve tried to help with that is by distilling research into practical case studies.

In countries with few resources, there’s a tendency by many librarians to work smarter, across borders and regions to get the information, specialists, and advice that’s needed to develop the profession and library services. A librarian may travel to Cameroon from Senegal to train on digitisation. A librarian in Cameroon may go to Angola to advise on LIS curricula. Regional associations provide a common meeting place. Certainly, it would be preferable to have enough resources in the country itself, and replicating projects is never so simple as just running the same thing again in a different place.

The recognition of the necessity to work together, across borders and sectors, building on where you are and what you already have, is something that we could all gain from.

by Fiona at May 10, 2012 12:33 PM

Bigwood, David

Open Annotation Core Data Model

The W3C has published the Open Annotation Core Data Model.

Annotating, the act of creating associations between distinct pieces of information, is a pervasive activity online in many guises but lacks a structured approach. Web citizens make comments about online resources using either tools built in to the hosting web site, external web services, or the functionality of an annotation client. Comments about photos on Flickr, videos on YouTube, people's posts on Facebook, or mentions of resources on Twitter could all be considered as annotations associated with the resource being discussed. In addition, there a plethora of closed and proprietary web-based "sticky note" systems, and stand-alone multimedia annotation systems. The primary complaint about all of these systems is that the user created annotations cannot be shared or reused, due to a deliberate "lock-in" strategy within the environments where they were created, or at the very least the lack of a common approach to expressing the annotations.

The Open Annotation data model provides an extensible, interoperable framework for expressing annotations such that they can easily be shared between platforms, with sufficient richness of expression to satisfy complex requirements while remaining simple enough to also allow for the most common use cases, such as attaching a piece of text to a single web resource.
Seen on Digital Koans.

by David (noreply@blogger.com) at May 10, 2012 10:51 AM

Open Knowledge Foundation

Boundless Learning Got Served. What does it all Mean for Open Textbooks?

If you are at all familiar with the open textbook world, you’ve likely heard of the startup called Boundless Learning. Leveraging information in the public domain, as well as dipping into the enormous stockpile of learning that is Open Education Resources, Boundless Learning has a created a tool that hopes to eventually replace the traditional textbook model.

Just like “open” anything, however, Boundless Learning has not gone without its fair share of trouble from vested industry interests. Recently, the textbook publishing giant Pearson, along with MacMillan and Cengage, filed a complaint alleging copyright infringement. Even though Boundless Learning culls its information from material available to the public through Creative Commons licensing, the publishers allege that “Defendant [Boundless Learning] exploits and profits from Plaintiffs’ successful textbooks by making and distributing the free “Boundless Version” of those books in the hopes that it can later monetize the user base that it draws to its Boundless Web site. In short, to build its business on Plaintiff’s intellectual property rights.”

Boundless Learning, on the other hand, claims that the accusations are patently false. The startup states that it only uses information already in the public domain, and said in a Boston.com article, “you can’t copyright facts and ideas. When you look at educational information, it’s primarily facts and ideas.” Boundless Learning will soon send out a legal response, and has expressed disappointment that the textbook publishers didn’t communicate with Boundless Learning amicably before resorting to litigation.

So what does this mean for the open textbook movement? Can we expect more lawsuits of this nature against innovative businesses? For one, Boundless Learning has truly launched a paradigm-shifting product. Most open textbooks are presented to students in PDF format using e-readers and other devices. However, Boundless Learning has extended beyond just digitizing traditional books by offering more. Their content is distinctly interactive, and students may build upon Boundless Learning material in a way that closely resembles both Facebook and Wikipedia. You can study along with other students, help each other in the learning process, and do it all online. For free.

Lawsuits of this sort aren’t anything new, and it’s important for those of us who are believers in the open textbook movement that we understand what we’ll have to fight against to live in a more open society. While Boundless Learning may have been careless in copying the format of copyrighted textbooks, down to the pagination, it does offer a platform that is new, that goes beyond mere open versions of closed textbooks. It’s with this innovative spirit that we can effectively, legally, and affordably, make information available to all. The world is not yet open, but we can get it there.

This guest post is contributed by Katheryn Rivas, who writes on the topics of online university. She welcomes your comments at her email Id: katherynrivas87@gmail.com.

by Katheryn Rivas at May 10, 2012 09:30 AM

New Brazilian Data Portal dados.gov.br – powered by CKAN

Last Friday (May 4), the Ministry of Planning in Brazil launched the final version of the Brazilian Open Data Portal. In line with the federal government policy to promote the use of free software in public administration, the portal was made using only free and open source tools. Among them is the Open Knowledge Foundation’s open-source data portal software CKAN. Moreover, the whole process of development of the portal was conducted with the participation of concerned citizens in an open way to promote open data.

dados.gov.br

Opening Data Openly

The development project of the Brazilian Open Data Portal takes the concept of social participation to the extreme. From the beginning, planning meetings and development forums were open to any interested citizen, announced in advance on open discussion lists and where possible relayed via live streaming video to the Internet (webcast) .

In each planning meeting, the development tasks were selected by a flexible development process, in which people present ideas of what they think is needed in small ticket records. At the end of the round, the tickets are grouped, categorized and prioritized. At the end of the meeting, the events were recorded in a publicly accessible wiki (INDA wiki), and a publicly visible task manager (Trac).

We engaged the participation of people from civil society and of civil servants, who collaborated in various ways. Some people were involved right through the process, while others made contributions along the way. We had contributions in the form of software development, design, and information architecture, among others. The latter began with an experimental “card sorting” conducted with the participants of the event Campus Party 2012 in Sao Paulo. This synergy between government and citizens working together for the common good is what we mean by open government.

The Portal has gone through several versions, but the most important are the first (a simple HTML page with a tagcloud of catalogue data), followed by its beta, a little more prepared and documented, and then the current version with a new set of features and extensive reference material and learning.

The dados.gov.br now has 78 data sets with 849 resources. These have mostly been catalogued based on a survey of data that public bodies already publish on the Internet, but that until then were scattered and lacked a central access point where the public could find them. They are, however, the tip of the iceberg compared to what there is to be opened around public data in Brazil.

Recognizing this and the urgency in meeting the new law on access to information, the Secretariat for Logistics and Information Technology is preparing a workshop to guide public bodies, on how to include their data in the catalogue. This will take place in early June.

The portal is part of a larger project called the National Infrastructure Open Data – INDA. The general idea of ​​INDA is to establish technical standards for open data, promote training and support public bodies in the task of publishing open data. This entire process is done through intra-government cooperation and cooperation between government and citizens, always aiming to achieve a real platform for open government.

by Rufus Pollock at May 10, 2012 08:59 AM

May 09, 2012

Schneider, Jodi

Error reporting: it’s easier in Kindle

One thing I can say about Kindle: error reporting is easier.

You report problems in context, by selecting the offending text. No need to explain where - just what the problem is.

Feedback receipt is confirmed, along with the next steps for how it will be used.

By contrast, to report problems to academic publishers, you often must fill out an elaborate form (e.g. Springer or Elsevier). Digging up contact information often requires going to another page (e.g. ACM.). Some make you *both* go to another page to leave feedback and then fill out a form (e.g. EBSCO). Do any academic publishers keep the context of what journal article or book chapter you’re reporting a problem with? (If so, I’ve never noticed!)

by jodi at May 09, 2012 09:45 PM

del.icio.us

The Code4Lib Journal – Presenting results as dynamically generated co-authorship subgraphs in semantic digital library collections

by gaps96 at May 09, 2012 09:15 PM

Rochkind, Jonathan

Rails Engine Guide

The Rails Guides are actually really good overview documentation. The days of saying Rails documentation is terrible are over, with the good guides, and good api-level docs too.

I knew there was a Plugin Guide, but I only just noticed there’s an Engine Guide too.

For reasons I don’t know, the Engines guide is not listed on the Guides home table of contents page, even though it’s available at guides.rubyonrails.org.   It also doesn’t google very well.

So here’s my part to publisize it. Both the Engines and Plugin guides are pretty good. They’re also overlapping in coverage. As the Engines guide says:

Engines are also closely related to plugins where the two share a common lib directory structure and are both generated using the rails plugin new generator. The difference being that an engine is considered a “full plugin” by Rails as indicated by the --full option that’s passed to the generator command, but this guide will refer to them simply as “engines” throughout. An engine can be a plugin, and a plugin can be an engine.

Maybe ideally they’d be merged, but they’re both good guides; you’ll probably want to review both if you’re working on Rails plugins OR engines.

Actually, you won’t find those exact words above in the current stable release of the Engines Guide, you’ll find it on edgeguides.rubyonrails.org instead.  I believe the actual guides are versioned with Rails releases — after one is released to guides.rubyonrails.org with the most recent rails version, it’s never changed (but perhaps for very serious errors), rather any changes will be released with the next Rails release.But you can preview em on edgeguides (I’m not sure if new ‘stable’ guides at guides.rubyonrails are pushed with new patch-releases of Rails, or just new minor-releases).

So you might find edgeguides contains some content that doesn’t apply to the current Rails release; but it may also contain improved explication or better examples or more coverage, as a result of contributors improving things. It’s worth checking edgeguides for a complicated topic. In this case, the Engines guide is rather improved in the ‘edge’ version as of this writing, and I don’t believe it includes anything that won’t work in Rails 3.2, it’s worth checking out.

A while ago I learned about `rails plugin new plugin_name`  command to give you a gem plugin  skeleton. Including a very useful dummy app for testing.  Before that I had been doing it by hand!   But this generates a very lightweight plugin, I was going in by hand and adding the files and methods turning into a heavyweight engine. I only just now, from the Engines Guide, noticed you can do `rails plugin new plugin_name -full` to get a fully engine-ized plugin.

Note on plugins ‘vs’. gems

It’s not a “vs.”.  Since Rails 3.0, Rails plugins could be distributed as gems. Since Rails 3.2, distributing a plugin as anything but a gem is deprecated — vendor/plugins is probably going to go away in future releases.

Although the great architecture of Rails 3 is such it wouldn’t be that hard to put it back into a particular app yourself, if you really needed to for legacy purposes. But in general, you don’t want, plugins-as-gems are great, making dependency management a lot more sane.

The current guides.rubyonrails.org plugin guide still gives you the option of a /vendor/plugin or a plugin-as-gem — it ought not to, since the current Rails version deprecates /vendor/plugin.  So don’t do that with a new plugin.

The edgeguide is more clear.  (I made the commit myself! Did you know anyone can commit to docrails github repo?  The commits are reviewed by editors before being merged into actual rails, and in the case of guides, eventually deployed to guides.rubyonrails).


Filed under: General

by jrochkind at May 09, 2012 06:46 PM

LITA

Jobs in Library Technology: May 9

New vacancy listings are posted weekly on Wednesday at approximately 11:00 a.m. Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Digital Services and Instruction Librarian Full-time, Tenure-Track Position ,Mateo County Community College District, San Mateo, CA

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

 

by vedmonds at May 09, 2012 05:04 PM

Library Hackers Unite blog

Groklaw has the goods on Oracle v. Google

I’m expecting you already know about the important Oracle v. Google court case over Android’s use of Java APIs, including both copyright and patent claims. But it would be hard to find a more detailed and direct account than Groklaw’s series of notes from the courtroom like this one from the copyright phase, ultimately subtitled Partial Verdict; Oracle Wins Nothing That Matters.  For the entire ongoing catalog, try this rabbit hole.

Having read direct courtroom reporting and the Court’s own documents, the headlines in some mainstream news outlets declaring Oracle the “winner” and Google “guilty” will start to look awfully remote and more than a little bizarre.  Google has moved for a new case since this jury was unable to determine whether their code constitutes a “fair use” of the Java API, so we might get to see the whole thing play out again.

More likely, the Court itself could deliver definitive resolution to the question whether APIs are copyrightable, particularly if the Court’s opinion converges its EU counterpart in a very recent case (Ars Technica via Google Cache, since the original is 404-ing for some reason).  One can hope.  The EU ruling was particularly bold because it protects reimplementation to the extent of voiding any agreements (read: EULAs) inhibiting that right.  That is a smart extra step in order to make the rights not immediately click-through disposable.

For more, follow along during the trial’s current patent phase at Groklaw.

by Joe Atzberger at May 09, 2012 04:34 PM

Open Knowledge Foundation

Hackathon alert: BiblioHack!

The Open Knowledge Foundation’s Open Biblio group, and Working Group on Open Data in Cultural Heritage, along with DevCSI, present BiblioHack: an open Hackathon to kick-start the summer months. From Wednesday 13th – Thursday 14th June, we’ll be meeting at Queen Mary, University of London, East London, and any budding hackers are welcome, along with anyone interested in opening up metadata and the open cause – this free event aims to bring together software developers, project managers, librarians and experts in the area of Open Bibliographic Data. A workshop will run alongside the coding on the 13th, and a meet-up on the evening of the 12th is open to all whether you’re attending the Hackathon or not.

What is BiblioHack?

BiblioHack will be two days of hacking and sharing ideas about open bibliographic metadata.

There will be opportunities to hack on open bibliographic datasets and experiment with new prototypes and tools. The focus will be on building things and improving existing systems that enable people and institutions to get the most of bibliographic data.

If you’re a non-coder there are sessions for you too. We will be running a hands-on workshop addressing the technical aspects to opening up cultural heritage data looking at best of breed open source tools for doing that, preparing your data for a hackathon and the best standards for storing and exposing your data to make it more easily re-used.

When and where?

Who is organising the event?

Who else is involved?

We’ve already lined up a whole host of speakers and groups who’ll be attending both the hack and the workshop. The list so far includes UK Discovery, CKAN, Europeana, Total Impact, Neontribe, The British Library with many more to be added in the coming days…

You’re giving your time and expertise – what do you get if you attend the whole hack?

How can I sign up?

Please note, if you wish to attend all 3 events you should sign up for each, and the Workshop will run in parallel with the hacking on the morning of the 13th.

More questions?

Contact Naomi Lillie on admin [@] okfn.org.

See you there!

by Naomi Lillie at May 09, 2012 12:04 PM

Sefton, Peter

File wrangling for researchers / Feral-data capture

[Had some problems with the images in this post at fist, should be fixed now]

At UWS we’re about to start work on our Research Data Repository project, which you can read about over on the UWS eResearch blog. The starting point will be the Research Data Catalogue component of the repository. The main point of the catalogue is to describe research data collections for the purposes of discovery, reuse, reporting and archiving. But what is a research data collection and how might a researcher put one together?

I won’t attempt an all-encompassing answer to that, but I would like to look at one common case, where a data collection is a set of files. How can we help researchers deal with file-based data efficiently, and as generically as possible?

This is important – we know from talking to eResearch and IT people from other universities that if you provide raw storage to researchers people will use it; it will start to fill-up and at some point the institution will be scratching its eResearch head and asking “what exactly do we have here?”. We really need to get data described both early and often, and to think about data in the context of the research lifecycle; applying for grants, reporting on grants, publishing and so on.

This post tackles the question “How can we help our researchers keep track of the vast amounts of stuff that will start accumulating our servers when we roll out file storage services?” It summarizes a recent demo I gave at Intersect NSW, our eResearch partner, at a meeting organized by Ingrid Mason (@1n9r1d). The demo is designed to show how generic file management services could help researchers to select and package file-based data for easy deposit into long-term curated, managed storage in a couple of scenarios.

I have written about this before, and showed some other Intersect people a similar demo last year in the hope that the demo might be of interest to the team working on the application formerly known as FieldHelper. FieldHelper is about getting files labeled and bundled for repository deposit as efficiently as possible. I’d love to hear from the Intersect team about what other applications they’ve found in this space, and their experiences with The Fascinator software, comments are open below, or there is an active mailing list for the software.

Previously I have shown the same demo software looking at other kinds of data such as computational chemistry in the Beyond the PDF workshop organized by Prof Phil Bourne, and documents, such as Joss Winn’s thesis.

The use-case for a data collection is where there are a number of files that need to be grouped together:

So, there’s a bunch of data files sitting somewhere; on a laptop, a share, an USB message stick, in Dropbox, etc.

For the data for this demo I chose an example from the University of Western Sydney, where I work, using a data collection collected by Professor Roger Dean and Dr Freya Bailes from the MARCS institute. This data set is one of the exemplars from the university’s Seeding the Commons project, funded by the Australian National Data Service.  It’s a collection of measurements of audio intensity in a range of musical works consisting of 51 files, all plain text. This data set is explained in a journal article, A rise-fall temporal asymmetry of intensity in composed and improvised electroacoustic music.

There’s a web application (The Fascinator) watching the relevant storage, finding all the files you put there and showing them to you as best it can through a web browser. There are two ways to package the files:

  1. In a hand-curated ‘package’ where you can corral a group of files, optionally provide some navigation hierarchy and describe the data. This was the main focus of this particular demo.

  2. In a dynamic view of the working storage that watches the storage for data with certain properties such as a location on disk, a tag, or a metadata field and does something with it, like routing it a repository or a collaboration-space.

The demo:

  1. There’s a Dropbox folder (as in dropbox.com) on my machine. I put the sound intensity data files in there:

  2. I’ve set up a server using a free (as in beer) virtual machine from the NeCTAR research cloud, funded by the Australian government.

    On the server I have installed a copy of The Fascinator in its default, un-customized guise – but remember the same software could be installed on a laptop, or in the lab. (The Fascinator is the Free Software toolkit that was used to build the ReDBox Research Data Catalogue that’s being widely deployed in Australian Universities now, including at UWS).

  3. The server also has the Dropbox folder so anything I put in the folder on my machine turns up there (there’s still no compelling Free (as in libre) alternative to Dropbox that I could have used, but we keep looking – has anyone tried OwnCloud or SparkleShare? Let me know in the comments).

  4. The Fascinator is, by virtue of a few lines of configuration, watching the Dropbox folder. Anything that appears in the folder gets processed. Metadata is extracted, web-previews are generated for office documents, images, videos etc using an extensible set of plugins. If there was a business case someone could write a plugin for the sound intensity data, to show it as a graph, or do analysis across samples.

  5. You can see the files in the web interface via the file system :

  6. And via a search interface:

  7. And there is a mechanism to package several files together, and build a navigational structure for them. This produces a navigable package outline.

  8. Here’s what it looks like when browsing the package online to find an individual file:

    So we have:

    • Found the data.

    • Packaged it together and ordered it.

    • An interface, using The Fascinator where we can eyeball the data file by file, tag things, or apply formal metadata; there’s a huge list of features in The Fascinator, we would need to work out which ones are useful to which researchers if we deployed it in this kind of role.

  9. The next step is not done yet, but soon we will demonstrate a very simple workflow showing a path from files on disc, to a package in the institutional Research Data Repository. I could tag this package as ‘CurateMe’ and institutional Research Data Catalogue could pick it up and put it in the work-queue for a research librarian to help with long-term curation. This is exactly the same model we described for linking our Data Capture project for ecological data to the Research Data Catalogue.

This work is a demo that was built by the team back at the University of Southern Queensland. The work there was halted, but now with many institutions building institutional Research Data Catalogues with their free ANDS Metadata Stores money it is time to think about how we might capture some of the long-tail research data which is never going to have a $200,000 data capture project devoted to it and how we are going to keep track of data throughout the research lifecycle.

Copyright Peter Sefton 2012. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

by ptsefton at May 09, 2012 12:09 AM

May 08, 2012

Ng, Cynthia

TRY 2012: Drupal for Libraries at UTL

Just a warning that some of this gets fairly technical, especially with hardware setup, and without the related diagrams, it may be difficult to understand, but the basics are there.

Presenters

Evolution of the UTL Website

Drupal

Drupal Related Additions

Implementation

Training

Performance

Setup

Results

As I posted on twitter, I’m quite glad we don’t take care of our own hardware, especially since we just don’t have the people and resources (including not having any server admin), but I was quite impressed with the setup of the UTL Drupal setup. Quite interesting to hear what they’re doing.


Filed under: Academic, Technology

by Cynthia at May 08, 2012 11:33 PM

LITA

LITA @ #ala12 Anaheim

Once again I bring you a gCal of the LITA-sponsored events at ALA Annual conference.

Also, I encourage you to go and check out the awesome ALA Annual Conference Scheduler and build your own custom calendar for the conference.

Please note that the LITA Happy Hour will be on Sunday 6/24/2012 from 5:30pm – 8:00pm (a change from our usual Friday evening, so I felt it should be highlighted)

by AaronDobbs at May 08, 2012 10:41 PM

Ng, Cynthia

TRY 2012: Evolving Services with Technology

Angela Hamilton, U of T Scarborough, spoke about technologies that she has used particularly at a campus where many are commuter or distance education students.

Libguides: Customized tools

Online Meeting Software

Screencapture Videos

I think some of the ideas presented here are great ways to give students further reference on how to do their research, especially on-the-spot screencasts for customized tutorials for them to review later.


Filed under: Library, Technology

by Cynthia at May 08, 2012 08:32 PM

TRY 2012: Digital Signage at the Robarts Library (UTL)

This presentation actually not only talks about digital signage itself, but also the work culture change that happened in the systems department at UTL.

Presenters

Good Signs Can Make a Difference

Writing the Message

I don’t know that I agree with all of these, but then it was clear that it depends on the size and distance of the sign as well as where it is.

Presenting the Message

What Makes Digital Signage Different?

What Users Say

  1. Help me make better decisions
    • chat with a librarian, workshops
  2. Save me time
    • maps: library, stacks, workstations
    • directories: by floor, service, name, library
  3. Show me something relevant to me
    • news, community content
  4. Tell me something new and interesting
    • exhibitis, events, news
  5. Give me ideas
    • collection highlights

This is not what their actual users were saying. These ideas were based on a talk done by someone outside of the library and the list here is how those ideas might be applied in a library setting.

Touchscreen Kiosks

Interaction

What’s Next

This is interesting, because we’re working on something similar at our library and we were considering how responsive to make the site. Obviously, we need to seriously consider designing from desktop down to mobile.

Overhead Signage

Features

What’s Next

Building Directories

Development – Devops Movement

On System Administration

Devops Goals

  1. Eliminate stereotypes
    • developers are careless, arrogant while sysadmins always say no and work all night
  2. Increase communicatin between developers, operations, and management
  3. Continuous systems improvement
  4. Break down barriers and silos
  5. Develop methods to encourage all team members to see the organization’s goals

Advantages

Implementing DevOps With Digital Signange

I thought it was interesting that they spoke a lot about the more technical aspect as well as development methodology. I think it’s a good lesson for a lot of library IT departments that agile development with integrated back and front end staff can be very beneficial, particularly because it makes more development faster and more flexible.

One of the things that came up during the code4lib conference too is that developers should have a small amount of time to work on whatever seems interesting to develop new tools or services.


Filed under: Events, Technology, Web design

by Cynthia at May 08, 2012 08:19 PM

Bisson, Casey

Find Neighbors On The Same IP

What other sites share the same infrastructure with your site, or any other? Bing‘s IP search can answer. Do a search by IP number:

by Casey Bisson at May 08, 2012 07:39 PM

Grimmelmann, James

Google's Wardriving: A Retrospective

We now know much more about the Google Street View WiFi story, thanks to Google’s decision to release an unredacted version of the FCC report, to the New York Times’s identification of the Google employee involved as Marius Milner, and to further reporting from Ars Technica. The picture it paints is in some respects more flattering to Google, and in some respects worse.

Milner is the creator of NetStumbler, a tool for detecting and analyzing WiFi access points. It makes sense in hindsight that he ended up using his 20% time for the part of the Street View project that aimed to build a database of WiFi networks. And it turns out that he thought about the ethics and legality of recording payload data. He appears to have read some law-review scholarship on wardriving. He considered potential privacy issues, and concluded that the mobility of the Street View cars would minimize the risk of extensive data-gathering from any one user. Further, he emphasized that none of the data would be shared with Google users.

This is, I have to say, above the baseline of ethical cognition for programmers. Looking to legal scholarship at all is quite unusual. In fact, Milner’s thoughtfulness strikes me as roughly par for the course for front-line Google technologists. It’s a company that hires reasonably thoughtful people and encourages them to think about the implications of what they do for society, both good and bad.

But if Google is a company of smart, reflective, and well-intended individuals, collectively they make bad choices. Milner put his privacy concerns and the details of the WiFi payload recording in a design document. The document included a “to do”: “[D]iscuss privacy considerations with Product Counsel.” He talked to a member of the search quality team about the idea; he circulated the design document together with his code to Street View’s project leaders, who forwarded it to the entire Street View team. And he exchanged emails with other Street View programmers and managers that made clear Google was collecting payload data. But nothing happened. For fifteen months, Google Street View cars sucked up and recorded WiFi payload data.

As I said in an earlier post:

When it comes to privacy, this is a company out of control. Google’s management is literally not in control of the company.

Google’s Street View managers failed badly at their jobs. One of them “pre-approved” the design document before it was written, demonstrating complete failure to understand the purpose of managerial review. No one followed up to make sure the discussion with Product Counsel actually happened. Other engineers read the design document and Milner’s code, but either missed the fact that it was collecting payload data or didn’t realize that this could be a potential issue. Again, this is a failure of management: it’s an important part of their job to make programmers aware of the possible legal trouble zones in the areas they’re working on.

Milner has invoked the Fifth and isn’t talking to reporters. He made a mistake, but he’s not a legal expert and it’s a bit unfair to expect him to be. No, his managers let him—and the rest of us—down.

by James Grimmelmann (james@grimmelmann.net) at May 08, 2012 05:09 PM

Rochkind, Jonathan

Reddit’s actual? (or a variation?) story ranking algorithm explained (significant typos in previously published version (or not))

So it turns out there’s a significant typo, that keeps the algorithm from working right, in the several previously blogged descriptions of reddit’s story-ranking algorithm.


update 6:28PM ET 8 May On reddit, someone with a flag suggesting they ought to know tells me I got it wrong and the original algorithm was correct and is used in production. All I can say is I can’t figure out how that could be, I could not get it to work in a non-reddit codebase, I could get it to work with my ammendment.

If I haven’t corrected a typo, then I guess I’ve derived my own variation which works a lot better for me (that is, works at all for me, but also seems to mimic reddit), in my own codebase. Good enough for me. If you are trying to reimplement this algorithm in a non-reddit codebase, I suspect you’ll find my investigation useful.

Now back to the blog post as originally written.


More oddly, this same significant typo is in the public version of reddit’s code released on github.

I’ve found myself finding joy in code-for-code’s-sake like I haven’t since past days of being an undergrad staying up all the night in the CS lab working on the most fun homework ever. And so I found myself staying up into the wee hours last night investigating reddit’s story ranking algorithm (the one used for stories/posts in the default ‘hot’ ranking,  that is time-of-posting sensitive. A different algorithm is used for comments).

The wrong algorithm

The (typo’d) algorithm is most nicely described by Amir Salihefendic. He even provides a python implementation. I figured I’d translate it to my preferred comfortable language ruby, and play around with it changing different parameters to get a feel for it, and get a feel of how it might be modified/tuned to behave somewhat differently if one wanted.

My assumption was that this algorithm outputs a number which can be stored in the database, and stories can be ordered purely by this number, to produce the on-page ranking.   This seems indeed to be true, although I was doubting it a bit in the middle. (Another nice thing about this particular algorithm, that everyone did catch on to even in the typo’d version, is that a ranking order calculation only needs to be done when a ‘vote’ happens, then it can be stored in the db unchanging forever (until another vote happens)).

So I translated Amir’s python to ruby and starting playing, and the results made no sense.  They didn’t match how things work on reddit, and they didn’t result in any kind of useful ranking algorithm.

Users of reddit know that the story listings are mostly chronological. The vote count will change a story’s position somewhat, but not put it dozens of weeks or pages ahead or behind of it’s strictly chronological order.  But that’s what this algorithm did.  It also gave any story with a net-negative vote a negative score. Which would put all the net-positive vote stories before any of the net-negative voted stories. Which is not how reddit works.

Looking at the math again now, I’m kind of embaressed I didn’t immediately see the problem, it’s not complicated math. But I didn’t, it just made my head swim. I’ll give you the relevant line from Amir’s python version here, maybe you can do better than I did, now that I’ve primed you:

return round(order + sign * seconds / 45000, 7)

Before I give you the answer, I’m going to tell you all the things I did first:

(Yep, the implementation in github public reddit has the typo, and is wrong!).

At this point, I gave up on understanding the reddit algorithm, I figured there was something I was missing (wrong, only thing I was missing was the typo). But, okay, I dove back into the math, trying to understand it and convert it to something that would work for me.

Take a moment to note lesson learned

Like many programmers, I rather like working from fixed assumptions and constraints, and building on top. This is kind of the nature of abstraction, don’t question the lower levels, take em as assumptions, don’t question em,  build upon them.

This is the second time recently that’s led me astray into butting my head against a dead end wall repeatedly, assuming the problem was in my own implementation or understanding, instead of in the framework code I was using, or the published algorithm or explanation I was working from.

Sometimes you’ve got to start questioning the validity of the algorithm you’re working from or the correctness of the library/framework code you’re using sooner rather than later, to save yourself some time. However, do it privately, if you start questioning your dependencies in public without evidence, everyone’s probably (rightly) going to tell you “occam’s razor, the bug is probably in your code, not the dependency.”

The Fix

WRONG:  return round(order + sign * seconds / 45000, 7)
RIGHT:  return round(  (order * sign) + (seconds / 45000),   7)

Is it obvious now that you see it, that the first one makes no sense, but the second one does?  Maybe if you see it in context, here’s my ruby implementation of the corrected algorithm. 

I feel kind of stupid  for not noticing this right away; on the other hand, as far as I can find on google, nobody’s pointed out the typo bug before, and several have commented on the (wrong, typo buggy) algorithm.

The Explanation of the Algorithm

With the typo corrected, it’s much easier to explain the algorithm. The crucial line, from my ruby version, with variables named how I think is clearer:

return (displacement * sign.to_f) + ( epoch_seconds(date) / 45000 )

It plots each story on a fixed timeline by post, and then displaces a story on the timeline by it’s votes.  It uses only the vote difference between up and down for displacement, the total number of votes is irrelevant. First:

... + ( epoch_seconds(date) / 45000 )

This just plots each story on a fixed timeline, with distance between two stories always exactly proportional to difference between absolute posting time.

The `/45000` fixes the units of the timeline as “12 hour periods” (45000 seconds in 12 hours), rather than seconds. This reduces the order of magnitude of the units by 4.5ish, making them conveniently less likely to overflow wherever you’re keeping them. But more importantly, choosing the units matters for how much displacement the actual votes will cause, making sure they match appropriately. Then:

(displacement * sign.to_f) + ....

Here’s our displacement. `displacement` is the based on the vote difference (up – down), but on a logarithmic scale.  The way the logarithmic scale is calculated, it loses the sign, so it just has to be added back in to net-down-votes will displace the story to be older on the timeline, and net-up-votes will displace the story to be newer on the timeline.

Why is a logarithmic scale used? Other explainers have said something like “to weight the first votes higher than the rest.”  While it might have this effect because of reddit voter’s behavior, this is a misleading explanation.  The algorithm pays no attention to which votes were made first, either in absolute chronological time or in sequence. It’s just vote-difference.  ”10 up, 1 down” has exactly the same effect as “100 up, 91 down” or “1000 up, 991 down”.  And it doesn’t matter what order the ups and downs were placed.

The logarithmic scale is in fact used to prevent the displacement-from-voting from displacing the display order too much.  Reddit doesn’t want a very high or low voted story to be months ahead or behind, the reddit ‘hot’ order is mostly chronological, with just some displacement from votes.

I dont’ do this kind of mathematical analysis much, and don’t know how to get, say, R, to make you a pretty plot (it ought to be an actual function plot not a bar graph, for explanatory power). So I’ll just give you some samples of how much displacement a given vote-diff can get. Again vote-diff is just ups minus downs, doens’t matter total number of votes. I’ve converted from the “12-hour units” the displacement is actually expressed in to more comprehensible ‘in hours’ units.

vote-diff displacement in hours
0 0
1 0
2 3.7
3 6.0
4 7.5
5 8.7
6 9.7
7 10.6
8 11.3
9 11.9
10 12.5
100 25.0
1000 37.5
10000 50.0

As you see, even something with an absurdly high 10000 vote-diff only gets put 50 hours ahead of it’s usual place in the timeline. Likewise, if it had a -10000 vote-diff (10k more downvotes than upvotes), it would be only 50 hours behind it’s usual place in the timeline.

Keeps votes from changing the position of a story too much, keeping it at the top forever, or moving it so many pages in that nobody ever sees it.  That’s what the log scale does.

And that scale pretty well matches what we reddit users actually observe on reddit, I went and checked it against some popular reddits; reddit only displays approximate posting time of a story as far as I can tell (“1 day ago” could mean 28 hours or 32 hours or whatever), so can’t check completely, but the actual ordering could be explained by the corrected algorithm.

Wrong in the public source on github?

update 6:55pm ET 8 May 2012.  reddit assures me that this code is what reddit runs live, and I have made some really stupid mistake. Fair enough. Struck out this section.

Unless I’m making some really stupid mistake, this typo-bug is present in the reddit source publicly shared on github as of time of this writing. [1], [2]

This means that there’s pretty much no way actual reddit.com is using the code they’ve posted publicly on github. At the very least, they’ve fixed this bug in the implementation they’re actually running.

It probably means nobody else is using the reddit github source either, cause it wouldn’t work right with ‘hot’ ranking. (Or someone else is using it, and fixed the bug in their source but didn’t send it back).

How did this bug end up in the publicly shared reddit source? Not fixed yet? I’m kind of curious, and curious as to what relationship this publicly shared source has to what reddit actually runs.

Considering tweaking the algorithm

Now that we understand the basic “timeline + displacement” algorithm, we can consider tweaks/modifications/tuning of the algorithm to behave differently in different environments, which curiosity was my original motivation for looking into this in the first place.

You might want vote displacement to have more of an effect, or have the effect trail off faster or slower . You’d still want to use a log-scale (or a mathematical function with similar properties) to keep very high vote-diffs from displacing a story too much, you still want a trail-off effect.

You could change the log from base-10 to base-something else to effect the velocity of the trail-off effect.  You could also introduce a factor into the operand of the log, take `log( factor * vote-diff)` instead of just `log(vote-diff)`.   You potentially could change the units from 12-hour units to something else (the 45000 number), but that could get confusing quick, you might need to add another factor on the left hand of the sum to compensate. So actually, instead, you can just add a factor in the left-hand side, `factor * log(votediff)` instead of just `log (votediff)`

I’m not enough of a math guy to predict exactly what all those things would do, I’d want to actually plot the function in R (or something else) and see what it looks like when you change the various factors, and I don’t know enough R (or anything else) to do it. I think plotting vote-diff vs hours-of-displacement is the right thing to plot though to give you the right feedback.

You could also try to introduce something to the equation to take account of total number of votes, so “10 up, 1 down” and “100 up, 91 down” don’t have exactly the same effect. You’d want to base this on the Wilson score confidence interval used by reddit for default comment ranking somehow, that’s the right way to take account of total number of votes, but it’s not immediately clear to me where you’d introduce that into the equation how (Did I mention I’m not a math guy?).  That would make it a bit harder to see what it does by plotting it, since it’ll be a multi-variate function now, doh.

And you might not want to trust that the algorithm found in reddit’s public github source for Wilson score confidence interval is actually bug free. Last year someone said they found a bug in at least one published implementation; I think I saw someone say it had been fixed on reddit.com, but I don’t know if it’s been fixed in the github public source.

You might also want to make up votes worth more or less than downvotes, instead of equivalent. Not quite sure how you’d do that. You could make net-negative votes worth more or less than the same absolute value net-positive, just by using a factor in `factor * log(diff)` that depends on diff being positive or negative.


Filed under: General

by jrochkind at May 08, 2012 04:18 PM