planet code4lib

Syndicate content
Planet Code4Lib - http://planet.code4lib.org
Updated: 20 weeks 5 days ago

Morgan, Eric Lease: Three RDF data models for archival collections

Sun, 2014-03-30 18:49

Listed and illustrated here are three examples of RDF data models for archival collections. It is interesting to literally see the complexity or thoroughness of each model, depending on your perspective.


This one was designed by Aaron Rubinstein. I don’t know whether or not it was ever put into practice.


This is the model used in Project LOACH by the Archives Hub.


This final model — OAD — is being implemented in a project called ReLoad.

There are other ontologies of interest to cultural heritage institutions, but these three seem to be the most apropos to archivists.

This work is a part of a yet-to-be published book called the LiAM Guidebook, a text intended for archivists and computer technologists interested in the application of linked data to archival description.

Rochkind, Jonathan: Academic freedom in Israel and Palestine

Sun, 2014-03-30 16:03

While I mostly try to keep this blog focused on professional concerns, I do think academic freedom is a professional concern for librarians, and I’m going to again use this platform to write about an issue of concern to me.

On December 17th, 2013, the American Studies Association membership endorsed a Resolution on Boycott of Israeli Academic Institutions. This resolution endorses and joins in a campaign organized by Palestinian civil society organizations for boycott of Israel for human rights violations against Palestinians — and specifically, for an academic boycott called for by Palestinian academics.

In late December and early January, very many American university presidents released letters opposing and criticizing the ASA boycott resolution, usually on the grounds that the ASA action threatened the academic freedom of Israeli academics.

Here at Johns Hopkins, the President and Provost issued such a letter on December 23rd. I am quite curious about what organizing took place that resulted in letters from so many university presidents within in a few weeks. Beyond letters of disproval from presidents, there has also been organizing to prevent scholars, departments, and institutions from affiliating with the ASA or to retaliate against scholars who do so (such efforts are, ironically, quite a threat to academic freedom themselves).

The ASA resolution (and the Palestinian academic boycott campaign in general) does not call for prohibition of cooperation with Israeli academics, but only against formal collaborations with Israeli academic institutions — and in the case of the ASA, only formal partnerships by the ASA itself, they are not trying to require any particular actions by members as a condition of membership in the ASA.  You can read more about the parameters of the ASA resolution, and the motivation that led to it, on the ASA’s FAQ on the subject, a concise and well-written document I definitely recommend reading.

So I don’t actually think the ASA resolution will have significant effect on academic freedom for scholars at Israeli institutions.  It’s mostly a symbolic action, although the fierce organizing against it shows how threatening the symbolic action is to the Israeli government and those who would like to protect it from criticism.

But, okay, especially if academic boycott of Israel continues to gain strength, then some academics at Israeli institutions will, at the very least, be inconvenienced in their academic affairs.  I can understand why some people find academic boycott an inappropriate tactic — even though I disagree with them.

But here’s the thing. The academic freedom of Palestinian scholars and students has been regularly, persistently, and severely infringed for quite some time.  In fact, acting in solidarity with Palestinian colleagues facing restrictions on freedom of movement and expression and inquiry was the motivation of the ASA’s resolution in the first place, as they write in their FAQ and the language of the resolution itself.

You can read more about restrictions in Palestinian academic freedom, and the complicity of Israeli academic institutions in these restrictions, in a report from Palestinian civil society here; or this campaign web page from Birzeit University and other Palestinian universities;  this report from the Israeli Alternative Information Center;  or in this 2006 essay by Judith Butler; or this 2011 essay by Riham Barghouti, one of the founding members of the Palestinian Campaign for the Academic and Cultural Boycott of Israel.

What are we to make of the fact that so many university presidents spoke up in alarm at an early sign of possible, in their views, impingements to academic freedom of scholars at Israeli institutions, but none have spoken up to defend significantly beleaguered Palestinian academic freedom?

Here at Hopkins, Students for Justice in Palestine believes that we do all have a responsibility to speak up in solidarity with our Palestinian colleagues, students and scholars, whose freedoms of inquiry and expression are severely curtailed; and that administrators silence on the issue does not in fact represent our community.  Hopkins SJP thinks the community should speak out in concern and support for Palestinian academic freedom, and they’ve written a letter Hopkins affiliates can sign on to.

I’ve signed the letter. I’d urge any readers who are also affiliated to Hopkins to read it, and consider it signing it as well. Here it is.


Filed under: General

Rochkind, Jonathan: Academic freedom in Israel and Palestine

Sun, 2014-03-30 16:03

While I mostly try to keep this blog focused on professional concerns, I do think academic freedom is a professional concern for librarians, and I’m going to again use this platform to write about an issue of concern to me.

On December 17th, 2013, the American Studies Association membership endorsed a Resolution on Boycott of Israeli Academic Institutions. This resolution endorses and joins in a campaign organized by Palestinian civil society organizations for boycott of Israel for human rights violations against Palestinians — and specifically, for an academic boycott called for by Palestinian academics.

In late December and early January, very many American university presidents released letters opposing and criticizing the ASA boycott resolution, usually on the grounds that the ASA action threatened the academic freedom of Israeli academics.

Here at Johns Hopkins, the President and Provost issued such a letter on December 23rd. I am quite curious about what organizing took place that resulted in letters from so many university presidents within in a few weeks. Beyond letters of disproval from presidents, there has also been organizing to prevent scholars, departments, and institutions from affiliating with the ASA or to retaliate against scholars who do so (such efforts are, ironically, quite a threat to academic freedom themselves).

The ASA resolution (and the Palestinian academic boycott campaign in general) does not call for prohibition of cooperation with Israeli academics, but only against formal collaborations with Israeli academic institutions — and in the case of the ASA, only formal partnerships by the ASA itself, they are not trying to require any particular actions by members as a condition of membership in the ASA.  You can read more about the parameters of the ASA resolution, and the motivation that led to it, on the ASA’s FAQ on the subject, a concise and well-written document I definitely recommend reading.

So I don’t actually think the ASA resolution will have significant effect on academic freedom for scholars at Israeli institutions.  It’s mostly a symbolic action, although the fierce organizing against it shows how threatening the symbolic action is to the Israeli government and those who would like to protect it from criticism.

But, okay, especially if academic boycott of Israel continues to gain strength, then some academics at Israeli institutions will, at the very least, be inconvenienced in their academic affairs.  I can understand why some people find academic boycott an inappropriate tactic — even though I disagree with them.

But here’s the thing. The academic freedom of Palestinian scholars and students has been regularly, persistently, and severely infringed for quite some time.  In fact, acting in solidarity with Palestinian colleagues facing restrictions on freedom of movement and expression and inquiry was the motivation of the ASA’s resolution in the first place, as they write in their FAQ and the language of the resolution itself.

You can read more about restrictions in Palestinian academic freedom, and the complicity of Israeli academic institutions in these restrictions, in a report from Palestinian civil society here; or this campaign web page from Birzeit University and other Palestinian universities;  this report from the Israeli Alternative Information Center;  or in this 2006 essay by Judith Butler; or this 2011 essay by Riham Barghouti, one of the founding members of the Palestinian Campaign for the Academic and Cultural Boycott of Israel.

What are we to make of the fact that so many university presidents spoke up in alarm at an early sign of possible, in their views, impingements to academic freedom of scholars at Israeli institutions, but none have spoken up to defend significantly beleaguered Palestinian academic freedom?

Here at Hopkins, Students for Justice in Palestine believes that we do all have a responsibility to speak up in solidarity with our Palestinian colleagues, students and scholars, whose freedoms of inquiry and expression are severely curtailed; and that administrators silence on the issue does not in fact represent our community.  Hopkins SJP thinks the community should speak out in concern and support for Palestinian academic freedom, and they’ve written a letter Hopkins affiliates can sign on to.

I’ve signed the letter. I’d urge any readers who are also affiliated to Hopkins to read it, and consider it signing it as well. Here it is.


Filed under: General

Miedema, John: Good-bye database design and unique identifiers. Strong NLP and the singularity of Watson.

Sat, 2014-03-29 18:17

Every so often the game changes. Newton thought time was a constant. Einstein showed that time slows down for travelers at light speed. A change of singular proportion is happening in computing today because of the challenges of big data and the rise of Strong Natural Language Processing technologies.

Step back to the world of small data and database design 101. An entity such as a Customer is defined by a list of attributes: Name, Address, Phone Number, and so on. These attributes are structured as fields in a Customer table. Each Customer record is assigned a unique identifier (UID). The Customer ID is a primary key, allowing database designers to create relationships between Customers and other entities. Each record in a Products table has a Product ID, and each record in an Invoices table has an Invoice ID. The Invoices table will have extra foreign key columns for Customer ID and Product ID so that queries can efficiently pull out a purchase history for a customer.

Traditional database design is effective, as long as you don’t want to integrate databases across organizations. In the Customers database, John Smith has a specific Customer ID. Good enterprise design will share that information across databases, but in another enterprise John Smith has a completely different UID. Assuming permissions to share data, the only way to line up these two UIDs is to compare record fields: Name, Address, Phone Number, again. Hopefully the information has not changed and there are not too many typos.

The largest volume of data being generated today is unstructured: documents, emails, blog posts, tweets, and so forth. Most of this data is accessible on the open web. This is the world of ‘big data.’ Search technologies help. A Google search yields possible results ranked by its sophisticated ranking algorithm. A likely match is often found in the first page of results. We accept this as a good thing. Wouldn’t it be nice if the old I’m Feeling Lucky button could correctly answer a question in one try?

Natural Language Processing (NLP) is a big data technology. Beyond keyword matching, NLP parses words and sentences for meaning. Standard patterns identity people, companies and locations. Custom domain models are built to identity business concepts and detect relationships. NLP transforms unstructured content into structured data. One begins to wonder if NLP can replace good-old database design and its unique identifiers. Not quite. There is still an unsatisfactory margin of error. John Smith might get identified correctly as a Customer who purchased Twinkies, but NLP might struggle in other cases with variants in proper names and products. Error rates get compounded. If person identification is 95% accurate, and product 90%, the overall confidence is only 86%. NLP still depends on human search and analytics for uniquely resolving an answer.

Stronger NLP technologies are emerging. In 2011, IBM challenged the world’s two best Jeopardy players to compete with its Watson supercomputer. The game requires nuanced knowledge of human language and culture. Watson used NLP on a database of 200 million pages of structured and unstructured content, including Wikipedia. Thing is, the game only permits one answer to a question, not pages of possible answers. Certainly, Watson would come up with a list of likely answers, ranked by probability, but it could only submit its single best answer. To win, Watson had to answer correctly more often than its skilled human competitors throughout the game. Watson won. Strong NLP is the ability to process big data to produce one correct answer.1 Strong NLP is a singularity in computing history. We can say good-bye to traditional database design and its unique identifiers. I can imagine much bigger changes.

  1. Of course, it would have the ability to answer successive questions correctly. Also, if there are two equally correct answers, both answers would be given. This would not work on Jeopardy but it would be necessary in real life.

Miedema, John: Good-bye database design and unique identifiers. Strong NLP and the singularity of Watson.

Sat, 2014-03-29 18:17

Every so often the game changes. Newton thought time was a constant. Einstein showed that time slows down for travelers at light speed. A change of singular proportion is happening in computing today because of the challenges of big data and the rise of Strong Natural Language Processing technologies.

Step back to the world of small data and database design 101. An entity such as a Customer is defined by a list of attributes: Name, Address, Phone Number, and so on. These attributes are structured as fields in a Customer table. Each Customer record is assigned a unique identifier (UID). The Customer ID is a primary key, allowing database designers to create relationships between Customers and other entities. Each record in a Products table has a Product ID, and each record in an Invoices table has an Invoice ID. The Invoices table will have extra foreign key columns for Customer ID and Product ID so that queries can efficiently pull out a purchase history for a customer.

Traditional database design is effective, as long as you don’t want to integrate databases across organizations. In the Customers database, John Smith has a specific Customer ID. Good enterprise design will share that information across databases, but in another enterprise John Smith has a completely different UID. Assuming permissions to share data, the only way to line up these two UIDs is to compare record fields: Name, Address, Phone Number, again. Hopefully the information has not changed and there are not too many typos.

The largest volume of data being generated today is unstructured: documents, emails, blog posts, tweets, and so forth. Most of this data is accessible on the open web. This is the world of ‘big data.’ Search technologies help. A Google search yields possible results ranked by its sophisticated ranking algorithm. A likely match is often found in the first page of results. We accept this as a good thing. Wouldn’t it be nice if the old I’m Feeling Lucky button could correctly answer a question in one try?

Natural Language Processing (NLP) is a big data technology. Beyond keyword matching, NLP parses words and sentences for meaning. Standard patterns identity people, companies and locations. Custom domain models are built to identity business concepts and detect relationships. NLP transforms unstructured content into structured data. One begins to wonder if NLP can replace good-old database design and its unique identifiers. Not quite. There is still an unsatisfactory margin of error. John Smith might get identified correctly as a Customer who purchased Twinkies, but NLP might struggle in other cases with variants in proper names and products. Error rates get compounded. If person identification is 95% accurate, and product 90%, the overall confidence is only 86%. NLP still depends on human search and analytics for uniquely resolving an answer.

Stronger NLP technologies are emerging. In 2011, IBM challenged the world’s two best Jeopardy players to compete with its Watson supercomputer. The game requires nuanced knowledge of human language and culture. Watson used NLP on a database of 200 million pages of structured and unstructured content, including Wikipedia. Thing is, the game only permits one answer to a question, not pages of possible answers. Certainly, Watson would come up with a list of likely answers, ranked by probability, but it could only submit its single best answer. To win, Watson had to answer correctly more often than its skilled human competitors throughout the game. Watson won. Strong NLP is the ability to process big data to produce one correct answer.1 Strong NLP is a singularity in computing history. We can say good-bye to traditional database design and its unique identifiers. I can imagine much bigger changes.

  1. Of course, it would have the ability to answer successive questions correctly. Also, if there are two equally correct answers, both answers would be given. This would not work on Jeopardy but it would be necessary in real life.

Hess, M Ryan: mryanhess

Fri, 2014-03-28 23:40

At the recent SXSW conference, Edward Snowden supplied people with tips to complicate the lives, if not totally block, those that stick their noses in your online business.

Not to be confused with trying to ruin the chances of the NSA averting a nuclear strike by terrorists on my own country, I do feel there are some well-reasoned limits to what the US government should be doing, especially when it comes to figuring out ways to undermine secure Internet protocols. After all, when, as purported by Snowden, the NSA begins devising backdoor hacks into our web browsers, you can be certain that this only makes it easier for others (perhaps dangerous) individuals from doing the same.

In other words, in the name of the War on Terror, the NSA might actually be planting the seeds for the death of the Internet…or at least a 9/11 style assault on the world’s computer infrastructure. Students of the origins of Bin Laden and his connections with the US War on Communism might be right to feel a little déjà vu.

A related threat, of course, is that criminals might stand on the shoulders of the NSA’s good work and do some very bad work against you and your bank account and your identity.

Anyway, that’ my soap box speech on this.

But back to my recent spat of blogs on privacy and how to cover your virtual butts. Snowden did hand out a few treats for the kids at SXSW: two browser plugins that he regards as good ways to enhance your privacy against NSA or NSA-inspired hackers.

The first is Ghostery, which allows you to view what web services are collecting data on you when you visit a given web page. It goes further by letting you (Ad Block style) block, pause or allow such collection.

I’ve been using it for a few days and have found it fascinating just how many scripts are gathering info on me when I land on a given page. Right now, I have everything turned off, so that should take care of that.

I did experience one problem watching an embedded video on a website. In these cases, you can pause all of Ghostery or try to figure out which one of the dozen or so scripts it’s blocking is the required one for the video and then decide if it’s worth it.

The other plugin is called NoScript, which simply shuts down all scripts, including JavaScript, Flash, etc. I haven’t tried this out, but I’m expecting it be something I will only use sparingly given the amount of jQuery and other useful bits embedded in many web interfaces.

 


Hess, M Ryan: mryanhess

Fri, 2014-03-28 23:40

At the recent SXSW conference, Edward Snowden supplied people with tips to complicate the lives, if not totally block, those that stick their noses in your online business.

Not to be confused with trying to ruin the chances of the NSA averting a nuclear strike by terrorists on my own country, I do feel there are some well-reasoned limits to what the US government should be doing, especially when it comes to figuring out ways to undermine secure Internet protocols. After all, when, as purported by Snowden, the NSA begins devising backdoor hacks into our web browsers, you can be certain that this only makes it easier for others (perhaps dangerous) individuals from doing the same.

In other words, in the name of the War on Terror, the NSA might actually be planting the seeds for the death of the Internet…or at least a 9/11 style assault on the world’s computer infrastructure. Students of the origins of Bin Laden and his connections with the US War on Communism might be right to feel a little déjà vu.

A related threat, of course, is that criminals might stand on the shoulders of the NSA’s good work and do some very bad work against you and your bank account and your identity.

Anyway, that’ my soap box speech on this.

But back to my recent spat of blogs on privacy and how to cover your virtual butts. Snowden did hand out a few treats for the kids at SXSW: two browser plugins that he regards as good ways to enhance your privacy against NSA or NSA-inspired hackers.

The first is Ghostery, which allows you to view what web services are collecting data on you when you visit a given web page. It goes further by letting you (Ad Block style) block, pause or allow such collection.

I’ve been using it for a few days and have found it fascinating just how many scripts are gathering info on me when I land on a given page. Right now, I have everything turned off, so that should take care of that.

I did experience one problem watching an embedded video on a website. In these cases, you can pause all of Ghostery or try to figure out which one of the dozen or so scripts it’s blocking is the required one for the video and then decide if it’s worth it.

The other plugin is called NoScript, which simply shuts down all scripts, including JavaScript, Flash, etc. I haven’t tried this out, but I’m expecting it be something I will only use sparingly given the amount of jQuery and other useful bits embedded in many web interfaces.

 


del.icio.us: www.youtube.com

Fri, 2014-03-28 09:09

del.icio.us: www.youtube.com

Fri, 2014-03-28 09:09

Rosenthal, David: PREMIS & LOCKSS

Fri, 2014-03-28 09:00
We were asked if the CLOCKSS Archive uses PREMIS metadata. The answer is no, and a detailed explanation is below the fold.

The CLOCKSS archive is implemented using LOCKSS technology. LOCKSS systems do not use PREMIS. As with OAIS, there are significant conceptual mismatches between the PREMIS model based on it and the reality of the content LOCKSS typically preserves. For example, the concept of "digital object" is hard to apply to preserving an artifact such as an e-journal that continually publishes new, compound objects with only a loose semantic structure. The view that e-journals consist of volumes that consist of issues that consist of articles only loosely corresponds to the real world.

As regards format metadata such as is generated by JHOVE, we are skeptical of its utility in the LOCKSS system because it is expensive to generate, unreliable, and of marginal relevance to content which is unlikely to suffer format obsolescence in the foreseeable future, and if it does may well be rendered via emulation rather than format migration.

Nevertheless, we integrated FITS into one version of the LOCKSS  daemon and used it to generate format metadata for the content in the CLOCKSS Archive. We do not use this version of the daemon in production CLOCKSS boxes:
  • FITS is several times bigger than the production LOCKSS daemon.
  • We do not have the resources to audit it for potential risks to the preserved content that it might pose.
  • The computational and I/O resources it consumes are significant.
  • Even if the metadata FITS generates were reliable, it would not be of operational significance in the CLOCKSS environment. 
As regards bibliographic metadata, to be affordable at the scale at which they operate, LOCKSS networks generally depend heavily on extracting metadata automatically, in whatever form it can be found in the content, and performing the minimal processing needed to support the needs of users, primarily for DOI and OpenURL resolution. Human intervention can be considered only at a very coarse level, a journal volume or above.

The CLOCKSS Archive uses bibliographic metadata for four purposes:
  • For billing, the number of articles received from each publisher must be counted. The article-level metadata needed is only the existence of an article.
  • For Keepers and KBART reports. These need volume-level metadata.
  • To locate content that is the subject of a board-approved trigger event in order to extract a copy from the archive. This typically needs volume-level metadata.
  • Once content has been triggered, to update DOI and OpenURL resolvers. This needs detailed article-level metadata.
Although we extract detailed article-level metadata, note that for the vast majority of the archive's content there is no operational need for it. It is needed only for the tiny fraction that is triggered, and only after the trigger event.

CLOCKSS is a dark archive. Until it is triggered, there are no readers to access the content, so there are no readers demanding the kinds of access that PREMIS bibliographic metadata would support. If the CLOCKSS board were to decide that PREMIS metadata support was important enough to justify the rather significant development costs that would be involved, it would be possible to implement it because LOCKSS supports similar semantic units to those that PREMIS describes. Although they are internally factored differently than the PREMIS data model; it would be possible to externalize a data dictionary for our content or respond to a query in terms of the data model that it describes.

Information is stored in several places within the CLOCKSS network, including the LOCKSS repository (storage-level metadata), the title database (preservation-unit level metadata), and the metadata database (bibliographic-level metadata). This information is tied together internally using a preservation-unit level "archival unit" identifier (AUID). Traversing these databases would enable us to generate a PREMIS compliant data dictionary. Responding to a query for any PREMIS-defined entity could be answered by mapping it to a range of AUIDs, and from there to information stored in these databases.

For example, a PREMIS Intellectual Entity (e.g. a journal article) is represented in the the metadata database. It can be located using an Intellectual Entity key such as its DOI or an ISSN and other bibliographic information. Using the AUID associated with that article allows us to retrieve its preservation-level metadata such as the Archival Unit parameters and attributes that specify its Agents and Rights. It also enables us to retrieve the Object Entities and their physical characteristics, and the Events related to provenance from the associated storage-level metadata in the repository.

But note that this isn't an operation that would ever be performed in the CLOCKSS archive. It is a dark archive; no access to preserved content is permitted unless and until it is triggered. Triggering happens externally at a journal (or conceptually at a volume level) and internally at an AUID level, not at an article level. It is a one-time process initiated by the board that hands content off under CC license to multiple re-publishing sites, from where readers can access it in the same way that they accessed it from the original publisher, via the Web. The CLOCKSS archive has no role in these reader accesses.

Philip Gust of the LOCKSS team provided some of the content above.

Rosenthal, David: PREMIS & LOCKSS

Fri, 2014-03-28 09:00
We were asked if the CLOCKSS Archive uses PREMIS metadata. The answer is no, and a detailed explanation is below the fold.

The CLOCKSS archive is implemented using LOCKSS technology. LOCKSS systems do not use PREMIS. As with OAIS, there are significant conceptual mismatches between the PREMIS model based on it and the reality of the content LOCKSS typically preserves. For example, the concept of "digital object" is hard to apply to preserving an artifact such as an e-journal that continually publishes new, compound objects with only a loose semantic structure. The view that e-journals consist of volumes that consist of issues that consist of articles only loosely corresponds to the real world.

As regards format metadata such as is generated by JHOVE, we are skeptical of its utility in the LOCKSS system because it is expensive to generate, unreliable, and of marginal relevance to content which is unlikely to suffer format obsolescence in the foreseeable future, and if it does may well be rendered via emulation rather than format migration.

Nevertheless, we integrated FITS into one version of the LOCKSS  daemon and used it to generate format metadata for the content in the CLOCKSS Archive. We do not use this version of the daemon in production CLOCKSS boxes:
  • FITS is several times bigger than the production LOCKSS daemon.
  • We do not have the resources to audit it for potential risks to the preserved content that it might pose.
  • The computational and I/O resources it consumes are significant.
  • Even if the metadata FITS generates were reliable, it would not be of operational significance in the CLOCKSS environment. 
As regards bibliographic metadata, to be affordable at the scale at which they operate, LOCKSS networks generally depend heavily on extracting metadata automatically, in whatever form it can be found in the content, and performing the minimal processing needed to support the needs of users, primarily for DOI and OpenURL resolution. Human intervention can be considered only at a very coarse level, a journal volume or above.

The CLOCKSS Archive uses bibliographic metadata for four purposes:
  • For billing, the number of articles received from each publisher must be counted. The article-level metadata needed is only the existence of an article.
  • For Keepers and KBART reports. These need volume-level metadata.
  • To locate content that is the subject of a board-approved trigger event in order to extract a copy from the archive. This typically needs volume-level metadata.
  • Once content has been triggered, to update DOI and OpenURL resolvers. This needs detailed article-level metadata.
Although we extract detailed article-level metadata, note that for the vast majority of the archive's content there is no operational need for it. It is needed only for the tiny fraction that is triggered, and only after the trigger event.

CLOCKSS is a dark archive. Until it is triggered, there are no readers to access the content, so there are no readers demanding the kinds of access that PREMIS bibliographic metadata would support. If the CLOCKSS board were to decide that PREMIS metadata support was important enough to justify the rather significant development costs that would be involved, it would be possible to implement it because LOCKSS supports similar semantic units to those that PREMIS describes. Although they are internally factored differently than the PREMIS data model; it would be possible to externalize a data dictionary for our content or respond to a query in terms of the data model that it describes.

Information is stored in several places within the CLOCKSS network, including the LOCKSS repository (storage-level metadata), the title database (preservation-unit level metadata), and the metadata database (bibliographic-level metadata). This information is tied together internally using a preservation-unit level "archival unit" identifier (AUID). Traversing these databases would enable us to generate a PREMIS compliant data dictionary. Responding to a query for any PREMIS-defined entity could be answered by mapping it to a range of AUIDs, and from there to information stored in these databases.

For example, a PREMIS Intellectual Entity (e.g. a journal article) is represented in the the metadata database. It can be located using an Intellectual Entity key such as its DOI or an ISSN and other bibliographic information. Using the AUID associated with that article allows us to retrieve its preservation-level metadata such as the Archival Unit parameters and attributes that specify its Agents and Rights. It also enables us to retrieve the Object Entities and their physical characteristics, and the Events related to provenance from the associated storage-level metadata in the repository.

But note that this isn't an operation that would ever be performed in the CLOCKSS archive. It is a dark archive; no access to preserved content is permitted unless and until it is triggered. Triggering happens externally at a journal (or conceptually at a volume level) and internally at an AUID level, not at an article level. It is a one-time process initiated by the board that hands content off under CC license to multiple re-publishing sites, from where readers can access it in the same way that they accessed it from the original publisher, via the Web. The CLOCKSS archive has no role in these reader accesses.

Philip Gust of the LOCKSS team provided some of the content above.

Rochkind, Jonathan: “users hate change”

Fri, 2014-03-28 03:22

reddit comment with no particularly significant context:

Would be really interesting to put a number on “users hate change”.

Based on my own experience at a company where we actually researched this stuff, the number I would forward is 30%. Given an existing user base, on average 30% will hate any given change to their user experience, independent of whether the that experience is actually worse or better.

“Some random person on reddit” isn’t scientific evidence or anything, but it definitely seems pretty plausible to me that some very significant portion of any user base will generally dislike any change at all — I think I’ve been one of those users for software I don’t develop, I’m thinking of recent changes to Google Maps, many changes to Facebook, etc.

I’m not quite sure what to do with that though, or how it should guide us.  Because, if our users really do want stability over (in the best of cases) improvement, we should give it to them, right? But if it’s say 1/3rd of our users who want this, and not necessarily the other 2/3rds, what should that mean?  And might we hear more from that 1/3rd than the other 2/3rds and over-estimate them yet further?

But, still, say, 1/3rd, that’s a lot. What’s the right balance between stability and improvement? Does it depend on the nature of the improvement, or how badly some other portion of your userbase are desiring change or improvement?

Or, perhaps, work on grouping changes into more occasional releases instead of constant releases, to at least minimize the occurrences of disruption?  How do you square that with software improvement through iteration, so you can see how one change worked before making another?

Eventually users will get used to change, or even love the change and realize it helped them succeed at whatever they do with the software (and then the change-resistant won’t want the new normal changed either!) — does it matter how long this period of adjustment is? Might it be drastically different for different user bases or contexts?

Does it matter how much turnover you should expect or get in your user base?  If you’re selling software, you probably want to keep all the users you’ve got and keep getting more, but the faster you’re growing, the quicker the old users (the only ones to whom a change is actually a change) get diluted by newcomers.   If you’re developing software for an ‘enterprise’ (such as most kinds of libraries), then the turnover of your userbase is a factor of the organization not of your market or marketing.  Either way, if you have less turnover, does that mean you can even less afford to irritate the change-resistant portion of the userbase, or is it irrelevant?

In commercial software development, the answer (for better or worse) is often “whatever choice makes us more money”, and the software development industry has increasingly sophisticated tools for measuring the effect of proposed changes on revenue. If the main goal(s) of your software development effort is something other than revenue, then perhaps it’s important to be clear about exactly what those goals are,  to have any hope of answering these questions.


Filed under: General

Rochkind, Jonathan: “users hate change”

Fri, 2014-03-28 03:22

reddit comment with no particularly significant context:

Would be really interesting to put a number on “users hate change”.

Based on my own experience at a company where we actually researched this stuff, the number I would forward is 30%. Given an existing user base, on average 30% will hate any given change to their user experience, independent of whether the that experience is actually worse or better.

“Some random person on reddit” isn’t scientific evidence or anything, but it definitely seems pretty plausible to me that some very significant portion of any user base will generally dislike any change at all — I think I’ve been one of those users for software I don’t develop, I’m thinking of recent changes to Google Maps, many changes to Facebook, etc.

I’m not quite sure what to do with that though, or how it should guide us.  Because, if our users really do want stability over (in the best of cases) improvement, we should give it to them, right? But if it’s say 1/3rd of our users who want this, and not necessarily the other 2/3rds, what should that mean?  And might we hear more from that 1/3rd than the other 2/3rds and over-estimate them yet further?

But, still, say, 1/3rd, that’s a lot. What’s the right balance between stability and improvement? Does it depend on the nature of the improvement, or how badly some other portion of your userbase are desiring change or improvement?

Or, perhaps, work on grouping changes into more occasional releases instead of constant releases, to at least minimize the occurrences of disruption?  How do you square that with software improvement through iteration, so you can see how one change worked before making another?

Eventually users will get used to change, or even love the change and realize it helped them succeed at whatever they do with the software (and then the change-resistant won’t want the new normal changed either!) — does it matter how long this period of adjustment is? Might it be drastically different for different user bases or contexts?

Does it matter how much turnover you should expect or get in your user base?  If you’re selling software, you probably want to keep all the users you’ve got and keep getting more, but the faster you’re growing, the quicker the old users (the only ones to whom a change is actually a change) get diluted by newcomers.   If you’re developing software for an ‘enterprise’ (such as most kinds of libraries), then the turnover of your userbase is a factor of the organization not of your market or marketing.  Either way, if you have less turnover, does that mean you can even less afford to irritate the change-resistant portion of the userbase, or is it irrelevant?

In commercial software development, the answer (for better or worse) is often “whatever choice makes us more money”, and the software development industry has increasingly sophisticated tools for measuring the effect of proposed changes on revenue. If the main goal(s) of your software development effort is something other than revenue, then perhaps it’s important to be clear about exactly what those goals are,  to have any hope of answering these questions.


Filed under: General

Morgan, Eric Lease: LiAM Guidebook – a new draft

Fri, 2014-03-28 02:44

I have made available a new draft of the LiAM Guidebook. Many of the lists of things (tools, projects, vocabulary terms, Semantic browsers, etc.) are complete. Once the lists are done I will move back to the narratives. Thanks go to various people I’ve interviewed lately (Gregory Colati, Karen Gracy, Susan Pyzynski, Aaron Rubinstein, Ed Summers, Diane Hillman, Anne Sauer, and Eliot Wilczek) because without them I would to have been able to get this far nor see a path forward.

Morgan, Eric Lease: Linked data projects of interest to archivists (and other cultural heritage personnel)

Fri, 2014-03-28 02:22

While the number of linked data websites is less than the worldwide total number, it is really not possible to list every linked data project but only things that will presently useful to the archivist and computer technologist working in cultural heritage institutions. And even then the list of sites will not be complete. Instead, listed below are a number of websites of interest today. This list is a part of the yet-to-be published LiAM Guidebook.

Introductions

The following introductions are akin to directories or initial guilds filled with pointers to information about RDF especially meaningful to archivists (and other cultural heritage workers).

  • Datahub (http://datahub.io/) – This is a directory of data sets. It includes descriptions of hundreds of data collections. Some of them are linked data sets. Some of them are not.
  • LODLAM (http://lodlam.net/) – LODLAM is an acronym for Linked Open Data in Libraries Archives and Museums. LODLAM.net is a community, both virtual and real, of linked data aficionados in cultural heritage institutions. It, like OpenGLAM, is a good place to discuss linked data in general.
  • OpenGLAM (http://openglam.org) – GLAM is an acronym for Galleries, Libraries, Archives, and Museums. OpenGLAM is a community fostered by the Open Knowledge Foundation and a place to to discuss linked data that is “free”. for It, like LODLAM, is a good place to discuss linked data in general.
  • semanticweb.org (http://semanticweb.org) – semanticweb.org is a portal for publishing information on research and development related to the topics Semantic Web and Wikis. Includes data.semanticweb.org and data.semanticweb.org/snorql.
Data sets and projects

The data sets and projects range from simple RDF dumps to full-blown discovery systems. In between some simple browsable lists and raw SPARQL endpoints.

  • 20th Century Press Archives (http://zbw.eu/beta/p20) – This is an archive of digitized newspaper articles which is made accessible not only as HTML but a number of other metadata formats such as RDFa, METS/MODS and OAI-ORE. It is a good example of how metadata publishing can be mixed and matched in a single publishing system.
  • AGRIS (http://agris.fao.org/openagris/) – Here you will find a very large collection of bibliographic information from the field of agriculture. It is accessible via quite a number of methods including linked data.
  • D2R Server for the CIA Factbook (http://wifo5-03.informatik.uni-mannheim.de/factbook/) – The content of the World Fact Book distributed as linked data.
  • D2R Server for the Gutenberg Project (http://wifo5-03.informatik.uni-mannheim.de/gutendata/) – This is a data set of Project Gutenburgh content — a list of digitized public domain works, mostly books.
  • Dbpedia (http://dbpedia.org/About) – In the simplest terms, this is the content of Wikipedia made accessible as RDF.
  • Getty Vocabularies (http://vocab.getty.edu) – A set of data sets used to “categorize, describe, and index cultural heritage objects and information”.
  • Library of Congress Linked Data Service (http://id.loc.gov/) – A set of data sets used for bibliographic classification: subjects, names, genres, formats, etc.
  • LIBRIS (http://libris.kb.se) – This is the joint catalog of the Swedish academic and research libraries. Search results are presented in HTML, but the URLs pointing to individual items are really actionable URIs resolvable via content negotiation, thus support distribution of bibliographic information as RDF. This initiative is very similar to OpenCat.
  • Linked Archives Hub Test Dataset (http://data.archiveshub.ac.uk) – This data set is RDF generated from a selection of archival finding aids harvested by the Archives Hub in the United Kingdom.
  • Linked Movie Data Base (http://linkedmdb.org/) – A data set of movie information.
  • Linked Open Data at Europeana (http://pro.europeana.eu/datasets) – A growing set of RDF generated from the descriptions of content in Europeana.
  • Linked Open Vocabularies (http://lov.okfn.org/dataset/lov/) – A linked data set of linked data sets.
  • Linking Lives (http://archiveshub.ac.uk/linkinglives/) – While this project has had no working interface, it is a good read on the challenges of presenting link data people (as opposed to computers). Its blog site enumerates and discusses issues from provenance to unique identifiers, from data clean up to interface design.
  • LOCAH Project (http://archiveshub.ac.uk/locah/) – This is/was a joint project between Mimas and UKOLN to make Archives Hub data available as structured Linked Data. (All three organizations are located in the United Kingdom.). EAD files were aggregated. Using XSLT, they were transformed into RDF/XML, and the RDF/XML was saved in a triple store. The triple store was then dumped as a file as well as made searchable via a SPARQL endpoint.
  • New York Times (http://data.nytimes.com/) – A list of New York Times subject headings.
  • OCLC Data Sets & Services (http://www.oclc.org/data/) – Here you will find a number of freely available bibliographic data sets and services. Some are available as RDF and linked data. Others are Web services.
  • OpenCat (http://demo.cubicweb.org/opencatfresnes/) – This is a library catalog combining the authority data (available as RDF) provided by the National Library of France with works of a second library (Fresnes Public Library). Item level search results have URIs whose RDF is available via content negotiation. This project is similar to LIBRIS.
  • PELAGIOS (http://pelagios-project.blogspot.com/p/about-pelagios.html) – A data set of ancient places.
  • ReLoad (http://labs.regesta.com/progettoReload/en) – This is a collaboration between the Central State Archive of Italy, the Cultural Heritage Institute of Emilia Romagna Region, and Regesta.exe. It is the aggregation of EAD files from a number of archives which have been transformed into RDF and made available as linked data. Its purpose and intent are very similar to the the purpose and intent of the combined LOCAH Project and Linking Lives.
  • VIAF (http://viaf.org/) – This data set functions as a name authority file.
  • World Bank Linked Data (http://worldbank.270a.info/.html) – A data set of World Bank indicators, climate change information, finances, etc.

Morgan, Eric Lease: RDF tools for the archivist

Fri, 2014-03-28 02:11

This posting lists various tools for archivists and computer technologists wanting to participate in various aspects of linked data. Here you will find pointers to creating, editing, storing, publishing, and searching linked data. It is a part of yet-to-be published LiAM Guidebook.

Directories

The sites listed in this section enumerate linked data and RDF tools. They are jumping off places to other sites:

RDF converters, validators, etc.

Use these tools to create RDF:

  • ead2rdf (http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl) – This is the XST stylesheet previously used by the Archives Hub in their LOCAH Linked Archives Hub project. It transforms EAD files into RDF/XML. A slightly modified version of this stylesheet was used to create the LiAM “sandbox”.
  • Protégé (http://protege.stanford.edu) – Install this well-respected tool locally or use it as a hosted Web application to create OWL ontologies.
  • RDF2RDF (http://www.l3s.de/~minack/rdf2rdf/) – A handy Java jar file enabling you to convert various versions of serialized RDF into other versions of serialized RDF.
  • Vapour, a Linked Data Validator (http://validator.linkeddata.org/vapour) – Much like the W3C validator, this online tool will validate the RDF at the other end of a URI. Unlike the W3C validator, it echoes back and forth the results of the content negotiation process.
  • W3C RDF Validation Service (http://www.w3.org/RDF/Validator/) – Enter a URI or paste an RDF/XML document into the text field, and a triple representation of the corresponding data model as well as an optional graphical visualization of the data model will be displayed.
Linked data frameworks and publishing systems

Once RDF is created, use these systems to publish it as linked data:

  • 4store (http://4store.org/) – A linked data publishing framework for managing triple stores, querying them locally, querying them via SPARQL, dumping their contents to files, as well as providing support via a number of scripting languages (PHP, Ruby, Python, Java, etc.).
  • Apache Jena (http://jena.apache.org/) – This is a set of tools for creating, maintaining, and publishing linked data complete a SPARQL engine, a flexible triple store application, and inference engine.
  • D2RQ (http://d2rq.org/) – Use this application to provide a linked data front-end to any (well-designed) relational database. It supports SPARQL, content negotiation, and RDF dumps for direct HTTP access or uploading into triple store.
  • oai2lod (https://github.com/behas/oai2lod) – This is a particular implementation D2RQ Server. More specifically, this tool is an intermediary between a OAI-PMH data providers and a linked data publishing system. Configure oai2lod to point to your OAI-PMH server and it will publish the server’s metadata as linked data.
  • OpenLink Virtuoso Open-Source Edition (https://github.com/openlink/virtuoso-opensource/) – An open source version of OpenLink Virtuoso. Feature-rich and well-documented.
  • OpenLink Virtuoso Universal Server (http://virtuoso.openlinksw.com) – This is a commercial version of OpenLink Virtuoso Open-Source Edition. It seems to be a platform for modeling and accessing data in a wide variety of forms: relational databases, RDF triples stores, etc. Again, feature-rich and well-documented.
  • openRDF (http://www.openrdf.org/) – This is a Java-based framework for implementing linked data publishing including the establishment of a triple store and a SPARQL endpoint.

LITA: LITA Bylaws Review Underway

Fri, 2014-03-28 01:44

Based on conversations at Board meetings, as well as an attempt to fix a number of issues that have arisen over the last 2-3 years (specifically issues around officers and timing of elections) the Bylaws committee has began work on analyzing the LITA Bylaws.

For those of you that are new to LITA, or just haven’t been enthralled by parliamentary process like some of us, the Bylaws are the rules by which the Division operates. The LITA Manual lays out responsibilities and operational issues, but the Bylaws are the rules by which the organization operates. Want to know how to start an IG? That’s in the bylaws.

After discussions within the Bylaws Committee, and examining how such a review of those specific sections relating to elections and officers would need to occur, our conclusion was that a large part of the current issues have been caused by just this sort of partial-rewriting over the years. Because the Bylaws are an interconnected document, we felt like the best way to tackle solutions to the issues presented would be with a comprehensive review, starting at the sections that are most needed, but then following the implications throughout the Bylaws in order to ensure that we cover all possible areas of disagreement.

As a result, we’ve begun this process. Our goals are twofold: to specifically close the holes that we have uncovered over the last few years, but also to compare/contrast our bylaws with those of ALA proper and harmonize them when that makes sense to do so.

Our timeline is to try to review 2 sections of the Bylaws per month, and then review and discuss at monthly meetings to ensure that we all understand what’s been done and agree that the changes are appropriate. We are doing this in a public google doc:

http://bit.ly/lita_bylaws_review

We have just met and discussed our first round of comments and suggested changes…we wanted to test the process before we presented it to both the Board and the membership.

The google doc is open to editing by the Committee, but open to comment by anyone with the link. We would like to have as transparent a process as possible, by asking for commentary and inviting members to follow along as we work our way through a more streamlined set of bylaws. We will also be publicizing our next few meetings and streaming them here on LITABlog so that members can chime in with questions, or just follow our progress.

The goal that we have set for ourselves is to have a draft of the revised bylaws to present to the Board prior to the Annual meeting, with the expectation of discussing issues at that meeting. Assuming the discussion is satisfactory, we’ll then begin the process of moving to to membership for formal review, before putting the Bylaws changes up to vote. There’s a process in, you guessed it, the Bylaws about how this is done. This isn’t going to be something that happens tomorrow; it will likely take most of the rest of 2014 to complete. But I think that at the end we’ll have a set of bylaws that will enable LITA to be a more flexible and nimble division moving forward.

OCLC Dev Network: VIAF Autocomplete Widget

Thu, 2014-03-27 19:20

One of the projects to come out of OCLC Developer House in February 2014 was the Viaf Autocomplete.

Ng, Cynthia: Code4Lib 2014: Day 3 Morning Presentations

Thu, 2014-03-27 16:08
Presentations for Day 3 of Code4Lib 2014. Under the Hood of Hadoop Processing at OCLC Research – Roy Tennant previously using MapReduce hardware with lots of processing nodes with several copies of Worldcat Java Native, but can use any language you want if you use the “streaming” option best kept as shell script mappers and […]

Miedema, John: Use Apache Tika to extract metadata and convert different content types into plain text

Thu, 2014-03-27 15:47

The first step in text analytics is to crawl document sources. A full crawling solution involves three tasks: creating authenticated connections to data sources, collecting document metadata, and pre-processing different content types into a single internal format, usually plain text. Typically a crawl is scheduled to obtain content updates. These tasks allow for the next steps of indexing and searching.

The following code sample illustrates how Apache Tika can be used to extra metadata and transform different content types into plain text.

//source documents include different content types processDocument("resources/mobydick.htm"); processDocument("resources/robinsoncrusoe.txt"); processDocument("resources/callofthewild.pdf"); private static void processDocument(String pathfilename) { try { InputStream input = new FileInputStream(new File(pathfilename)); //Apache Tika ContentHandler textHandler = new BodyContentHandler(10*1024*1024); Metadata meta = new Metadata(); Parser parser = new AutoDetectParser(); //handles documents in different formats: ParseContext context = new ParseContext(); parser.parse(input, textHandler, meta, context); //extract metadata System.out.println("Title: " + meta.get(DublinCore.TITLE)); //content is plain text System.out.println("Body: " + textHandler.toString()); } catch (Exception ex) { System.out.println(ex.getMessage()); } }

(Nod to example in Taming Text)