You are here

Feed aggregator

Peter Murray: Thursday Threads: Advertising and Privacy, Giving Away Linux, A View of the Future

planet code4lib - Thu, 2015-06-11 10:49
Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

In just a few weeks there will be a gathering of 25,000 librarians in the streets of San Francisco for the American Library Association annual meeting. The topics on my mind as the meeting draws closer? How patrons intersect with advertising and privacy when using our services. What one person can do to level the information access divide using free software. Where is technology in our society going to take us next. Heady topics for heady times.

On a personal note: funding for my current position at LYRASIS runs out at the end of June, so I am looking for my next challenge. Check out my resume/c.v. and please let me know of job opportunities in library technology, open source, and/or community engagement.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my Pinboard bookmarks (or subscribe to its feed in your feed reader). Items posted to are also sent out as tweets; you can follow me on Twitter. Comments and tips, as always, are welcome.

Internet Users Don’t Care For Ads and Do Care About Privacy

In advertising, an old adage holds, half the money spent is wasted; the problem is that no one knows which half. This should be less of a problem in online advertising, since readers’ tastes and habits can be tracked, and ads tailored accordingly. But consumers are increasingly using software that blocks advertising on the websites they visit. If current trends continue, the saying in the industry may well become that half the ads aimed at consumers never reach their screens. This puts at risk online publishing’s dominant business model, in which consumers get content and services free in return for granting advertisers access to their eyeballs.

Block shock: Internet users are increasingly blocking ads, including on their mobiles, The Economist, 6-Jun-2015

A new report into U.S. consumers’ attitude to the collection of personal data has highlighted the disconnect between commercial claims that web users are happy to trade privacy in exchange for ‘benefits’ like discounts. On the contrary, it asserts that a large majority of web users are not at all happy, but rather feel powerless to stop their data being harvested and used by marketers.

The Online Privacy Lie Is Unraveling, by Natasha Lomas, TechCrunch, 6-Jun-2015

This week The Economist printed a story about how users are starting to use software in their desktop and mobile browsers to block advertisements, and what the reaction may be from websites that rely on advertising to fund their activities. I found it interesting that “younger consumers seem especially intolerant of intrusive ads” and as they get older, of course, more of the population would be using ad-blocking software. Reactions range from gentle prodding to support the website in other ways, lawsuits against the makers of ad-blocking software, and mixing advertising with editorial content.

Also this week the news outlet TechCrunch reported on a study by the Annenberg School for Communication on how “a majority of Americans are resigned to giving up their data” when they “[believe] an undesirable outcome is inevitable and [feel] powerless to stop it.” This sort of thing is coming up in the NISO Patron Privacy working group discussions that have occurred over the past couple weeks and will culminate in a day-and-a-half working meeting at ALA. It is also something that I have been blogging about recently as well.

Welcome to America: Here’s your Linux computer

So, the following Monday I delivered a lovely Core2Duo desktop computer system with Linux Mint 17.1 XFCE installed. This computer was recently surplussed from the public library where I work. Installed on the computer was:

  • LibreOffice, for writing and documenting
  • Klavaro, a touch-typing tutor
  • TuxPaint, a painting program for kids
  • Scratch, to learn computer programming
  • TeamViewer, so I can volunteer to remotely support this computer

In 10 years time, these kids and their mom may well remember that first Linux computer the family received. Tux was there, as I see it, waiting to welcome these youth to their new country. Without Linux, that surplussed computer might have gotten trashed. Now that computer will get two, four, or maybe even six more years use from students who really value what it has to offer them.

Welcome to America: Here’s your Linux computer, by Phil Shapiro,, 5-June-2015

This is a heartwarming story of making something out of nearly nothing: a surplus computer, free software, and a little effort. This is a great example of how one person can make a significant difference for a needy family.

What Silicon Valley Can Learn From Seoul

“When I was in S.F., we called it the mobile capital of the world,” [Mike Kim] said. “But I was blown away because Korea is three or four years ahead.” Back home, Kim said, people celebrate when a public park gets Wi-Fi. But in Seoul, even subway straphangers can stream movies on their phones, deep beneath the ground. “When I go back to the U.S., it feels like the Dark Ages,” he said. “It’s just not there yet.”

What Silicon Valley Can Learn From Seoul, by Jenna Wortham, New York Times Magazine, 2-Jun-2015

What is moving the pace of technology faster than Silicon Valley? South Korea. Might that country’s citizens be divining the path that the rest of us will follow?

Link to this post!

Eric Hellman: Protect Reader Privacy with Referrer Meta Tags

planet code4lib - Thu, 2015-06-11 03:10
Back when the web was new, it was fun to watch a website monitor and see the hits come in. The IP address told you the location of the user, and if you turned on the referer header display, you could see what the user had been reading just before.  There was a group of scientists in Poland who'd be on my site regularly- I reported the latest news on nitride semiconductors, and my site was free. Every day around the same time, one of the Poles would check my site, and I could tell he had a bunch of sites he'd look at in order. My site came right after a Russian web site devoted to photographs of unclothed women.

The original idea behind the HTTP referer header (yes, that's how the header is spelled) was that webmasters like me needed it to help other webmasters fix hyperlinks. Or at least that was the rationalization. The real reason for sending the referer was to feed webmaster narcissism. We wanted to know who was linking to our site, because those links were our pats on the back. They told us about other sites that liked us. That was fun. (Still true today!)

The fact that my nitride semiconductor website ranked up there with naked Russian women amused me; reader privacy issues didn't bother me because the Polish scientist's habits were safe with me.

Twenty years later, the referer header seems like a complete privacy disaster. Modern web sites use resources from all over the web, and a referer header, including the complete URL of the referring web page, is sent with every request for those resources. The referer header can send your complete web browsing log to websites that you didn't know existed.

Privacy leakage via the referrer header plagues even websites that ostensibly believe in protecting user privacy, such as those produced by or serving libraries. For example, a request to the WorldCat page for What you can expect when you're expecting  results in the transmission of referer headers containing the user's request to the following hosts:
  • (with tracking cookies)
  • (with tracking cookies)
None of the resources requested from these third parties actually need to know what page the user is viewing, but WorldCat causes that information to be sent anyway. In principle, this could allow advertising networks to begin marketing diapers to carefully targeted WorldCat users. (I've written about AddThis and how they sell data about you to advertising networks.)

It turns out there's an easy way to plug this privacy leak in HTML5. It's called the referrer meta tag. (Yes, that's also spelled correctly.)

The referrer meta tag is put in the head section of an HTML5 web page. It allows the web page to control the referer headers sent by the user's browser. It looks like this:

<meta name="referrer" content="origin" />
If this one line were used on WorldCat, only the fact that the user is looking a WorldCat page would be sent to Google, AddThis, and BibTip. This is reasonable, library patrons typically don't expect their visits to a library to be private; they do expect that what they read there should be private.

Because use of third party resources is often necessary, most library websites leak lots of privacy in referer headers. The meta referrer policy is a simple way to stop it. You may well ask why this isn't already standard practice. I think it's mostly lack of awareness. Until very recently, I had no idea that this worked so well. That's because it's taken a long time for browser vendors to add support. Although Chrome and Safari have been supporting the referrer meta tag for more than two years; Firefox only added it in January of 2015. Internet Explorer will support it with the Windows 10 release this summer. Privacy will still leak for users with older browser software, but this problem will gradually go away.

There are 4 options for the meta referrer tag, in addition to the "origin" policy. The origin policy sends only the host name for the originating page.

For the strictest privacy, use

<meta name="referrer" content="no-referrer" />

If you use this sitting, other websites won't know you're linking to them, which can be a disadvantage in some situations. If the web page links to resources that still use the archaic "referer authentication", they'll break.

 The prevailing default policy for most browsers is equivalent to

<meta name="referrer" content="no-referrer-when-downgrade" />

"downgrade" here refers to http links in https pages.

If you need the referer for your own website but don't want other sites to see it you can use

<meta name="referrer" content="origin-when-cross-origin" />
Finally, if you want the user's browser to send the full referrer, no matter what, and experience the thrills of privacy brinksmanship, you can set

<meta name="referrer" content="unsafe-url" />
Widespread deployment of the referrer meta tag would be a big boost for reader privacy all over the web. It's easy to implement, has little downside, and is widely deployable. So let's get started!


Peter Murray: Can Google’s New “My Account” Page be a Model for Libraries?

planet code4lib - Thu, 2015-06-11 00:30

One of the things discussed in the NISO patron privacy conference calls has been the need for transparency with patrons about what information is being gathered about them and what is done with it. The recent announcement by Google of a "My Account" page and a privacy question/answer site got me thinking about what such a system might look like for libraries. Google and libraries are different in many ways, but one similarity we share is that people use both to find information. (This is not the only use of Google and libraries, but it is a primary use.) Might we be able to learn something about how Google puts users in control of their activity data? Even though our motivations and ethics are different, I think we can.

What the Google "My Account" page gives us

Last week I got an e-mail from Google that invited me to visit the new My Account page for "controls to protect and secure my Google account."

Google’s “My Account” home page

I think the heart of the page is the "Security Checkup" tool and the "Privacy Checkup" tool. The "Privacy Checkup" link takes you through five steps:

The five areas that Google offers when you run the “Privacy Checkup”.

  1. Google+ settings (including what information is shared in your public profile)
  2. Phone numbers (whether people can find you by your real life phone numbers)
  3. YouTube settings (default privacy settings for videos you upload and playlists you create)
  4. Account history (what of your activity with Google services is saved)
  5. Ads settings (what demographic information Google knows about you for tailoring ads)

These are broad brushes of control; the settings available here are pretty global. For instance, if you want to see your search history and what links you followed from the search pages, you would need to go to a separate page. In the “Privacy Checkup” the only option that is offered is whether or not your search history is saved. Still, for someone who wants to go with an “everything off” or “everything on” approach, the Privacy Checkup is a good way to do that.

Sidebar: I would also urge you to go through the “Security Checkup” as well. There you can change you password and account recovery options, see what other services have access to your Google account data, and make changes to account security settings.

The more in-depth settings can be reached by going to the "Personal Information and Privacy" page. This is a really long page, and you can see the full page content separately.

First part of the My Account “Personal Information and Privacy” page. The full screen capture is also available.

There you can see individual searches and search results that you followed.

My Account “Search and App Activity” page

Same with activity on YouTube.

My Account ‘YouTube Activity’ page

Google clearly put some thought and engineering time into developing this. What would a library version of this look like?

Google's Privacy site

The second item in the Google announcement was its privacy site. There they cover these topics:

  • What data does Google collect?
  • What does Google do with the data it collects?
  • Does Google sell my personal information?
  • What tools do I have to control my Google experience?
  • How does Google keep my information safe?
  • What can I do to stay safe online?

Each has a brief answer that leads to more information and sometimes to an action page like updating your password to something more secure or setting search history preferences.

Does this apply to libraries?

It could. It is clearly easier for Google because they have control over all the web properties and can do a nice integration like what is on the My Account page. We will have a more difficult task because libraries use many service providers and there are not programming interfaces libraries can use to pull together all the privacy settings onto one page. There isn't even consistency of vocabulary or setting labels that service providers could use to build such a page for making choices. Coming to an agreement on:

  1. how service providers should be transparent on what is collected, and
  2. how patrons can opt-in to data collection for their own benefit, see what data has been collected, and selectively delete and/or download that activity

…would be a significant step forward. Hopefully that is the level of detail that the NISO Patron Privacy framework can describe.

Link to this post!

DuraSpace News: MOVING Content: Institutional Tools and Strategies for Fedora 3 to 4 Upgrations

planet code4lib - Thu, 2015-06-11 00:00

Winchester, MA  The Fedora team has made tools that simplify content migration from Fedora 3 to Fedora 4 available to assist institutions in establishing production repositories. Using the Hydra-based Fedora-Migrate tool — which was built in advance of Penn State’s deadline to have Fedora 4 in production, before the generic Fedora Migration Utilities was released —  Penn State’s ScholarSphere moved all data from production instances of Fedora 3 to Fedora 4 in about 20 hours.

District Dispatch: Experts to talk library hacker spaces at 2015 ALA Annual Conference

planet code4lib - Wed, 2015-06-10 18:45

Woman playing classic video games using Makey Makey and coins as controllers at ALA conference. Photo by Jenny Levine.

How can libraries ensure that learners of all ages stay curious, develop their passions, immerse themselves in learning? Learn about developing library learning spaces at this year’s 2015 American Library Association (ALA) Annual Conference in San Francisco. The interactive session, “Hacking the Culture of Learning in the Library,” takes place from 1:00–2:00p.m. on Sunday, June 28, 2015. The session will be held at the Moscone Convention Center in room 2018 of the West building.

Leaders will discuss ways that libraries serve as informal learning spaces that encourage exploration and discovery, while librarians lead in creating new opportunities to engage learners and make learning happen. During the session, library leaders will explore ways that libraries are creating incubation spaces to hack education and create new paradigms where learners own their education.

  • Moderator: Christopher Harris, school library system director, Genesee Valley Educational Partnership; ALA Office for Information Technology Policy Fellow for Program on Youth and Technology Policy
  • Erica Compton, project coordinator, Idaho Commission for Libraries
  • Megan Egbert, youth services manager, Meridian Library District (Idaho)
  • Connie Williams, teacher librarian, Petaluma High School (Calif.)

View all ALA Washington Office conference sessions

The post Experts to talk library hacker spaces at 2015 ALA Annual Conference appeared first on District Dispatch.

Brown University Library Digital Technologies Projects: Best bets for library search

planet code4lib - Wed, 2015-06-10 17:33

The library has added “best bets” to the new easySearch tool.  Best bets are commonly searched for library resources.  Examples include JSTOR, Pubmed, and Web of Science.  Searches for these phrases (as well as known alternate names and misspellings) will return a best bet highlighted at the top of the search results.

To get started, 64 resources have been selected as best bets and are available now via easySearch.  As we would like to know how useful this feature is, please leave us feedback.

Thanks to colleagues at North Carolina State University for leading the way in adding best bets to library search and writing about their efforts.

Technical details

Library staff analyzed search logs to find commonly used search terms and matched those terms to appropriate resources.  The name, url, and description for each resource is entered into a shared Google Spreadsheet.  A script runs regularly to convert the spreadsheet data into Solr documents and posts the updates to a separate Solr core.  The Blacklight application searches for best bet matches when users enter a search into the default search box.

Since the library maintains a database of e-resources, in many cases only the identifier for a resource is needed to populate the best bets index.  The indexing script is able to retrieve the resource from the database and use that information to create the best bet.  This eliminates maintaining data about the resources in multiple places.

SearchHub: Indexing Performance in Solr 5.2 (now twice as fast)

planet code4lib - Wed, 2015-06-10 16:31
About this time last year (June 2014), I introduced the Solr Scale Toolkit and published some indexing performance metrics for Solr 4.8.1. Solr 5.2.0 was just released and includes some exciting indexing performance improvements, especially when using replication. Before we get into the details about what we fixed, let’s see how things have improved empirically. Using Solr 4.8.1 running in EC2, I was able to index 130M documents into a collection with 10 shards and replication factor of 2 in 3,727 seconds (~62 minutes) using ten r3.2xlarge instances; please refer to my previous blog post for specifics about the dataset. This equates to an average throughput of 34,881 docs/sec. Today, using the same dataset and configuration, with Solr 5.2.0, the job finished in 1,704 seconds (~28 minutes), which is an average 76,291 docs/sec. To rule out any anomalies, I reproduced these results several times while testing release candidates for 5.2. To be clear, the only notable difference between the two tests is a year of improvements to Lucene and Solr! So now let’s dig into the details of what we fixed. First, I cannot stress enough how much hard work and sharp thinking has gone into improving Lucene and Solr over the past year. Also, special thanks goes out to Solr committers Mark Miller and Yonik Seeley for helping identify the issues discussed in this post, recommending possible solutions, and providing oversight as I worked through the implementation details. One of the great things about working on an open source project is being able to leverage other developers’ expertise when working on a hard problem. Too Many Requests to Replicas One of the key observations from my indexing tests last year was that replication had higher overhead than one would expect. For instance, when indexing into 10 shards without replication, the test averaged 73,780 docs/sec, but with replication, performance dropped to 34,881. You’ll also notice that once I turned on replication, I had to decrease the number of Reducer tasks (from 48 to 34) I was using to send documents to Solr from Hadoop to avoid replicas going into recovery during high-volume indexing. Put simply, with replication enabled, I couldn’t push Solr as hard. When I started digging into the reasons behind replication being expensive, one of the first things I discovered is that replicas receive up to 40x the number of update requests from their leader when processing batch updates, which can be seen in the performance metrics for all request handlers on the stats panel in the Solr admin UI. Batching documents into a single request is a common strategy used by client applications that need high-volume indexing throughput. However, batches sent to a shard leader are parsed into individual documents on the leader, indexed locally, and then streamed to replicas using ConcurrentUpdateSolrClient. You can learn about the details of the problem and the solution in SOLR-7333. Put simply, Solr’s replication strategy caused CPU load on the replicas to be much higher than on the leaders, as you can see in the screenshots below. CPU Profile on Leader CPU Profile on Replica (much higher than leader) Ideally, you want all servers in your cluster to have about the same amount of CPU load. The fix provided in SOLR-7333, helps reduce the number of requests and CPU load on replicas by sending more documents from the leader per request when processing a batch of updates. However, be aware that the batch optimization is only available when using the JavaBin request format (the default used by CloudSolrClient in SolrJ); if your indexing application sends documents to Solr using another format (JSON or XML), then shard leaders won’t utilize this optimization when streaming documents out to replicas. We’ll likely add a similar solution for processing other formats in the near future. Version Management Lock Contention Solr adds a _version_ field to every document to support optimistic concurrency control. Behind the scenes, Solr’s transaction log uses an array of version “buckets” to keep track of the highest known version for a range of hashed document IDs. This helps Solr detect if an update request is out-of-date and should be dropped. Mark Miller ran his own indexing performance tests and found that expensive index housekeeping operations in Lucene can stall a Solr indexing thread. If that thread happens to be holding the lock on a version bucket, it can stall other threads competing for the lock. To address this, we increased the default number of version buckets used by Solr’s transaction logs from 256 to 65536, which helps reduce the number of concurrent requests that are blocked waiting to acquire the lock on a version bucket. You can read more about this problem and solution in SOLR-6820. We’re still looking into how to deal with Lucene using the indexing thread to performance expensive background operations but for now, it’s less of an issue. Expensive Lookup for a Document’s Version When adding a new document, the leader sets the _version_ field to a long value based on the CPU clock time; incidentally, you should use a clock synchronization service for all servers in your Solr cluster. Using the YourKit profiler, I noticed that replicas spent a lot of time trying to lookup the _version_ for new documents to ensure update requests were not re-ordered. Specifically, the expensive code was where Solr attempts to find the internal Lucene ID for a given document ID. Of course for new documents, there is no existing version, so Solr was doing a fair amount of wasted work looking for documents that didn’t exist. Yonik pointed out that if we initialize the version buckets used by the transaction log to the maximum value of the _version_ field before accepting new updates, then we can avoid this costly lookup for every new document coming into the replica. In other words, if a version bucket is seeded with the max value from the index, then when new documents arrive with a version value that is larger than the current max, we know this update request has not been reordered. Of course the max version for each bucket gets updated as new documents flow into Solr. Thus, as of Solr 5.2.0, when a Solr core initializes, it seeds version buckets with the highest known version from the index, see SOLR-7332 for more details. With this fix, when a replica receives a document from its leader, it can quickly determine if the update was reordered by consulting the highest value of the version bucket for that document (based on a hash of the document ID). In most cases, the version on an incoming document to a replica will have a higher value than the version bucket, which saves an expensive lookup to the Lucene index and increases overall throughput on replicas. If by chance, the replica sees a version that is lower than the bucket max, it will still need to consult the index to ensure the update was not reordered. These three tickets taken together achieve a significant increase in indexing performance and allows us to push Solr harder now. Specifically, I could only use 34 reducers with Solr 4.8.1 but was able to use 44 reducers with 5.2.0 and still remain stable. Lastly, if you’re wondering what you need to do to take advantage of these fixes, you only need to upgrade to Solr 5.2.0, no additional configuration changes are needed. I hope you’re able to take advantage of these improvements in your own environment and please file JIRA requests if you have other ideas on how to improve Solr indexing performance. The Solr Scale Toolkit has been upgraded to support Solr 5.2.0 and the dataset I used is publicly shared on S3 if you want to reproduce these results.

The post Indexing Performance in Solr 5.2 (now twice as fast) appeared first on Lucidworks.

LITA: Sunday Routines: Susan Sharpless Smith

planet code4lib - Wed, 2015-06-10 14:58

In this series, inspired by the New York Times’ Sunday Routines, we gain a glimpse into the lives of the people behind LITA. This post focuses on Susan Sharpless Smith, who was recently elected 2015-2018 Director-at-Large.

Susan Sharpless Smith is an Associate Dean at Wake Forest University’s Z. Smith Reynolds Library. She’s been in that role since 2011, but has worked in a range of positions at that library since 1996. Her current job provides a wide variety of responsibilities and opportunities, and fills her week with interesting, meaningful professional work.

Sunday is the day Susan reserves to enjoy her family and her interests. It normally unfolds slowly. Susan is an early riser, often heading for the first cup of coffee and the Sunday newspaper before 6 am. In the summer, the first hour of the day is spent watching the day emerge from her screen porch in Winston-Salem, NC. She is not a big TV watcher but always tunes into the Today Show on Sunday mornings.

Bicycling is one of Susan’s passions, so a typical Sunday will include a 15-40 mile bike ride, either around town or out into the surrounding countryside. It’s her belief that bicycling is good for the soul. It also is one of the best ways to get acquainted with new places, so a bike ride is always on her agenda when traveling. Plans are already underway for a San Francisco bike excursion during ALA!

Susan’s second passion is photography, so whatever she is up to on any given Sunday, a camera accompanies her. (The best camera is the one you have with you!). She has been archiving her photographs on Flickr since 2006 and has almost 10,500 of them. Her most relaxing Sunday evening activity is settling in on her MacBook Air to process photos from that day in Photoshop.

Her son and daughter are grown, so often the day is planned around a family gathering of some sort. This can involve a road trip to Durham, an in-town Sunday brunch, a drive to North Carolina wine country or an hike at nearby Hanging Rock.

Her best Sunday would be spent in her favorite place in the world, at her family’s beach house in Rehoboth Beach, DE. Susan’s family has been going there for vacations since she was a child. It’s where she heads whenever she has a long weekend and wants to recharge. Her perfect Sunday is spent there (either for real, or in her imagination when she can’t get away). This day includes a sunrise walk on the beach, a morning bike ride on the boardwalk stopping for a breakfast while people-watching, reading a book on the beach, eating crab cakes for every meal (there’s no good ones in Piedmont North Carolina), a photo-shoot at the state park, a kayak trip on the bay and an evening at Funland riding bumper cars and playing skeeball. It doesn’t get any better than that!

Open Library Data Additions: Amazon Crawl: part eo

planet code4lib - Wed, 2015-06-10 11:29

Part eo of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Peter Murray: My View of the NISO Patron Privacy Working Group

planet code4lib - Tue, 2015-06-09 21:34

Yesterday Bobbi Newman posted Thinking Out Loud About Patron Privacy and Libraries on her blog. Both of us are on the NISO committee to develop a Consensus Framework to Support Patron Privacy in Digital Library and Information Systems, and her article sounded a note of discouragement that I hope to dispel while also outlining what I’m hoping to see come out of the process. I think we share a common belief: the privacy of our patron’s activity data is paramount to the essence of being a library. I want to pull out a couple of sentences from her post:

Libraries negotiate with vendors on behalf of their patrons. Library users trust the library, and the choses librarians make need to be worthy of that trust.

Wholeheartedly agree!

Librarians should be able to tell users exactly what information vendors are collecting about them and what they are doing with that data.

This is why I am engaged in the NISO effort. As librarians, I don’t think we do have a good handle on the patron activity data that we are collecting and the intersection of our service offerings with what third parties might do with it. Eric Hellman lays out a somewhat dark scenario in his Towards the Post-Privacy Library? article published in the recent American Libraries Digital Futures supplement. 1 What I’m hoping comes out of this is a framework for awareness and a series of practices that libraries can take to improve patron privacy.

  1. A statement of principles of what privacy means for library patrons in the highly digital distributed environment that we are in now.
  2. A recognition that protecting privacy is an incremental process, so we need something like the “SANS Critical Security Controls” ( to help libraries take an inventory of their risks and to seek resources to address them.
  3. A “Shared Understanding” between service subscribers and service providers around expectations for privacy.

A statement of principles…

We have lived through a radical shift in how information and services are delivered to patrons, and I’d argue we haven’t thought through the impacts of that shift. There was a time when libraries collected information just in case for the needs of their patrons: books, journals/periodicals, newspapers and the catalogs and indexes that covered them. Not so long ago — at least in my professional lifetime — we were making the transition from paper indexes to CD-ROM indexes. We saw the beginnings of online delivery in services like Dialog and FirstSearch, but for the most part everything was under our roof.

Nowadays, however, we purchase or subscribe to services where information is delivered just in time. Gone are the days of shelf-after-shelf of indexes and the tedium of swapping CD-ROMs in large towers. “The resource is online, and it is constantly updated!” we trumpeted. And in recent years even the library’s venerable online catalog is often hosted by service providers. It makes for more efficient information delivery, but it also brings more actors into the interaction between our patrons and our information providers. It is that reality we need to account for, and to educate each other on the privacy implications of those new actors.

A recognition that protecting privacy is an incremental practice…

One of the important lessons from the information security field is that protecting software systems is never “done” — it is never a checklist or a completed audit or a one-time task. Security professionals developed the “Critical Security Controls” list to get a handle on persistent and new forms of attack. From the introduction:

Over the years, many security standards and requirements frameworks have been developed in attempts to address risks to enterprise systems and the critical data in them. However, most of these efforts have essentially become exercises in reporting on compliance and have actually diverted security program resources from the constantly evolving attacks that must be addressed. … The Critical Security Controls focuses first on prioritizing security functions that are effective against the latest Advanced Targeted Threats, with a strong emphasis on “What Works” – security controls where products, processes, architectures and services are in use that have demonstrated real world effectiveness.

The current edition has 20 items, and they are listed in a priority order. If an organization does nothing more than the first five, then it has already done a lot to protected itself from the most common threats.

Patron privacy needs to be addressed in the same way. There are things we can do that will have the most impact for the effort, and once we get a handle on those then we can move on to other less impactful areas. And just as the SANS organization regularly convenes professionals to review and make recommendations based on new threats and practices, so too must our “critical privacy controls” be updates as new service models are introduced and new points of privacy threats are found.

A shared understanding…

Libraries will not be able to raise the privacy levels of their patrons activities without involving the service providers that we new rely on. At the second open teleconference of the NISO patron privacy effort, I briefly presented my thoughts on why a shared understanding between libraries and service providers was important. I found it interesting that during the same teleconference we identified a need from service providers need a “service level agreement” of sorts that covers how the libraries must react to detected breaches in proxy systems2. With NISO acting as an ideal intermediary, the parties can come together and create a shared understanding of what each other need in this highly distributed world.

Making Progress

The TL;DR-at-the-bottom summary? Take heart, Bobbi. I think we are seeing the part of the process where a bunch of ideas are thrown out (including mine above!) and we begin the steps to condense all of those ideas into a plan of action. I, for one, am not interested in improving services at the expense of our core librarian ethic to hold in confidence the activities of our patrons. I don’t see it as a matter of matching the competition; in fact, I see this activity as a distinguishing characteristic for libraries. This week the news outlet TechCrunch reported on a study by the Annenberg School for Communication on how “a majority of Americans are resigned to giving up their data” when they “[believe] an undesirable outcome is inevitable and [feel] powerless to stop it.” If libraries can honestly say — because we’ve studied the issue and proactively protected patrons — that we are a reliable source of exceptional information provided in a way that is respectful of the patron’s desire to control how their activity information is used, then I think we have a good story to tell and a compelling service to offer.

  1. While I have a stage, can I point out

    Is there any irony in ALA's "Digital Futures" document being a hunkin' flash app leading to a 4.5MB PDF?

    — Peter Murray (@DataG) May 28, 2015

  2. A proxy server, while enabling a patron to get access to third-party information services while not on the library’s network, also acts as an anonymizing agent of sorts. The service provider only sees the aggregate activity of all patrons coming through the proxy server. That makes it impossible, though, for a service provider to fight off a bulk-download attack without help from the library.
Link to this post!

SearchHub: What’s new in Apache Solr 5.2

planet code4lib - Tue, 2015-06-09 19:46
Apache Lucene and Solr 5.2.0 were just released with tons of new features, optimizations, and bug fixes. Here are the major highlights from the release: Rule based replica assignment This feature allows users to have fine grained control on placement of new replicas during collection, replica, and shard creation. A rule is a set of conditions, comprising of shard, replica, and a tag that must be satisfied before a replica can be created. This can be used to restrict replica creations like:
  • Keep less than 2 replicas of a collection on any node
  • For a shard, keep less than 2 replicas on any node
  • (Do not) Create shards on a particular rack, or host.
More details about this feature are available in this blogpost : Restore API So far, Solr provided with a feature to back-up an existing index using a call like: http://localhost:8983/solr/techproducts/replication?command=backup&name=backup_name The new restore API allows you to restore an existing back-up via a command like: http://localhost:8983/solr/techproducts/replication?command=restore&name=backup_name The location of the index backup defaults to the data directory but can be overriden by the location parameter. JSON Facet API unique() facet function The unique facet function is now supported for numeric and date fields. Example: json.facet={   unique_products : "unique(product_type)" } The “type” parameter: flatter request There’s now a way to construct a flatter JSON Facet request using the “type” parameter. The following request from 5.1: top_authors : { terms : { field:author, limit:5 } } is equivalent to this request in 5.2: top_authors : { type:terms, field:author, limit:5 } mincount parameter and range facets The mincount parameter is now supported by range facets to filter out ranges that don’t meet a minimum document count. Example: prices:{ type:range, field:price, mincount:1, start:0, end:100, gap:10 } multi-select faceting A new parameter, excludeTags will disregards any matching tagged filters for that facet. Example: q=cars &fq={!tag=COLOR}color:black &fq={!tag=MODEL}model:xc90 &json.facet={ colors:{type:terms, field:color, excludeTags=COLOR}, model:{type:terms, field:model, excludeTags=MODEL} } The above example shows a request where a user selected “color:black”. This query would do the following:
  • Get a document list with the filter applied.
  • colors facet:
    • Exclude the color filter so you get back facets for all colors instead of just getting the color ‘black’.
    • Apply the model filter.
  • Similarly compute facets for the model i.e. exclude the model filter but apply the color filter.
hll facet function The json facet API has an option to use the HyperLogLog implementation for computing unique values. Example: json.facet={ unique_products : "hll(product_type)" } Choosing facet implementation Pre Solr 5.2, interval faceting had a different implementation than range faceting based on DocValues, which at times is faster and doesn’t rely on filters and filter-cache. Solr 5.2 has support to choose between the Filters and DocValues based implementations. Functionally, the results of the two are the same, but there could be a difference in performance. The facet.range.method parameter allows for specifying the implementation to be used. Some numbers on the performance of the two methods can be found here: Stats component Solr stats component now has support for HyperLogLog based cardinality estimation. The same is also used by the new Json facet API. The cardinality option uses probabilistic “HyperLogLog” (HLL) algorithm to estimate the cardinality of the sets in a fixed amount of memory. It also allows for tuning the cardinality parameter, which allows you to trade off accuracy for the amount of RAM used at query time, with relatively minor impacts on response time performance. More about this can be read here: Solr security SolrCloud allows for hosting multiple collections within the same cluster but until 5.1, didn’t provide a mechanism to restrict access. The authentication framework in 5.2 allows for plugging in a custom authentication plugin or using the Kerberos plugin that is shipped out of the box. This allows for authenticating requests to Solr. The authorization framework allows for implementing a custom plugin to authorize access for resources in a SolrCloud cluster. Here’s a Solr reference guide link for the same: Solr streaming expressions Streaming Expressions, provide a simple query language for SolrCloud that merges search with parallel computing. This builds on the Solr streaming API introduced in 5.1. The Solr reference guide has more information about the same: Other features A few configurations in Solr need to be in place as part of the bootstrapping process and before the first Solr node comes up e.g. to enable SSL. The CLUSTERPROP call provides with an API to do so, but requires a running Solr instance. Starting Solr 5.2, a cluster-wide property can be added/edited/deleted using the zkcli script and doesn’t require a running Solr instance. On the spatial front, this release introduces the new spatial RptWithGeometrySpatialField, based on CompositeSpatialStrategy, which blends RPT indexes for speed with serialized geometry for accuracy. It includes a Lucene segment based in-memory shape cache. There is now a refactored Admin UI built using AngularJS. This new UI isn’t the default, but an optional UI interface so users could report issues and provide feedback for this to migrate and become the default UI. The new UI can be accessed at: http://hostname:port/solr/index.html Though it’s an internal detail but it’s certainly an important one. Solr has internally been upgraded to use Jetty 9. This allows us to move towards using Async calls and more. Indexing performance improvement This release also comes with a substantial indexing performance improvement and bumps it up by almost 100% as compared to Solr 4x. Watch out for a blog on that real soon.   Beyond the features and improvements listed above, Solr 5.2.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. For more information, the detailed change log for both Lucene and Solr can be found here: Lucene: Solr: Featured image by David Precious

The post What’s new in Apache Solr 5.2 appeared first on Lucidworks.

LITA: LITA Annual Report, 2014-2015

planet code4lib - Tue, 2015-06-09 16:41

As we reflect on 2014-2015, it’s fair to say that LITA, despite some financial challenges, has had numerous successes and remains a thriving organization. Three areas – membership, education, and publications – bring in the most revenue for LITA. Of those, membership is the largest money generator. However, membership has been on a decline, a trend that’s been seen across the American Library Association (ALA) for the past decade. In response, the Board, committees, interest groups, and many and individuals have been focused on improving the member experience to retain current members and attract potential ones. With all the changes to the organization and leadership, LITA is on the road to becoming profitable again and will remain one of ALA’s most impactful divisions.

Read more in the LITA Annual Report.

DPLA: Digital Public Library of America receives $96,000 grant from the Whiting Foundation to expand its impact in education

planet code4lib - Tue, 2015-06-09 14:45

The Digital Public Library of America (DPLA) is pleased to announce that it has received $96,000 from the Whiting Foundation to begin creating resources for users in K-12 and higher education. The grant will allow DPLA to develop and share primary source sets built on the foundation of national educational standards and under the guidance of a diverse group of education experts. DPLA will also refine tools for creating user-generated content so that students and teachers can curate their own resources as part of the learning process.

“We are very grateful to the Whiting Foundation for their continued support of our education work, and we look forward to connecting a growing community of students with the rich primary source materials provided by our partners,” said Franky Abbott, DPLA’s Project Manager and American Council of Learned Societies Public Fellow.

This grant builds on DPLA’s recent Whiting-funded efforts to understand the ways large­-scale digital collections can best adapt their resources to address classroom needs. This work culminated in a comprehensive research paper published in April 2015.

“The growing collection of primary sources created by the Digital Public Library of America and its partners has the potential to become an unparalleled educational resource for teachers, students – and indeed anyone with a spark of curiosity and an internet connection,” said Whiting Foundation Executive Director Daniel Reid. “The Whiting Foundation is proud to continue our support for the DPLA’s work to build out new features and content to meet this purpose more effectively.”

“DPLA seeks not only to bring together openly available materials, but to maximize their use, and education is an essential realm for that use,” said Dan Cohen, DPLA’s Executive Director. “Thanks to the Whiting Foundation, we can move forward with the creation of easy-to-use resources, which we believe will help students and teachers in both K-12 and college.”

If you are interested in learning more about DPLA’s education work or getting involved, please email

# # #

About the Digital Public Library of America

The Digital Public Library of America ( strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated over 10 million items from over 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

About the Whiting Foundation

The Whiting Foundation ( has supported scholars and writers for more than forty years. This grant is part of the Foundation’s efforts to infuse the humanities into American public culture.

OCLC Dev Network: Change to Dewey Web Services

planet code4lib - Tue, 2015-06-09 14:00 is not available at the time. There is no current projected date for the return of Dewey.into yet.

FOSS4Lib Upcoming Events: What's New in Archivematica 1.4

planet code4lib - Tue, 2015-06-09 13:57
Date: Tuesday, June 16, 2015 - 11:00 to 12:00Supports: Archivematica

Last updated June 9, 2015. Created by Peter Murray on June 9, 2015.
Log in to edit this page.

From the meeting announcement:

Please join us for a free webinar highlighting what's new in Archivematica version 1.4, released last month. The webinar will be one hour long, with 45 minutes for demonstration and 15 minutes for question and answer.

Date: June 16
Time: 11 am - 12 PM Pacific Standard Time
Topics: General usability enhancements, CONTENTdm integration, improvements to bag ingest and more!

FOSS4Lib Upcoming Events: ArchivesSpace Member Meeting

planet code4lib - Tue, 2015-06-09 13:53
Date: Saturday, August 22, 2015 - 13:00 to 17:00Supports: ArchivesSpace

Last updated June 9, 2015. Created by Peter Murray on June 9, 2015.
Log in to edit this page.

From the meeting announcement

State Library of Denmark: Net Archive Search building blocks

planet code4lib - Tue, 2015-06-09 13:25

An extremely webarchive-discovery and Statsbiblioteket centric description of some of the technical possibilities with Net Archive Search. This could be considered internal documentation, but we like to share.

There are currently 2 generations of indexes at Statsbiblioteket: v1 (22TB) & v2 (8TB). External access is currently to v1. As the features of v2 is a superset of v1, v1 will be disabled as soon as v2 catches up in terms of the amount of indexed material. ETA: July or August 2015.

Aggregation fields

The following fields has the DocValues-option enabled, meaning that it is possible to export them efficiently as well as doing sorting, grouping and faceting at a low memory cost.

Network Field v1 v2 Description url_norm * The resource URL, lightly normalised (lowercased, www removed, etc.) the same way as links. The 2 fields together can be used to generate graphs of resource interconnections.

This field is also recommended for grouping. To get unique URLs in their earliest versions, do


links * Outgoing links from web pages, normalised the same way as url_norm. As cardinality is non-trivial (~600M values per TB of index), it is strongly recommended to enable low-memory mode if this field is used for faceting:
facet=true&facet.field=links&f.links.facet.sparse.counter=nplanez host * * The host part of the URL for the resource. Example: domain * * The domain part of the URL for the resource. Example: links_hosts * The host part of all outgoing links from web pages. links_domain * The domain part of all outgoing links from web pages. links_public_suffixes * The suffix of all outgoing links from web pages. Samples: dk,, nu Time and harvest-data Field v1 v2 Description last_modified * As reported by the web server. Note that this is not a reliable value as servers are not reliable. last_modified_year * The year part of sort_modified crawl_date * The full and authoritative timestamp for when the resource was collected. For coarser grained (and faster) statistics, consider using crawl_date_year. crawl_date_year * The year part of the crawl_date. publication_date * The publication date as reported by the web server. Not authoritative. publication_date_year * The year part of the publication_date. arc_orig * Where the ARC file originated from. Possible values are sb & kb. If used for faceting, it is recommended to use enum: facet=true&facet.field=arc_orig&f.arc_orig.facet.method=enum. arc_job * The ID of the harvest job as used by the Danish Net Archive when performing a crawl. Content Field v1 v2 Description url * * The resource URL as requested by the harvester. Consider using url_norm instead to reduce the amount of noise. author * As stated in PDFs, Word documents, presentations etc. Unfortunately the content is highly unreliably and with a high degree of garbage. content_type * The MIME type returned by the web server. content_length * The size of the raw resource, measured in bytes. Consider using this with Solr range faceting. content_encoding * The character set for textual resources. content_language * Auto-detected language of the resource. Unreliable for small text samples, but quite accurate on larger ones. content_type_norm * * Highly normalised content type. Possible values are: html, text, pdf, other, image, audio, excel, powerpoint, video& word. content_type_ version, full, tika, droid, served, ext * Variations of content type, resolved from servers and third party tools. server * The web server, as self-reported. generator * The web page generator. elements_used * All HTML-elements used on web pages. Search & stored fields

It is not recommended to sort, group or facet on the following fields. If it is relevant to do so, DocValues can be enabled for v3.

Field v1 v2 Description id * * The unique Solr-ID of the resource. Used together with highlighting or for graph-exploration. source_files_s * The name of the ARC file and the offset of the resource. Sample:

This can be used as a CDX-lookup replacement by limiting the fields returned:


arc_harvest * The harvest-ID from the crawler. hash * SHA1-hash of the content. Can be used for finding exact duplicates. ssdeep_hash_bs_3, ssdeep_hash_bs_6, ssdeep_hash_ngram_bs_3, ssdeep_hash_ngram_bs_6 * Fuzzy hashes. Can be used for finding near-duplicates. content * * Available as content_text in v1. The full extracted text of the resource. Used for text-mining or highlighting:


  1. Get core data for a single page: .../select? +crawl_year%3A2010+content_type_norm%3Ahtml &rows=1&fl=url%2Ccrawl_date%2Csource_file_s &wt=json

    gives us

    "docs": [ { "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735", "url": "", "crawl_date": "2010-06-18T14:33:29Z" } ]
  2. Request the resource 86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735 from storage, extract all links to images, css etc. The result is a list of URLs like
  3. Make a new request for the URLs from #2, grouped by unique URL, sorted by temporal distance to the originating page: .../select?q=url%3A( DRLogos%2FDR_logo.jpg%22) &rows=5&fl=url%2Ccrawl_date%2Csource_file_s&wt=json &group=true&group.field=url_norm &group.sort=abs(sub(ms(2010-06-18T14:33:29Z),%20crawl_date))%20asc

    gives us

    "groups": [ { "groupValue": "", "doclist": { "numFound": 331, "start": 0, "docs": [ { "source_file_s": "87154-32-20100624134901-00003-sb-prod-har-005.arc@7259371", "url": "", "crawl_date": "2010-06-24T13:51:10Z" } ] } }, { "groupValue": "", "doclist": { "numFound": 796, "start": 0, "docs": [ { "source_file_s": "86727-117-20100618142303-00001-sb-prod-har-006.arc@19369735", "url": "", "crawl_date": "2010-06-18T14:33:29Z" } ] } } ]

Et voilà: Reconstruction of a webpage from a given point in time, using only search and access to the (W)ARC-files.

Richard Wallis: Is There Still a Case for Aggregations of Cultural Data

planet code4lib - Tue, 2015-06-09 13:10

A bit of a profound question – triggered by a guest post on Museums Computer Group by Nick Poole CEO of The Collections Trust about Culture Grid and an overview of recent announcements about it.

Broadly the changes are that:

  • The Culture Grid closed to ‘new accessions’ (ie. new collections of metadata) on the 30th April
  • The existing index and API will continue to operate in order to ensure legacy support
  • Museums, galleries, libraries and archives wishing to contribute material to Europeana can still do so via the ‘dark aggregator’, which the Collections Trust will continue to fund
  • Interested parties are invited to investigate using the Europeana Connection Kit to automate the batch-submission of records into Europeana

The reasons he gave for the ending of this aggregation service are enlightening for all engaged with or thinking about data aggregation in the library, museum, and archives sectors.

Throughout its history, the Culture Grid has been tough going. Looking back over the past 7 years, I think there are 3 primary and connected reasons for this:

  • The value proposition for aggregation doesn’t stack up in terms that appeal to museums, libraries and archives. The investment of time and effort required to participate in platforms like the Culture Grid isn’t matched by an equal return on that investment in terms of profile, audience, visits or political benefit. Why would you spend 4 days tidying up your collections information so that you can give it to someone else to put on their website? Where’s the kudos, increased visitor numbers or financial return?
  • Museum data (and to a lesser extent library and archive data) is non-standard, largely unstructured and dependent on complex relations. In the 7 years of running the Culture Grid, we have yet to find a single museum whose data conforms to its own published standard, with the result that every single data source has required a minimum of 3-5 days and frequently much longer to prepare for aggregation. This has been particularly salutary in that it comes after 17 years of the SPECTRUM standard providing, in theory at least, a rich common data standard for museums;
  • Metadata is incidental. After many years of pump-priming applications which seek to make use of museum metadata it is increasingly clear that metadata is the salt and pepper on the table, not the main meal. It serves a variety of use cases, but none of them is ‘proper’ as a cultural experience in its own right. The most ‘real’ value proposition for metadata is in powering additional services like related search & context-rich browsing.

The first of these two issues represent a fundamental challenge for anyone aiming to promote aggregation. Countering them requires a huge upfront investment in user support and promotion, quality control, training and standards development.

The 3rd is the killer though – countering these investment challenges would be possible if doing so were to lead directly to rich end-user experiences. But they don’t. Instead, you have to spend a huge amount of time, effort and money to deliver something which the vast majority of users essentially regard as background texture.

As an old friend of mine would depressingly say – Makes you feel like packing up your tent and going home!

Interestingly earlier in the post Nick give us an insight into the purpose of Culture Grid:

.… we created the Culture Grid with the aim of opening up digital collections for discovery and use ….

That basic purpose is still very valid for both physical and digital collections of all types.  The what [helping people find, discover, view and use cultural resources] is as valid as it has ever been.  It is the how [aggregating metadata and building shared discovery interfaces and landing pages for it] that has been too difficult to justify continuing in Culture Grid’s case.

In my recent presentations to library audiences I have been asking a simple question “Why do we catalogue?”  Sometimes immediately, sometimes after some embarrassed shuffling of feet, I inevitably get the answer “So we can find stuff!“.  In libraries, archives, and museums helping people finding the stuff we have is core to what we do – all the other things we do are a little pointless if people can’t find, or even be aware of, what we have.

If you are hoping your resources will be found they have to be referenced where people are looking.  Where are they looking?

It is exceedingly likely they are not looking in your aggregated discovery interface, or your local library, archive or museum interface either.  Take a look at this chart detailing the discovery starting point for college students and others.  Starting in a search engine is up in the high eighty percents, with things like library web sites and other targeted sources only just making it over the 1% hurdle to get on the chart.  We have known about this for some time – the chart comes from an OCLC Report ‘College Students’ Perceptions of Libraries and Information Resources‘ published in 2005.  I would love to see a similar report from recent times, it would have to include elements such as Siri, Cortana, and other discovery tools built-in to our mobile devices which of course are powered by the search engines.  Makes me wonder how few cultural heritage specific sources would actually make that 1% cut today.

Our potential users are in the search engines in one way or another, however it is the vast majority case that our [cultural heritage] resources are not there for them to discover.

Culture Grid, I would suggest, is probably not the only organisation, with an ‘aggregate for discovery’ reason for their existence, that may be struggling to stay relevant, or even in existence.

You may well ask about OCLC, with it’s iconic discovery interface. It is a bit simplistic say that it’s 320 million plus bibliographic records are in WorldCat only for people to search and discover through the user interface.  Those records also underpin many of the services, such as cooperative cataloguing, record supply, inter library loan, and general library back office tasks, etc. that OCLC members and partners benefit from.  Also for many years WorldCat has been at the heart of syndication partnerships supplying data to prominent organisations, including Google, that help them reference resources within which in turn, via find in a library capability, lead to clicks onwards to individual libraries. [Declaration: OCLC is the company name on my current salary check]   Nevertheless, even though WorldCat has a broad spectrum of objectives, it is not totally immune from the influences that are troubling the likes of Culture Graph.  In fact they are one of the web trends that have been driving the Linked Data and efforts from the WorldCat team, but more of that later.

How do we get our resources visible in the search engines then?  By telling the search engines what we [individual organisations] have. We do that by sharing a relevant view of our metadata about our resources, not necessarily all of it, in a form that the search engines can easily consume. Basically this means sharing data embeded in your web pages, marked up using the vocabulary. To see how this works, we need look no further than the rest of the web – commerce, news, entertainment etc.  There are already millions of organisations, measured by domains, that share structured data in their web pages using the vocabulary with the search engines.  This data is being used to direct users with more confidence directly to a site, and is contributing to the global web of data.

There used to be a time that people complained in the commercial world of always ending up being directed to shopping [aggregation] sites instead of directly to where they could buy the TV or washing machine they were looking for.  Today you are far more likely to be given some options in the search engine that link you directly to the retailer.  I believe is symptomatic of the disintermediation of the aggregators by individual syndication of metadata from those retailers.

Can these lessons be carried through to the cultural heritage sector – of course they can.  This is where there might be a bit of light at the end of the tunnel for those behind the aggregations such as Culture Grid.  Not for the continuation as an aggregation/discovery site, but as a facilitator for the individual contributors.  This stuff, when you first get into it, is not simple and many organisations do not have the time and resources to understand how to share data about their resources with the web.  The technology itself is comparatively simple, in web terms, it is the transition and implementation that many may need help with. is not the perfect solution to describing resources, it is not designed to be. It is there to describe them sufficiently to be found on the web. Nevertheless it is also being evolved by community groups to enhance it capabilities. Through my work with the Schema Bib Extend W3C Community Group, enhancements to to enable better description of bibliographic resources, have been successfully proposed and adopted.  This work is continuing towards a bibliographic extension –  There is obvious potential for other communities to help evolve and extend Schema to better represent their particular resources – archives for example. I would be happy to talk with others who want insights into how they may do this for their benefit. is not a replacement for our rich common data standards such as MARC for libraries, and SPECTRUM for museums as Nick describes. Those serve purposes beyond sharing information with the wider world, and should be continued to be used for those purposes whilst relevant. However we can not expect the rest of the world to get its head around our internal vocabularies and formats in order to point people at our resources. It needs to be a compromise. We can continue to use what is relevant in our own sectors whilst sharing data so that our resources can be discovered and then explored further.

So to return to the question I posed – Is There Still a Case for Cultural Heritage Data Aggregation? – If the aggregation is purely for the purpose of supporting discovery, I think the answer is a simple no.  If it has broader purpose, such as for WorldCat, it is not as clear cut.

I do believe nevertheless that many of the people behind the aggregations are in the ideal place to help facilitate the eventual goal of making cultural heritage resources easily discoverable.  With some creative thinking, adoption of ‘web’ techniques, technologies and approaches to provide facilitation services, reviewing what their real goals are [which may not include running a search interface]. I believe we are moving into an era where shared authoritative sources of easily consumable data could make our resources more visible than we previously could have hoped.

Are there any black clouds on this hopeful horizon?  Yes there is one. In the shape of traditional cultural heritage technology conservatism.  The tendency to assume that our vocabulary or ontology is the only way to describe our resources, coupled with a reticence to be seen to engage with the commercial discovery world, could still hold back the potential.

As an individual library, archive, or museum scratching your head about how to get your resources visible in Google and not having the in-house ability to react; try talking within the communities around and behind the aggregation services you already know.  They all should be learning and a problem shared is more easily solved.  None of this is rocket science, but trying something new is often better as a group.

Open Library Data Additions: Amazon Crawl: part dj

planet code4lib - Tue, 2015-06-09 07:21

Part dj of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text


Subscribe to code4lib aggregator