You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 8 hours 23 min ago

Brown University Library Digital Technologies Projects: Search relevancy tests

Thu, 2015-06-11 13:37

We are creating a set of relevancy tests for the library’s Blacklight implementation.  These tests use predetermined phrases to search Solr, Blacklight’s backend, mimicking the results a user would retrieve.  This provides useful data that can be systematically analyzed.  We use the results of these tests to verify that users will get the results we, as application managers and librarians, expect.  It also will help us protect against regressions, or new, unexpected problems, when we make changes over time to Solr indexing schema or term weighting.

This work is heavily influenced by colleagues at Stanford who have both written about their (much more thorough at this point) relevancy tests and developed a Ruby Gem to assist others with doing similar work.

We are still working to identify common and troublesome searches but have already seen benefits of this approach and used it to identify (and resolve) deficiencies in title weighting and searching by common identifiers, among other issues.  Our test code and test searches are available on Github for others to use as an example or to fork and apply to their own project.

Brown library staff who have examples of searches not producing expected results, please pass them on to Jeanette Norris or Ted Lawless.

— Jeanette Norris and Ted Lawless

Hydra Project: Booking for Hydra Connect 2015 open!

Thu, 2015-06-11 12:04

We are delighted to announce that booking for Hydra Connect 2015 is now open.  This year’s Connect takes place in Minneapolis, MN, from Monday September 21st to Thursday September 24th.  Details at https://wiki.duraspace.org/display/hydra/Hydra+Connect+2015   It is intended to publish a draft program in the first week of July.

Hydra Connect meetings are intended to be the major annual event in the Hydra year.  Hydra advertises them with the slogan “as a Hydra Partner or user, if you can only make it to one Hydra meeting this academic year, this is the one to attend!”

The three-day meetings are preceded by an optional day of workshops.  The meeting proper is a mix of plenary sessions, lightning talks, show and tell sessions, and unconference breakouts.  The evenings are given over to a mix of conference-arranged activities and opportunities for private meetings over dinner and/or drinks!  The meeting program is aimed at existing users, managers and developers and at new folks who may be just “kicking the tires” on Hydra and who want to know more.

We hope to see you there!

 

Peter Sefton: Ozmeka: extending the Omeka repository to make linked-data research data collections for (any and) all research disciplines

Thu, 2015-06-11 11:10

Ozmeka: extending the Omeka repository to make linked-data research data collections for (any and) all research disciplines by Peter Sefton, Sharyn Wise, Katrina Trewin is licensed under a Creative Commons Attribution 4.0 International License.

[Update 2015-06-11, fixing typos]


Ozmeka: extending the Omeka repository to make linked-data research data collections for (any and) all research disciplines Peter Sefton, University of Technology, Sydney, peter.sefton@uts.edu.au Sharyn Wise, University of Technology of Sydney, Sharyn.Wise@uts.edu.au Peter Bugeia, Intersect Australia Ltd, Sydney, peter.bugeia@intersect.org.au Katrina Trewin, University of Western Sydney, k.trewin@uws.edu.au Katrina Trewin, University of Western Sydney, k.trewin@uws.edu.au

There have been some adjustments to the authorship on this presentation, Peter Bugeia was on the abstract but didn’t end up contributing to the presentation, whereas Katrina Trewin withdrew her name from the proposal for a while, but then produced the Farms to Freeways collection and decided to come back in to the fold. The notes here are written in the first person, to be delivered in this instance by Peter but they come from all of the authors.

Abstract as submitted

The Ozmeka project is an Australian open source project to extend the Omeka repository system. Our aim is to support Open Scholarship, Open Science, and Cultural Heritage via repository software than can manage a wide range of Research (and Open) Data, both Open and access-restricted, providing rich repository services for the gathering, curation and publishing of diverse data sets. The Ozmeka project places a great deal of importance in integrating with external systems , to ensure that research data is linked to its context, and high quality identifiers are used for as much metadata as possible. This will include links to the ‘traditional’ staples of the Open Repositories conference series, publications repositories, and to the growing number of institutional and discipline research data repositories.

In this presentation we will take a critical look at how the Omeka system, extended with Ozmeka plugins and themes can be used to manage (a) a large cross disciplinary archive of research data about water-resources (b) an ethno-historiography built around a published book and (c) for managing large research data sets in and scientific institute, and talk about how this work paves the way for eResearch and repository support teams to supply similar services to researchers in a wide variety of fields. This work intended to reduce the cost of and complexity of creating new research data repository systems.

Slightly different scope now

I will be talking about Dharmae, the database of water-resources-themed research data, the project to put the book data into Omeka took a different turn and the scientific data repository is still being developed.



How does this presentation fit in to the conference?

Which Conference Themes are we touching on?

  • Supporting Open Scholarship, Open Science, and Cultural Heritage

  • Managing Research (and Open) Data

  • Building the Perfect Repository

  • Integrating with External Systems

Re-using Repository Content

Things we want to cover:
  • A bit about the research data projects we’ve worked on.

  • How we’ve implemented Linked Data for metadata (stamping out strings!)

  • What about this Omeka thing?

(The picture is one I took of the conference hotel)



What’s Omeka?

We like to call Omeka the “Wordpress of repositories”

It’s a PHP application which is easy to install and get up and running and yes – it is a ‘repository’, it lets you upload digital objects, describe them with Dublin Core Metadata, and no, it’s not perfect.



The Perfect Repository?

So lets talk about this phrase “the perfect repository”. I have been following Jason Scott at the Internet archive (who would make a great keynote speaker for this conference, by the way) and his work on rescuing and making available cultural heritage such as computer-related ephemera and programs for obsolete computing and gaming platforms. He uses the phrase “Perfect is the enemy of done” and talks about how making some tradeoffs and compromises and then just doing it mean that stuff, you know, actually gets done that otherwise wouldn’t.

No, we’re not calling Omeka “third best”, but one of the points of this talk is that instead of waiting for or trying to build the ‘perfect’ research data repository Omeka is a low-barrier-to-entry, cheap way to build some kinds of working-data-repositories or data-publishing websites. I have talked to quite a few people who say they have looked at Omeka and decided that it is too simple, too limited for whatever project they were doing. Indeed, it does have some limitations; the two big ones are that it does not handle access control at all and it has no approval workflow, at least not in this version.

The quote on the slide is via the wikipedia page Perfect is the Enemy of Good



The Portland Common Data Model

Omeka more-or-less implements a subset of the Portland Common Data Model, which I was introduced to yesterday in the Fedora workshop, although as I just mentioned it is not strong on Access control, having only a published/unpublished flag on items.



Why Omeka? We’ll come back to this – but the ability to do Linked Data was one of the main attractions of Omeka. We had to add some code to make the relations appear like this, and easier to apply than I the ‘trunk’ version of Omeka 2.x but that development was not hard or expensive, compared to what It might have cost on top of other repository systems with more complex application stacks. Another

(Note – if you look at the current version of Dharmae, the item relations will appear a little differently, as not all the Ozmeka enhanced code has been rolled out).


Australian national data service (ANDS) funded project … to kick-start a major open data collection

I’m going to give you a quick backgrounder on our project by way of introduction: ANDS approached us with a funding opportunity to create an open data collection. Many of you will be familiar with the frustrations of funding rules : our constraint was that we were not allowed to develop software, although we could adapt it.

The UTS team put the word out for publishable research data collections but got little response. Then, thanks to the library eScholarship team, Sharyn met Professor Heather Goodall and Jodi Frawley, who had data from a large Oral History project on the impacts of water management on stakeholders in the Murray Darling Basin – called Talking Fish.

And they had had the amazing foresight – the foresight of the historian- to obtain informed consent to publically archive the interview data.


Field science in MDB (from Dharmae)

In the image above MDB means he Murray Darling Basin, a big, long river system with hardly any water in it.

First up I’ll talk about Dharmae. was conceived as a multi-disciplinary data hub themed around water related research, with the “ecocultures” concept intended to flag that we welcome scientific data contributors (ecological or otherwise), as well as cultural data. Because they are equally crucial if we want to research to have an impact on the world.

This position is also supported in the literature of the intergovernmental science policy community and environmental sustainability and resilience research.

One paper expressed it this way – for research to have a transformative impact, its not simply more knowledge that we need, but different types of knowledge.

The literature emphasizes the need for improved connectivity between knowledge systems: those applied to researching the natural world, such as science, and those that investigate socio-cultural practices such as social sciences, history and particularly also indigenous knowledge.

But because these different knowledge systems each come with their own practices and terminologies, we have an interesting information science problem:

How to support data deposit and discovery by users from all disciplines?


Linked data & disambiguation

Essentially by using linked data. We extended the open source repository Omeka by allowing all named entities (like, places, people, species) to be linked to an authoritative source of truth.

Lets take location – it is one of the obvious correspondences between scientific and cultural data..

That still doesn’t mean its an easy thing to link on. Place names are rarely unique as we see Kendell noticing above.

But by using authoritative sources, like Geonames, we can disambiguate place names, and better still we can derive their coordinates.

Now we want users of Dharmae who are interested in finding data by location to access it in the way that makes sense to them – and that may not be name.


Lower Darling/Anabranch

In Dharmae readers can search by place name or they can use a map.

Here is one of 12 study regions from the Talking Fish data, showing the Lower Darling and Anabranch above Murray Sunset National Park.

We georeferenced these regions using a Geonode map server, but we have superimposed the researchers hand-drawn map as a layer on top to preserve the sense of human scale interaction

You can click through from here to read or listen to the oral histories completed in this region, look at photos or investigate the species identified by participants.

You can also search by Indigenous language Community if you prefer.

How else could this be useful?


Lower Darling/Anabranch:

It just so happens that we also have a satellite remote sensing dataset that corresponds reasonably well to this region above the national park.

It shows the Normalized Difference Vegetation Index for the region or the vegetation change over the decade 1996-2006.

Relative increase in vegetation shows as green and relative decrease as pink.

Could the interviews with participants from that region provide any clues as to why?

I can’t tell you that, but the point is that the more we enrich and link data, the more possible hypotheses we can generate.


The Graph

Here’s the graph of our solution: We created local records, so that the Dharmae hub could maintain its own set of ‘world-views’ while still interfacing with the semantic web knowledge graph.

This design pattern is something we want to explore more: having a local record for an entity or concept, with links to external authorities. So, for example we might use a Dbpedia URI for a person, and quote a particular ‘approved’ version of the wikipedia page about them so there is a local, stable proxy for an external URI, but the local record is still part of the global graph. With the species data, this will allow researchers to explore the way the participants in Talking Fish talked about fish and compare this to what the Atlas of Living Australia says about nomenclature and distribution.



From the Journey to Horshoe Bend Website at the University of Western Sydney:

TGH Strehlow’s biographical memoir, Journey to Horseshoe Bend, is a vivid ethno-historiographic account of the Aboriginal, settler and Lutheran communities of Central Australia in the 1920’s. The ‘Journey to Horseshoe Bend’ project elaborates on Strehlow’s book in the form of an extensive digital hub – a database and website – that seeks to ‘visualise’ the key textual thematics of Arrernte* identity and sense of “place”, combined with a re-mapping of European and Aboriginal archival objects related to the book’s social and cultural histories.

Thus far the project has produced a valuable collection of unique historical and contemporary materials developed to encourage knowledge sharing and to initiate knowledge creation. By bringing together a wide variety of media – including photographs, letters, journals, Government files, audio recordings, moving images, newspaper, newsletters, interviews, manuscripts, an electronic version of the text and annotations – the researchers hope to ‘open out’ the histories of Central Australia’s Aboriginal, settler and missionary communities.

JTHB research work entailed creating annotations relating to sections of the book text. The existing book text, marked up with TEI, was converted to HTML and the annotations were anchored within the HTML. Plan was to create an Omeka plugin to display the text and co-display or footnote the annotations relating to each part of the text.

Issues

  • The existing annotations were incomplete and the research team wished to continue adding annotations and material. This meant that the HTML would need to be continuously edited (outside Omeka), giving rise to issues around workflow, researcher skills, and version control.
  • Cultural sensitivities were also a barrier to open publication (not an Omeka issue but a MODC one)


Katrina Trewin is a data librarian, working at the University of Western Sydney. While the Journey to Horseshoe Bend project could not be completed using Omeka, due to resource constraints. Another project was able to be completed. Using Omeka, Katrina was able to build web site around an oral-history data set without needing any development. This work took place in parallel with the work on Dharmae at UTS so was not able to make use of some of the innovations introduced in that project such as enhancements to the Item Relations plugin to allow rich-interlinking between resources.

Katrina’s notes:

Material had been in care of researcher for 20+ years.

  • Audio interviews on cassette, photographs, transcripts (some electronic)
  • Digitised all the material
  • Created csv files for upload of item metadata into Omeka
  • Once collections of items were created, then used exhibit plugin to bring material relating to each interviewee together.

Worked well because collection was complete – fine to edit metadata in Omeka but items themselves need to be stable (unlike the JTHB text)

Omeka allows item-level description which is not possible via institutional repository. This could have been done in Omeka interface but was more efficient via csv upload. csv files, bundled item files, readme and Omeka xml output made available from institutional repository record for longer term availability as hosting arrangement is not in place. Chambers, Deborah; Liston, Carol; Wieneke, Christine (2015): Interview material from Western Sydney women’s oral history project: ‘From farms to freeways: Women’s memories of Western Sydney’. University of Western Sydney. http://dx.doi.org/10.4225/35/555d661071c76





Katrina and team have published all the data as a set of files with a link to the website , in the institutional research data repository. This screenshot shows the data files available for download for re-use. My team at UTS are doing a similar thing with the Dharmae data.



At UTS we are are constructing a growing ‘grid’ of research data services. This diagram is a sketch of how Omeka fits into this bigger picture, showing the geonode mapping service which supplies map display services and can harvest maps from Omeka as well. In this architecture, all items ultimately end up in an archival repository with a catalogue description, as I showed earlier for the Farms to Freeways data.



Interested? Check out Clone our Ozmeka github repostiories



Conclusion

Omeka is a very simple seeming repository solution which is easy to dismiss for projects that demand the ‘perfect’ repository, but looking beyond its limitations it has some strengths that make it attractive for creating ‘micro repository services’ (Field & McSweeney 2014). Our work has made it easier to set up new research-data repositories that adhere to linked-data principles and create rich semantic web interfaces to data collections. This paves the way for a new generation of micro or workgroup-level research data repositories which link-to and re-use a wide range of data sources.

References

Johnson, Ian. “Heurist Scholar,”2014 http://heuristnetwork.org/.

Kucsma, Jason, Kevin Reiss, and Angela Sidman. “Using Omeka to Build Digital Collections: The METRO Case Study.” D-Lib Magazine 16, no. 3/4 (March 2010). doi:10.1045/march2010-kucsma.

Nahl, Diane. “A Discourse Analysis Technique for Charting the Flow of Micro-Information Behavior.” Journal of Documentation 63, no. 3 (2007): 323–39. doi:http://dx.doi.org.ezproxy.lib.uts.edu.au/10.1108/00220410710743270.

Palmer, Carole L., and Melissa H. Cragin. “Scholarship and Disciplinary Practices.” Annual Review of Information Science and Technology 42, no. 1 (2008): 163–212. doi:10.1002/aris.2008.1440420112.

Palmer, Carole L. “Thematic Research Collections”, Chapter 24 in Schreibman, Susan, Ray Siemens, and John Unsworth. Companion to Digital Humanities (Blackwell Companions to Literature and Culture). Hardcover. Blackwell Companions to Literature and Culture. Oxford: Blackwell Publishing Professional, 2004. http://www.digitalhumanities.org/companion/.

Simon, Herbert. “Rational Choice and the Structure of the Environment.” Psychological Review 63, no. 2 (1956): 129–38.

Strehlow, Theodor George Henry. Journey to Horseshoe Bend. [Sydney]: Angus and Robertson, 1969.


Credits
  • Researchers:
    • Prof. Heather Goodall
    • Dr Michelle Voyer
    • Associate professor Carol Liston
    • Dr Jodi Frawley
    • Dr Kevin Davies
  • eResearch: Sharyn Wise, Peter Sefton, Mike Lynch, Paul Nguyen, Mike Lake, Carmi Cronje, Thom McIntyre and Kevin Davies, Kim Heckenberg, Andrew Leahy, Lloyd Harischandra
  • Library: Duncan Loxton (eScholarship) & Kendell Powell (Aboriginal & Torres Strait Islander Data Archive Officer), Katrina Trewin, Michael Gonzalez
  • Thanks to: State Library of NSW Indigenous Unit, Atlas of Living Australia, Terrestrial Ecosystems Research Network and our funder, ANDS.

I didn’t have this slide when I presented, and forgot to acknowledge the contribution of all of the above, and anyone who’s been left off by accident.

Peter Murray: Thursday Threads: Advertising and Privacy, Giving Away Linux, A View of the Future

Thu, 2015-06-11 10:49
Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

In just a few weeks there will be a gathering of 25,000 librarians in the streets of San Francisco for the American Library Association annual meeting. The topics on my mind as the meeting draws closer? How patrons intersect with advertising and privacy when using our services. What one person can do to level the information access divide using free software. Where is technology in our society going to take us next. Heady topics for heady times.

On a personal note: funding for my current position at LYRASIS runs out at the end of June, so I am looking for my next challenge. Check out my resume/c.v. and please let me know of job opportunities in library technology, open source, and/or community engagement.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my Pinboard bookmarks (or subscribe to its feed in your feed reader). Items posted to are also sent out as tweets; you can follow me on Twitter. Comments and tips, as always, are welcome.

Internet Users Don’t Care For Ads and Do Care About Privacy

In advertising, an old adage holds, half the money spent is wasted; the problem is that no one knows which half. This should be less of a problem in online advertising, since readers’ tastes and habits can be tracked, and ads tailored accordingly. But consumers are increasingly using software that blocks advertising on the websites they visit. If current trends continue, the saying in the industry may well become that half the ads aimed at consumers never reach their screens. This puts at risk online publishing’s dominant business model, in which consumers get content and services free in return for granting advertisers access to their eyeballs.

Block shock: Internet users are increasingly blocking ads, including on their mobiles, The Economist, 6-Jun-2015

A new report into U.S. consumers’ attitude to the collection of personal data has highlighted the disconnect between commercial claims that web users are happy to trade privacy in exchange for ‘benefits’ like discounts. On the contrary, it asserts that a large majority of web users are not at all happy, but rather feel powerless to stop their data being harvested and used by marketers.

The Online Privacy Lie Is Unraveling, by Natasha Lomas, TechCrunch, 6-Jun-2015

This week The Economist printed a story about how users are starting to use software in their desktop and mobile browsers to block advertisements, and what the reaction may be from websites that rely on advertising to fund their activities. I found it interesting that “younger consumers seem especially intolerant of intrusive ads” and as they get older, of course, more of the population would be using ad-blocking software. Reactions range from gentle prodding to support the website in other ways, lawsuits against the makers of ad-blocking software, and mixing advertising with editorial content.

Also this week the news outlet TechCrunch reported on a study by the Annenberg School for Communication on how “a majority of Americans are resigned to giving up their data” when they “[believe] an undesirable outcome is inevitable and [feel] powerless to stop it.” This sort of thing is coming up in the NISO Patron Privacy working group discussions that have occurred over the past couple weeks and will culminate in a day-and-a-half working meeting at ALA. It is also something that I have been blogging about recently as well.

Welcome to America: Here’s your Linux computer

So, the following Monday I delivered a lovely Core2Duo desktop computer system with Linux Mint 17.1 XFCE installed. This computer was recently surplussed from the public library where I work. Installed on the computer was:

  • LibreOffice, for writing and documenting
  • Klavaro, a touch-typing tutor
  • TuxPaint, a painting program for kids
  • Scratch, to learn computer programming
  • TeamViewer, so I can volunteer to remotely support this computer

In 10 years time, these kids and their mom may well remember that first Linux computer the family received. Tux was there, as I see it, waiting to welcome these youth to their new country. Without Linux, that surplussed computer might have gotten trashed. Now that computer will get two, four, or maybe even six more years use from students who really value what it has to offer them.

Welcome to America: Here’s your Linux computer, by Phil Shapiro, OpenSource.com, 5-June-2015

This is a heartwarming story of making something out of nearly nothing: a surplus computer, free software, and a little effort. This is a great example of how one person can make a significant difference for a needy family.

What Silicon Valley Can Learn From Seoul

“When I was in S.F., we called it the mobile capital of the world,” [Mike Kim] said. “But I was blown away because Korea is three or four years ahead.” Back home, Kim said, people celebrate when a public park gets Wi-Fi. But in Seoul, even subway straphangers can stream movies on their phones, deep beneath the ground. “When I go back to the U.S., it feels like the Dark Ages,” he said. “It’s just not there yet.”

What Silicon Valley Can Learn From Seoul, by Jenna Wortham, New York Times Magazine, 2-Jun-2015

What is moving the pace of technology faster than Silicon Valley? South Korea. Might that country’s citizens be divining the path that the rest of us will follow?

Link to this post!

Eric Hellman: Protect Reader Privacy with Referrer Meta Tags

Thu, 2015-06-11 03:10
Back when the web was new, it was fun to watch a website monitor and see the hits come in. The IP address told you the location of the user, and if you turned on the referer header display, you could see what the user had been reading just before.  There was a group of scientists in Poland who'd be on my site regularly- I reported the latest news on nitride semiconductors, and my site was free. Every day around the same time, one of the Poles would check my site, and I could tell he had a bunch of sites he'd look at in order. My site came right after a Russian web site devoted to photographs of unclothed women.

The original idea behind the HTTP referer header (yes, that's how the header is spelled) was that webmasters like me needed it to help other webmasters fix hyperlinks. Or at least that was the rationalization. The real reason for sending the referer was to feed webmaster narcissism. We wanted to know who was linking to our site, because those links were our pats on the back. They told us about other sites that liked us. That was fun. (Still true today!)

The fact that my nitride semiconductor website ranked up there with naked Russian women amused me; reader privacy issues didn't bother me because the Polish scientist's habits were safe with me.


Twenty years later, the referer header seems like a complete privacy disaster. Modern web sites use resources from all over the web, and a referer header, including the complete URL of the referring web page, is sent with every request for those resources. The referer header can send your complete web browsing log to websites that you didn't know existed.

Privacy leakage via the referrer header plagues even websites that ostensibly believe in protecting user privacy, such as those produced by or serving libraries. For example, a request to the WorldCat page for What you can expect when you're expecting  results in the transmission of referer headers containing the user's request to the following hosts:
  • http://ajax.googleapis.com
  • http://www.google.com (with tracking cookies)
  • http://s7.addthis.com (with tracking cookies)
  • http://recommender.bibtip.de
None of the resources requested from these third parties actually need to know what page the user is viewing, but WorldCat causes that information to be sent anyway. In principle, this could allow advertising networks to begin marketing diapers to carefully targeted WorldCat users. (I've written about AddThis and how they sell data about you to advertising networks.)

It turns out there's an easy way to plug this privacy leak in HTML5. It's called the referrer meta tag. (Yes, that's also spelled correctly.)

The referrer meta tag is put in the head section of an HTML5 web page. It allows the web page to control the referer headers sent by the user's browser. It looks like this:

<meta name="referrer" content="origin" />
If this one line were used on WorldCat, only the fact that the user is looking a WorldCat page would be sent to Google, AddThis, and BibTip. This is reasonable, library patrons typically don't expect their visits to a library to be private; they do expect that what they read there should be private.

Because use of third party resources is often necessary, most library websites leak lots of privacy in referer headers. The meta referrer policy is a simple way to stop it. You may well ask why this isn't already standard practice. I think it's mostly lack of awareness. Until very recently, I had no idea that this worked so well. That's because it's taken a long time for browser vendors to add support. Although Chrome and Safari have been supporting the referrer meta tag for more than two years; Firefox only added it in January of 2015. Internet Explorer will support it with the Windows 10 release this summer. Privacy will still leak for users with older browser software, but this problem will gradually go away.

There are 4 options for the meta referrer tag, in addition to the "origin" policy. The origin policy sends only the host name for the originating page.

For the strictest privacy, use

<meta name="referrer" content="no-referrer" />

If you use this sitting, other websites won't know you're linking to them, which can be a disadvantage in some situations. If the web page links to resources that still use the archaic "referer authentication", they'll break.

 The prevailing default policy for most browsers is equivalent to

<meta name="referrer" content="no-referrer-when-downgrade" />

"downgrade" here refers to http links in https pages.

If you need the referer for your own website but don't want other sites to see it you can use

<meta name="referrer" content="origin-when-cross-origin" />
Finally, if you want the user's browser to send the full referrer, no matter what, and experience the thrills of privacy brinksmanship, you can set

<meta name="referrer" content="unsafe-url" />
Widespread deployment of the referrer meta tag would be a big boost for reader privacy all over the web. It's easy to implement, has little downside, and is widely deployable. So let's get started!

Links:

Peter Murray: Can Google’s New “My Account” Page be a Model for Libraries?

Thu, 2015-06-11 00:30

One of the things discussed in the NISO patron privacy conference calls has been the need for transparency with patrons about what information is being gathered about them and what is done with it. The recent announcement by Google of a "My Account" page and a privacy question/answer site got me thinking about what such a system might look like for libraries. Google and libraries are different in many ways, but one similarity we share is that people use both to find information. (This is not the only use of Google and libraries, but it is a primary use.) Might we be able to learn something about how Google puts users in control of their activity data? Even though our motivations and ethics are different, I think we can.

What the Google "My Account" page gives us

Last week I got an e-mail from Google that invited me to visit the new My Account page for "controls to protect and secure my Google account."

Google’s “My Account” home page

I think the heart of the page is the "Security Checkup" tool and the "Privacy Checkup" tool. The "Privacy Checkup" link takes you through five steps:

The five areas that Google offers when you run the “Privacy Checkup”.

  1. Google+ settings (including what information is shared in your public profile)
  2. Phone numbers (whether people can find you by your real life phone numbers)
  3. YouTube settings (default privacy settings for videos you upload and playlists you create)
  4. Account history (what of your activity with Google services is saved)
  5. Ads settings (what demographic information Google knows about you for tailoring ads)

These are broad brushes of control; the settings available here are pretty global. For instance, if you want to see your search history and what links you followed from the search pages, you would need to go to a separate page. In the “Privacy Checkup” the only option that is offered is whether or not your search history is saved. Still, for someone who wants to go with an “everything off” or “everything on” approach, the Privacy Checkup is a good way to do that.

Sidebar: I would also urge you to go through the “Security Checkup” as well. There you can change you password and account recovery options, see what other services have access to your Google account data, and make changes to account security settings.

The more in-depth settings can be reached by going to the "Personal Information and Privacy" page. This is a really long page, and you can see the full page content separately.

First part of the My Account “Personal Information and Privacy” page. The full screen capture is also available.

There you can see individual searches and search results that you followed.

My Account “Search and App Activity” page

Same with activity on YouTube.

My Account ‘YouTube Activity’ page

Google clearly put some thought and engineering time into developing this. What would a library version of this look like?

Google's Privacy site

The second item in the Google announcement was its privacy site. There they cover these topics:

  • What data does Google collect?
  • What does Google do with the data it collects?
  • Does Google sell my personal information?
  • What tools do I have to control my Google experience?
  • How does Google keep my information safe?
  • What can I do to stay safe online?

Each has a brief answer that leads to more information and sometimes to an action page like updating your password to something more secure or setting search history preferences.

Does this apply to libraries?

It could. It is clearly easier for Google because they have control over all the web properties and can do a nice integration like what is on the My Account page. We will have a more difficult task because libraries use many service providers and there are not programming interfaces libraries can use to pull together all the privacy settings onto one page. There isn't even consistency of vocabulary or setting labels that service providers could use to build such a page for making choices. Coming to an agreement on:

  1. how service providers should be transparent on what is collected, and
  2. how patrons can opt-in to data collection for their own benefit, see what data has been collected, and selectively delete and/or download that activity

…would be a significant step forward. Hopefully that is the level of detail that the NISO Patron Privacy framework can describe.

Link to this post!

DuraSpace News: MOVING Content: Institutional Tools and Strategies for Fedora 3 to 4 Upgrations

Thu, 2015-06-11 00:00

Winchester, MA  The Fedora team has made tools that simplify content migration from Fedora 3 to Fedora 4 available to assist institutions in establishing production repositories. Using the Hydra-based Fedora-Migrate tool — which was built in advance of Penn State’s deadline to have Fedora 4 in production, before the generic Fedora Migration Utilities was released —  Penn State’s ScholarSphere moved all data from production instances of Fedora 3 to Fedora 4 in about 20 hours.

District Dispatch: Experts to talk library hacker spaces at 2015 ALA Annual Conference

Wed, 2015-06-10 18:45

Woman playing classic video games using Makey Makey and coins as controllers at ALA conference. Photo by Jenny Levine.

How can libraries ensure that learners of all ages stay curious, develop their passions, immerse themselves in learning? Learn about developing library learning spaces at this year’s 2015 American Library Association (ALA) Annual Conference in San Francisco. The interactive session, “Hacking the Culture of Learning in the Library,” takes place from 1:00–2:00p.m. on Sunday, June 28, 2015. The session will be held at the Moscone Convention Center in room 2018 of the West building.

Leaders will discuss ways that libraries serve as informal learning spaces that encourage exploration and discovery, while librarians lead in creating new opportunities to engage learners and make learning happen. During the session, library leaders will explore ways that libraries are creating incubation spaces to hack education and create new paradigms where learners own their education.

Speakers
  • Moderator: Christopher Harris, school library system director, Genesee Valley Educational Partnership; ALA Office for Information Technology Policy Fellow for Program on Youth and Technology Policy
  • Erica Compton, project coordinator, Idaho Commission for Libraries
  • Megan Egbert, youth services manager, Meridian Library District (Idaho)
  • Connie Williams, teacher librarian, Petaluma High School (Calif.)

View all ALA Washington Office conference sessions

The post Experts to talk library hacker spaces at 2015 ALA Annual Conference appeared first on District Dispatch.

Brown University Library Digital Technologies Projects: Best bets for library search

Wed, 2015-06-10 17:33

The library has added “best bets” to the new easySearch tool.  Best bets are commonly searched for library resources.  Examples include JSTOR, Pubmed, and Web of Science.  Searches for these phrases (as well as known alternate names and misspellings) will return a best bet highlighted at the top of the search results.

To get started, 64 resources have been selected as best bets and are available now via easySearch.  As we would like to know how useful this feature is, please leave us feedback.

Thanks to colleagues at North Carolina State University for leading the way in adding best bets to library search and writing about their efforts.

Technical details

Library staff analyzed search logs to find commonly used search terms and matched those terms to appropriate resources.  The name, url, and description for each resource is entered into a shared Google Spreadsheet.  A script runs regularly to convert the spreadsheet data into Solr documents and posts the updates to a separate Solr core.  The Blacklight application searches for best bet matches when users enter a search into the default search box.

Since the library maintains a database of e-resources, in many cases only the identifier for a resource is needed to populate the best bets index.  The indexing script is able to retrieve the resource from the database and use that information to create the best bet.  This eliminates maintaining data about the resources in multiple places.

SearchHub: Indexing Performance in Solr 5.2 (now twice as fast)

Wed, 2015-06-10 16:31
About this time last year (June 2014), I introduced the Solr Scale Toolkit and published some indexing performance metrics for Solr 4.8.1. Solr 5.2.0 was just released and includes some exciting indexing performance improvements, especially when using replication. Before we get into the details about what we fixed, let’s see how things have improved empirically. Using Solr 4.8.1 running in EC2, I was able to index 130M documents into a collection with 10 shards and replication factor of 2 in 3,727 seconds (~62 minutes) using ten r3.2xlarge instances; please refer to my previous blog post for specifics about the dataset. This equates to an average throughput of 34,881 docs/sec. Today, using the same dataset and configuration, with Solr 5.2.0, the job finished in 1,704 seconds (~28 minutes), which is an average 76,291 docs/sec. To rule out any anomalies, I reproduced these results several times while testing release candidates for 5.2. To be clear, the only notable difference between the two tests is a year of improvements to Lucene and Solr! So now let’s dig into the details of what we fixed. First, I cannot stress enough how much hard work and sharp thinking has gone into improving Lucene and Solr over the past year. Also, special thanks goes out to Solr committers Mark Miller and Yonik Seeley for helping identify the issues discussed in this post, recommending possible solutions, and providing oversight as I worked through the implementation details. One of the great things about working on an open source project is being able to leverage other developers’ expertise when working on a hard problem. Too Many Requests to Replicas One of the key observations from my indexing tests last year was that replication had higher overhead than one would expect. For instance, when indexing into 10 shards without replication, the test averaged 73,780 docs/sec, but with replication, performance dropped to 34,881. You’ll also notice that once I turned on replication, I had to decrease the number of Reducer tasks (from 48 to 34) I was using to send documents to Solr from Hadoop to avoid replicas going into recovery during high-volume indexing. Put simply, with replication enabled, I couldn’t push Solr as hard. When I started digging into the reasons behind replication being expensive, one of the first things I discovered is that replicas receive up to 40x the number of update requests from their leader when processing batch updates, which can be seen in the performance metrics for all request handlers on the stats panel in the Solr admin UI. Batching documents into a single request is a common strategy used by client applications that need high-volume indexing throughput. However, batches sent to a shard leader are parsed into individual documents on the leader, indexed locally, and then streamed to replicas using ConcurrentUpdateSolrClient. You can learn about the details of the problem and the solution in SOLR-7333. Put simply, Solr’s replication strategy caused CPU load on the replicas to be much higher than on the leaders, as you can see in the screenshots below. CPU Profile on Leader CPU Profile on Replica (much higher than leader) Ideally, you want all servers in your cluster to have about the same amount of CPU load. The fix provided in SOLR-7333, helps reduce the number of requests and CPU load on replicas by sending more documents from the leader per request when processing a batch of updates. However, be aware that the batch optimization is only available when using the JavaBin request format (the default used by CloudSolrClient in SolrJ); if your indexing application sends documents to Solr using another format (JSON or XML), then shard leaders won’t utilize this optimization when streaming documents out to replicas. We’ll likely add a similar solution for processing other formats in the near future. Version Management Lock Contention Solr adds a _version_ field to every document to support optimistic concurrency control. Behind the scenes, Solr’s transaction log uses an array of version “buckets” to keep track of the highest known version for a range of hashed document IDs. This helps Solr detect if an update request is out-of-date and should be dropped. Mark Miller ran his own indexing performance tests and found that expensive index housekeeping operations in Lucene can stall a Solr indexing thread. If that thread happens to be holding the lock on a version bucket, it can stall other threads competing for the lock. To address this, we increased the default number of version buckets used by Solr’s transaction logs from 256 to 65536, which helps reduce the number of concurrent requests that are blocked waiting to acquire the lock on a version bucket. You can read more about this problem and solution in SOLR-6820. We’re still looking into how to deal with Lucene using the indexing thread to performance expensive background operations but for now, it’s less of an issue. Expensive Lookup for a Document’s Version When adding a new document, the leader sets the _version_ field to a long value based on the CPU clock time; incidentally, you should use a clock synchronization service for all servers in your Solr cluster. Using the YourKit profiler, I noticed that replicas spent a lot of time trying to lookup the _version_ for new documents to ensure update requests were not re-ordered. Specifically, the expensive code was where Solr attempts to find the internal Lucene ID for a given document ID. Of course for new documents, there is no existing version, so Solr was doing a fair amount of wasted work looking for documents that didn’t exist. Yonik pointed out that if we initialize the version buckets used by the transaction log to the maximum value of the _version_ field before accepting new updates, then we can avoid this costly lookup for every new document coming into the replica. In other words, if a version bucket is seeded with the max value from the index, then when new documents arrive with a version value that is larger than the current max, we know this update request has not been reordered. Of course the max version for each bucket gets updated as new documents flow into Solr. Thus, as of Solr 5.2.0, when a Solr core initializes, it seeds version buckets with the highest known version from the index, see SOLR-7332 for more details. With this fix, when a replica receives a document from its leader, it can quickly determine if the update was reordered by consulting the highest value of the version bucket for that document (based on a hash of the document ID). In most cases, the version on an incoming document to a replica will have a higher value than the version bucket, which saves an expensive lookup to the Lucene index and increases overall throughput on replicas. If by chance, the replica sees a version that is lower than the bucket max, it will still need to consult the index to ensure the update was not reordered. These three tickets taken together achieve a significant increase in indexing performance and allows us to push Solr harder now. Specifically, I could only use 34 reducers with Solr 4.8.1 but was able to use 44 reducers with 5.2.0 and still remain stable. Lastly, if you’re wondering what you need to do to take advantage of these fixes, you only need to upgrade to Solr 5.2.0, no additional configuration changes are needed. I hope you’re able to take advantage of these improvements in your own environment and please file JIRA requests if you have other ideas on how to improve Solr indexing performance. The Solr Scale Toolkit has been upgraded to support Solr 5.2.0 and the dataset I used is publicly shared on S3 if you want to reproduce these results.

The post Indexing Performance in Solr 5.2 (now twice as fast) appeared first on Lucidworks.

LITA: Sunday Routines: Susan Sharpless Smith

Wed, 2015-06-10 14:58

In this series, inspired by the New York Times’ Sunday Routines, we gain a glimpse into the lives of the people behind LITA. This post focuses on Susan Sharpless Smith, who was recently elected 2015-2018 Director-at-Large.

Susan Sharpless Smith is an Associate Dean at Wake Forest University’s Z. Smith Reynolds Library. She’s been in that role since 2011, but has worked in a range of positions at that library since 1996. Her current job provides a wide variety of responsibilities and opportunities, and fills her week with interesting, meaningful professional work.

Sunday is the day Susan reserves to enjoy her family and her interests. It normally unfolds slowly. Susan is an early riser, often heading for the first cup of coffee and the Sunday newspaper before 6 am. In the summer, the first hour of the day is spent watching the day emerge from her screen porch in Winston-Salem, NC. She is not a big TV watcher but always tunes into the Today Show on Sunday mornings.

Bicycling is one of Susan’s passions, so a typical Sunday will include a 15-40 mile bike ride, either around town or out into the surrounding countryside. It’s her belief that bicycling is good for the soul. It also is one of the best ways to get acquainted with new places, so a bike ride is always on her agenda when traveling. Plans are already underway for a San Francisco bike excursion during ALA!

Susan’s second passion is photography, so whatever she is up to on any given Sunday, a camera accompanies her. (The best camera is the one you have with you!). She has been archiving her photographs on Flickr since 2006 and has almost 10,500 of them. Her most relaxing Sunday evening activity is settling in on her MacBook Air to process photos from that day in Photoshop.

Her son and daughter are grown, so often the day is planned around a family gathering of some sort. This can involve a road trip to Durham, an in-town Sunday brunch, a drive to North Carolina wine country or an hike at nearby Hanging Rock.

Her best Sunday would be spent in her favorite place in the world, at her family’s beach house in Rehoboth Beach, DE. Susan’s family has been going there for vacations since she was a child. It’s where she heads whenever she has a long weekend and wants to recharge. Her perfect Sunday is spent there (either for real, or in her imagination when she can’t get away). This day includes a sunrise walk on the beach, a morning bike ride on the boardwalk stopping for a breakfast while people-watching, reading a book on the beach, eating crab cakes for every meal (there’s no good ones in Piedmont North Carolina), a photo-shoot at the state park, a kayak trip on the bay and an evening at Funland riding bumper cars and playing skeeball. It doesn’t get any better than that!

Open Library Data Additions: Amazon Crawl: part eo

Wed, 2015-06-10 11:29

Part eo of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Peter Murray: My View of the NISO Patron Privacy Working Group

Tue, 2015-06-09 21:34

Yesterday Bobbi Newman posted Thinking Out Loud About Patron Privacy and Libraries on her blog. Both of us are on the NISO committee to develop a Consensus Framework to Support Patron Privacy in Digital Library and Information Systems, and her article sounded a note of discouragement that I hope to dispel while also outlining what I’m hoping to see come out of the process. I think we share a common belief: the privacy of our patron’s activity data is paramount to the essence of being a library. I want to pull out a couple of sentences from her post:

Libraries negotiate with vendors on behalf of their patrons. Library users trust the library, and the choses librarians make need to be worthy of that trust.

Wholeheartedly agree!

Librarians should be able to tell users exactly what information vendors are collecting about them and what they are doing with that data.

This is why I am engaged in the NISO effort. As librarians, I don’t think we do have a good handle on the patron activity data that we are collecting and the intersection of our service offerings with what third parties might do with it. Eric Hellman lays out a somewhat dark scenario in his Towards the Post-Privacy Library? article published in the recent American Libraries Digital Futures supplement. 1 What I’m hoping comes out of this is a framework for awareness and a series of practices that libraries can take to improve patron privacy.

  1. A statement of principles of what privacy means for library patrons in the highly digital distributed environment that we are in now.
  2. A recognition that protecting privacy is an incremental process, so we need something like the “SANS Critical Security Controls” (https://www.sans.org/critical-security-controls) to help libraries take an inventory of their risks and to seek resources to address them.
  3. A “Shared Understanding” between service subscribers and service providers around expectations for privacy.

A statement of principles…

We have lived through a radical shift in how information and services are delivered to patrons, and I’d argue we haven’t thought through the impacts of that shift. There was a time when libraries collected information just in case for the needs of their patrons: books, journals/periodicals, newspapers and the catalogs and indexes that covered them. Not so long ago — at least in my professional lifetime — we were making the transition from paper indexes to CD-ROM indexes. We saw the beginnings of online delivery in services like Dialog and FirstSearch, but for the most part everything was under our roof.

Nowadays, however, we purchase or subscribe to services where information is delivered just in time. Gone are the days of shelf-after-shelf of indexes and the tedium of swapping CD-ROMs in large towers. “The resource is online, and it is constantly updated!” we trumpeted. And in recent years even the library’s venerable online catalog is often hosted by service providers. It makes for more efficient information delivery, but it also brings more actors into the interaction between our patrons and our information providers. It is that reality we need to account for, and to educate each other on the privacy implications of those new actors.

A recognition that protecting privacy is an incremental practice…

One of the important lessons from the information security field is that protecting software systems is never “done” — it is never a checklist or a completed audit or a one-time task. Security professionals developed the “Critical Security Controls” list to get a handle on persistent and new forms of attack. From the introduction:

Over the years, many security standards and requirements frameworks have been developed in attempts to address risks to enterprise systems and the critical data in them. However, most of these efforts have essentially become exercises in reporting on compliance and have actually diverted security program resources from the constantly evolving attacks that must be addressed. … The Critical Security Controls focuses first on prioritizing security functions that are effective against the latest Advanced Targeted Threats, with a strong emphasis on “What Works” – security controls where products, processes, architectures and services are in use that have demonstrated real world effectiveness.

The current edition has 20 items, and they are listed in a priority order. If an organization does nothing more than the first five, then it has already done a lot to protected itself from the most common threats.

Patron privacy needs to be addressed in the same way. There are things we can do that will have the most impact for the effort, and once we get a handle on those then we can move on to other less impactful areas. And just as the SANS organization regularly convenes professionals to review and make recommendations based on new threats and practices, so too must our “critical privacy controls” be updates as new service models are introduced and new points of privacy threats are found.

A shared understanding…

Libraries will not be able to raise the privacy levels of their patrons activities without involving the service providers that we new rely on. At the second open teleconference of the NISO patron privacy effort, I briefly presented my thoughts on why a shared understanding between libraries and service providers was important. I found it interesting that during the same teleconference we identified a need from service providers need a “service level agreement” of sorts that covers how the libraries must react to detected breaches in proxy systems2. With NISO acting as an ideal intermediary, the parties can come together and create a shared understanding of what each other need in this highly distributed world.

Making Progress

The TL;DR-at-the-bottom summary? Take heart, Bobbi. I think we are seeing the part of the process where a bunch of ideas are thrown out (including mine above!) and we begin the steps to condense all of those ideas into a plan of action. I, for one, am not interested in improving services at the expense of our core librarian ethic to hold in confidence the activities of our patrons. I don’t see it as a matter of matching the competition; in fact, I see this activity as a distinguishing characteristic for libraries. This week the news outlet TechCrunch reported on a study by the Annenberg School for Communication on how “a majority of Americans are resigned to giving up their data” when they “[believe] an undesirable outcome is inevitable and [feel] powerless to stop it.” If libraries can honestly say — because we’ve studied the issue and proactively protected patrons — that we are a reliable source of exceptional information provided in a way that is respectful of the patron’s desire to control how their activity information is used, then I think we have a good story to tell and a compelling service to offer.

Footnotes
  1. While I have a stage, can I point out

    Is there any irony in ALA's "Digital Futures" document being a hunkin' flash app leading to a 4.5MB PDF? http://t.co/hjrO14buxr

    — Peter Murray (@DataG) May 28, 2015


  2. A proxy server, while enabling a patron to get access to third-party information services while not on the library’s network, also acts as an anonymizing agent of sorts. The service provider only sees the aggregate activity of all patrons coming through the proxy server. That makes it impossible, though, for a service provider to fight off a bulk-download attack without help from the library.
Link to this post!

SearchHub: What’s new in Apache Solr 5.2

Tue, 2015-06-09 19:46
Apache Lucene and Solr 5.2.0 were just released with tons of new features, optimizations, and bug fixes. Here are the major highlights from the release: Rule based replica assignment This feature allows users to have fine grained control on placement of new replicas during collection, replica, and shard creation. A rule is a set of conditions, comprising of shard, replica, and a tag that must be satisfied before a replica can be created. This can be used to restrict replica creations like:
  • Keep less than 2 replicas of a collection on any node
  • For a shard, keep less than 2 replicas on any node
  • (Do not) Create shards on a particular rack, or host.
More details about this feature are available in this blogpost : https://lucidworks.com/blog/rule-based-replica-assignment-solrcloud/ Restore API So far, Solr provided with a feature to back-up an existing index using a call like: http://localhost:8983/solr/techproducts/replication?command=backup&name=backup_name The new restore API allows you to restore an existing back-up via a command like: http://localhost:8983/solr/techproducts/replication?command=restore&name=backup_name The location of the index backup defaults to the data directory but can be overriden by the location parameter. JSON Facet API unique() facet function The unique facet function is now supported for numeric and date fields. Example: json.facet={   unique_products : "unique(product_type)" } The “type” parameter: flatter request There’s now a way to construct a flatter JSON Facet request using the “type” parameter. The following request from 5.1: top_authors : { terms : { field:author, limit:5 } } is equivalent to this request in 5.2: top_authors : { type:terms, field:author, limit:5 } mincount parameter and range facets The mincount parameter is now supported by range facets to filter out ranges that don’t meet a minimum document count. Example: prices:{ type:range, field:price, mincount:1, start:0, end:100, gap:10 } multi-select faceting A new parameter, excludeTags will disregards any matching tagged filters for that facet. Example: q=cars &fq={!tag=COLOR}color:black &fq={!tag=MODEL}model:xc90 &json.facet={ colors:{type:terms, field:color, excludeTags=COLOR}, model:{type:terms, field:model, excludeTags=MODEL} } The above example shows a request where a user selected “color:black”. This query would do the following:
  • Get a document list with the filter applied.
  • colors facet:
    • Exclude the color filter so you get back facets for all colors instead of just getting the color ‘black’.
    • Apply the model filter.
  • Similarly compute facets for the model i.e. exclude the model filter but apply the color filter.
hll facet function The json facet API has an option to use the HyperLogLog implementation for computing unique values. Example: json.facet={ unique_products : "hll(product_type)" } Choosing facet implementation Pre Solr 5.2, interval faceting had a different implementation than range faceting based on DocValues, which at times is faster and doesn’t rely on filters and filter-cache. Solr 5.2 has support to choose between the Filters and DocValues based implementations. Functionally, the results of the two are the same, but there could be a difference in performance. The facet.range.method parameter allows for specifying the implementation to be used. Some numbers on the performance of the two methods can be found here: https://issues.apache.org/jira/browse/SOLR-7406?focusedCommentId=14497338&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14497338 Stats component Solr stats component now has support for HyperLogLog based cardinality estimation. The same is also used by the new Json facet API. The cardinality option uses probabilistic “HyperLogLog” (HLL) algorithm to estimate the cardinality of the sets in a fixed amount of memory. It also allows for tuning the cardinality parameter, which allows you to trade off accuracy for the amount of RAM used at query time, with relatively minor impacts on response time performance. More about this can be read here: https://lucidworks.com/blog/hyperloglog-field-value-cardinality-stats/ Solr security SolrCloud allows for hosting multiple collections within the same cluster but until 5.1, didn’t provide a mechanism to restrict access. The authentication framework in 5.2 allows for plugging in a custom authentication plugin or using the Kerberos plugin that is shipped out of the box. This allows for authenticating requests to Solr. The authorization framework allows for implementing a custom plugin to authorize access for resources in a SolrCloud cluster. Here’s a Solr reference guide link for the same: https://cwiki.apache.org/confluence/display/solr/Security Solr streaming expressions Streaming Expressions, provide a simple query language for SolrCloud that merges search with parallel computing. This builds on the Solr streaming API introduced in 5.1. The Solr reference guide has more information about the same: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions Other features A few configurations in Solr need to be in place as part of the bootstrapping process and before the first Solr node comes up e.g. to enable SSL. The CLUSTERPROP call provides with an API to do so, but requires a running Solr instance. Starting Solr 5.2, a cluster-wide property can be added/edited/deleted using the zkcli script and doesn’t require a running Solr instance. On the spatial front, this release introduces the new spatial RptWithGeometrySpatialField, based on CompositeSpatialStrategy, which blends RPT indexes for speed with serialized geometry for accuracy. It includes a Lucene segment based in-memory shape cache. There is now a refactored Admin UI built using AngularJS. This new UI isn’t the default, but an optional UI interface so users could report issues and provide feedback for this to migrate and become the default UI. The new UI can be accessed at: http://hostname:port/solr/index.html Though it’s an internal detail but it’s certainly an important one. Solr has internally been upgraded to use Jetty 9. This allows us to move towards using Async calls and more. Indexing performance improvement This release also comes with a substantial indexing performance improvement and bumps it up by almost 100% as compared to Solr 4x. Watch out for a blog on that real soon.   Beyond the features and improvements listed above, Solr 5.2.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. For more information, the detailed change log for both Lucene and Solr can be found here: Lucene: http://lucene.apache.org/core/5_2_0/changes/Changes.html Solr: http://lucene.apache.org/solr/5_2_0/changes/Changes.html Featured image by David Precious

The post What’s new in Apache Solr 5.2 appeared first on Lucidworks.

LITA: LITA Annual Report, 2014-2015

Tue, 2015-06-09 16:41

As we reflect on 2014-2015, it’s fair to say that LITA, despite some financial challenges, has had numerous successes and remains a thriving organization. Three areas – membership, education, and publications – bring in the most revenue for LITA. Of those, membership is the largest money generator. However, membership has been on a decline, a trend that’s been seen across the American Library Association (ALA) for the past decade. In response, the Board, committees, interest groups, and many and individuals have been focused on improving the member experience to retain current members and attract potential ones. With all the changes to the organization and leadership, LITA is on the road to becoming profitable again and will remain one of ALA’s most impactful divisions.

Read more in the LITA Annual Report.

DPLA: Digital Public Library of America receives $96,000 grant from the Whiting Foundation to expand its impact in education

Tue, 2015-06-09 14:45

The Digital Public Library of America (DPLA) is pleased to announce that it has received $96,000 from the Whiting Foundation to begin creating resources for users in K-12 and higher education. The grant will allow DPLA to develop and share primary source sets built on the foundation of national educational standards and under the guidance of a diverse group of education experts. DPLA will also refine tools for creating user-generated content so that students and teachers can curate their own resources as part of the learning process.

“We are very grateful to the Whiting Foundation for their continued support of our education work, and we look forward to connecting a growing community of students with the rich primary source materials provided by our partners,” said Franky Abbott, DPLA’s Project Manager and American Council of Learned Societies Public Fellow.

This grant builds on DPLA’s recent Whiting-funded efforts to understand the ways large­-scale digital collections can best adapt their resources to address classroom needs. This work culminated in a comprehensive research paper published in April 2015.

“The growing collection of primary sources created by the Digital Public Library of America and its partners has the potential to become an unparalleled educational resource for teachers, students – and indeed anyone with a spark of curiosity and an internet connection,” said Whiting Foundation Executive Director Daniel Reid. “The Whiting Foundation is proud to continue our support for the DPLA’s work to build out new features and content to meet this purpose more effectively.”

“DPLA seeks not only to bring together openly available materials, but to maximize their use, and education is an essential realm for that use,” said Dan Cohen, DPLA’s Executive Director. “Thanks to the Whiting Foundation, we can move forward with the creation of easy-to-use resources, which we believe will help students and teachers in both K-12 and college.”

If you are interested in learning more about DPLA’s education work or getting involved, please email education@dp.la.

# # #

About the Digital Public Library of America

The Digital Public Library of America (http://dp.la) strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated over 10 million items from over 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

About the Whiting Foundation

The Whiting Foundation (www.whitingfoundation.org) has supported scholars and writers for more than forty years. This grant is part of the Foundation’s efforts to infuse the humanities into American public culture.

OCLC Dev Network: Change to Dewey Web Services

Tue, 2015-06-09 14:00

Dewey.info is not available at the time. There is no current projected date for the return of Dewey.into yet.

FOSS4Lib Upcoming Events: What's New in Archivematica 1.4

Tue, 2015-06-09 13:57
Date: Tuesday, June 16, 2015 - 11:00 to 12:00Supports: Archivematica

Last updated June 9, 2015. Created by Peter Murray on June 9, 2015.
Log in to edit this page.

From the meeting announcement:

Please join us for a free webinar highlighting what's new in Archivematica version 1.4, released last month. The webinar will be one hour long, with 45 minutes for demonstration and 15 minutes for question and answer.

Date: June 16
Time: 11 am - 12 PM Pacific Standard Time
Topics: General usability enhancements, CONTENTdm integration, improvements to bag ingest and more!

FOSS4Lib Upcoming Events: ArchivesSpace Member Meeting

Tue, 2015-06-09 13:53
Date: Saturday, August 22, 2015 - 13:00 to 17:00Supports: ArchivesSpace

Last updated June 9, 2015. Created by Peter Murray on June 9, 2015.
Log in to edit this page.

From the meeting announcement

Pages