You are here

Feed aggregator

State Library of Denmark: Sudden Solr performance drop

planet code4lib - Fri, 2014-10-31 20:33

There we were, minding other assignments and keeping a quarter of an eye on our web archive indexer and searcher. The indexer finished its 900GB Solr shard number 22 and the searcher was updated to a total of 19TB / 6 billion documents. With a bit more than 100GB free for disk cache (or about 1/2 percent of total index size), things were relatively unchanged, compared to ~120GB free a few shards ago. We expected no problems. As part of the index update, an empty Solr was created as entry-point, with a maximum of 3 concurrent connections, to guard against excessive memory use.

But something was off. User issued searches seemed slower. Quite a lot slower for some of them. Time for a routine performance test and comparison with old measurements.

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index

As the graph shows very clearly, response times rose sharply with the number of hits in a search in our 19TB index. At first glance that seems natural, but as the article Ten times faster explains, this should be a bell curve, not an ever-upgoing hill. The bell curve can be seen for the old 12TB index. Besides, those new response times were horrible.

Investigating the logs showed that most of the time was spend resolving facet-terms for fine-counting. There are hundreds of those for the larger searches and the log said it took 70ms for each, neatly explaining total response times of 10 or 20 seconds. Again, this would not have been surprising if we were not used to much better numbers. See Even sparse faceting is limited for details.

A Systems guy turned off swap, then cleared the disk cache, as disk cache clearing has helped us before in similar puzzling situations. That did not help this time: Even non-faceted searches had outliers above 10 seconds, which is 10 times worse than with the 12TB index.

Due to unrelated circumstances, we then raised the number of concurrent connections for the entry-point-Solr from 3 to 50 and restarted all Solr instances.

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

Welcome back great performance! You were sorely missed. The spread as well as the average for the 19TB index is larger than its 12TB counter part, but not drastically so.

So what went wrong?

  • Did the limiting of concurrent searches at the entry-Solr introduce a half-deadlock? That seems unlikely as the low-level logs showed the unusual high 70ms/term lookup-time, which is done without contact to other Solrs.
  • Did the Solr-restart clean up OS-memory somehow, leading to better overall memory performance and/or disk caching?
  • Were the Solrs somehow locked in a state with bad performance? Maybe a lot of garbage collection? Their Xmx is 8GB, which has been fine since the beginning: As each shard runs in a dedicated tomcat, the new shards should not influence the memory requirements of the Solrs handling the old ones.

We don’t know what went wrong and which action fixed it. If performance starts slipping again, we’ll focus on trying one thing at a time.

Why did we think clearing the disk cache might help?

It is normally advisable to use Huge Pages when running a large Solr server. Whenever a program requests memory from the operating system, this is done as pages. If the page size is small and the system has a lot of memory, there will be a lot of bookkeeping. It makes sense to use larger pages and have less bookkeeping.

Our indexing machine has 256GB of RAM, a single 32GB Solr instance and constantly starts new Tika processes. Each Tika process takes up to 1GB of RAM and runs for an average of 3 minutes. 40 of these are running at all times, so at least 10GB of fresh memory is requested from the operating system each minute.

We observed that the indexing speed of the machine fell markedly after some time, down to 1/4th of the initial speed. We also observed that most of the processing time was spend in kernel space (the %sy in a Linux top). Systems theorized that we had a case of OS memory fragmentation due to the huge pages. They tried flushing the disk cache (echo 3 >/proc/sys/vm/drop_caches) to reset part of the memory and performance restored.

A temporary fix of clearing the disk cache worked fine for the indexer, but the lasting solution for us was to disable the use of huge pages on that server.

The searcher got the same no-huge-pages treatment, which might have been a mistake. Contrary to the indexer, the searcher rarely allocates new memory and as such looks like an obvious candidate for using huge pages. Maybe our performance problems stemmed from too much bookkeeping of pages? Not fragmentation as such, but simply the size of the structures? But why would it help to free most of the memory and re-allocate it? Does size and complexity of the page-tracking structures increase with use, rather than being constant? Seems like we need to level up in Linux memory management.

Note: I strongly advice against using repeated disk cache flushing as a solution. It is symptom curing and introduces erratic search performance. But it can be very useful as a poking stick when troubleshooting.

On the subject of performance…

The astute reader will have noticed that the performance-difference is strange at the 10³ mark. This is because the top of the bell curve moves to the right as the number of shards increases. See Even sparse faceting is limited for details.

In order to make the performance comparison apples-to-apples, the no_cache numbers were used. Between the 12TB and the 19TB mark, sparse facet caching was added, providing a slight speed-up to distributed faceting. Let’s add that to the chart:

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

 

Although the index size was increased by 50%, sparse facet caching kept performance at the same level or better. It seem that our initial half-dismissal of the effectiveness of sparse facet caching was not fair. Now we just need to come up with similar software improvements each month and we we will never need to buy new hardware.

Do try this at home

If you want to try this on your own index, simply use sparse solr.war from GitHub.


LITA: Free Web Tools for Top-Notch Presentations

planet code4lib - Fri, 2014-10-31 17:00

Visually appealing and energizing slideshows are the lifeblood of conference presentations. But using animated PowerPoints or zooming Prezis to dizzy audiences delivers little more appeal than packing slides with text on a low-contrast background. Key to winning hearts and minds are visual flair AND minimalism, humor, and innovative use of technology.

Memes

Delightfully whimsical, memes  are a fantastic ice-breaker and laugh-inducer. My last two library conference presentations used variants of the crowdpleasing “One does not simply…” Boromir meme above, which never fails to generate laughter and praise. Memes.com offers great selections, is free of annoying popup ads, and is less likely than other meme generators to be blocked by your workplace’s Internet filters for being “tasteless.” (Yes, I speak from personal experience…)

Keep Calm-o-matic 

Do you want your audience to chuckle and identify with you? Everyone who’s ever panicked or worked under a deadline will appreciate the Keep Calm-o-matic. As with memes, variations are almost infinite.

Recite This

Planning to include quotations on some of your slides? Simply copy and paste your text into Recite This, then select an aesthetically pleasing template in which the quote will appear. Save time, add value.

Wordle

This free web tool enables you to paste text or a URL to generate a groovy word cloud. Vary sizes, fonts, and color schemes too. Note that Wordle’s Java applet refuses to function smoothly in Chrome. There are other word cloud generators, but Wordle is still gold.

Dictation

This is the rare dictation tool that doesn’t garble what you say, at least not excessively. It’s free, online, and available as a Chrome app. Often when preparing presentations, I simply start talking and then read over what I said. This is a valuable exercise in prewriting and a way to generate zingers and lead-ins to substantive content.

Poll Everywhere

Conduct live polls of your audience using texting and Twitter! Ask open-ended or multiple-choice questions and then watch the live results appear on your PowerPoint slide or web browser.  Poll Everywhere and equivalents such as EverySlide engage audiences and heighten interest more than a mere show of hands, especially for larger audiences in which many members otherwise would not be able to contribute to the discussion. Use whenever appropriate.

Emaze

This online presentation software offers incredible visual appeal and versatility without inducing either vertigo or snoozes. Create your slides in the browser, customize a range of attractive templates, and access from any device with an Internet connection (major caveat, that). You must pay to go premium to download slideshows, but this reservation aside, the free version is an outstanding product.

DoNotLink

Ever attempted to show a website containing misinformation or hate speech as part of an information literacy session but didn’t want to drive traffic to the site? DoNotLink is your friend! Visit or link to shady sites without increasing their search engine ranking.

Serendip-o-matic

Simply paste some text, and this serendipity search tool will draw on the Digital Public Library of America (DPLA), Flickr, Europeana, and other open digital repositories to produce related photographs, art, and documents that are visually displayed. Serendip-o-matic reveals unexpected connections between diverse materials and offers good, nerdy fun to boot. “Let your sources surprise you!”

So . . . what free web tools do you use to jazz up your presentations?

Riley Childs: TTS Video

planet code4lib - Fri, 2014-10-31 15:59

(Video is on it’s way, there is an issue with the Camera on my laptop.

Hello, I am Riley Childs a 17-year old student at Charlotte United Christian Academy. I am deeply involved there and am in charge/support of our network, *nix servers, viz servers, library stuff and of course end-user computers. One of my favorite things to do is work in the Library and administer IT awesomeness. I also work in the theater at CPCC as a Electrician. Another thing that I love to do is participate in a community called code4lib where I assist others and post about library technology. I also post to the Koha mailing list where I help out others who have issues with Koha. Overall I love technology and I believe in the freedom of information and that is why I love librarians because they are all about distribution of information. In addition to all this indoor stuff I also enjoy a good day hike and also like to go backpacking every once in a while.
Once again I am very sorry that isn’t a video, I will try and post one soon (I kinda jumped the gun on submitting my app!).
Thanks
//Riley

The post TTS Video appeared first on Riley's blog at https://rileychilds.net.

Riley Childs: TTS Video

planet code4lib - Fri, 2014-10-31 15:59

(Video is on it’s way, there is an issue with the Camera on my laptop.

Hello, I am Riley Childs a 17-year old student at Charlotte United Christian Academy. I am deeply involved there and am in charge/support of our network, *nix servers, viz servers, library stuff and of course end-user computers. One of my favorite things to do is work in the Library and administer IT awesomeness. I also work in the theater at CPCC as a Electrician. Another thing that I love to do is participate in a community called code4lib where I assist others and post about library technology. I also post to the Koha mailing list where I help out others who have issues with Koha. Overall I love technology and I believe in the freedom of information and that is why I love librarians because they are all about distribution of information. In addition to all this indoor stuff I also enjoy a good day hike and also like to go backpacking every once in a while.
Once again I am very sorry that isn’t a video, I will try and post one soon (I kinda jumped the gun on submitting my app!).
Thanks
//Riley

The post TTS Video appeared first on Riley's blog at https://rileychilds.net.

OCLC Dev Network: Learn More About Software Development Practices at November Webinars

planet code4lib - Fri, 2014-10-31 15:15

We're excited to announce two new webinars based on our recent popular blog series covering some of our favortie software development practices. Join Karen Coombs as she walks you through a collection of tools designed to close communication gaps throughout the development process. Registration for both 1-hour webinars is free and now open.

David Rosenthal: This is what an emulator should look like

planet code4lib - Fri, 2014-10-31 15:00
Via hackaday, [Jörg]'s magnificently restored PDP10 console, connected via a lot of wiring to a BeagleBone running the SIMH PDP10 emulator. He did the same for a PDP11. Two computers that gave me hours of harmless fun back in the day!

Kids today have no idea what a computer should look like. But even they can run [Jörg]'s Java virtual PDP10 console!

Islandora: Islandora 7.x-1.4 Release Announcement

planet code4lib - Fri, 2014-10-31 13:42

I am extremely pleased to announce the release of Islandora 7.x-1.4!

This is our second community release, and I couldn't be more happy with how much we've grown and progressed as a community. This software has continued to improve because of you!

We have an absolutely amazing team to thank for this:

Adam Vessey
Alan Stanley
Dan Aiken
Donald Moses
Ernie Gillis
Gabriela Mircea
Jordan Dukart
Kelli Babcock
Kim Pham
Kirsta Stapelfeldt
Lingling Jiang
Mark Jordan
Melissa Anez
Nigel Banks
Paul Pound
Robin Dean
Sam Fritz
Sara Allain
Will Panting
 

Now for the release info!

Release notes and download links are here along with updated documentation, and you can grab an updated VM here (sandbox.islandora.ca will be updated soon).

I'd like to highlight a few things. This release includes 48 bug fixes since the last release, and 23 document improvements. Along with those improvements, we have two new modules. Islandora Videojs (an Islandora viewer module using Video.js) and Islandora Solr Views (Exposes Islandora Solr search results into a Drupal view).

Our next release is will be out in April. If you would like to be apart of the release team (you'll get an awesome t-shirt!!!), keep an eye out on the list for a call for 7.x-1.5 volunteers. We'll need folks as component managers, testers, and documenters.

That's all I have for now.

cheers!

-nruest

Library of Congress: The Signal: An Online Event & Experimental Born Digital Collecting Project: #FolklifeHalloween2014

planet code4lib - Fri, 2014-10-31 12:14

If you haven’t heard, as the title of the press release explains, the Library of Congress Seeks Halloween Photos For American Folklife Center Collection.  As of writing this morning, there are now 288 photos shared on Flickr with the #folklifehalloween2014 tag. If you browse through the results, you can see a range of ways folks are experiencing, seeing, and documenting Halloween and Dia de los Muertos. Everyone has until November 5th to participate. So send this, or some of the links in this post, along to a few other people to spread the word.

Svayambhunath Buddha O’Lantern, Shared by user birellsalsh on Flickr

Because of the nature of this event, you can follow along in real time and see how folks are responding to this in the photostream. See the American Folklife Center’s blog posts on this for a more in depth explanation and some additional context of this project and a set of step-by-step directions about how people can participate. As this is still a live and active event, I wanted to make sure we had a post up about it today for people to share these links with others.

Consider emailing a link to this to any shutterbug friends and colleagues you have. In particular, there is an explicit interest in photos that document the diverse range of communities’ experiences of the holiday. So if you are part of an often underrepresented community it would be great to see that point of view in the photo stream. With that noted, I also wanted to take this opportunity to highlight some of the things about this event that I think are relevant to the digital collecting and preservation focus of The Signal.

Rapid Response Collecting & a Participatory Online Event

Aside from the fun of this project (I mean, its people’s Halloween photos!) I am interested to see how it plays out as a potential mode of contemporary collecting. I think there is a potential for this kind of public event focused on documenting our contemporary world to fit in with ideas like “rapid response collecting” that the Victoria and Albert Museum has been forwarding as well as notions of shared historical authority and conceptions of public participation in collection development.

We can’t know how this will end up playing out over the next few days of the event. However, I can already see how something like this could serve cultural institutions as a means to work with communities to document, interpret and share our perspectives on themes and issues that cultural heritage organizations document and collect in.

Oh and just a note of thanks to Adriel Luis, who shared a bit of wisdom and lessons learned from his work at the Asian Pacific American Center on the Day in the Life of Asian Pacific American event.

So, consider helping to spread the word and sharing some photos!

LibUX: Mobile Users are Demanding

planet code4lib - Fri, 2014-10-31 10:24

As library (public and academic) and higher ed websites approach their mobile moment, it is more crucial than ever that new sites, services, redesigns, whatever are optimized for performance. I would even go so far to say that speed is more important than a responsive layout, but it’s obviously better to improve the former by optimizing the latter all in one go.

There are caveats: it may take more time upfront to develop a performant mobile-first responsive website. This is an important distinction. As I mentioned in this ARCL TechConnect article, not all responsive websites are created equal.

38% of smartphone users have screamed at, cursed at, or thrown their phones when pages take too long to load.

Anticipate the user trying to check library hours from the road in peak-time rush hour traffic on an iPhone 4S over 3G. Latency alone (the time it takes just to communicate with the server) will take 2 seconds.

Your website has just 2 seconds to load at your patron’s point of need before they a certain percentage will give up, which may literally affect your foot traffic. Rather than chance the library being closed, your potential patron may change plans. After 10 seconds, 30% will never return to the site.

This data is from Radware’s 2014 State of the Union: Ecommerce Page Speed & Performance

The post Mobile Users are Demanding appeared first on LibUX.

Casey Bisson: Unit test WordPress plugins like a ninja (in progress)

planet code4lib - Fri, 2014-10-31 03:53

cc-by Zach Dischner

Unit testing a plugin can be easy, but if the plugin needs dashboard configuration or has dependencies on other plugins, it can quickly go off the tracks. And if you haven’t setup Travis integration, you’re missing out.

Activate Travis CI

To start with, go sign in to Travis now and activate your repos for testing. If you’re not already using Github to host the plugin, please start there.

Set the configuration

If your plugin needs options to be set that are typically set in the WP dashboard, do so in tests/bootstrap.php.

In bCMS, I’m doing a simple update_option( 'bcms_searchsmart', '1' ) just before loading the plugin code. For that plugin, that option is checked when loading the components. That’s not ideal, but it’s works for this plugin (until I refactor the plugin to solve the shortcomings this exposes).

Download and activate dependencies

Some plugins depend on others. An example is bStat, which depends on libraries from GO-UI. The dependency in that case is appropriate, but it can add frustration to unit testing. To solve that problem, I’ve made some changes to download the plugins in the Travis environment and activate them in all.

It starts with the tests/dependencies-array.php, where I’ve specified the plugins and their repo paths. That file is used both by bin/install-dependencies.php, which downloads the plugin in Travis, and tests/bootstrap.php, where the plugins are included at runtime.

Of course, if those additional plugins need configuration settings, then do that in the tests/bootstrap.php as in the section above.

DuraSpace News: The Islandora Foundation Releases Islandora 7.1.4

planet code4lib - Fri, 2014-10-31 00:00

From Melissa Anez, Islandora Foundation

Charlottetown, Prince Edward Island, CA  The Islandora Foundation is extremely pleased to announce the release of Islandora 7.x-1.4.

District Dispatch: After privacy glitch, the ball is now in our court

planet code4lib - Thu, 2014-10-30 21:28

Photo by John Leben Art Prints via Deviant Art

Last week, Adobe announced that with its software update (Digital Editions 4.0.1), the collection and transmission of user data has been secured. Adobe was true to its word that a fix would be made by the week of October 20 correcting this apparent oversight.

For those who might not know, a recap: Adobe Digital Editions is widely used software in the e-book trade for both library and commercial ebook transactions to authenticate legitimate library users, apply DRM to encrypt e-book files, and in general facilitate the e-book circulation process, such as deleting an e-book from a device after the loan period has expired. Earlier in October, librarians and others discovered that the new Adobe Digital Editions software (4.0) had a tremendous security and privacy glitch. A large amount of unencrypted data reflecting e-book loan and purchase transactions was being collected and transmitted to Adobe servers.

The collection of data “in the clear” is a hacker’s dream because it can be so easily obtained. Information about books, including publisher, title and other metadata was also unencrypted raising alarms about reader privacy and the collection of personal information. Some incorrectly reported that Adobe was scanning hard drives and spying on readers. After various librarians conducted a few tests, they confirmed that Adobe was not scanning or spying, but nonetheless this was a clearly a security nightmare and alleged assault on reader privacy.

ALA contacted Adobe about the breach and asked to talk to Adobe about what was going on. Conversations did take place and Adobe responded to several questions raised by librarians.

Now that the immediate problem of unencrypted data is fixed, let’s step back and consider what we have learned and ponder what to do next.

We learned that few librarians have the knowledge base to explain how these software technologies work. To a great extent, users (librarians and otherwise) do not know what is going on behind the curtain (without successfully hacking various layers of encryption).

We can no longer ensure user privacy by simply destroying circulation records, or refusing to reveal information without a court order. This just isn’t enough in the digital environment. Data collection is a permanent part of the digital landscape. It is lucrative and highly valued by some, and is often necessary to make things work.

We learned that most librarians continue to view privacy as a fundamental value of the profession, and something we should continue to support through awareness and action.

We should hold venders and other suppliers to account—any data collected to enable services should be encrypted, retained for only as long as necessary with no personal information collected, shared or sold.

What’s next? We have excellent policy statements regarding privacy, but we do not have a handy dandy guide to help us and our library communities understand how digital technologies work and how they can interfere with reader privacy. We need a handy dandy guide with diagrams and narrative that is not too technicalese (new word, modeled after “legalese”).

We have to inform our users that whenever they key in their name for a service or product, all privacy bets are off. We need to understand how data brokers amass boat loads of data and what they do with it. We need to know how to opt out of data collection when possible, or never opt in in the first place. We need to better inform our library communities.

A good suggestion is to collaborate with vendors and other suppliers and not just talk to one another at the license negotiating table. By working together we can renew our commitment to privacy. The vendors have extended an invitation by asking to work with us on best practices for privacy. Let’s RSVP “yes.”

The post After privacy glitch, the ball is now in our court appeared first on District Dispatch.

District Dispatch: Webinar archive available: “$2.2 Billion reasons libraries should care about WIOA”

planet code4lib - Thu, 2014-10-30 21:04

Photo by the Knight Foundation

On Monday, more than one thousand people participated in the American Library Association’s (ALA) webinar “$2.2 Billion Reasons to Pay Attention to WIOA,” an interactive webinar that focused on ways that public libraries can receive funding for employment skills training and job search assistance from the recently-passed Workforce Innovation and Opportunity Act (WIOA).

During the webinar, leaders from the Department of Education and the Department of Labor explored the new federal law. Watch the webinar.

An archive of the webinar is available now:

The Workforce Innovation and Opportunity Act allows public libraries to be considered additional One-Stop partners, prohibits federal supervision or control over selection of library resources and authorizes adult education and literacy activities provided by public libraries as an allowable statewide employment and training activity. Additionally, the law defines digital literacy skills as a workforce preparation activity.

View slides from the webinar presentation:

Webinar speakers included:

  • Susan Hildreth, director, Institute of Museum and Library Services
  • Kimberly Vitelli, chief of Division of National Programs, Employment and Training Administration, U.S. Department of Labor
  • Heidi Silver-Pacuilla, team leader, Applied Innovation and Improvement, Office of Career, Technical, and Adult Education, U.S. Department of Education

We are in the process of developing a WIOA Frequently Asked Questions guide for library leaders—we’ll publish the report on the District Dispatch shortly. Subscribe to the District Dispatch, ALA’s policy blog, to be alerted to when additional WIOA information becomes available.

The post Webinar archive available: “$2.2 Billion reasons libraries should care about WIOA” appeared first on District Dispatch.

District Dispatch: Fun with Dick and Jane, and Stephen Colbert

planet code4lib - Thu, 2014-10-30 17:28

Photo by realworldracingphotog via Flickr

The Library Copyright Alliance (LCA) issued this letter (pdf) in response to Stephen Colbert’s suggestion that librarians just “make up” data. Enjoy!

The post Fun with Dick and Jane, and Stephen Colbert appeared first on District Dispatch.

Library of Congress: The Signal: Gossiping About Digital Preservation

planet code4lib - Thu, 2014-10-30 15:35

ANTI-ENTROPY by user 51pct on Flickr.

In September the Library held its annual Designing Storage Architectures for Digital Collections meeting. The meeting brings together technical experts from the computer storage industry with decision-makers from a wide range of organizations with digital preservation requirements to explore the issues and opportunities around the storage of digital information for the long-term. I always learn quite a bit during the meeting and more often than not encounter terms and phrases that I’m not familiar with.

One I found particularly interesting this time around was the term “anti-entropy.”  I’ve been familiar with the term “entropy” for a while, but I’d never heard “anti-entropy.” One definition of “entropy” is a “gradual decline into disorder.” So is “anti-entropy” a “gradual coming-together into order?” Turns out that the term has a long history in information science and is important to get an understanding of some very important digital preservation processes regarding file storage, file repair and fixity checking.

The “entropy” we’re talking about when we talk about “anti-entropy” might also be called “Shannon Entropy” after the legendary information scientist Claude Shannon. His ideas on entropy were elucidated in a 1948 paper called “A Mathematical Theory of Communication” (PDF), developed while he worked at Bell Labs. For Shannon, entropy was the measure of the unpredictability of information content. He wasn’t necessarily thinking about information in the same way that digital archivists think about information as bits, but the idea of the unpredictability of information content has great applicability to digital preservation work.

“Anti-entropy” represents the idea of the “noise” that begins to slip into information processes over time. It made sense that computer science would co-opt the term, and in that context “anti-entropy” has come to mean “comparing all the replicas of each piece of data that exist (or are supposed to) and updating each replica to the newest version.” In other words, what information scientists call “bit flips” or “bit rot” are examples of entropy in digital information files, and anti-entropy protocols (a subtype of “gossip” protocols) use methods to ensure that files are maintained in their desired state. This is an important concept to grasp when designing digital preservation systems that take advantage of multiple copies to ensure long-term preservability, LOCKSS being the most obvious example of this.

gossip_bench by user ricoslounge on Flickr.

Anti-entropy and gossip protocols are the means to ensure the automated management of digital content that can take some of the human overhead out of the picture. Digital preservation systems invoke some form of content monitoring in order to do their job. Humans could do this monitoring, but as digital repositories scale up massively, the idea that humans can effectively monitor the digital information under their control with something approaching comprehensiveness is a fantasy. Thus, we’ve got to be able to invoke anti-entropy and gossip protocols to manage the data.

An excellent introduction to how gossip protocols work can be found in the paper “GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems.”  The authors note three key parameters to gossip protocols: monitoring, failure detection and consensus.  Not coincidentally, LOCKSS “consists of a large number of independent, low-cost, persistent Web caches that cooperate to detect and repair damage to their content by voting in “opinion polls” (PDF). In other words, gossip and anti-entropy.

I’ve only just encountered these terms, but they’ve been around for a long while.  David Rosenthal, the chief scientist of LOCKSS, has been thinking about digital preservation storage and sustainability for a long time and he has given a number of presentations at the LC storage meetings and the summer digital preservation meetings.

LOCKSS is the most prominent example in the digital preservation community on the exploitation of gossip protocols, but these protocols are widely used in distributed computing. If you really want to dive deep into the technology that underpins some of these systems, start reading about distributed hash tables, consistent hashing, versioning, vector clocks and quorum in addition to anti-entropy-based recovery. Good luck!

One of the more hilarious anti-entropy analogies was recently supplied by the Register, which suggested that a new tool that supports gossip protocols “acts like [a] depressed teenager to assure data reliability” and “constantly interrogates itself to make sure data is ok.”

You learn something new every day.

LibUX: Web for Libraries: The UX Bandwagon

planet code4lib - Thu, 2014-10-30 15:26

This issue of The Web for Libraries was mailed Wednesday, October 29th, 2014. Want to get the latest from the cutting-edge web made practical for libraries and higher ed every Wednesday? You can subscribe here!

The UX Bandwagon

Is it a bad thing? Throw a stone and you’ll hit a user experience talk at a library conference (or even a whole library conference). There are books, courses, papers, more books, librarians who understand the phrase “critical rendering path,” this newsletter, this podcast, interest groups, and so on.

It is the best fad that could happen for library perception. The core concept behind capital-u Usability is continuous data-driven decision making that invests in the library’s ability to iterate upon itself. Usability testing that stops is usability testing done wrong. What’s more, libraries concerned with UX are thus concerned about measurable outward perception – marketing–which libraries used to suck at–that can neither be haphazard nor half-assed. This bandwagon values experimentation, permits change, and increases the opportunities to create delight.

Latest Podcast: A High-Functioning Research Site with Sean Hannan

Sean Hannan talks about designing a high functioning research site for the John Hopkins Sheridan Libraries and University Museums. It’s a crazy fast API-driven research dashboard mashing up research databases, LibGuides, and a magic, otherworldly carousel actually increasing engagement. Research tools are so incredibly difficult to build well, especially when libraries rely so heavily on third parties, that I’m glad to have taken the opportunity to pick Sean’s brain. You can catch this and every episode on Stitcher, iTunes, or on the Web.

Top 5 Problems with Library Websites – a Review of Recent Usability Studies

Emily Singley looked at 16 library website usability studies over the past two years and broke down the biggest complaints. Can you guess what they are?

“Is the semantic web still a thing?”

Jonathan Rochkind sez: “The entire comment, and, really the entire thread, are worth a read. There seems to be a lot of energy in libraryland behind trying to produce “linked data”, and I think it’s important to pay attention to what’s going on in the larger world here.
Especially because much of the stated motivation for library “linked data” seems to have been: “Because that’s where non-library information management technology is headed, and for once let’s do what everyone else is doing and not create our own library-specific standards.” It turns out that may or may not be the case ….”

How to Run a Content-Planning Workshop

Let’s draw a line. There are libraries that blah-blah “take content seriously” enough in that they pair down the content patrons don’t care about, ensure that hours and suchlike are findable, that their #libweb is ultimately usable. Then there are libraries that dive head-first into content creation. They podcast, make lists, write blogs, etc. For the latter, the library without a content strategy is going to be a mess, and I think these suggestions by James Deer on Smashing Magazine are really helpful.

New findings: For top ecommerce sites, mobile web performance is wildly inconsistent

I’m working on a new talk and maybe even a #bigproject about treating library web services and apps as e-commerce – because, think about it, what a library website does and what a web-store wants you to do isn’t too dissimilar. That said, I think we need to pay a lot of attention to stats that come out of e-commerce. Every year, Radware studies the mobile performance of the top 100 ecommerce sites to see how they measure up to user expectations. Here’s the latest report.

These are a few gems I think particularly important to us:

  • 1 out of 4 people worldwide own a smartphone
  • On mobile, 40% will abandon a page that takes longer than 3 seconds to load
  • Slow pages are the number one issue that mobile users complain about. 38% of smartphone users have screamed at, cursed at, or thrown their phones when pages take too long to load.
  • The median page is 19% larger than it was one year ago

There is also a lot of ink dedicated to sites that serve m-dot versions to mobile users, mostly making the point that this is ultimately dissatisfying and, moreover, tablet users definitely don’t want that m-dot site.

The post Web for Libraries: The UX Bandwagon appeared first on LibUX.

Galen Charlton: Reaching LITA members: a datapoint

planet code4lib - Thu, 2014-10-30 00:00

I recently circulated a petition to start a new interest group within LITA, to be called the Patron Privacy Technologies IG.  I’ve submitted the formation petition to the LITA Council, and a vote on the petition is scheduled for early November.  I also held an organizational meeting with the co-chairs; I’m really looking forward to what we all can do to help improve how our tools protect patron privacy.

But enough about the IG, let’s talk about the petition! To be specific, let’s talk about when the signatures came in.

I’ve been on Twitter since March of 2009, but a few months ago I made the decision to become much more active there (you see, there was a dearth of cat pictures on Twitter, and I felt it my duty to help do something about it).  My first thought was to tweet the link to a Google Form I created for the petition. I did so at 7:20 a.m. Pacific Time on 15 October:

LITA members interested in the @ALA_LITA Patron Privacy Technologies IG – please sign the petition to form the IG: https://t.co/kOggjNSKYi

— Galen Charlton (@gmcharlt) October 15, 2014

Also, if you are interested in being co-chair of the LITA Patron Privacy Tech IG, please indicate on the petition or drop me a line.

— Galen Charlton (@gmcharlt) October 15, 2014

Since I wanted to gauge whether there was interest beyond just LITA members, I also posted about the petition on the ALA Think Tank Facebook group at 7:50 a.m. on the 15th.

By the following morning, I had 13 responses: 7 from LITA members, and 6 from non-LITA members. An interest group petition requires 10 signatures from LITA members, so at 8:15 on the 16th, I sent another tweet, which got retweeted by LITA:

Just a few more signatures from LITA members needed for the Patron Privacy IG formation petition: https://t.co/i4mXsJps1p @ALA_LITA

— Galen Charlton (@gmcharlt) October 16, 2014

By early afternoon, that had gotten me one more signature. I was feeling a bit impatient, so at 2:28 p.m. on the 16th, I sent a message to the LITA-L mailing list.

That opened the floodgates: 10 more signatures from LITA members arrived by the end of the day, and 10 more came in on the 17th. All told, a total of 42 responses to the form were submitted between the 15th and the 23rd.

The petition didn’t ask how the responder found it, but if I make the assumption that most respondents filled out the form shortly after they first heard about it, I arrive at my bit of anecdata: over half of the petition responses were inspired by my post to LITA-L, suggesting that the mailing list remains an effective way of getting the attention of many LITA members.

By the way, the petition form is still up for folks to use if they want to be automatically subscribed to the IG’s mailing list when it gets created.

DuraSpace News: The Society of Motion Picture and Television Engineers (SMPTE) Archival Technology Medal Awarded to Neil Beagrie

planet code4lib - Thu, 2014-10-30 00:00

From William Kilbride, Digital Preservation Coalition

Heslington, York  At a ceremony in Hollywood on October 23, 2014, the Society of Motion Picture and Television Engineers® (SMPTE®) awarded the 2014 SMPTE Archival Technology Medal to Neil Beagrie in recognition of his long-term contributions to the research and implementation of strategies and solutions for digital preservation.

District Dispatch: ALA opposes e-book accessibility waiver petition

planet code4lib - Wed, 2014-10-29 21:29

ALA and the Association of Research Libraries (ARL) renewed their opposition to a petition filed by the Coalition of E-book Manufacturers seeking a waiver from complying with disability legislation and regulation (specifically Sections 716 and 717 of the Communications Act as Enacted by the Twenty-First Century Communications and Video Accessibility Act of 2010). Amazon, Kobo, and Sony are the members of the coalition, and they argue that they do not have to make their e-readers’ Advanced Communications Services (ACS) accessible to people with print disabilities.

Why? The coalition argues that because basic e-readers (Kindle, Sony Reader, Kobo E-Reader) are primarily used for reading and have only rudimentary ACS, they should be exempt from CVAA accessibility rules. People with disabilities can buy other more expensive e-readers and download apps in order to access content. To ask the Coalition to modify their basic e-readers is a regulatory burden, will raise consumer prices, will ruin the streamlined look of basic e-readers, and inhibit innovation (I suppose for other companies and start-ups that want to make even more advanced inaccessible readers).

The library associations have argued that these basic e-readers do have ACS capability as a co-primary use. In fact, the very companies asking for this waiver market their e-readers as being able to browse the web, for example. The Amazon Webkit that comes with the basic Kindle can “render HyperText Markup Language (HTML) pages, interpret JavaScript code, and apply webpage layout and styles from Cascading Style Sheets (CSS).” The combination of HTML, JavaScript, and CSS demonstrates that this basic e-reader’s browser leaves open a wide array of ACS capability, including mobile versions of Facebook, Gmail, and Twitter, to name a few widely popular services.”

We believe denying the Coalition’s petition will not only increase access to ACS, but also increase access to more e-content for more people. As we note in our FCC comments: “Under the current e-reader ACS regime proposed by the Coalition and tentatively adopted by the Commission, disabled persons must pay a ‘device access tax.’ By availing oneself of one of the ‘accessible options’ as suggested by the Coalition, a disabled person would pay at minimum $20 more a device for a Kindle tablet that is heavier and has less battery life than a basic Kindle e-reader.” Surely it is right that everyone ought to be able to buy and use basic e-readers just like everybody has the right to drink from the same water fountain.

This decision will rest on the narrowly question of whether or not ACS is offered, marketed and used as a co-primary purpose in these basic e-readers. We believe the answer to that question is “yes,” and we will continue our advocacy to support more accessible devices for all readers.

The post ALA opposes e-book accessibility waiver petition appeared first on District Dispatch.

Eric Hellman: GITenberg: Modern Maintenance Infrastructure for Our Literary Heritage

planet code4lib - Wed, 2014-10-29 20:51
One day back in March, the Project Gutenberg website thought I was a robot and stopped letting me download ebooks. Frustrated, I resolved to put some Project Gutenberg ebooks into GitHub, where I could let other people fix problems in the files. I decided to call this effort "Project Gitenhub". On my second or third book, I found that Seth Woodworth had had the same idea a year earlier, and had already moved about a thousand ebooks into GitHub. That project was named "GITenberg". So I joined his email list and started submitting pull requests for PG ebooks that I was improving.

Recently, we've joined forces to submit a proposal to the Knight Foundation's News Challenge, whose theme is "How might we leverage libraries as a platform to build more knowledgeable communities? ". Here are some excerpts:
Abstract Project Gutenberg (PG) offers 45,000 public domain ebooks, yet few libraries use this collection to serve their communities. Text quality varies greatly, metadata is all over the map, and it's difficult for users to contribute improvements. We propose to use workflow and software tools developed and proven for open source software development- GitHub- to open up the PG corpus to maintenance and use by libraries and librarians. The result- GITenberg- will include MARC records, covers, OPDS feeds and ebook files to facilitate library use. Version-controlled fork and merge workflow, combined with a change triggered back-end build environment will allow scaleable, distributed maintenance of the greatest works of our literary heritage.  Description Libraries need metadata records in MARC format, but in addition they need to be able to select from the corpus those works which are most relevant to their communities. They need covers to integrate the records with their catalogs, and they need a level of quality assurance so as not to disappoint patrons. Because this sort of metadata is not readily available, most libraries do not include PG records in their catalogs, resulting in unnecessary disappointment when, for example, a patron want to read Moby Dick from the library on their Kindle. Progress 43,000 books and their metadata have been moved to the git version control software, this will enable librarians to collaboratively edit and control the metadata. The GITenberg website, mailing list and software repository has been launched at https://gitenberg.github.io/ . Software for generating MARC records and OPDS feeds have already been written.Background Modern software development teams use version control, continuous integration, and workflow management systems to coordinate their work. When applied to open-source software, these tools allow diverse teams from around the world to collaboratively maintain even the most sprawling projects. Anyone wanting to fix a bug or make a change first forks the software repository, makes the change, and then makes a "pull request". A best practice is to submit the pull request with a test case verifying the bug fix. A developer charged with maintaining the repository can then review the pull request and accept or reject the change. Often, there is discussion asking for clarification. Occasionally versions remain forked and diverge from each other. GitHub has become the most popular sites for this type software repository because of its well developed workflow tools and integration hooks. The leaders of this team recognized the possibility to use GitHub for the maintenance of ebooks, and we began the process of migrating the most important corpus of public domain ebooks, Project Gutenberg, onto GitHub, thus the name GITenberg. Project Gutenberg has grown over the years to 50,000 ebooks, audiobooks, and related media, including all the most important public domain works of English language literature. Despite the great value of this collection, few libraries have made good use of this resource to serve their communities. There are a number of reasons why. The quality of the ebooks and the metadata around the ebooks is quite varied. MARC records, which libraries use to feed their catalog systems, are available for only a subset of the PG collection. Cover images and other catalog enrichment assets are not part of PG. To make the entire PG corpus available via local libraries, massive collaboration amoung librarians and ebook develeopers is essential. We propose to build integration tools around github that will enable this sort of collaboration to occur. 
  1. Although the PG corpus has been loaded into GITenberg, we need to build a backend that automatically converts the version-controlled source text into well-structured ebooks. We expect to define a flavor of MarkDown or Asciidoc which will enable this automatic, change-triggered building of ebook files (EPUB, MOBI, PDF). (MarkDown is a human-readable plain text format used on GitHub for documentation; MarkDown for ebooks is being developed independently by several team of developers. Asciidoc is a similar format that works nicely for ebooks.) 
  2. Similarly, we will need to build a parallel backend server that will produce MARC and XML formatted records from version-controlled plain-text metadata files.
  3. We will generate covers for the ebooks using a tool recently developed by NYPL and include them in the repository.
  4. We will build a selection tool to help libraries select the records best suited to their libraries.
  5. Using a set of "cleaned up" MARC records from NYPL, and adding custom cataloguing, we will seed the metadata collection with ~1000 high quality metadata records.
  6. We will provide a browsable OPDS feed for use in tablet and smartphone ebook readers.
  7. We expect that the toolchain we develop will be reusable for creation and maintenance of a new generation of freely licensed ebooks.

The rest of the proposal is on the Knight News Challenge website. If you like the idea of GITenberg, you can "applaud" it there. The "applause' is not used in the judging of the proposals, but it makes us feel good. There are lots of other interesting and inspiring proposals to check out and applaud, so go take a look!

Pages

Subscribe to code4lib aggregator