You are here

Feed aggregator

David Rosenthal: Three Good Reads

planet code4lib - Mon, 2014-09-22 12:11
Below the fold I'd like to draw your attention to two papers and a post worth reading.



Cappello et al have published an update to their seminal 2009 paper Towards Exascale Resilience called Towards Exascale Resilience: 2014 Update. They review progress in some critical areas in the past five years. I've referred to the earlier paper as an example of the importance of, and the difficulty of, fault-tolerance at scale. As scale increases, faults become part of the normal state of the system; they cannot be treated as an exception. It is nevertheless disappointing that the update, like its predecessor, deals only with exascale computation not also with exascale long-term storage. Their discussion of storage is limited to the performance of short-term storage in checkpointing applications. This is a critical issue, but a complete exascale system will need a huge amount of longer-term storage. The engineering problems in providing it should not be ignored.

Dave Anderson of Seagate first alerted me to the fact that, in the medium term, manufacturing economics make it impossible for flash to displace hard disk as the medium for most long-term near-line bulk storage. Fontana et al from IBM Almaden have now produced a comprehensive paper, The Impact of Areal Density and Millions of Square Inches (MSI) of Produced Memory on Petabyte Shipments of TAPE, NAND Flash, and HDD Storage Class Memories that uses detailed industry data on flash, disk and tape shipments, technologies and manufacturing investments from 2008-2012 to reinforce this message. They also estimate the scale of investment needed to increase production to meet an estimated 50%/yr growth in data. The mismatch between the estimates of data growth and the actual shipments of media on which to store it is so striking that they are forced to cast doubt on the growth estimates. It is clear from their numbers that the industry will not make the mistake of over-investing in manufacturing capacity, driving prices, and thus margins, down. This provides significant support for our argument that Storage Will Be Much Less Free Than It Used To Be.

Henry Newman has a post up at Enterprise Storage entitled Ensuring the Future of Data Archiving discussing the software architecture that future data archives require. Although I agree almost entirely with Henry's argument, I think he doesn't go far enough. We need to fix the system, not just the software. I will present my, much more radical, view of future archival system architecture in a talk at the Library of Congress' Designing Storage Architectures workshop. The text will go up here in a few days.

Nick Ruest: Islandora and nginx

planet code4lib - Mon, 2014-09-22 11:40
Islandora and nginx Background

I have been doing a fair bit of scale testing for York University Digital Library over the last couple weeks. Most of it has been focused on horizontal scaling of the traditional Islandora stack (Drupal, Fedora Commons, FedoraGSearch, Solr, and aDORe-Djtatoka). The stack is traditionally run with Apache2 in front of it, and it reserve proxies parts of the stack that are Tomcat webapps. I was curious if the stack would work with nginx, and if would get any noticeable improvements by just switching from Apache2 to nginx. The preliminary good news is that the stack works with nginx (I'll outline the config below). The not surprising news, according to this, is I'm not seeing any noticeable improvements. If time permits, I'll do some real benchmarking.

Islandora nginx configurations

Having no experience with nginx, I started searching around, and found a config by David StClair that worked. With a few slight modifications, I was able to get the stack up any running with no major issues. The only major item that I needed to figure out how to do was reverse proxying aDORe-djatoka so that it would place nice with the default settings for Islandora OpenSeadragon. All this turned out to be was figuring out what the ProxyPass and ProxyPassReverse directive equivalents were for nginx. Turns out that it is very straightforward. With Apache2, we needed:

#Fedora Commons/Islandora proxying ProxyRequests Off ProxyPreserveHost On Order deny,allow Allow from all ProxyPass /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka ProxyPassReverse /adore-djatoka http://digital.library.yorku.ca:8080/adore-djatoka

This gives us a nice dog in a hat with Apache2.

With nginx we use the proxy_redirect directive.

server { location /adore-djatoka { proxy_pass http://localhost:8080/adore-djatoka; proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka; } }

This gives us a nice dog in a hat with nginx.

That's really only the major modification that I had to make to get the stack running with nginx. Here is my config adapted from David StClair's example.

server { server_name kappa.library.yorku.ca; root /path/to/drupal/install; ## <-- Your only path reference. # Enable compression, this will help if you have for instance advagg module # by serving Gzip versions of the files. gzip_static on; location = /favicon.ico { log_not_found off; access_log off; } location = /robots.txt { allow all; log_not_found off; access_log off; } # Very rarely should these ever be accessed outside of your lan location ~* \.(txt|log)$ { allow 127.0.0.1; deny all; } location ~ \..*/.*\.php$ { return 403; } # No no for private location ~ ^/sites/.*/private/ { return 403; } # Block access to "hidden" files and directories whose names begin with a # period. This includes directories used by version control systems such # as Subversion or Git to store control files. location ~ (^|/)\. { return 403; } location / { # This is cool because no php is touched for static content try_files $uri @rewrite; proxy_read_timeout 300; } location /adore-djatoka { proxy_pass http://localhost:8080/adore-djatoka; proxy_redirect http://localhost:8080/adore-djatoka /adore-djatoka; } location @rewrite { # You have 2 options here # For D7 and above: # Clean URLs are handled in drupal_environment_initialize(). rewrite ^ /index.php; # For Drupal 6 and bwlow: # Some modules enforce no slash (/) at the end of the URL # Else this rewrite block wouldn't be needed (GlobalRedirect) #rewrite ^/(.*)$ /index.php?q=$1; } # For Munin location /nginx_status { stub_status on; access_log off; allow 127.0.0.1; deny all; } location ~ \.php$ { fastcgi_split_path_info ^(.+\.php)(/.+)$; #NOTE: You should have "cgi.fix_pathinfo = 0;" in php.ini include fastcgi_params; fastcgi_param SCRIPT_FILENAME $request_filename; fastcgi_intercept_errors on; fastcgi_pass 127.0.0.1:9000; } # Fighting with Styles? This little gem is amazing. # This is for D7 and D8 location ~ ^/sites/.*/files/styles/ { try_files $uri @rewrite; } location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ { expires max; log_not_found off; } } tags: drupalislandoraapache2nginx

DuraSpace News: Open Repository: Work, Plus Hack Day Fun

planet code4lib - Mon, 2014-09-22 00:00

From James Evans, Open Repository, BioMed Central

Peter Sefton: Digital Object Pattern (DOP) vs chucking files in a database, approaches to repository design

planet code4lib - Sun, 2014-09-21 23:09

At work, in the eResearch team at the University of Western Sydney we’ve been discussing various software options for working-data repositories for research data, and holding a series of ‘tool days’; informal hack-days where we try out various software packages. For the last few months we’ve been looking at “working-data repository” software for researchers in a principled way, searching for one or more perfect Digital Object Repositories for Academe (DORAs).

One of the things I’ve been ranting on about is the flexibility of the “Digital Object Pattern”, (we always need more acronyms, so lets call it DOP) for repository design, as implemented by the likes of ePrints, DSpace, Omeka, CKAN and many of the Fedora Commons based repository solutions. At its most basic, this means a repository that is built around a core set of objects (which might be called something like an Object, an ePrint, an Item, or a Data Set depending on which repository you’re talking to). These Digital Objects have:

  • Object level Metadata
  • One or more ‘files’ or ‘datastreams’ or ‘bitstreams’, which may themselves be metadata. Depending on the repository these may or may not have their own metadata.

Basic DOP Pattern

There are infinite ways to model a domain but this is a tried-and-tested pattern which is worth exploring for any repository, if only because it’s such a common abstraction that lots of protocols and user interface conventions have grown up around it.

I found this discussion of the Digital Object used in a CNRI, Digital Object Repository Server (DORS), obviously a cousin of DORA.

This data structure allows an object to have the following:

  • a set of key-value attributes that describe the object, one of which is the object’s identifier

  • a set of named ‘data elements’ that hold potentially large byte sequences (analogous to one or more data files)

  • a set of key-value attributes for each of the data elements

This relatively simple data structure allows for the simple case, but is sufficiently flexible and extensible to incorporate a wide variety of possible structures, such as an object with extensive metadata, or a single object which is available in a number of different formats. This object structure is general enough that existing services can easily map their information-access paradigm onto the structure, thus enhancing interoperability by providing a common interface across multiple and diverse information and storage systems. An example application of the DO data model is illustrated in Figure 1.

To the above list of features and advantages I’d add a couple of points on how to implement the ideal Digital Object repository:

  • Every modern repository should make it easy for people to do linked data. Instead of merely key-value attributes that describe the object, it would be better to allow for and encourage RDF-style predicate / object metadata where both the predicate and object are HTTP URIs with human-friendly labels. This is implemented natively in Fedora Commons v4. But when you are using the DOP it’s not essential as you can always add an RDF metadata data-element/file.
  • It’s nice if the files also get metadata as in the CNRI Digital Object, but using the DOP you can always add ‘file’ that describes the file relationships rather than relying on conventions like file-extensions or suffixes to say stuff like “This is a thumbnail preview of img01.jpg”
  • There really should be a way to do relationships with other objects but again, the DOP means you can DIY this feature with a ‘relationships’ data element.

(I’m trying to keep this post reasonably short, but just quickly, another really good repository pattern that complements DOP is to keep separate the concerns of Storing Stuff from Indexing Stuff for Search and Browse. That is, the Digital Objects should be stashed somewhere with all their metadata and data, and no matter what metadata type you’re using you build one or more discovery indexes from that. This is worth mentioning because as soon as some people see RDF they immediately think Triple Store, OK, but for repository design I think it’s more helpful to think Triple Index. That is, treat the RDF reasoner, SPARQL query endpoint etc as a separate concern from repositing.)

The DOP contrasts with a file-centric pattern where every file is modelled separately, with it’s own metadata, which is the approach taken by HIEv, the environmental science Working Data Repository we looked at last week. Theoretically, this gives you infinite flexibility but in practice it makes it harder to build a usable data repository.

Files as primary repository objects

Once your repository starts having a lot of stuff in it like image thumbnails, derived files like OCRed text, and transcoded versions of files (say from the proprietary TOA5 format into NETCDF) then you’re faced with the challenge of indexing them all, for search and browse in a way that they appear to clump together. I think that as HIEv matures and more and more relationships between files become important then we’ll probably want to add container objects that automatically bundle together all the related bits and pieces to do with a single ‘thing’ in the repository. For example, a time series data set may have the original proprietary file format, some rich metadata, a derived file in a standard format, a simple plot to preview the file contents, and re-sampled data set at lower resolution, all of which really have more or less the same metadata about where they came from, when, and some shared utility. So, we’ll probably end up with something like this:

Adding an abstraction to group files into Objects (once the UI gets unmanageable)

Draw a box around that and what have you got?

The Digital Object Pattern, that’s what, albeit probably implemented in a somewhat fragile way.

With the DOP, as with any repository implementation pattern you have to make some design decisions. Gerry Devine asked at our tools day this week, what do you do about data-items that are referenced by multiple objects?

First of all, it is possible for one object to reference another, or data elements in another, but if there’s a lot of this going on then maybe the commonly re-used data elements could be put in their own object. A good example of this is the way WordPress, which is probably where you’re reading this, works. All images are uploaded into a media collection, and then referenced by posts and pages: an image doesn’t ‘belong’ to a document except by association, if the document calls it in. This is a common approach for content management systems, allowing for re-use of assets across objects, but if you were building a museum collection project with a Digital Object for each physical artefact, it might be better for practical reasons to store images of objects as data elements on the object, and other images which might be used for context etc separately as image objects.

Of course if you’re a really hardcore developer you’ll probably want to implement the most flexible possible pattern and put one file per object, with a ‘master object’ to tie them together. This makes development of a usable repository significantly harder. BTW, you can do it using the DOP with one-file per Digital Object, and lots of relationships. Just be prepared for orders of magnitude more work to build a robust, usable system.


Digital Object Pattern (DOP) vs chucking files in a database, approaches to repository design is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Code4Lib: Code4Lib North (Ottawa): Tuesday October 7th, 2014

planet code4lib - Sun, 2014-09-21 18:38

Speakers:

  • Mark Baker - Principal Architect at Zepheira will provide a brief overview of some of Zepheira’s BibFrame tools in development.
  • Jennifer Whitney - Systems Librarian at MacOdrum Library will present OpenRefine (formerly Google Refine) – a neat and powerful tool for cleaning up messy data.
  • Sarah Simpkin, GIS and Geography Librarian & Catherine McGoveran, Government Information Librarian (both from UOttawa Library) - will team up to present on a recent UOttawa sponsored Open Data Hackfest as well as to introduce you to Open Data Ottawa.

Date: Tuesday October 7th, 2014, 7:00PM (19h00)

Location: MacOdrum Library, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Ottawa, ON (map)

RSVP: You can RSVP on code4lib Ottawa Meetup page

David Rosenthal: Utah State Archives has a problem

planet code4lib - Sun, 2014-09-21 04:55
A recent discussion on the NDSA mailing list featured discussion about the Utah State Archives struggling with the costs of being forced to use Utah's state IT infrastructure for preservation. Below the fold, some quick comments.



Here's summary of the situation the Archives finds itself in:
we actually have two separate copies of the AIP. One is on m-disc and the other is on spinning disk (a relatively inexpensive NAS device connected to our server, for which we pay our IT department each month). ... We have centralized IT, where there is one big data center and servers are virtualized. Our IT charges us a monthly rate for not just storage, but also all of their overhead to exist as a department. ... and we are required by statute to cooperate with IT in this model, so we can't just go out and buy/install whatever we want. For an archives, that's a problem, because our biggest need is storage but we are funded based upon the number of people we employ, not the quantity of data we need to store, and convincing the Legislature that we need $250,000/year for just one copy of 50 TB of data is a hard sell, never mind additional copies for SIP, AIP, and/or DIP.Michelle Kimpton, who is in the business of persuading people that using DuraCloud is cheaper and better than doing it yourself, leaped at the opportunity this offered (my emphasis):
If I look at Utah State Archive storage cost, at $5,000 per year per TB vs. Amazon S3 at $370/year/TB it is such a big gap I have a hard time believing that Central IT organizations will be sustainable in the long run.  Not that Amazon is the answer to everything, but they have certainly put a stake in the ground regarding what spinning disk costs, fully loaded( meaning this includes utilities, building and personnel). Amazon S3 also provides 3 copies, 2 onsite and one in another data center.

I am not advocating by any means that S3 is the answer to it all, but it is quite telling to compare the fully loaded TB cost from an internal IT shop vs. the fully loaded TB cost from Amazon.
I appreciate you sharing the numbers Elizabeth and it is great your IT group has calculated what I am guessing is the true cost for managing data locally.Elizabeth Perkes for the Archives responded:
I think using Amazon costs more than just their fees, because someone locally still has to manage any server space you use in the cloud and make sure the infrastructure is updated. So then you either need to train your archives staff how to be a system administrator, or pay someone in the IT community an hourly rate to do that job. Depending on who you get, hourly rates can cost between $75-150/hour, and server administration is generally needed at least an hour per week, so the annual cost of that service is an additional $3,900-$7,800. Utah's IT rate is based on all costs to operate for all services, as I understand it. We have been using a special billing rate for our NAS device, which reflects more of the actual storage costs than the overhead, but then the auditors look at that and ask why that rate isn't available to everyone, so now IT is tempted to scale that back. I just looked at the standard published FY15 rates, and they have dropped from what they were a couple of years ago. The official storage rate is now .2386/GB/month, which is $143,160/year for 50 TB, or $2,863.20 per TB/year. But this doesn't get at the fundamental flaws in Michelle's marketing:
  • She suggests that Utah's IT charges reflect "the true cost for managing data locally". But that isn't what the Utah Archives are doing. They are buying IT services from a competitor to Amazon, one that they are required by statute to buy from. 
  • She compares Utah's IT with S3. S3 is a storage-only product. Using it cost-effectively, as Elizabeth points out, involves also buying AWS compute services, which is a separate business of Amazon's with its own P&L and pricing policies. For the Archives, Utah IT is in effect both S3 and AWS, so the comparison is misleading.
  • The comparison is misleading in another way. Long-term, reliable storage is not the business Utah IT is in. The Archives are buying storage services from a compute provider, not a storage provider. It isn't surprising that the pricing isn't competitive.
  • But more to the point, why would Utah IT bother to be competitive? Their customers can't go any place else, so they are bound to get gouged. I'm surprised that Utah IT is only charging 10 times the going rate for an inferior storage product
  • And don't fall for the idea that Utah IT is only charging what they need to cover their costs. They control the costs, and they have absolutely no incentive to minimize them. If an organization can hire more staff and pass the cost of doing so on to customers who are bound by statute to pay for them, it is going to hire a lot more staff than an organization whose customers can walk.
As I've pointed out before, Amazon's margins on S3 are enviable. You don't need to be very big to have economies of scale enough to undercut S3, as the numbers from Backblaze demonstrate. The Archive's 50TB is possibly not enough to do this if they were actually managing the data locally.

But the Archive might well employ a strategy similar to that I suggested for the Library of Congress Twitter collection. They already keep a copy on m-disk. Suppose they kept two copies on m-disk as the Library keeps two copies on tape, and regarded that as their preservation solution. Then they could use Amazon's Reduced Redundancy Storage and AWS virtual servers as their access solution. Running frequent integrity checks might take an additional small AWS instance, and any damage detected could be repaired from one of the m-disk copies.

Using the cloud for preservation is almost always a bad idea. Preservation is a base-load activity whereas the cloud is priced as a peak-load product. But the spiky nature of current access to archival collections is ideal for the cloud.

John Miedema: “Book Was There” by Andrew Piper. If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.

planet code4lib - Sat, 2014-09-20 18:56

“I can imagine a world without books. I cannot imagine a world without reading” (Piper, ix). In these last few generations of print there is nothing keeping book lovers from reading print books. Yet with each decade the print book yields further to the digital. But there it is, we are the first few generations of digital, and we are still discovering what that means for reading. It is important to document this transition. In Book Was There: Reading in Electronic Times, Piper describes how the print book is shaping the digital screen and what it means for reading.

Book was there. It is a quote from Gertrude Stein, who understood that it matters deeply where one reads. Piper: “my daughter … will know where she is when she reads, but so too will someone else.” (128) It is a warm promise and an observation that could be ominous, but still being explored for possibilities.

The differences between print and digital are complex, and Piper is not making a case for or against books. The book is a physical container of letters. The print book is “at hand,” a continuous presence, available for daily reference and so capable of reinforcing new ideas. The word, “digital,” comes from “digits” (at least in English), the fingers of the hand. Digital technology is ambient, but could could allow for more voices, more debate. On the other hand, “For some readers the [print] book is anything but graspable. It embodies … letting go, losing control, handing over.” (12)  And internet users are known to flock together, reinforcing what they already believe, ignoring dissent. Take another example. Some criticize the instability of the digital. Turn off the power and the text is gone. Piper counters that digital text is incredibly hard to delete, with immolation of the hard drive being the NSA recommended practice.

Other differences are still debated. There is a basic two-dimensional nature to the book, with pages facing one another and turned. One wonders if this duality affords reflection. Does the return to one-dimensional scrolling of the web page numb the mind? Writing used to be the independent act of one or two writers. Reading was a separate event. Digital works like Wikipedia are written by many contributors, organized into sections. Piper wonders if it possible to have collaborative writing that is also tightly woven like literature? There is the recent example of 10 PRINT, written by ten authors in one voice. Books have always been shared, a verb that has its origins in “shearing … an act of forking.” (88) With digital, books can be shared more easily, and readers can publish endings of their own. Books are forked into different versions. Piper cautions that over-sharing can lead to the forking that ended the development of Unix. But we now have the successful Unix. Is there a downside?

Scrolling aside, digital is really a multidimensional media. Text has been rebuilt from the ground up, with numbers first. New deep kinds of reading are becoming possible. Twenty-five years ago a professor of mine lamented that he could not read all the academic literature in his discipline. Today he can. Piper introduces what is being called “distant reading”: the use of big data technologies, natural language processing, and visualization, to analyze the history of literature at the granular level of words. In his research, he calculates how language influences the writing of a book, and how in turn the book changes the language of its time. It measures a book in a way that was never possible with disciplined close reading or speed reading. “If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.” (148)

Piper embraces the fact that we now have new kinds of reading. He asserts that these practices need not replace the old. Certainly there were always be print books for those of us who love a good slow read. I do think, however, that trade-offs are being made. Books born digital are measurably shorter than print, more suited to quick reading and analysis by numbers. New authors are writing to digital readers. Readers and reading are being shaped in turn. The reading landscape is changing. These days I am doubtful that traditional reading of print books — or even ebooks — will remain a common practice. There it is.

District Dispatch: “Outside the Lines” at ICMA

planet code4lib - Fri, 2014-09-19 21:14

(From left) David Singleton, Director of Libraries for the Charlotte Mecklenburg Library, with Public Library Association (PLA) Past President Carolyn Anthony, PLA Director Barb Macikas and PLA President Larry Neal after a tour of ImaginOn.

This week, many libraries are inviting their communities to reconnect as part of a national effort called Outside the Lines (September 14-20). Since my personal experience of new acquaintances often includes an exclamation of “I didn’t know libraries did that,” and this experience is buttressed by Pew Internet Project research that finds that only about 23 percent of people who already visit our libraries feel they know all or most of what we do, the need to invite people to rethink libraries is clear.

On the policy front, this also is a driving force behind the Policy Revolution! initiative—making sure national information policy matches the current and emerging landscape of how libraries are serving their communities. One of the first steps is simply to make modern libraries more visible to key decision-makers and influencers.

One of these influential groups, particularly for public libraries, is the International City/County Management Association (ICMA), which concluded its 100th anniversary conference in Charlotte this past week. I enjoyed connecting with city and county managers and their professional staffs over several days, both informally and formally through three library-related presentations.

The Aspen Institute kicked off my conference experience with a preview and discussion of its work emerging from the Dialogue on Public Libraries. Without revealing any details that might diminish the national release of the Aspen Institute report to come in October, I can say it was a lively and engaged discussion with city and county managers from communities of all sizes across the globe. One theme that emerged and resonated throughout the conference was one related to breaking down siloes and increasing collaboration. One participant described this force factor as “one plus one equals three” and referenced the ImaginOn partnership between the Charlotte Mecklenburg Library and the Children’s Theatre of Charlotte.

A young patron enjoys a Sunday afternoon at ImaginOn.

While one might think that the level of library knowledge and engagement in the room was perhaps exceptional, throughout my conversations, city and county managers described new library building projects and renovations, efforts to increase local millages, and proudly touted the energy and expertise of the library directors they work with in building vibrant and informed communities. In fact, they sounded amazingly like librarians in their enthusiasm and depth of knowledge!

Dr. John Bertot and I shared findings and new tools from the Digital Inclusion Survey, with a particular focus on how local communities can use the new interactive mapping tools to connect library assets to community demographics and concerns. ICMA is a partner with the American Library Association (ALA) and the University of Maryland Information Policy & Access Center on the survey, which is funded by the Institute of Museum and Library Services (IMLS). Through our presentation (ppt), we explored the components of digital inclusion and key data related to technology infrastructure, digital literacy and programs and services that support education, civic engagement, workforce and entrepreneurship, and health and wellness. Of greatest interest was—again—breaking down barriers…in this case among diverse datasets relating libraries and community priorities.

Finally, I was able to listen in on a roundtable on Public Libraries and Community Building in which the Urban Libraries Council (ULC) shared the Edge benchmarks and facilitated a conversation about how the benchmarks might relate to city/county managers’ priorities and concerns. One roundtable participant from a town of about 3,300 discovered during a community listening tour that the library was the first place people could send a fax; and often where they used a computer and the internet for the first time. How could the library continue to be the “first place” for what comes next in new technology? The answer: you need to have facility and culture willing to be nimble. One part of preparing the facility was to upgrade to a 100 Mbps broadband connection, which has literally increased traffic to this community technology hub as people drive in with their personal devices.

I was proud to get Outside the Lines at the ICMA conference, and am encouraged that so many of these city and county managers already had “met” the 21st century library and were interested in working together for stronger cities, towns, counties and states. Thanks #ICMA14 for embracing and encouraging library innovation!

The post “Outside the Lines” at ICMA appeared first on District Dispatch.

FOSS4Lib Recent Releases: Evergreen - 2.5.7-rc1

planet code4lib - Fri, 2014-09-19 20:28

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Friday, September 5, 2014

FOSS4Lib Recent Releases: Evergreen - 2.6.3

planet code4lib - Fri, 2014-09-19 20:27

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Friday, September 5, 2014

FOSS4Lib Recent Releases: Evergreen - 2.7.0

planet code4lib - Fri, 2014-09-19 20:27

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Thursday, September 18, 2014

FOSS4Lib Upcoming Events: Fedora 4.0 in Action at The Art Institute of Chicago and UCSD

planet code4lib - Fri, 2014-09-19 20:16
Date: Wednesday, October 15, 2014 - 13:00 to 14:00Supports: Fedora Repository

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Presented by: Stefano Cossu, Data and Application Architect, Art Institute of Chicago and Esmé Cowles, Software Engineer, University of California San Diego
Join Stefano and Esmé as they showcase new pilot projects built on Fedora 4.0 Beta at the Art Institute of Chicago and the University of California San Diego. These projects demonstrate the value of adopting Fedora 4.0 Beta and taking advantage of new features and opportunities for enhancing repository data.

HangingTogether: Talk Like a Pirate – library metadata speaks

planet code4lib - Fri, 2014-09-19 19:32

Pirate Hunter, Richard Zacks

Friday, 19 September is of course well known as International Talk Like a Pirate Day. In order to mark the day, we created not one but FIVE lists (rolled out over this whole week). This is part of our What In the WorldCat? series (#wtworldcat lists are created by mining data from WorldCat in order to highlight interesting and different views of the world’s library collections).

If you have a suggestion something you’d like us to feature, let us know or leave a comment below.

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (268)

FOSS4Lib Upcoming Events: VuFind Summit 2014

planet code4lib - Fri, 2014-09-19 19:18
Date: Monday, October 13, 2014 - 08:00 to Tuesday, October 14, 2014 - 17:00Supports: VuFind

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

This year's VuFind Summit will be held on October 13-14 at Villanova University (near Philadelphia).

Registration for the two-day event is $40 and includes both morning refreshments and a full lunch for both days.

It is not too late to submit a talk proposal and, if accepted, have your registration fee waived.

State Library of Denmark: Sparse facet caching

planet code4lib - Fri, 2014-09-19 14:40

As explained in Ten times faster, distributed faceting in standard Solr is two-phase:

  1. Each shard performs standard faceting and returns the top limit*1.5+10 terms. The merger calculates the top limit terms. Standard faceting is a two-step process:
    1. For each term in each hit, update the counter for that term.
    2. Extract the top limit*1.5+10 terms by running through all the counters with a priority queue.
  2. Each shard returns the number of occurrences of each term in the top limit terms, calculated by the merger from phase 1. This is done by performing a mini-search for each term, which takes quite a long time. See Even sparse faceting is limited for details.
    1. Addendum: If the number for a term was returned by a given shard in phase 1, that shard is not asked for that term again.
    2. Addendum: If the shard returned a count of 0 for any term as part of phase 1, that means is has delivered all possible counts to the merger. That shard will not be asked again.
Sparse speedup

Sparse faceting speeds up phase 1 step 2 by only visiting the updated counters. It also speeds up phase 2 by repeating phase 1 step 1, then extracting the counts directly for the wanted terms. Although it sounds heavy to repeat phase 1 step 1, the total time for phase 2 for sparse faceting is a lot lower than standard Solr. But why repeat phase 1 step 1 at all?

Caching

Today, caching of the counters from phase 1 step 1 was added to Solr sparse faceting. Caching is tricky business to get just right, especially since the sparse cache must contain a mix of empty counters (to avoid re-allocation of large structures on the Java heap) as well as filled structures (from phase 1, intended for phase 2). But theoretically, it is simple: When phase 1 step 1 is finished, the counter structure is kept and re-used in phase 2. So time for testing:

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed queries

Note that there are no measurements of standard Solr faceting in the graph. See the Ten times faster article for that. What we have here are 4 different types of search:

  • no_facet: Plain searches without faceting, just to establish the baseline.
  • skip: Only phase 1 sparse faceting. This means inaccurate counts for the returned terms, but as can be seen, the overhead is very low for most searches.
  • cache: Sparse faceting with caching, as described above.
  • nocache: Sparse faceting without caching.
Observations

For 1-1000 hits, nocache is actually a bit faster than cache. The peculiar thing about this hit-range is that chances are high that all shards returns all possible counts (phase 2 addendum 2), so phase 2 is skipped for a lot of searches. When phase 2 is skipped, this means wasted caching of a filled counter structure, that needs to be either cleaned for re-use or discarded if the cache is getting too big. This means a bit of overhead.

For more than 1000 hits, cache wins over nocache. Filter through the graph noise by focusing on the medians. As the difference between cache and nocache is that the base faceting time is skipped with cache, the difference of their medians should be the about the same as the difference of the medians from no_facet and skip. Are they? Sorta-kinda. This should be repeated with a larger sample.

Conclusion

Caching with distributed faceting means a small performance hit in some cases and a larger performance gain in other. Nothing Earth-shattering and as it works best when there is more memory allocated for caching, it is not clear in general whether it is best to use it or not. Download a Solr sparse WAR from GitHub and try for yourself.


State Library of Denmark: Sparse facet caching

planet code4lib - Fri, 2014-09-19 14:40

As explained in Ten times faster, distributed faceting in standard Solr is two-phase:

  1. Each shard performs standard faceting and returns the top limit*1.5+10 terms. The merger calculates the top limit terms. Standard faceting is a two-step process:
    1. For each term in each hit, update the counter for that term.
    2. Extract the top limit*1.5+10 terms by running through all the counters with a priority queue.
  2. Each shard returns the number of occurrences of each term in the top limit terms, calculated by the merger from phase 1. This is done by performing a mini-search for each term, which takes quite a long time. See Even sparse faceting is limited for details.
    1. Addendum: If the number for a term was returned by a given shard in phase 1, that shard is not asked for that term again.
    2. Addendum: If the shard returned a count of 0 for any term as part of phase 1, that means is has delivered all possible counts to the merger. That shard will not be asked again.
Sparse speedup

Sparse faceting speeds up phase 1 step 2 by only visiting the updated counters. It also speeds up phase 2 by repeating phase 1 step 1, then extracting the counts directly for the wanted terms. Although it sounds heavy to repeat phase 1 step 1, the total time for phase 2 for sparse faceting is a lot lower than standard Solr. But why repeat phase 1 step 1 at all?

Caching

Today, caching of the counters from phase 1 step 1 was added to Solr sparse faceting. Caching is tricky business to get just right, especially since the sparse cache must contain a mix of empty counters (to avoid re-allocation of large structures on the Java heap) as well as filled structures (from phase 1, intended for phase 2). But theoretically, it is simple: When phase 1 step 1 is finished, the counter structure is kept and re-used in phase 2. So time for testing:

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed queries

Note that there are no measurements of standard Solr faceting in the graph. See the Ten times faster article for that. What we have here are 4 different types of search:

  • no_facet: Plain searches without faceting, just to establish the baseline.
  • skip: Only phase 1 sparse faceting. This means inaccurate counts for the returned terms, but as can be seen, the overhead is very low for most searches.
  • cache: Sparse faceting with caching, as described above.
  • nocache: Sparse faceting without caching.
Observations

For 1-1000 hits, nocache is actually a bit faster than cache. The peculiar thing about this hit-range is that chances are high that all shards returns all possible counts (phase 2 addendum 2), so phase 2 is skipped for a lot of searches. When phase 2 is skipped, this means wasted caching of a filled counter structure, that needs to be either cleaned for re-use or discarded if the cache is getting too big. This means a bit of overhead.

For more than 1000 hits, cache wins over nocache. Filter through the graph noise by focusing on the medians. As the difference between cache and nocache is that the base faceting time is skipped with cache, the difference of their medians should be the about the same as the difference of the medians from no_facet and skip. Are they? Sorta-kinda. This should be repeated with a larger sample.

Conclusion

Caching with distributed faceting means a small performance hit in some cases and a larger performance gain in other. Nothing Earth-shattering and as it works best when there is more memory allocated for caching, it is not clear in general whether it is best to use it or not. Download a Solr sparse WAR from GitHub and try for yourself.


Library of Congress: The Signal: Emerging Collaborations for Accessing and Preserving Email

planet code4lib - Fri, 2014-09-19 13:02

The following is a guest post by Chris Prom, Assistant University Archivist and Professor, University of Illinois at Urbana-Champaign.

I’ll never forget one lesson from my historical methods class at Marquette University.  Ronald Zupko–famous for his lecture about the bubonic plague and a natural showman–was expounding on what it means to interrogate primary sources–to cast a skeptical eye on every source, to see each one as a mere thread of evidence in a larger story, and to remember that every event can, and must, tell many different stories.

He asked us to name a few documentary genres, along with our opinions as to their relative value.  We shot back: “Photographs, diaries, reports, scrapbooks, newspaper articles,” along with the type of ill-informed comments graduate students are prone to make.  As our class rattled off responses, we gradually came to realize that each document reflected the particular viewpoint of its creator–and that the information a source conveyed was constrained by documentary conventions and other social factors inherent to the medium underlying the expression. Settling into the comfortable role of skeptics, we noted the biases each format reflected.  Finally, one student said: “What about correspondence?”  Dr Zupko erupted: “There is the real meat of history!  But, you need to be careful!”

Dangerous Inbox by Recrea HQ. Photo courtesy of Flickr through a CC BY-NC-SA 2.0 license.

Letters, memos, telegrams, postcards: such items have long been the stock-in-trade for archives.  Historians and researchers of all types, while mindful of the challenges in using correspondence, value it as a source for the insider perspective it provides on real-time events.   For this reason, the library and archives community must find effective ways to identify, preserve and provide access to email and other forms of electronic correspondence.

After I researched and wrote a guide to email preservation (pdf) for the Digital Preservation Coalition’s Technology Watch Report series, I concluded that the challenges are mostly cultural and administrative.

I have no doubt that with the right tools, archivists could do what we do best: build the relationships that underlie every successful archival acquisition.  Engaging records creators and donors in their digital spaces, we can help them preserve access to the records that are so sorely needed for those who will write histories.  But we need the tools, and a plan for how to use them.  Otherwise, our promises are mere words.

For this reason, I’m so pleased to report on the results of a recent online meeting organized by the National Digital Stewardship Alliance’s Standards and Practices Working Group.  On August 25, a group of fifty-plus experts from more than a dozen institutions informally shared the work they are doing to preserve email.

For me, the best part of the meeting was that it represented the diverse range of institutions (in terms of size and institutional focus) that are interested in this critical work. Email preservation is not something of interest only to large government archives,or to small collecting repositories, but also to every repository in between. That said, the representatives displayed a surprising similar vision for how email preservation can be made effective.

Robert Spangler, Lisa Haralampus, Ken  Hawkins and Kevin DeVorsey described challenges that the National Archives and Records Administration has faced in controlling and providing access to large bodies of email. Concluding that traditional records management practices are not sufficient to task, NARA has developed the Capstone approach, seeking to identify and preserve particular accounts that must be preserved as a record series, and is currently revising its transfer guidance.  Later in the meeting, Mark Conrad described the particular challenge of preserving email from the Executive Office of the President, highlighting the point that “scale matters”–a theme that resonated across the board.

The whole account approach that NARA advocates meshes well with activities described by other presenters.  For example, Kelly Eubank from North Carolina State Archives and the EMCAP project discussed the need for software tools to ingest and process email records while Linda Reib from the Arizona State Library noted that the PeDALS Project is seeking to continue their work, focusing on account-level preservation of key state government accounts.

Functional comparison of selected email archives tools/services. Courtesy Wendy Gogel.

Ricc Ferrante and Lynda Schmitz Fuhrig from the Smithsonian Institution Archives discussed the CERP project which produced, in conjunction with the EMCAP project, an XML schema for email objects among its deliverables. Kate Murray from the Library of Congress reviewed the new email and related calendaring formats on the Sustainability of Digital Formats website.

Harvard University was up next.  Andrea Goethels and Wendy Gogel shared information about Harvard’s Electronic Archiving Service.  EAS includes tools for normalizing email from an account into EML format (conforming to the Internet Engineering Task Force RFC 2822), then packaging it for deposit into Harvard’s digital repository.

One of the most exciting presentations was provided by Peter Chan and Glynn Edwards from Stanford University.  With generous funding from the National Historical Publications and Records Commission, as well as some internal support, the ePADD Project (“Email: Process, Appraise, Discover, Deliver”) is using natural language processing and entity extraction tools to build an application that will allow archivists and records creators to review email, then process it for search, display and retrieval.  Best of all, the web-based application will include a built-in discovery interface and users will be able to define a lexicon and to provide visual representations of the results.  Many participants in the meeting commented that the ePADD tools may provided a meaningful focus for additional collaborations.  A beta version is due out next spring.

In the discussion that followed the informal presentations, several presenters congratulated the Harvard team on a slide Wendy Gogel shared, comparing the functions provided by various tools and services (reproduced above).

As is apparent from even a cursory glance at the chart, repositories are doing wonderful work—and much yet remains.

Collaboration is the way forward. At the end of the discussion, participants agreed to take three specific steps to drive email preservation initiatives to the next level: (1) providing tool demo sessions; (2) developing use cases; and (3) working together.

The bottom line: I’m more hopeful about the ability of the digital preservation community to develop an effective approach toward email preservation than I have been in years.  Stay tuned for future developments!

LITA: Tech Yourself Before You Wreck Yourself – Vol. 1

planet code4lib - Fri, 2014-09-19 12:30
Art from Cécile Graat

This post is for all the tech librarian caterpillars dreaming of one day becoming empowered tech butterflies. The internet is full to the brim with tools and resources for aiding in your transformation (and your job search). In each installment of Tech Yourself Before You Wreck Yourself – TYBYWY, pronounced tie-buy-why – I’ll curate a small selection of free courses, webinars, and other tools you can use to learn and master technologies.  I’ll also spotlight a presentation opportunity so that you can consider putting yourself out there- it’s a big, beautiful community and we all learn through collaboration.

MOOC of the Week -

Allow me to suggest you enroll in The Emerging Future: Technology Issues and Trends, a MOOC offered by the School of Information at San Jose State University through Canvas. Taking a Futurist approach to technology assessment, Sue Alman, PhD offers participants an opportunity to learn “the planning skills that are needed, the issues that are involved, and the current trends as we explore the potential impact of technological innovations.”

Sounds good to this would-be Futurist!

Worthwhile Webinars –

I live in the great state of Texas, so it is with some pride that I recommend the recurring series, Tech Tools with Tine, from the Texas State Library and Archives Commission.  If you’re like me, you like your tech talks in manageable bite-size pieces. This is just your style.

September 19th, 9-10 AM EST – Tech Tools with Tine: 1 Hour of Google Drive

September 26th, 9-10 AM EST – Tech Tools with Tine: 1 Hour of MailChimp

October 3rd, 9-10 AM EST – Tech Tools with Tine: 1 Hour of Curation with Pinterest and Tumblr

Show Off Your Stuff –

The deadline to submit a proposal to the 2015 Library Technology Conference at Macalester College in beautiful St. Paul is September 22nd. Maybe that tight timeline is just the motivation you’ve been looking for!

What’s up, Tiger Lily? -

Are you a tech caterpillar or a tech butterfly? Do you have any cool free webinars or opportunities you’d like to share? Write me all about it in the comments.

District Dispatch: OITP Director appointed to University of Maryland Advisory Board

planet code4lib - Fri, 2014-09-19 08:46

This week, the College of Information Studies at the University of Maryland appointed Alan Inouye, director of the American Library Association’s (ALA) Office for Information Technology Policy (OITP), to the inaugural Advisory Board for the university’s Master of Library Science (MLS) degree program.

“This appointment supports OITP’s policy advocacy and its Policy Revolution! initiative,” said OITP Director Alan S. Inouye. “Future librarians will be working in a rapidly evolving information environment. I look forward to the opportunity to help articulate the professional education needed for success in the future.”

The Advisory Board comprises of 17 leaders and students in the information professions who will guide the future development of the university’s MLS program. The Board’s first task will be to engage in a strategic “re-envisioning the MLS” discussion.

Serving three-year terms, the members of the Board will:

  • Provide insights on how the MLS program can enhance the impact of its services on various stakeholder groups;
  • Provide advice and counsel on strategy, issues, and trends affecting the future of the MLS Program;
  • Strengthen relationships with libraries, archives, industry, and other key information community partners;
  • Provide input for assessing the progress of the MLS program;
  • Provide a vital link to the community of practice for faculty and students to facilitate research, inform teaching, and further develop public service skills;
  • Support the fundraising efforts to support the MLS program; and
  • Identify the necessary entry-level skills, attitudes and knowledge competencies as well as performance levels for target occupations.

Additional Advisory Board Members include:

  • Tahirah Akbar-Williams, Education and Information Studies Librarian, McKeldin Library, University of Maryland
  • Brenda Anderson, Elementary Integrated Curriculum Specialist, Montgomery County Public Schools
  • R. Joseph Anderson, Director, Niels Bohr Library and Archives, American Institute of Physics
  • Jay Bansbach, Program Specialist, School Libraries, Instructional Technology and School Libraries, Division of Curriculum, Assessment and Accountability, Maryland State Department of Education
  • Sue Baughman, Deputy Executive Director, Association of Research Libraries
  • Valerie Gross, President and CEO, Howard County Public Library
  • Lucy Holman, Director, Langsdale Library, University of Baltimore
  • Naomi House, Founder, I Need a Library Job (INALJ)
  • Erica Karmes Jesonis, Chief Librarian for Information Management, Cecil County Public Library
  • Irene Padilla, Assistant State Superintendent for Library Development and Services, Maryland State Department of Education
  • Katherine Simpson, Director of Strategy and Communication American University Library
  • Lissa Snyders, MLS Candidate, University of Maryland iSchool
  • Pat Steele, Dean of Libraries, University of Maryland
  • Maureen Sullivan, Immediate Past President, American Library Association
  • Joe Thompson, Senior Administrator, Public Services, Harford County Public Library
  • Paul Wester, Chief Records Officer for the Federal Government, National Archives and Records Administration

The post OITP Director appointed to University of Maryland Advisory Board appeared first on District Dispatch.

OCLC Dev Network: Release Scheduling Update

planet code4lib - Thu, 2014-09-18 21:30

To accommodate additional performance testing and optimization, the September release of WMS, which includes changes to the WMS Vendor Information Center API, is being deferred.  We will communicate the new date for the release as soon as we have confirmation.

Pages

Subscribe to code4lib aggregator