You are here

Feed aggregator

Code4Lib: Code4Lib North (Ottawa): Tuesday October 7th, 2014

planet code4lib - Sun, 2014-09-21 18:38

Speakers:

  • Mark Baker - Principal Architect at Zepheira will provide a brief overview of some of Zepheira’s BibFrame tools in development.
  • Jennifer Whitney - Systems Librarian at MacOdrum Library will present OpenRefine (formerly Google Refine) – a neat and powerful tool for cleaning up messy data.
  • Sarah Simpkin, GIS and Geography Librarian & Catherine McGoveran, Government Information Librarian (both from UOttawa Library) - will team up to present on a recent UOttawa sponsored Open Data Hackfest as well as to introduce you to Open Data Ottawa.

Date: Tuesday October 7th, 2014, 7:00PM (19h00)

Location: MacOdrum Library, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Ottawa, ON (map)

RSVP: You can RSVP on code4lib Ottawa Meetup page

David Rosenthal: Utah State Archives has a problem

planet code4lib - Sun, 2014-09-21 04:55
A recent discussion on the NDSA mailing list featured discussion about the Utah State Archives struggling with the costs of being forced to use Utah's state IT infrastructure for preservation. Below the fold, some quick comments.



Here's summary of the situation the Archives finds itself in:
we actually have two separate copies of the AIP. One is on m-disc and the other is on spinning disk (a relatively inexpensive NAS device connected to our server, for which we pay our IT department each month). ... We have centralized IT, where there is one big data center and servers are virtualized. Our IT charges us a monthly rate for not just storage, but also all of their overhead to exist as a department. ... and we are required by statute to cooperate with IT in this model, so we can't just go out and buy/install whatever we want. For an archives, that's a problem, because our biggest need is storage but we are funded based upon the number of people we employ, not the quantity of data we need to store, and convincing the Legislature that we need $250,000/year for just one copy of 50 TB of data is a hard sell, never mind additional copies for SIP, AIP, and/or DIP.Michelle Kimpton, who is in the business of persuading people that using DuraCloud is cheaper and better than doing it yourself, leaped at the opportunity this offered (my emphasis):
If I look at Utah State Archive storage cost, at $5,000 per year per TB vs. Amazon S3 at $370/year/TB it is such a big gap I have a hard time believing that Central IT organizations will be sustainable in the long run.  Not that Amazon is the answer to everything, but they have certainly put a stake in the ground regarding what spinning disk costs, fully loaded( meaning this includes utilities, building and personnel). Amazon S3 also provides 3 copies, 2 onsite and one in another data center.

I am not advocating by any means that S3 is the answer to it all, but it is quite telling to compare the fully loaded TB cost from an internal IT shop vs. the fully loaded TB cost from Amazon.
I appreciate you sharing the numbers Elizabeth and it is great your IT group has calculated what I am guessing is the true cost for managing data locally.Elizabeth Perkes for the Archives responded:
I think using Amazon costs more than just their fees, because someone locally still has to manage any server space you use in the cloud and make sure the infrastructure is updated. So then you either need to train your archives staff how to be a system administrator, or pay someone in the IT community an hourly rate to do that job. Depending on who you get, hourly rates can cost between $75-150/hour, and server administration is generally needed at least an hour per week, so the annual cost of that service is an additional $3,900-$7,800. Utah's IT rate is based on all costs to operate for all services, as I understand it. We have been using a special billing rate for our NAS device, which reflects more of the actual storage costs than the overhead, but then the auditors look at that and ask why that rate isn't available to everyone, so now IT is tempted to scale that back. I just looked at the standard published FY15 rates, and they have dropped from what they were a couple of years ago. The official storage rate is now .2386/GB/month, which is $143,160/year for 50 TB, or $2,863.20 per TB/year. But this doesn't get at the fundamental flaws in Michelle's marketing:
  • She suggests that Utah's IT charges reflect "the true cost for managing data locally". But that isn't what the Utah Archives are doing. They are buying IT services from a competitor to Amazon, one that they are required by statute to buy from. 
  • She compares Utah's IT with S3. S3 is a storage-only product. Using it cost-effectively, as Elizabeth points out, involves also buying AWS compute services, which is a separate business of Amazon's with its own P&L and pricing policies. For the Archives, Utah IT is in effect both S3 and AWS, so the comparison is misleading.
  • The comparison is misleading in another way. Long-term, reliable storage is not the business Utah IT is in. The Archives are buying storage services from a compute provider, not a storage provider. It isn't surprising that the pricing isn't competitive.
  • But more to the point, why would Utah IT bother to be competitive? Their customers can't go any place else, so they are bound to get gouged. I'm surprised that Utah IT is only charging 10 times the going rate for an inferior storage product
  • And don't fall for the idea that Utah IT is only charging what they need to cover their costs. They control the costs, and they have absolutely no incentive to minimize them. If an organization can hire more staff and pass the cost of doing so on to customers who are bound by statute to pay for them, it is going to hire a lot more staff than an organization whose customers can walk.
As I've pointed out before, Amazon's margins on S3 are enviable. You don't need to be very big to have economies of scale enough to undercut S3, as the numbers from Backblaze demonstrate. The Archive's 50TB is possibly not enough to do this if they were actually managing the data locally.

But the Archive might well employ a strategy similar to that I suggested for the Library of Congress Twitter collection. They already keep a copy on m-disk. Suppose they kept two copies on m-disk as the Library keeps two copies on tape, and regarded that as their preservation solution. Then they could use Amazon's Reduced Redundancy Storage and AWS virtual servers as their access solution. Running frequent integrity checks might take an additional small AWS instance, and any damage detected could be repaired from one of the m-disk copies.

Using the cloud for preservation is almost always a bad idea. Preservation is a base-load activity whereas the cloud is priced as a peak-load product. But the spiky nature of current access to archival collections is ideal for the cloud.

John Miedema: “Book Was There” by Andrew Piper. If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.

planet code4lib - Sat, 2014-09-20 18:56

“I can imagine a world without books. I cannot imagine a world without reading” (Piper, ix). In these last few generations of print there is nothing keeping book lovers from reading print books. Yet with each decade the print book yields further to the digital. But there it is, we are the first few generations of digital, and we are still discovering what that means for reading. It is important to document this transition. In Book Was There: Reading in Electronic Times, Piper describes how the print book is shaping the digital screen and what it means for reading.

Book was there. It is a quote from Gertrude Stein, who understood that it matters deeply where one reads. Piper: “my daughter … will know where she is when she reads, but so too will someone else.” (128) It is a warm promise and an observation that could be ominous, but still being explored for possibilities.

The differences between print and digital are complex, and Piper is not making a case for or against books. The book is a physical container of letters. The print book is “at hand,” a continuous presence, available for daily reference and so capable of reinforcing new ideas. The word, “digital,” comes from “digits” (at least in English), the fingers of the hand. Digital technology is ambient, but could could allow for more voices, more debate. On the other hand, “For some readers the [print] book is anything but graspable. It embodies … letting go, losing control, handing over.” (12)  And internet users are known to flock together, reinforcing what they already believe, ignoring dissent. Take another example. Some criticize the instability of the digital. Turn off the power and the text is gone. Piper counters that digital text is incredibly hard to delete, with immolation of the hard drive being the NSA recommended practice.

Other differences are still debated. There is a basic two-dimensional nature to the book, with pages facing one another and turned. One wonders if this duality affords reflection. Does the return to one-dimensional scrolling of the web page numb the mind? Writing used to be the independent act of one or two writers. Reading was a separate event. Digital works like Wikipedia are written by many contributors, organized into sections. Piper wonders if it possible to have collaborative writing that is also tightly woven like literature? There is the recent example of 10 PRINT, written by ten authors in one voice. Books have always been shared, a verb that has its origins in “shearing … an act of forking.” (88) With digital, books can be shared more easily, and readers can publish endings of their own. Books are forked into different versions. Piper cautions that over-sharing can lead to the forking that ended the development of Unix. But we now have the successful Unix. Is there a downside?

Scrolling aside, digital is really a multidimensional media. Text has been rebuilt from the ground up, with numbers first. New deep kinds of reading are becoming possible. Twenty-five years ago a professor of mine lamented that he could not read all the academic literature in his discipline. Today he can. Piper introduces what is being called “distant reading”: the use of big data technologies, natural language processing, and visualization, to analyze the history of literature at the granular level of words. In his research, he calculates how language influences the writing of a book, and how in turn the book changes the language of its time. It measures a book in a way that was never possible with disciplined close reading or speed reading. “If we’re going to have ebooks that distract us, we might as well have ones that help us analyse too.” (148)

Piper embraces the fact that we now have new kinds of reading. He asserts that these practices need not replace the old. Certainly there were always be print books for those of us who love a good slow read. I do think, however, that trade-offs are being made. Books born digital are measurably shorter than print, more suited to quick reading and analysis by numbers. New authors are writing to digital readers. Readers and reading are being shaped in turn. The reading landscape is changing. These days I am doubtful that traditional reading of print books — or even ebooks — will remain a common practice. There it is.

District Dispatch: “Outside the Lines” at ICMA

planet code4lib - Fri, 2014-09-19 21:14

(From left) David Singleton, Director of Libraries for the Charlotte Mecklenburg Library, with Public Library Association (PLA) Past President Carolyn Anthony, PLA Director Barb Macikas and PLA President Larry Neal after a tour of ImaginOn.

This week, many libraries are inviting their communities to reconnect as part of a national effort called Outside the Lines (September 14-20). Since my personal experience of new acquaintances often includes an exclamation of “I didn’t know libraries did that,” and this experience is buttressed by Pew Internet Project research that finds that only about 23 percent of people who already visit our libraries feel they know all or most of what we do, the need to invite people to rethink libraries is clear.

On the policy front, this also is a driving force behind the Policy Revolution! initiative—making sure national information policy matches the current and emerging landscape of how libraries are serving their communities. One of the first steps is simply to make modern libraries more visible to key decision-makers and influencers.

One of these influential groups, particularly for public libraries, is the International City/County Management Association (ICMA), which concluded its 100th anniversary conference in Charlotte this past week. I enjoyed connecting with city and county managers and their professional staffs over several days, both informally and formally through three library-related presentations.

The Aspen Institute kicked off my conference experience with a preview and discussion of its work emerging from the Dialogue on Public Libraries. Without revealing any details that might diminish the national release of the Aspen Institute report to come in October, I can say it was a lively and engaged discussion with city and county managers from communities of all sizes across the globe. One theme that emerged and resonated throughout the conference was one related to breaking down siloes and increasing collaboration. One participant described this force factor as “one plus one equals three” and referenced the ImaginOn partnership between the Charlotte Mecklenburg Library and the Children’s Theatre of Charlotte.

A young patron enjoys a Sunday afternoon at ImaginOn.

While one might think that the level of library knowledge and engagement in the room was perhaps exceptional, throughout my conversations, city and county managers described new library building projects and renovations, efforts to increase local millages, and proudly touted the energy and expertise of the library directors they work with in building vibrant and informed communities. In fact, they sounded amazingly like librarians in their enthusiasm and depth of knowledge!

Dr. John Bertot and I shared findings and new tools from the Digital Inclusion Survey, with a particular focus on how local communities can use the new interactive mapping tools to connect library assets to community demographics and concerns. ICMA is a partner with the American Library Association (ALA) and the University of Maryland Information Policy & Access Center on the survey, which is funded by the Institute of Museum and Library Services (IMLS). Through our presentation (ppt), we explored the components of digital inclusion and key data related to technology infrastructure, digital literacy and programs and services that support education, civic engagement, workforce and entrepreneurship, and health and wellness. Of greatest interest was—again—breaking down barriers…in this case among diverse datasets relating libraries and community priorities.

Finally, I was able to listen in on a roundtable on Public Libraries and Community Building in which the Urban Libraries Council (ULC) shared the Edge benchmarks and facilitated a conversation about how the benchmarks might relate to city/county managers’ priorities and concerns. One roundtable participant from a town of about 3,300 discovered during a community listening tour that the library was the first place people could send a fax; and often where they used a computer and the internet for the first time. How could the library continue to be the “first place” for what comes next in new technology? The answer: you need to have facility and culture willing to be nimble. One part of preparing the facility was to upgrade to a 100 Mbps broadband connection, which has literally increased traffic to this community technology hub as people drive in with their personal devices.

I was proud to get Outside the Lines at the ICMA conference, and am encouraged that so many of these city and county managers already had “met” the 21st century library and were interested in working together for stronger cities, towns, counties and states. Thanks #ICMA14 for embracing and encouraging library innovation!

The post “Outside the Lines” at ICMA appeared first on District Dispatch.

FOSS4Lib Recent Releases: Evergreen - 2.5.7-rc1

planet code4lib - Fri, 2014-09-19 20:28

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Friday, September 5, 2014

FOSS4Lib Recent Releases: Evergreen - 2.6.3

planet code4lib - Fri, 2014-09-19 20:27

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Friday, September 5, 2014

FOSS4Lib Recent Releases: Evergreen - 2.7.0

planet code4lib - Fri, 2014-09-19 20:27

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Package: EvergreenRelease Date: Thursday, September 18, 2014

FOSS4Lib Upcoming Events: Fedora 4.0 in Action at The Art Institute of Chicago and UCSD

planet code4lib - Fri, 2014-09-19 20:16
Date: Wednesday, October 15, 2014 - 13:00 to 14:00Supports: Fedora Repository

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

Presented by: Stefano Cossu, Data and Application Architect, Art Institute of Chicago and Esmé Cowles, Software Engineer, University of California San Diego
Join Stefano and Esmé as they showcase new pilot projects built on Fedora 4.0 Beta at the Art Institute of Chicago and the University of California San Diego. These projects demonstrate the value of adopting Fedora 4.0 Beta and taking advantage of new features and opportunities for enhancing repository data.

HangingTogether: Talk Like a Pirate – library metadata speaks

planet code4lib - Fri, 2014-09-19 19:32

Pirate Hunter, Richard Zacks

Friday, 19 September is of course well known as International Talk Like a Pirate Day. In order to mark the day, we created not one but FIVE lists (rolled out over this whole week). This is part of our What In the WorldCat? series (#wtworldcat lists are created by mining data from WorldCat in order to highlight interesting and different views of the world’s library collections).

If you have a suggestion something you’d like us to feature, let us know or leave a comment below.

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (268)

FOSS4Lib Upcoming Events: VuFind Summit 2014

planet code4lib - Fri, 2014-09-19 19:18
Date: Monday, October 13, 2014 - 08:00 to Tuesday, October 14, 2014 - 17:00Supports: VuFind

Last updated September 19, 2014. Created by Peter Murray on September 19, 2014.
Log in to edit this page.

This year's VuFind Summit will be held on October 13-14 at Villanova University (near Philadelphia).

Registration for the two-day event is $40 and includes both morning refreshments and a full lunch for both days.

It is not too late to submit a talk proposal and, if accepted, have your registration fee waived.

State Library of Denmark: Sparse facet caching

planet code4lib - Fri, 2014-09-19 14:40

As explained in Ten times faster, distributed faceting in standard Solr is two-phase:

  1. Each shard performs standard faceting and returns the top limit*1.5+10 terms. The merger calculates the top limit terms. Standard faceting is a two-step process:
    1. For each term in each hit, update the counter for that term.
    2. Extract the top limit*1.5+10 terms by running through all the counters with a priority queue.
  2. Each shard returns the number of occurrences of each term in the top limit terms, calculated by the merger from phase 1. This is done by performing a mini-search for each term, which takes quite a long time. See Even sparse faceting is limited for details.
    1. Addendum: If the number for a term was returned by a given shard in phase 1, that shard is not asked for that term again.
    2. Addendum: If the shard returned a count of 0 for any term as part of phase 1, that means is has delivered all possible counts to the merger. That shard will not be asked again.
Sparse speedup

Sparse faceting speeds up phase 1 step 2 by only visiting the updated counters. It also speeds up phase 2 by repeating phase 1 step 1, then extracting the counts directly for the wanted terms. Although it sounds heavy to repeat phase 1 step 1, the total time for phase 2 for sparse faceting is a lot lower than standard Solr. But why repeat phase 1 step 1 at all?

Caching

Today, caching of the counters from phase 1 step 1 was added to Solr sparse faceting. Caching is tricky business to get just right, especially since the sparse cache must contain a mix of empty counters (to avoid re-allocation of large structures on the Java heap) as well as filled structures (from phase 1, intended for phase 2). But theoretically, it is simple: When phase 1 step 1 is finished, the counter structure is kept and re-used in phase 2. So time for testing:

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed queries

Note that there are no measurements of standard Solr faceting in the graph. See the Ten times faster article for that. What we have here are 4 different types of search:

  • no_facet: Plain searches without faceting, just to establish the baseline.
  • skip: Only phase 1 sparse faceting. This means inaccurate counts for the returned terms, but as can be seen, the overhead is very low for most searches.
  • cache: Sparse faceting with caching, as described above.
  • nocache: Sparse faceting without caching.
Observations

For 1-1000 hits, nocache is actually a bit faster than cache. The peculiar thing about this hit-range is that chances are high that all shards returns all possible counts (phase 2 addendum 2), so phase 2 is skipped for a lot of searches. When phase 2 is skipped, this means wasted caching of a filled counter structure, that needs to be either cleaned for re-use or discarded if the cache is getting too big. This means a bit of overhead.

For more than 1000 hits, cache wins over nocache. Filter through the graph noise by focusing on the medians. As the difference between cache and nocache is that the base faceting time is skipped with cache, the difference of their medians should be the about the same as the difference of the medians from no_facet and skip. Are they? Sorta-kinda. This should be repeated with a larger sample.

Conclusion

Caching with distributed faceting means a small performance hit in some cases and a larger performance gain in other. Nothing Earth-shattering and as it works best when there is more memory allocated for caching, it is not clear in general whether it is best to use it or not. Download a Solr sparse WAR from GitHub and try for yourself.


State Library of Denmark: Sparse facet caching

planet code4lib - Fri, 2014-09-19 14:40

As explained in Ten times faster, distributed faceting in standard Solr is two-phase:

  1. Each shard performs standard faceting and returns the top limit*1.5+10 terms. The merger calculates the top limit terms. Standard faceting is a two-step process:
    1. For each term in each hit, update the counter for that term.
    2. Extract the top limit*1.5+10 terms by running through all the counters with a priority queue.
  2. Each shard returns the number of occurrences of each term in the top limit terms, calculated by the merger from phase 1. This is done by performing a mini-search for each term, which takes quite a long time. See Even sparse faceting is limited for details.
    1. Addendum: If the number for a term was returned by a given shard in phase 1, that shard is not asked for that term again.
    2. Addendum: If the shard returned a count of 0 for any term as part of phase 1, that means is has delivered all possible counts to the merger. That shard will not be asked again.
Sparse speedup

Sparse faceting speeds up phase 1 step 2 by only visiting the updated counters. It also speeds up phase 2 by repeating phase 1 step 1, then extracting the counts directly for the wanted terms. Although it sounds heavy to repeat phase 1 step 1, the total time for phase 2 for sparse faceting is a lot lower than standard Solr. But why repeat phase 1 step 1 at all?

Caching

Today, caching of the counters from phase 1 step 1 was added to Solr sparse faceting. Caching is tricky business to get just right, especially since the sparse cache must contain a mix of empty counters (to avoid re-allocation of large structures on the Java heap) as well as filled structures (from phase 1, intended for phase 2). But theoretically, it is simple: When phase 1 step 1 is finished, the counter structure is kept and re-used in phase 2. So time for testing:

15TB index / 5B docs / 2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed queries

Note that there are no measurements of standard Solr faceting in the graph. See the Ten times faster article for that. What we have here are 4 different types of search:

  • no_facet: Plain searches without faceting, just to establish the baseline.
  • skip: Only phase 1 sparse faceting. This means inaccurate counts for the returned terms, but as can be seen, the overhead is very low for most searches.
  • cache: Sparse faceting with caching, as described above.
  • nocache: Sparse faceting without caching.
Observations

For 1-1000 hits, nocache is actually a bit faster than cache. The peculiar thing about this hit-range is that chances are high that all shards returns all possible counts (phase 2 addendum 2), so phase 2 is skipped for a lot of searches. When phase 2 is skipped, this means wasted caching of a filled counter structure, that needs to be either cleaned for re-use or discarded if the cache is getting too big. This means a bit of overhead.

For more than 1000 hits, cache wins over nocache. Filter through the graph noise by focusing on the medians. As the difference between cache and nocache is that the base faceting time is skipped with cache, the difference of their medians should be the about the same as the difference of the medians from no_facet and skip. Are they? Sorta-kinda. This should be repeated with a larger sample.

Conclusion

Caching with distributed faceting means a small performance hit in some cases and a larger performance gain in other. Nothing Earth-shattering and as it works best when there is more memory allocated for caching, it is not clear in general whether it is best to use it or not. Download a Solr sparse WAR from GitHub and try for yourself.


Library of Congress: The Signal: Emerging Collaborations for Accessing and Preserving Email

planet code4lib - Fri, 2014-09-19 13:02

The following is a guest post by Chris Prom, Assistant University Archivist and Professor, University of Illinois at Urbana-Champaign.

I’ll never forget one lesson from my historical methods class at Marquette University.  Ronald Zupko–famous for his lecture about the bubonic plague and a natural showman–was expounding on what it means to interrogate primary sources–to cast a skeptical eye on every source, to see each one as a mere thread of evidence in a larger story, and to remember that every event can, and must, tell many different stories.

He asked us to name a few documentary genres, along with our opinions as to their relative value.  We shot back: “Photographs, diaries, reports, scrapbooks, newspaper articles,” along with the type of ill-informed comments graduate students are prone to make.  As our class rattled off responses, we gradually came to realize that each document reflected the particular viewpoint of its creator–and that the information a source conveyed was constrained by documentary conventions and other social factors inherent to the medium underlying the expression. Settling into the comfortable role of skeptics, we noted the biases each format reflected.  Finally, one student said: “What about correspondence?”  Dr Zupko erupted: “There is the real meat of history!  But, you need to be careful!”

Dangerous Inbox by Recrea HQ. Photo courtesy of Flickr through a CC BY-NC-SA 2.0 license.

Letters, memos, telegrams, postcards: such items have long been the stock-in-trade for archives.  Historians and researchers of all types, while mindful of the challenges in using correspondence, value it as a source for the insider perspective it provides on real-time events.   For this reason, the library and archives community must find effective ways to identify, preserve and provide access to email and other forms of electronic correspondence.

After I researched and wrote a guide to email preservation (pdf) for the Digital Preservation Coalition’s Technology Watch Report series, I concluded that the challenges are mostly cultural and administrative.

I have no doubt that with the right tools, archivists could do what we do best: build the relationships that underlie every successful archival acquisition.  Engaging records creators and donors in their digital spaces, we can help them preserve access to the records that are so sorely needed for those who will write histories.  But we need the tools, and a plan for how to use them.  Otherwise, our promises are mere words.

For this reason, I’m so pleased to report on the results of a recent online meeting organized by the National Digital Stewardship Alliance’s Standards and Practices Working Group.  On August 25, a group of fifty-plus experts from more than a dozen institutions informally shared the work they are doing to preserve email.

For me, the best part of the meeting was that it represented the diverse range of institutions (in terms of size and institutional focus) that are interested in this critical work. Email preservation is not something of interest only to large government archives,or to small collecting repositories, but also to every repository in between. That said, the representatives displayed a surprising similar vision for how email preservation can be made effective.

Robert Spangler, Lisa Haralampus, Ken  Hawkins and Kevin DeVorsey described challenges that the National Archives and Records Administration has faced in controlling and providing access to large bodies of email. Concluding that traditional records management practices are not sufficient to task, NARA has developed the Capstone approach, seeking to identify and preserve particular accounts that must be preserved as a record series, and is currently revising its transfer guidance.  Later in the meeting, Mark Conrad described the particular challenge of preserving email from the Executive Office of the President, highlighting the point that “scale matters”–a theme that resonated across the board.

The whole account approach that NARA advocates meshes well with activities described by other presenters.  For example, Kelly Eubank from North Carolina State Archives and the EMCAP project discussed the need for software tools to ingest and process email records while Linda Reib from the Arizona State Library noted that the PeDALS Project is seeking to continue their work, focusing on account-level preservation of key state government accounts.

Functional comparison of selected email archives tools/services. Courtesy Wendy Gogel.

Ricc Ferrante and Lynda Schmitz Fuhrig from the Smithsonian Institution Archives discussed the CERP project which produced, in conjunction with the EMCAP project, an XML schema for email objects among its deliverables. Kate Murray from the Library of Congress reviewed the new email and related calendaring formats on the Sustainability of Digital Formats website.

Harvard University was up next.  Andrea Goethels and Wendy Gogel shared information about Harvard’s Electronic Archiving Service.  EAS includes tools for normalizing email from an account into EML format (conforming to the Internet Engineering Task Force RFC 2822), then packaging it for deposit into Harvard’s digital repository.

One of the most exciting presentations was provided by Peter Chan and Glynn Edwards from Stanford University.  With generous funding from the National Historical Publications and Records Commission, as well as some internal support, the ePADD Project (“Email: Process, Appraise, Discover, Deliver”) is using natural language processing and entity extraction tools to build an application that will allow archivists and records creators to review email, then process it for search, display and retrieval.  Best of all, the web-based application will include a built-in discovery interface and users will be able to define a lexicon and to provide visual representations of the results.  Many participants in the meeting commented that the ePADD tools may provided a meaningful focus for additional collaborations.  A beta version is due out next spring.

In the discussion that followed the informal presentations, several presenters congratulated the Harvard team on a slide Wendy Gogel shared, comparing the functions provided by various tools and services (reproduced above).

As is apparent from even a cursory glance at the chart, repositories are doing wonderful work—and much yet remains.

Collaboration is the way forward. At the end of the discussion, participants agreed to take three specific steps to drive email preservation initiatives to the next level: (1) providing tool demo sessions; (2) developing use cases; and (3) working together.

The bottom line: I’m more hopeful about the ability of the digital preservation community to develop an effective approach toward email preservation than I have been in years.  Stay tuned for future developments!

LITA: Tech Yourself Before You Wreck Yourself – Vol. 1

planet code4lib - Fri, 2014-09-19 12:30
Art from Cécile Graat

This post is for all the tech librarian caterpillars dreaming of one day becoming empowered tech butterflies. The internet is full to the brim with tools and resources for aiding in your transformation (and your job search). In each installment of Tech Yourself Before You Wreck Yourself – TYBYWY, pronounced tie-buy-why – I’ll curate a small selection of free courses, webinars, and other tools you can use to learn and master technologies.  I’ll also spotlight a presentation opportunity so that you can consider putting yourself out there- it’s a big, beautiful community and we all learn through collaboration.

MOOC of the Week -

Allow me to suggest you enroll in The Emerging Future: Technology Issues and Trends, a MOOC offered by the School of Information at San Jose State University through Canvas. Taking a Futurist approach to technology assessment, Sue Alman, PhD offers participants an opportunity to learn “the planning skills that are needed, the issues that are involved, and the current trends as we explore the potential impact of technological innovations.”

Sounds good to this would-be Futurist!

Worthwhile Webinars –

I live in the great state of Texas, so it is with some pride that I recommend the recurring series, Tech Tools with Tine, from the Texas State Library and Archives Commission.  If you’re like me, you like your tech talks in manageable bite-size pieces. This is just your style.

September 19th, 9-10 AM EST – Tech Tools with Tine: 1 Hour of Google Drive

September 26th, 9-10 AM EST – Tech Tools with Tine: 1 Hour of MailChimp

October 3rd, 9-10 AM EST – Tech Tools with Tine: 1 Hour of Curation with Pinterest and Tumblr

Show Off Your Stuff –

The deadline to submit a proposal to the 2015 Library Technology Conference at Macalester College in beautiful St. Paul is September 22nd. Maybe that tight timeline is just the motivation you’ve been looking for!

What’s up, Tiger Lily? -

Are you a tech caterpillar or a tech butterfly? Do you have any cool free webinars or opportunities you’d like to share? Write me all about it in the comments.

District Dispatch: OITP Director appointed to University of Maryland Advisory Board

planet code4lib - Fri, 2014-09-19 08:46

This week, the College of Information Studies at the University of Maryland appointed Alan Inouye, director of the American Library Association’s (ALA) Office for Information Technology Policy (OITP), to the inaugural Advisory Board for the university’s Master of Library Science (MLS) degree program.

“This appointment supports OITP’s policy advocacy and its Policy Revolution! initiative,” said OITP Director Alan S. Inouye. “Future librarians will be working in a rapidly evolving information environment. I look forward to the opportunity to help articulate the professional education needed for success in the future.”

The Advisory Board comprises of 17 leaders and students in the information professions who will guide the future development of the university’s MLS program. The Board’s first task will be to engage in a strategic “re-envisioning the MLS” discussion.

Serving three-year terms, the members of the Board will:

  • Provide insights on how the MLS program can enhance the impact of its services on various stakeholder groups;
  • Provide advice and counsel on strategy, issues, and trends affecting the future of the MLS Program;
  • Strengthen relationships with libraries, archives, industry, and other key information community partners;
  • Provide input for assessing the progress of the MLS program;
  • Provide a vital link to the community of practice for faculty and students to facilitate research, inform teaching, and further develop public service skills;
  • Support the fundraising efforts to support the MLS program; and
  • Identify the necessary entry-level skills, attitudes and knowledge competencies as well as performance levels for target occupations.

Additional Advisory Board Members include:

  • Tahirah Akbar-Williams, Education and Information Studies Librarian, McKeldin Library, University of Maryland
  • Brenda Anderson, Elementary Integrated Curriculum Specialist, Montgomery County Public Schools
  • R. Joseph Anderson, Director, Niels Bohr Library and Archives, American Institute of Physics
  • Jay Bansbach, Program Specialist, School Libraries, Instructional Technology and School Libraries, Division of Curriculum, Assessment and Accountability, Maryland State Department of Education
  • Sue Baughman, Deputy Executive Director, Association of Research Libraries
  • Valerie Gross, President and CEO, Howard County Public Library
  • Lucy Holman, Director, Langsdale Library, University of Baltimore
  • Naomi House, Founder, I Need a Library Job (INALJ)
  • Erica Karmes Jesonis, Chief Librarian for Information Management, Cecil County Public Library
  • Irene Padilla, Assistant State Superintendent for Library Development and Services, Maryland State Department of Education
  • Katherine Simpson, Director of Strategy and Communication American University Library
  • Lissa Snyders, MLS Candidate, University of Maryland iSchool
  • Pat Steele, Dean of Libraries, University of Maryland
  • Maureen Sullivan, Immediate Past President, American Library Association
  • Joe Thompson, Senior Administrator, Public Services, Harford County Public Library
  • Paul Wester, Chief Records Officer for the Federal Government, National Archives and Records Administration

The post OITP Director appointed to University of Maryland Advisory Board appeared first on District Dispatch.

OCLC Dev Network: Release Scheduling Update

planet code4lib - Thu, 2014-09-18 21:30

To accommodate additional performance testing and optimization, the September release of WMS, which includes changes to the WMS Vendor Information Center API, is being deferred.  We will communicate the new date for the release as soon as we have confirmation.

District Dispatch: The Goodlatte, the bad and the ugly…

planet code4lib - Thu, 2014-09-18 20:55

My Washington Office colleague Carrie Russell, ALA’s copyright ace in the Office of Information Technology Policy, provides a great rundown here in DD on the substantive ins and outs of the House IP Subcommittee’s hearing yesterday. The Subcommittee met to take testimony on the part of the 1998 Digital Millennium Copyright Act (Section 1201, for those of you keeping score at home) that prohibits anyone from “circumventing” any kind of “digital locks” (aka, “technological protection measures,” or “TPMs”) used by their owners to protect copyrighted works. The hearing was also interesting, however, for the politics of the emerging 1201 debate on clear display.

First, the good news.  Rep. Bob Goodlatte (VA), Chairman of the full House Judiciary Committee, made time in a no doubt very crowded day to attend the hearing specifically for the purpose of making a statement in which he acknowledged that targeted reform of Section 1201 was needed and appropriate.  As one of the original authors of 1201 and the DMCA, and the guy with the big gavel, Mr. Goodlatte’s frank and informed talk was great to hear.

Likewise, Congressman Darrell Issa of California (who’s poised to assume the Chairmanship of the IP Subcommittee in the next Congress and eventually to succeed Mr. Goodlatte at the full Committee’s helm) agreed that Section 1201 might well need modification to prevent it from impeding technological innovation — a cause he’s championed over his years in Congress as a technology patent-holder himself.

Lastly, Rep. Blake Farenthold added his voice to the reform chorus.  While a relatively junior Member of Congress, Rep. Farenthold clearly “gets” the need to assure that 1201 doesn’t preclude fair use or valuable research that requires digital locks to be broken precisely to see if they create vulnerabilities in computer apps and networks that can be exploited by real “bad guys,” like malware- and virus-pushing lawbreakers.

Of course, any number of other members of the Subcommittee were singing loudly in the key of “M” for yet more copyright protection.  Led by the most senior Democrat on the full Judiciary Committee, Rep. John Conyers (MI), multiple members appeared (as Carrie described yesterday) to believe that “strengthening” Section 1201 in unspecified ways would somehow thwart … wait for it … piracy, as if another statute and another penalty would do anything to affect the behavior of industrial-scale copyright infringers in China who don’t think twice now about breaking existing US law.  Sigh….

No legislation is yet pending to change Section 1201 or other parts of the DMCA, but ALA and its many coalition partners in the public and private sectors will be in the vanguard of the fight to reform this outdated and ill-advised part of the law (including the triennial process by which exceptions to Section 1201 are granted, or not) next year.  See you there!

The post The Goodlatte, the bad and the ugly… appeared first on District Dispatch.

SearchHub: Say Hello to Lucidworks Fusion

planet code4lib - Thu, 2014-09-18 20:43

The team at Lucidworks is proud to announce the release of our next-generation platform for building powerful, scalable search applications: Lucidworks Fusion.

Fusion extends any Solr deployment with the enterprise-grade capabilities you need to deliver a world-class search experience:

Full support for any Solr deployment including Lucidworks Search, SolrCloud, and stand-alone mode.

Deeper support for recommendations including Item-to-Query, Query-to-Item, and Item-to-Item with aggregated signals.

Advanced signal processing including any datapoint (click-through, purchases, ratings) – even social signals like Twitter.

Enhanced application development with REST APIs, index-side and query-time pipelines, with sophisticated connector frameworks.

Advanced web and filesystem crawlers with multi-threaded HTML/document connectors, de-duping, and incremental crawling.

Integrated security management for roles and users supporting HTTPs, form-based, Kerberos, LDAP, and native methods.

Search, log, and trend analytics for any log type with real-time and historical data with SiLK.

Ready to learn more? Join us for our upcoming webinar:

Webinar: Meet Lucidworks Fusion

Join Lucidworks CTO Grant Ingersoll for a ‘first look’ at our latest release, Lucidworks Fusion. You’ll be among the first to see the power of the Fusion platform and how it gives you everything you need to design, build, and deploy amazing search apps.

Webinar: Meet Lucidworks Fusion
Date: Thursday, October 2, 2014
Time: 11:00 am Pacific Daylight Time (San Francisco, GMT-07:00)

Click here to register for this webinar.

Or learn more at http://lucidworks.com/product/fusion/

John Miedema: Wilson iteration plans: Topics on text mining the novel.

planet code4lib - Thu, 2014-09-18 20:27

The Wilson iteration of my cognitive system will involve a deep dive into topics on text mining the novel. My overly ambitious plans are the following, roughly in order:

  • Develop a working code illustration of genre detection.
  • Develop another custom entity recognition model for literature, using an annotated corpus.
  • Visualization of literary concepts using time trends.
  • Collection of open data, open access articles, and open source tools for text analysis of literature.
  • Think about a better teaching tool for building models. Distinguish teaching computers from programming.

We’ll see where it goes.

DPLA: Nearly 100,00 items from the Getty Research Institute now available in DPLA

planet code4lib - Thu, 2014-09-18 20:03

More awesome news from DPLA! Hot on the heels of announcements earlier this week about newly added materials from the Medical Heritage Library and the Government Printing Office, we’re excited to share today that nearly 100,000 items from the Getty Research Institute are now available via DPLA.

To view the Getty in DPLA, click here.

From an announcement posted today on the Getty Research Institute Blog:

As a DPLA content hub, the Getty Research Institute has contributed metadata—information that enables search and retrieval of material—for nearly 100,000 digital images, documentary photograph collections, archives, and books dating from the 1400s to today. We’ve included some of the most frequently requested and significant material from our holdings of more than two million items, including some 5,600 images from the Julius Shulman photography archive, 2,100 images from the Jacobson collection of Orientalist photography, and dozens of art dealers’ stockbooks from the Duveen and Knoedler archives.

The Getty will make additional digital content available through DPLA as their collections continue to be cataloged and digitized.

All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

Pages

Subscribe to code4lib aggregator