You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 1 day 9 hours ago

FOSS4Lib Recent Releases: TemaTres Vocabulary Server - 2.1

Fri, 2016-03-18 20:00

Last updated March 18, 2016. Created by Peter Murray on March 18, 2016.
Log in to edit this page.

Package: TemaTres Vocabulary ServerRelease Date: Monday, March 14, 2016

pinboard: Twitter

Fri, 2016-03-18 17:28
We named our API-first archives data model after the beloved Trapper Keeper #c4l16 #code4lib

pinboard: RAD’s Code4Lib 2016 presentation is now online!...

Fri, 2016-03-18 17:28
We named our API-first archives data model after the beloved Trapper Keeper #c4l16 #code4lib

FOSS4Lib Recent Releases: Evergreen - 2.10.0

Fri, 2016-03-18 15:38
Package: EvergreenRelease Date: Thursday, March 17, 2016

Last updated March 18, 2016. Created by gmcharlt on March 18, 2016.
Log in to edit this page.

New features and enhancements of note in Evergreen 2.10.0 include:

Open Knowledge Foundation: International Open Data Day in Addis Abba, Ethiopia

Fri, 2016-03-18 15:14

This blog post was written By Solomon Mekonne Co-founder, Code4Ethiopia & Local Organizer, Open Knowledge

An open data interest group representing 25 participants from universities, NGOs, CSOs and government ministries attended an open data event on 5th March, 2016, with theme “Raising Open Data awareness in the grass root community of Ethiopia”. The event was organized by Code4Ethiopia and Open Knowledge Ethiopia, with the support of Open Knowledge International and Addis Ababa University, in connection with Open Data Day which is a global celebration of openness.

The event was opened by Mr. Mesfin Gezehagn, a University Librarian at the Addis Ababa University (AAU). Mr. Mesfin briefed the participants that Addis Ababa University has been providing training on open research data and open science to postgraduate students and academicians to see more researchers practicing open data sharing (making data free to use, reuse, and redistribute) and open science (making scientific research, data and other results and work flows available to all). He also stated that the University collaborates with open data communities like Open Knowledge Ethiopia and Code4Ethiopia.

Mr. Mesfin also informed the participants that AAU has started drafting a policy to ensure mandatory submission of research data for researches that are sponsored by the University to open the data to the public.

Following the opening, three of the Cofounders of Code4Ethiopia (Solomon Mekonnen, Melkamu Beyene and Teshome Alemu), and a Lecturer at the Department of Computer Science of AAU (Desalegn Mequanint) presented discussion areas for participants. The presentations were focused on Code4Ethiopia and Open Knowledge Ethiopia Programmes , raising Open Data awareness to the grass root Community of Ethiopia , open data experience in African countries, and,  social, cultural & economic factors affecting open data implementation in the Ethiopia.

Following, the workshop was opened for discussion by Daniel Beyene, co-founder of Code4Ethiopia. The participants recommend that advocacy should be done from top to down starting from the policy makers to grass root community of Ethiopia and they also proposed that Code4Ethiopia and Open Knowledge Ethiopia in collaboration international partners should organize a national sensitization Open Data Hackathon to reach various stakeholders.

The workshop also identified challenges in Ethiopia for open data implementation including lack of awareness, absence of policy level commitment from governments and lack of appropriate data science skills & data literacy. The participants also selected data sets that need priority for the country’s development and that interest the general public which includes budget data, expenditure (accounts) data, census,  trade information, election data, health and educational data.

The workshop was concluded by thanking our partners Open Knowledge International and Addis Ababa University for their contribution to the success of the event. All of the participants have also been invited to join Code4Ethiopia and the Open Knowledge community. Most of the participants have agreed to join these two communities to build open data ecosystem in Ethiopia.

State Library of Denmark: CDX musings

Fri, 2016-03-18 13:16

This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry.

There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as Warcbase. A CDX Server API is being formulated, as is the CDX format itself. Inspired by a post by Kristinn Sigurðsson, here comes another input.

CDX components, diagram by Ulrich Karstoft Have

CDX Server API

There is an ongoing use case-centric discussion of needed features for a CDX API. Compared to that, the CDX Server API – BETA seems a bit random. For example: A feature such as regexp-matching on URLs can be very heavy on the backend and open op for easy denial of service (intentional as well as unintentional). It should only be in the API if it is of high priority. One way of handling all wishes is to define a core API with the bare necessities and add extra functionality as API extensions. What is needed and what is wanted?

The same weighing problem can be seen for the fields required for the server. Kristinn discusses CDX as format and boils the core fields down to canonicalized URL, timestamp, original URL, digest, record type, WARC filename and WARC offset. Everything else is optional. This approach matches the core vs. extensions functionality division.

With optional features and optional fields, a CDX server should have some mechanism for stating what is possible.

URL canonicalization and digest

The essential lookup field is the canonicalised URL. Unfortunately that is also the least formalised, which is really bad from an interoperability point of view. When the CDX Server API is formalised, a strict schema for URL canonicalisation is needed.

Likewise, the digest format needs to be fixed, to allow for cross-institution lookups. As the digests do not need to be cryptographically secure, the algorithm chosen (hopefully) does not become obsolete with age.

It would be possible to allow for variations of both canonicalisation and digest, but in that case it would be as extensions rather than core.

CDX (external) format

CDX can be seen as a way of representing a corpus, as discussed on the RESAW workshop in december 2015.

  • From a shared external format perspective, tool-specific requirements such as sorting of entries or order of fields are secondary or even irrelevant.
  • Web archives tend to contain non-trivial amounts of entries, so the size of a CDX matters.

With this in mind, the pure minimal amount of fields would be something like original URL, timestamp and digest. The rest is a matter of expanding the data from the WARC files. On a more practical level, having the WARC filename and WARC offset is probably a good idea.

The thing not to require is the canonicalized URL: It is redundant, as it can be generated directly from the original URL, and it unnecessarily freezes the canonicalization method to the CDX format.

Allowing for optional extra fields after the core is again pragmatic. JSON is not the most compact format when dealing with tables of data, but it has the flexibility of offering entry-specific fields. CDXJ deals with this, although it does not specify the possible keys inside of the JSON blocks, meaning that the full CDX file has to be iterated to get those keys. There is also a problem of simple key:value JSON entries, which can be generically processed, vs. complex nested JSON structures, which requires implementation-specific code.

CDX (internal) format

Having a canonicalized and SURTed URL as the primary field and having the full CDX file sorted is an optimization towards specific tools. Kris touches lightly on the problem with this by suggesting that the record type might be better positioned as the secondary field (as opposed to timestamp) in the CDX format.

It follows easily that the optimal order or even representation of fields depends on tools as well as use case. But how the tools handle CDX data internally really does not matter as long as they expose the CDX Server API correctly and allows for export in an external CDX format. The external format should not be dictated by internal use!

CDX (internal vs. external) format

CDX is not a real format yet, but tools do exist and they expects some loosely defined common representation to work together. As such it is worth considering if some of the traits of current CDXes should spill over to an external format. Primarily that would be the loosely defined canonicalized URL as first field and the sorted nature of the files. In practice that would mean a near doubling of file size due to the redundant URL representation.

CDX implementation

The sorted nature of current CDX files has two important implications: Calculating the intersections between two files is easy and lookup on primary key (canonicalised URL) can algorithmically be done in O(log n) time using binary search.

In reality, simple binary search works poorly on large datasets, due to the lack of locality. This is worsened by slow storage types such as spinning drives and/or networked storage. There are a lot of tricks to remedy this, from building indexes to re-ordering the entries. The shared theme is that the current non-formalised CDX files are not directly usable: They require post-processing and extensions by the implementation.

The take away is that the value of a binary search optimized external CDX format is diminishing as scale goes up and specialized implementations are created. Wholly different lookup technologies, such as general purpose databases or search engines, has zero need for the format-oriented optimizations.



Library Tech Talk (U of Michigan): Developing Pathways to Full-text Resources with User Journeys

Fri, 2016-03-18 00:00

How does a library present the right information to patrons at the right time and place in the face of changing services, new technologies and vendors? User Journeys provide a way to create and improve what information, services and tools will help users on their path to the resources and services they seek. Find out what insights our team gained from developing User Journeys and we'll tell you about tools, resources and templates you can use to make your own!

Evergreen ILS: Evergreen 2.10.0 released

Thu, 2016-03-17 22:57

Thanks to the efforts of many contributors, the Evergreen community is pleased to announce the release of version 2.10.0 of the Evergreen open source integrated library system. Please visit the download page to get it!

New features and enhancements of note in Evergreen 2.10.0 include:

  • Improved password management and authentication. Evergreen user passwords are now stored with additional layers of encryption and may only be accessed directly by the database, not the application layer.
  • To improve privacy and security, Evergreen now stores less data about credit card transactions.
  • A new library setting has been added which enables a library to prevent their patrons from being opted in at other libraries.
  • To improve patron privacy, patron checkout history is now stored in separate, dedicated database table instead of being derived from the main circulation data.
  • Patrons can now delete titles that they do not wish to appear in their checkout history.
  • A new action/trigger event definition (“New User Created Welcome Notice”) has been added that will allow you to send a notice after a new patron has been created.
  • The web staff client now includes a patron editor/registration form.
  • Funds are now marked as paid when the invoice is marked as closed rather than when the invoice created.
  • A new “paid” label appears along the bottom of each line item in the PO display when every non-canceled copy on the line item has been invoiced.
  • The MARC stream importer is now able to import authority records as well as bibliographic records.
  • When inspecting a queue in MARC Batch Import/Export, there is now a link to download to MARC file any records in the queue that were not imported into the catalog.
  • Coded value maps have been added for a variety of fixed fields.
  • MARC batch import can now assign monograph part labels when adding or overlaying copies.
  • The stock indexing definitions now include a search and facet index on the form/genre field (tag 655).
  • The web staff client now includes preview functionality for cataloging, including MARC recording editing, authority maintenance, and volume/copy management.
  • HTML reports can now be sorted by clicking on the header for a given column.
  • Evergreen’s unAPI support now includes access to many more record types.

With the release of 2.10.0, bugfixes for the web staff client will now be considered for backporting to maintenance releases in the 2.10.x series, particularly in the areas of circulation and patron management.  Libraries are encouraged to try out the web staff client and file bug reports for it.

Support for PostgreSQL 9.1 is deprecated as of the release of Evergreen 2.10. Users are recommended to install Evergreen on PostgreSQL 9.2 or later. In the next major release following 2.10, Evergreen will no longer officially support PostgreSQL 9.1.

For more information about what’s in the release, check out the release notes.

Jenny Rose Halperin: Media for Everyone?

Thu, 2016-03-17 21:42
Media for Everyone? User empowerment and community in the age of subscription streaming media

The Netflix app is displayed alongside other streaming media services. (Photo credit: Matthew Keys / Flickr Creative Commons)

<noscript class="js-progressiveMedia-inner"><img class="progressiveMedia-noscript js-progressiveMedia-inner" src="*_KSJX2-NzgtAve-Hb&lt;/noscript&gt;&lt;noscript class=" /></noscript><noscript class="js-progressiveMedia-inner"><br /></noscript><noscript class="js-progressiveMedia-inner"><br /></noscript> Fragments of an Information Architecture

In 2002, Tim O’Reilly wrote the essay “Piracy is Progressive Taxation and other thoughts on the evolution of online distribution,” which makes several salient points that remain relevant as unlimited, native, streaming media continues to take the place of the containerized media product. In the essay, he predicts the rise of streaming media as well as the intermediary publisher on the Web that serves its purpose as content exploder. In an attempt to advocate for flexible licensing in the age of subscription streaming media, I’d like to begin by discussing two points in particular from that essay: “Obscurity is a far greater threat to authors and creative artists than piracy” and “’Free’ is eventually replaced by a higher-quality paid service.”

As content becomes more fragmented and decontainerized across devices and platforms (the “Internet of Things”), I have faith that expert domain knowledge will prevail in the form of vetted, quality materials, and streaming services provide that curation layer for many users. Subscription services could provide greater visibility to artists by providing unlimited access and new audiences. However, the current licensing regulations surrounding content on streaming subscription services privilege the corporation rather than the creator, further exercising the hegemony of the media market. The first part of this essay will discuss the role of serendipity and discovery in streaming services and how they contribute to user engagement. Next, I will explore how Creative Commons and flexible licensing in the age of unlimited subscription media can return power to the creator by supporting communities of practice around content creation in subscription streaming services.

Tim O’Reilly’s second assertion that “’Free’ is eventually replaced by a higher-quality paid service” is best understood through the lens of information architecture. In their seminal work Information Architecture for the World Wide Web, Morville, Arango, and Rosenfeld write about how most software solutions are designed to solve specific problems, and as they outgrow their shells they become ecosystems, thereby losing clarity and simplicity. While the physical object’s data is constrained within its shell, the digital object provides a new set of metadata based on the user’s usage patterns and interests. Media is spread out among a variety of types, devices, and spaces, platforms cease to define the types of content that people consume, with native apps replacing exportable, translatable solutions like the MP3 or PDF. Paid services utilize the data from these ecosystems and create more meaningful consumption patterns within a diverse media landscape.

What content needs is coherency, that ineffable quality that helps us create taxonomy and meaning across platforms. Streaming services provide a comfortable architecture so users don’t have to rely on the shattered, seemingly limitless, advertising-laden media ecosystem of the Internet. Unlimited streaming services provide the coherency that users seek in content, and their focus should be on discoverability and engagement.

If you try sometimes, you might get what you need: serendipity and discoverability in streaming media

Not all streaming services operate within the same content model, which provides an interesting lens to explore the roles of a variety of products. Delivering the “sweet spot” of content to users is an unfulfillable ideal for most providers, and slogging through a massive catalog of materials can quickly cause information overload.

When most content is licensed and available outside of the service, discoverability and user empowerment should be the primary aim of the streaming media provider.

While Spotify charges $9.99 per month for more music than a person can consume in their entire lifetime, the quality of the music is not often listed as a primary reason why users engage with the product. In fact, most of the music on Spotify I can describe as “not to my taste,” and yet I pay every month for premium access to the entire library. At Safari Books Online, we focused on content quality in addition to scope, with “connection to expert knowledge” and subject matter coherency being a primary reason why subscribers paid premium prices rather than relying on StackOverflow or other free services.

Spotify’s marketing slogan, “Music for everyone” focuses on its content abundance, freemium model, and ease of use rather than its quality. The marketing site does not mention the size of Spotify’s library, but the implications are clear: it’s huge.

These observations beg a few questions:

  1. Would I still pay $9.99 per month for a similar streaming service that only provided music in the genres I enjoy like jazz, minimal techno, or folk by women in the 1970s with long hair and a bone to pick with Jackson Browne?
  2. What would I pay to discover more music in these genres? What about new music created by lesser-known artists?
  3. What is it about Spotify that brought me back to the service after trying Apple Music and Rdio? What would bring me back to Safari if I tried another streaming media service like Lynda or Pluralsight?
  4. How much will users pay for what is, essentially, an inflexible native discoverability platform that exists to allow them access to other materials that are often freely available on the Web in other, more exportable formats?

Serendipity and discoverability were the two driving factors in my decision to stay with Spotify as a streaming music service. Spotify allows for almost infinite taste flexibility and makes discoverability easy through playlists and simple search. In addition, a social feed allows me to follow my friends and discover new music. Spotify bases its experience on my taste preferences and social network, and I can easily skip content that is irrelevant or not to my liking.

To contrast, at Safari, while almost every user lauded the diversity of content, most found the amount of information overwhelming and discoverability problematic. As a counter-example, the O’Reilly Learning Paths product have been immensely popular on Safari, even though the “paths” consist of recycled content from O’Reilly Media repackaged to improve discoverability. While the self-service discovery model worked for many users, for most of our users, guidance through the library in the form of “paths” provides a serendipitous adventure through content that keeps them wanting more.

Music providers like Tidal have experimented with exclusive content, but content wants to be free on the Internet, and streaming services should focus on user need and findability, not exclusivity. Just because a Beyonce single drops first on Tidal, it doesn’t mean I can’t torrent it soon after. In Spotify, the “Discover Weekly” playlists as well as the ease of use of my own user-generated playlists serve the purpose of “exclusive content.” By providing me the correct dose of relevant content through playlists and social connection, Spotify delivers a service that I cannot find anywhere else, and these discoverability features are my primary product incentive. Spotify’s curated playlists, even algorithmically calculated ones, feel home-spun, personal, and unique, which is close to product magic.

There seems to be an exception to this rule in the world of streaming television, where users generally want to be taken directly to the most popular exclusive content. I would argue that the Netflix ecosystem is much smaller than in a streaming business or technical library or music service. This is why Netflix can provide a relatively limited list of rotating movies while focusing on its exclusive content while services like Spotify and Safari consistently grow their libraries to delight their users with the extensive amount of content available.

In fact, most people subscribe to Netflix for its exclusive content, and streaming television providers that lag behind (like Hulu), often provide access to content that is otherwise easily discoverable other places on the Web. Why would I watch Broad City with commercials on Hulu one week after it airs when I can just go to the Free TV Project and watch it an hour later for free? There is no higher quality paid service than free streaming in this case, and until Hulu strikes the balance between payment, advertising, licensed content, and exclusive content, they will continue to lag behind Netflix.

As access to licensed content becomes centralized and ubiquitous among a handful of streaming providers, it should be the role of the streaming service to provide a new paradigm that supports the role of artists in the 21st Century that challenges the dominant power structures within the licensed media market.

Shake it off, Taylor: the dream of Creative Commons and the power of creators

As a constantly evolving set of standards, Creative Commons is one way that streaming services can focus on a discoverability and curation layer that provides the maximum benefit to both users and creators. If we allow subscription media to work with artists rather than industry, we can increase the power of the content creator and loosen stringent, outdated copyright regulations. I recognize that much of this section is a simplification of the complex issue of copyright, but I wish to create a strawman that brings to light what Lawrence Lessig calls “a culture in which creators get to create only with the permission of the powerful, or of creators from the past.” The unique positioning of streaming, licensed content is no longer an issue that free culture can ignore, and creating communities of practice around licensing could ease some of the friction between artists and subscription services.

When Taylor Swift withheld her album from Apple Music because the company would not pay artists for its temporary three-month trial period, it sent a message to streaming services that withholding pay from artists is not acceptable. I believe that Swift made the correct choice to take a stand against Apple for not paying artists, but I want to throw a wrench into her logic.

Copies of 1989 have probably been freely available on the Internet since before its “official” release. (The New Yorker ran an excellent piece on leaked albums last year.) By not providing her album to Apple Music but also not freely licensing it, Swift chose to operate under the old rules that govern content, where free is the exception, not the norm.

Creative Commons provides the framework and socialization that could provide streaming services relevancy and artists the new audiences they seek. The product that users buy via streaming services is not necessarily music or books (they can buy those anywhere), it is the ability to consume it in a manner that is organized, easy, and coherent across platforms: an increased Information Architecture. The flexible licensing of Creative Commons could at least begin the discussion to cut out the middle man between streaming services, licensing, and artists, allowing these services to act more like Soundcloud, Wattpad, or Bandcamp, which provide audience and voice to lesser-known artists. These services do what streaming services have so far failed to do because of their licensing rules: they create social communities around media based on user voice and community connection.

The outlook for both the traditional publishing and music industries are similarly grim and to ignore the power of the content creator is to lapse into obscurity. While many self-publishing platforms present Creative Commons licensing as a matter of course and pride, subscription streaming services usually present all content as equally, stringently licensed. Spotify’s largest operating costs are licensing costs and most of the revenue in these transactions goes to the licensor, not the artist. To rethink a model that puts trust and power in the creator could provide a new paradigm under which creators and streaming services thrive. This could take shape in a few ways:

  • Content could be licensed directly from the creator and promoted by the streaming service.
  • Content could be exported outside of the native app, allowing users to distribute and share content freely according to the wishes of its creator.
  • Content could be directly uploaded to the streaming service, vetted or edited by the service, and signal boosted according to the editorial vision of the streaming content provider.

When Safari moved from books as exportable PDFs to a native environment, some users threatened to leave the service, viewing the native app as a loss of functionality. This exodus reminds me that while books break free of their containers, the coherence of the ecosystem maintains that users want their content in a variety of contexts, usable in a way that suits them. Proprietary native apps do not provide that kind of flexibility. By relying on native apps as a sole online/offline delivery mechanism, streaming services ultimately disenfranchise users who rely on a variety of IoT devices to consume media. Creative Commons could provide a more ethical licensing layer to rebalance the power differential that streaming services continue to uphold.

The right to read, listen, and watch: streaming, freedom, and pragmatism

Several years ago, I would probably have scoffed at this essay, wondering why I even considered streaming services as a viable alternative to going to the library or searching for content through torrents or music blogs, but I am fundamentally a pragmatist and seek to work on systems that present the most exciting vision for creators. 40 million Americans have a Netflix account and Spotify has over 10 million daily active users. The data they collect from users is crucial to the media industry’s future.

To ignore or deny the rise of streaming subscription services as a content delivery mechanism has already damaged the free culture movement. While working with subscription services feels antithetical to its goals, content has moved closer and closer toward Stallman’s dystopian vision from 1997 and we need to continue to create viable alternatives or else continue to put the power in the hands of the few rather than the many.

Licensed streaming services follow the through line of unlimited content on the Web, and yet most users want even more content, and more targeted content for their specific needs. The archetype of the streaming library is currently consumption, with social sharing as a limited exception. Services like Twitter’s Vine and Google’s YouTube successfully create communities based on creation rather than consumption and yet they are constantly under threat, with large advertisers still taking the lion’s share of profits.

I envision an ecosystem of community-centered content creation services that are consistently in service to their users, and I think that streaming services can take the lead by considering licensing options that benefit artists rather than corporations.

The Internet turns us all into content creators, and rather than expanding ecosystems into exclusivity, it would be heartening to see a streaming app that is based on community discoverability and the “intertwingling” of different kinds of content, including user-generated content. The subscription streaming service can be considered as industry pushback in the age of user-generated content, yet it’s proven to be immensely popular. For this reason, conversations about licensing, user data, and artistic community should be a primary focus within free culture and media.

The final lesson of Tim O’Reilly’s essay is: “There’s more than one way to do it,” and I will echo this sentiment as the crux of my argument. As he writes, “’Give the wookie what he wants!’… Give it to [her] in as many ways as you can find, at a fair price, and let [her] choose which works best for [her].” By amplifying user voice in curation and discoverability as well as providing a more fair, free, and open ecosystem for artists, subscription services will more successfully serve their users and creators in ways that make the artistic landscape more humane, more diverse, and yes, more remixable.

DPLA: Announcing our fourth class of Community Reps

Thu, 2016-03-17 17:00

We are extremely excited to introduce and welcome our fourth class of DPLA Community Reps–-volunteers who engage their local communities by leading DPLA outreach activities. We received a great response to our fourth call for applicants, and we’re pleased to now add another fantastic group of Community Reps to our outstanding and dedicated corps of volunteers from the first three classes.

Our fourth class continues our success at bringing together volunteers from all over the US representing diverse fields and backgrounds. Our newest reps work in K-12 schools, public libraries, state libraries, municipal archives, public history and museums, technology, genealogy, education technology, and many areas of higher education. This round we are excited to have a very strong cohort of educators as well as representation from diverse disciplines including psychology, social work, art history, and studio art.

Our newest reps have already shared some of their great ideas for connecting new communities with DPLA and we’re eager to support this new class’ creative outreach and engagement work.  We thank them for helping us grow the DPLA community! For more detailed information about our Reps and their plans, including the members of the fourth class, please visit our Meet the Reps page.

The next call for our fifth class of Reps will take place early next year (January 2017).  To learn more about this program and follow our future calls for applicants, check out our Community Reps page.

David Rosenthal: Dr. Pangloss loves technology roadmaps

Thu, 2016-03-17 15:00
Its nearly three years since we last saw the renowned Dr. Pangloss chuckling with glee at the storage industry's roadmaps. But last week he was browsing Slashdot and found something much to his taste. Below the fold, an explanation of what the good Doctor enjoyed so much.

The Slashdot entry that caught the Doctor's eye was this:
Several key technologies are coming to market in the next three years that will ensure data storage will not only keep up with but exceed demand. Heat-assisted magnetic recording and bit-patterned media promise to increase hard drive capacity initially by 40% and later by 10-fold, or as Seagate's marketing proclaims: 20TB hard drives by 2020. At the same time, resistive RAM technologies, such as Intel/Micron's 3D XPoint, promise storage-class memory that's 1,000 times faster and more resilient than today's NAND flash, but it will be expensive — at first. Meanwhile, NAND flash makers have created roadmaps for 3D NAND technology that will grow to more than 100 layers in the next two to three generations, increasing performance and capacity while ultimately lowering costs to that of hard drives."Very soon flash will be cheaper than rotating media," said Siva Sivaram, executive vice president of memory at SanDisk.ASTC roadmap (2015)The article by Lucas Merian that sparked the post has the wonderfully Panglossian title These technologies will blow the lid off data storage, and it has some quotes that the Doctor really loves, such as the one above from Siva Siaram, and the ASTC technology roadmap showing a 30% Kryder rate from next year with HAMR and BPM shipping in 2021.

But the good Doctor tends not to notice that the article is actually careful to balance things like "HAMR technology will eventually allow Seagate to achieve a linear bit density of around 10 trillion (10Tbits) per square inch" with "Seagate has already demonstrated HAMR HDDs with 1.4Tbits per square inch" (my emphasis). If you pay attention to these caveats it is actually a good survey of the technologies still out on the roadmap.

But curmudgeons like me remember that back in 2013 the Doctor was rubbing his hands over statements like:
Seagate is projecting HAMR drives in 2014 and WD in 2016.In 2016 we hear that:
Seagate plans to begin shipping HAMR HDDs next year.So in three years HAMR has gone from next year to "next year". Not to mention the graph I keep pointing to from 2008 showing HAMR taking over in 2009 and BPM taking over in 2013. So actually HAMR has taken 8 years to go from next year to next year. And BPM has taken 8 years to go from 5 years out to 5 years out.

Why is this? As the technologies get closer and closer to the physical limits, the difficulty and cost of moving from "demonstration" to "shipping" increase. For example, lets suppose Seagate could demonstrate HAMR in 2013 and will ship it in 2017. BPM is even harder than HAMR, so if it is going to ship in 2021 it should be demonstrable this year. Has anyone heard that it will be?

The article also discusses 3D NAND flash, which also featured in Robert Fontana's wonderful presentation to the Library of Congress Storage Architecture workshop. From his slides I extracted the cost ratio between flash and hard disk for the period 2008-2014, showing that it was converging very slowly. Eric Brewer made the same point in his FAST 2016 keynote. Flash is a better medium than hard disk, so even if the manufacturing cost per byte were the same, the selling price for flash would be higher. But, as the article points out:
factories to build 3D NAND are vastly more expensive than plants that produce planar NAND or HDDs -- a single plant can cost $10 billionso no-one is going to make the huge investment needed for 3D NAND to displace hard disks from the cloud storage market because it wouldn't generate a return.

The article also covers "Storage Class Memories" (SCM) such as Intel/Micron's Xpoint, mentioning price:
even if Intel's Xpoint ReRAM technology enters the consumer PC marketplace this year, its use will be limited to the highest-end products due to cost.Actually, Intel isn't positioning Xpoint as a consumer storage technology but, as shown in the graph, as initially being deployed as an ultra-fast but non-volatile layer between DRAM and flash,

As I commented on Fontana's presentation:
The roadmaps for the post-flash solid state technologies such as 3D Xpoint are necessarily speculative, since they are still some way from shipping in volume. But by analogy with flash we can see the broad outlines. They are a better technology than flash, 1000 times faster than NAND, 1000 times the endurance, and 100 times denser. So even if the manufacturing cost were the same, they would command a price premium. The manufacturing cost will initially be much higher because of low volumes, and will take time to ramp down.So, despite the good Doctor's enthusiasm, revolutionary change in the storage landscape is unlikely. We are unlikely to see ASTC's 30% Kryder rate, 3D NAND will not become cheaper for bulk storage than hard disk, and SCM will not have a significant effect on the cost of storage in the foreseeable future.

Islandora: The State of the CLAW

Thu, 2016-03-17 14:25

The state of the work is that is in progress. Like Fedora 4, CLAW is a complete rewrite of the entire Islandora stack. It is a collaborative community effort, and needs the resources of the community. An update on the project was included in the most recent Islandora Community Newsletter. You can check that out here.

  • We have weekly CLAW Calls that you are more than welcome to join us on, and add items to the agenda.
  • We send updates to the list each week after each call, and you can view them all here.
  • We have monthly sprints which are held during the last two weeks of the month. If you (or your colleagues) are in a position to join, you are more than welcome to join us there too.
  • We also have weekly CLAW lessons which are led by Diego Pino Navarro. You can find more information on them here.
  • Data model, and Hydra interoperability? We're working on implementing the Portland Common Data Model (PCDM). More is available on that here and here.

If you want to see CLAW completed faster, you can help! 

  • Contribute developer time. Either your own, or some developer time from your institution. Not comfortable with the stack? Thats what CLAW lessons are for!
  • Contribute funds: The Islandora Foundation is very close to having the necessary membership funding to hire a Technical Lead, who could devote a lot more time to coordinating the development of CLAW than our current volunteer team has available. Joining the Islandora Foundation has many benefits, but adding a Technical Lead to the project will be a big one in the CLAW corner.
  • Contribute opinions: We need to know how you want CLAW to work. You are welcome to attend the weekly CLAW Call detailed above. Please also watch the listserv for questions about features and use cases.

pinboard: Update: Digital Film Historiography – A Bibliography | Film History in the Making

Thu, 2016-03-17 14:02
Video annotation for film historiography: An update to @DerKleineMozart's bibliography #dighum #code4lib

Equinox Software: Welcome, Rogan!

Thu, 2016-03-17 13:48

Equinox is pleased to welcome Rogan Hamby to the team!  He will join us as a Project and Data Analyst.  Rogan got his undergrad degree in English Lit with minors in Computer Science and Sociology and then received his MLIS from the University of South Carolina.  By repeatedly graduating he proved to be really bad at being the professional student he had aspirations to be.  Deciding that a reference librarian was the next best thing he went off and after twenty years has done nearly every job you can do in a public library, many simultaneously.

In 2009 Rogan was involved with the creation of an Evergreen based public library consortium that he supported for eight years, overseeing operations, migrations and special projects.  Believing in the philosophical principles of open source software and the cultural mission of libraries he found Evergreen to be the tool he needed and the community to be people he shared values with.  When the time came for his next adventure Rogan realized that he wanted to remain in the Evergreen community and find new ways to aid libraries in their missions.  Fortunately, he found a place at Equinox to do that.

Outside of work Rogan has managed to be married, have three children, learn archery from buddhist monks and once play D&D with Dave Arneson.  He doesn’t have a favorite book but has re-read works by and about J. R. R. Tolkien more than is probably healthy.

Grace Dunbar, Equinox Vice President, had this to say about the hire:  “We’re delighted to be adding Rogan to our team of librarians and techies… and techie-librarians.  Many of us have known Rogan for years and we deeply value his experience both in libraries and in the Evergreen community.  We also respect his deep and abiding love of Tolkien.”

Open Knowledge Foundation: Open Data Day Guyana – Bringing Open Street Map to the classroom

Thu, 2016-03-17 13:21

This blog post was written by Vijay Datadin from the GIS collective

Open Data is a new and still not very well understood concept in Guyana, as is probably the case in other countries as well. The GIS Collective, a group of volunteers, each highly skilled and experienced in Geographic Information Systems (GIS), know the value of data being available to help a country to develop, and the hurdles posed by unavailable or outdated data.

Secondary school teachers can impart their knowledge to the upcoming generation of youth on the subject. The GIS Collective therefore offered a short seminar on open data for secondary school Geography and IT teachers based in and around the capital city, Georgetown, working through the office of the Guyana Chief Education Officer (CEO) and with the support of the Assistant CEO for Secondary Schools. The event was hosted on the 11 March 2016 at the National Centre for Educational Resource Development (NCERD) located in the Kingston ward of Georgetown.

The idea of open data was briefly presented and discussed, that is ‘What is Open Data?’ and ‘What Open Data does for National Development’. However the main part of the seminar involved the teachers learning-by-doing, producing open data  themselves.

The teachers were introduced to a source of open spatial data – Open Street Map (OSM) and taught to use and edit it themselves. The teachers were organised into groups of 4-6 people and using Field Papers to make notes, they walked and surveyed various parts of the surrounding area of the city. Using laptops and the OSM iD editor the teachers then transferred their observations to OSM, digitizing building outlines, naming and describing landmarks, and so on.

The group enriched OSM by adding information on Government Ministries, Embassies, private companies and other buildings, and historic structures such as the Georgetown Lighthouse (built 1830), the Umana Yana (a national landmark built by indigenous peoples) and the Georgetown Seawall Roundhouse (built 1860).

The teachers were enthusiastic participants, and enjoyed the hands-on approach of the seminar. Some have apparently already continued to edit OSM in other areas of Guyana in the days following the seminar. The organisers are grateful for the support of the Guyana Ministry of Education and Open Knowledge International.

Library of Congress: The Signal: 22 Opportunities in Web Archiving! A Harvard Library Report

Thu, 2016-03-17 12:22

The following is a guest post by Andrea Goethals, Digital Preservation and Repository Manager at Harvard Library.

It’s St. Patrick’s Day, so I wanted to have a catchy Irish saying for the title but, believe it or not, Irish sayings about web archiving or even the web are hard to find. I did find some great phrases though, especially “You must take the little potato with the big potato.” Potatoes seem to be a common theme in Irish sayings, along with rain.

In the last couple years within Harvard Library, when we haven’t been thinking about our own frequently inclement weather, we have been thinking a lot about web archiving and what our strategy should be for scaling up our web archiving activities. We wanted to know more about the current practices, needs and expectations of other institutions who are either actively engaged in web archiving or would like to be, and if our institutions had common needs that might be addressed by collaborative efforts.

With the generous support of the Arcadia Fund, my colleague Abigail Bordeaux and I worked closely with Gail Truman of Truman Technologies to conduct a five-month environmental scan of web archiving programs, practices, tools and research. The final report is now available from Harvard’s open access repository, DASH.

The heart of the study was a series of interviews with web archiving practitioners from archives, museums and libraries worldwide; web archiving service providers; and researchers who use web archives. The interviewees were selected from the membership of the International Internet Preservation Consortium, the Web Archiving Roundtable at the Society of American Archivists, the Internet Archive’s Archive-It Partner Community, the Ivy Plus institutions, Working with Internet archives for REsearch (Ruters/WIRE Group), and the Research infrastructure for the Study of Archived Web materials (RESAW).

The interviews of web archiving practitioners covered a wide range of areas, everything from how the institution is maintaining their web archiving infrastructure (e.g. outsourcing, staffing, location in the organization), to how they are (or aren’t) integrating their web archives with their other collections. From this data, profiles were created for 23 institutions, and the data was aggregated and analyzed to look for common themes, challenges and opportunities.

In the end, the environmental scan revealed 22 opportunities for future research and development. These opportunities are listed in Table 1 (below) and described in more detail in the report. At a high level, these opportunities fall under four themes: (1) increase communication and collaboration, (2) focus on “smart” technical development, (3) focus on training and skills development and (4) build local capacity.

Table 1: The 22 opportunities for further research and development that emerged from the environmental scan

One of the biggest takeaways is that the first theme, the need to radically increase communication and collaboration among all individuals and organizations involved in some way in web archiving, was the most prevalent. Thirteen of the 22 opportunities fell under this theme. Clearly much more communication and collaboration is needed among those collecting web content but also between those who are collecting it and researchers who would like to use it.

This environmental scan has given us a great deal of insight into how other institutions are approaching web archiving, which will inform our own web archiving strategy at Harvard Library in the coming years. We hope that it has also highlighted key areas for research and development that need to be addressed if we are to build efficient and sustainable web archiving programs that result in complementary and rich collections that are truly useful to researchers.

Jonathan Rochkind: Followup: Reliable Capybara JS testing with RackRequestBlocker

Wed, 2016-03-16 18:19

My post on Struggling Towards Reliable Capybara Javascript Testing attracted a lot of readers, and some discussion on reddit.

I left there thinking I had basically got my Capybara JS tests reliable enough… but after that, things degraded again.

But now I think I really have fixed it for real, with some block/wait rack middleware based on the original concept by Joel Turkel, which I’ve released as RackRequestBlocker. This is middleware to keep track of ‘outstanding’ requests in your app that were triggered by a feature spec that has finished, and let the main test thread wait until they are complete before DatabaseCleaning and moving on to the next spec.

My RackRequestBlocker implementation is based on the new hotness concurrent-ruby (a Rails5 dependency, great collection of ruby concurrency primitives) instead of Turkel’s use of the older `atomic` gem, and using actual signal/wait logic instead of polling, and refactored to have IMO a more convenient packaged API. Influenced by Dan Dorman’s unfinished attempts to gemify Turkel’s design.

It’s only a few dozen lines of code, check it out for an example of using concurrent-ruby’s primitives to build something concurrent.

And my Capybara JS feature tests now appear to be very very reliable, and I expect them to stay that way. Woot.

To be clear, I also had to turn off DatabaseCleaner transactional strategy entirely, even for non-JS tests.  Just RackRequestBlocker wasn’t enough, neither was just turning off transactional strategy.  Either one by themselves I still had crazy race conditions — including pg:deadlocks… and actual segfaults!

Why? I honestly am not sure. There’s no reason transactional fixture strategy shouldn’t work when used only for non-JS tests, even with RackRequestBlocker.  The segfaults suggests a bug in something C; MRI, pg, poltergeist? (poltergeist was very unpopular in the reddit thread on my original post, but I still think it’s less bad than other options for my situation.)  Bug of some kind in the test_after_commit gem we were using to make things work even with transactional fixture strategy? Honestly, I have no idea — I just accepted it, and was happy to have tests that were working.

Try out RackRequestBlocker, see if it helps with your JS Capybara race condition problems, let me know in comments if you want, I’m curious. I can’t support this super well, I just provide the code as a public service, because I fantasize of the day nobody has to go through as many hours as I have fighting with JS feature tests.

Filed under: General

District Dispatch: WHCLIST application deadline approaches

Wed, 2016-03-16 17:13

Library advocates! There is still time to apply for this year’s White House Conference on Library and Information Services Taskforce (WHCLIST) award. Applications are due on April 1, 2016 (no joke!).

This award is granted to a non-librarian participant in National Library Legislative Day (NLLD). The winner receives a stipend of $300 and two free nights at the Liaison hotel.

Masood Cajee accepts the 2015 award from former ALA president Courtney Young.

The criteria for the WHCLIST Award are:

  • The recipient should be a library supporter (trustee, friend, general supporter) and not a professional librarian.
    Recipient should be a first-time attendee of NLLD.
  • Representatives of WHCLIST and the ALA Washington office will choose the recipient. The ALA Washington Office will contact the recipient’s senators and representatives to announce the award. The winner of the WHCLIST Award will be announced at NLLD.
  • The deadline for applications is April 1, 2016.

To apply for the WHCLIST award, please submit a completed NLLD registration form; a letter explaining why you should receive the award; and a letter of reference from a library director, school librarian, library board chair, Friend’s group chair, or other library representative to:

Lisa Lindle
Grassroots Communications Coordinator
American Library Association
1615 New Hampshire Ave., NW
First Floor
Washington, DC 20009
202-628-8419 (fax)

Note: Applicants must register for NLLD and pay all associated costs. Applicants must make their own travel arrangements. The winner will be reimbursed for two free nights in the NLLD hotel in D.C and receive the $300 stipend to defray the costs of attending the event.

The post WHCLIST application deadline approaches appeared first on District Dispatch.

Library of Congress: The Signal: Advancing Institutional Progress through Digital Repository Assessment

Wed, 2016-03-16 17:00

The following is a guest post by Jessica Tieman.

U.S. Government Publishing Office Director Davita Vance Cooks reveals the new during the site launch event on February 3rd. Photo: U.S. Government Publishing Office

Three quarters of the way into my twelve-month National Digital Stewardship Residency at the U.S. Government Publishing Office, I reflect on the success and challenges of my project. I also recognize how the outcome of my work will impact the future of the GPO, its business units, the communities within the Federal Government and the general public that are all invested in the success of GPO’s audit and certification of govinfo (formerly FDsys) repository.

In this post, I’ll give a brief update on my assigned role at GPO: to prepare govinfo for ISO 16363 Trustworthy Digital Repository Audit and Certification. I’ll also explain how preparing for the audit has served as a way for GPO to advance its strategic plan to transition from a print-centric model to a content-centric digital agency.

I have been collecting and evaluating all existing documentation relating to GPO’s govinfo to satisfy the requirements of the 109 criteria outlined in ISO 16363. Not only this, but where necessary, I have had the opportunity to participate in and sometimes offer guidance for writing and developing documentation for procedures and processes based on digital preservation best practices not yet fully captured within existing documentation at the time I arrived at GPO.

In preparation for evaluating GPO’s documentation and readiness for an ISO 16363 audit, I interviewed TRAC-certified and OAIS-compliant digital repository managers to gather feedback about repository assessment to share with GPO internal staff. I am beginning the internal audit process.

GPO implemented a SharePoint folder-based system to organize all of their documentation and evidence by each criteria. Documentation includes workflows, roles and responsibilities and organizational charts, strategic plans, technical documentation, project specifications, meeting notes, planning documents and vision statements, data management definitions, risk registries, standard operating procedures, policies, and gathered statistics on systems and users, and more. (the repository’s Architecture System Design document is available online.

For me to evaluate GPO against each criteria, I will assess the content I have gathered within the SharePoint system for:
• Adequacy: How well it satisfies the specific criteria requirements;
• Transparency: Documentation truly captures the repository activities and practices and is written with clarity
• Measurability: Procedures have been documented in a manner that can be substantiated through measurable outcomes, such as not only listing end-user requirements, but also providing the data collected on users to validate the requirements;
• Sustainability: Are the current processes scalable and will they remain effective over time as systems, people and funding change

This project has been a time-consuming and highly detailed. What surprised me most about it has been the dynamic nature of evaluating and creating “good” documentation. Many times I found that a document seemed to perfectly meet the expectations of criteria, but later, I realized the many ways in which it wasn’t actually enough.

There is another dynamic aspect to the project: advancing institutional change through assessment. The U.S. Government Publishing Office was the U.S. Government Printing Office for over 150 years until its recent name change in 2014. In addition to the intentional name change, GPO has increasingly been engaging in business and fostering internal developments to further its commitment to authenticating, preserving and distributing Federal information that remains critically important to American democracy in the digital age.

This is a unique opportunity for an ISO 16363 Audit and Certification of the govinfo repository to support GPO’s efforts. In many ways, the govinfo repository plays a critical role in this transformation as it exists as the primary source of Federal information products for all of GPO’s stakeholders and its user community, including all branches of Federal government, the depository and non-depository library community, local and state government, private industries, non-profit organizations, transparency organizations, legal professionals, researchers, data consumers, and the general public.

The value of the govino repository to the Federal Depository Library Program is changing due to GPO’s overall digital transformation. Indeed, the present-day FDLP program was initially codified in Title 44, Chapter 19 of the U.S. Code to mandate availability of government publications for public access. In 1993, Title 44 was expanded to include a mandate for electronic access to government publications in an online facility managed by GPO.

Since this time, GPO has fulfilled this responsibility by providing an open, free, publicly accessible preservation repository, with the goal of functioning as the official resource for government information products. In order to meet this goal, however, collection development for govinfo is essential to increase the variety of digital content within the repository, including digitized content submitted by FDLP libraries.

The repository itself is both impacting and reacting to this “digital stimulus.” This has an effect on how I determine the sustainability of the repository’s documentation in the context of the audit. The repository’s underlying technology will need to be flexible enough to anticipate content that may be arriving from new library partnerships. Staff must be agile enough to develop the workflows for handling producer-archive agreements, digitization guidelines and ingest processes.

For GPO, the eagerness for a certification process impels the need for cross-functional decision-making across business units, receptiveness to new policies and procedures centered around digital publication and preservation standards, and a strong commitment to communicate these new commitments and values to their user community, which includes both their Federal stakeholders, the depository library community and the American public at-large.