You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 3 days 23 hours ago

District Dispatch: School libraries can’t afford to wait

Fri, 2015-02-27 19:46

A student at the ‘Iolani School in Hawaii

Today, we are ending a very busy week of work to include school library provisions in the reauthorization of the Elementary and Secondary Education Act (ESEA). As we move forward on this very important legislative effort to secure federal funding for school libraries, we are asking library advocates, teachers, parents and students to ask their Senators to become co-sponsors of the SKILLS Act (S 312). We are encouraging those lucky few advocates who live in states where one of their Senators is on on the Senate HELP Committee to call their Washington Office and ask their Senators to co-sponsor Senator Sheldon Whitehouse’s (D-RI) efforts to include the SKILLS Act in the Committee’s ESEA bill.

I’ve had many meetings in the Senate recently and none of the congressional staff members I’ve met with have heard from any library supporters…..so PLEASE MAKE THESE CALLS. AND ASK EVERYONE ELSE YOU KNOW TO CALL TOO.

Students deserve to go to a school with an effective school library program-take action now!

The post School libraries can’t afford to wait appeared first on District Dispatch.

District Dispatch: ALA welcomes Alternative Spring Break student Natalie Yee

Fri, 2015-02-27 19:25

Natalie Yee

Next week, we welcome University of Michigan student Natalie Yee to the American Library Association (ALA) Washington Office, where she will learn about information-related fields as part of her university’s School of Information Alternative Spring Break program. Yee will conduct research on online resources and tools produced by the ALA Washington Office (under my guidance as the press officer).

Yee is currently working to earn a master’s degree in Information, with a focus in Human-Computer Interaction. She previously earned a bachelors in Anthropology with a minor in Asian Studies from Colorado College.

The University of Michigan Alternative Spring Break program creates the opportunity for students to engage in a service-oriented integrative learning experience; connects public sector organizations to the knowledge and abilities of students through a social impact project; and facilitates and enhances the relationship between the School and the greater community.

In addition to ALA, the students are hosted by other advocacy groups such as the Future of Music Coalition as well as federal agencies such as the Library of Congress, the Smithsonian Institution, and the National Archives. The students get a taste of work life here in D.C. and an opportunity to network with information professionals.

“We are pleased to support students from the University of Michigan’s spring break program,” said Alan S. Inouye, director of ALA’s Office for Information Technology Policy. “We look forward to working collaboratively with Yee in the next week.”

The post ALA welcomes Alternative Spring Break student Natalie Yee appeared first on District Dispatch.

M. Ryan Hess: Three Emerging Digital Platforms for 2015

Fri, 2015-02-27 15:43

‘Twas a world of limited options for digital libraries just a few short years back. Nowadays, however, the options are many more and the features and functionalities are truly groundbreaking.

Before I dive into some of the latest whizzbang technologies that have caught my eye, let me lay out the platforms we currently use and why we use them.

  • Digital Commons for our institutional repository. This is a simple yet powerful hosted repository service. It has customizable workflows built into it for managing and publishing online journals, conferences, e-books, media galleries and much more. And, I’d emphasize the “service” aspect. Included in the subscription comes notable SEO power, robust publishing tools, reporting, stellar customer service and, of course, you don’t have to worry about the technical upkeep of the platform.
  • CONTENTdm for our digital collections. There was a time that OCLC’s digital collections platform appeared to be on a development trajectory that would take out of the clunky mire it was in say in 2010. They’ve made strides, but this has not kept up.
  • LUNA for restricted image reserve services. You and your faculty can build collections in this system popular with museums and libraries alike. Your collection also sits within the LUNA Commons, which means users of LUNA can take advantage of collections outside their institutions.
  • Omeka.net for online exhibits and digital humanities projects. The limited cousin to the self-hosted Omeka, this version is an easy way to launch multiple sites for your campus without having to administer multiple installs. But it has a limited number of plugins and options, so your users will quickly grow out of it.
The Movers and Shakers of 2015

There are some very interesting developments out there and so here is a brief overview of a few of the three most ground-breaking, in my opinion.

PressForward

If you took Blog DNA and spliced it with Journal Publishing, you’d get a critter called PresForward: a WordPress plug-in that allows users to launch publications that approach publishing from a contemporary web publishing perspective.

There are a number of ways you can use PressForward but the most basic publishing model its intended for starts with treating other online publications (RSS feeds from individuals, organizations, other journals) as sources of submissions. Editors can add external content feeds to their submission feed, which bring that content into their PressForward queue for consideration. Editors can then go through all the content that is brought in automatically from outside and then decide to include it in their publication. And of course, locally produced content is also included if you’re so inclined.

Examples of PressForward include:

Islandora

Built on Fedora Commons with a Drupal front-end layer, Islandora is a truly remarkable platform that is growing in popularity at a good clip. A few years back, I worked with a local consortia examining various platforms and we looked at Islandora. At the time, there were no examples of the platform being put into use and it felt more like an interesting concept more than a tool we should recommend for our needs. Had we been looking at this today, I think it would have been our number one choice.

Part of the magic with Islandora is that it uses RDF triples to flatten your collections and items into a simple array of objects that can have unlimited relationships to each other. In other words, a single image can be associated with other objects that all relate as a single object (say a book of images) and that book object can be part of a collection of books object, or, in fact, be connected to multiple other collections. This is a technical way of saying that it’s hyper flexible and yet very simple.

And because Islandora is built on two widely used open source platforms, finding tech staff to help manage it is easy.

But if you don’t have the staff to run a Fedora-Drupal server, Lyrasis now offers hosted options that are just as powerful. In fact, one subscription model they offer allows you to have complete access to the Drupal back end if customization and development are important to you, but you dont’ want to waste staff time on updates and monitoring/testing server performance.

Either way, this looks like a major player in this space and I expect it to continue to grow exponentially. That’s a good thing too, because some aspects of the platform are feeling a little “not ready for prime time.” The Newspaper solution pack, for example, while okay, is no where near as cool as what Veridian currently can do.

ArtStor’s SharedShelf

Rapid development has taken this digital image collection platform to a new level with promises of more to come. SharedShelf integrates the open web, including DPLA and Google Images, with their proprietary image database in novel ways that I think put LUNA on notice.

Like LUNA, SharedShelf allows institutions to build local collections that can contain copyrighted works to be used in classroom and research environments. But what sets it apart is that it allows users to also build beyond their institutions and push that content to the open web (or not depending on the rights to the images they are publishing).

SharedShelf also integrates with other ArtStor services such as their Curriculum Guides that allow faculty to create instructional narratives using all the resources available from ArtStor.

The management layer is pretty nice and works well with a host of schema.

And, oh, apparently audio and video support is on the way.


CrossRef: Update for CrossCheck Users: iThenticate bibliography exclusion issue fix

Fri, 2015-02-27 14:18

This is an update for CrossRef members who participate in the CrossCheck service, powered by iThenticate.

Users of the iThenticate system had reported that the bibliography/reference exclusion option did not work properly when tables were appended after the bibliography of a document and a cell of the table contained a bibliography keyword i.e. references, works cited, bibliography.

A fix was released on February 16th 2015 for this issue, so that reference exclusion will work for documents that fit the criteria above. Please note that the enhancement applies to new manuscripts submitted to iThenticate from February 16th.

CrossCheck users should feel free to test out the bibliography exclusion feature, and do contact us if you have any questions; or if you continue to experience this problem on any new submissions to the service.

LITA: Librarians: We Open Access

Fri, 2015-02-27 12:00

Open Access (storefront). Credit: Flickr user Gideon Burton

In his February 11 post, my fellow LITA blogger Bryan Brown interrogated the definitions of librarianship. He concluded that librarianship amounts to a “set of shared values and duties to our communities,” nicely summarized in the ALA’s Core Values of Librarianship. These core values are access, confidentiality / privacy, democracy, diversity, education and lifelong learning, intellectual freedom, preservation, the public good, professionalism, service, and social responsibility. But the greatest of these is access, without which we would revert to our roots as monastic scriptoriums and subscription libraries for the literate elite.

Bryan experienced some existential angst given that he is a web developer and not a “librarian” in the sense of job title or traditional responsibilities–the ancient triad of collection development, cataloging, and reference. In contrast, I never felt troubled about my job, as my title is e-learning librarian (got that buzzword going for me, which is nice) and as I do a lot of mainstream librarian-esque things, especially camping up front doing reference or visiting classes doing information literacy instruction.

Meme by Michael Rodriguez using Imgflip

However, I never expected to become manager of electronic resources, systems, web redesign, invoicing and vendor negotiations, and hopefully a new institutional repository fresh out of library school. I did not expect to spend my mornings troubleshooting LDAP authentication errors, walking students through login issues, running cost-benefit analyses on databases, and training users on screencasting and BlackBoard.

But digital librarians like Bryan and myself are the new faces of librarianship. I deliver and facilitate electronic information access in the library context; therefore, I am a librarian. A web developer facilitates access to digital scholarship and library resources. A reference librarian points folks to information they need. An instruction librarian teaches people how to find and evaluate information. A cataloger organizes information so that people can access it efficiently. A collection developer selects materials that users will most likely desire to access. All of these job descriptions–and any others that you can produce–are predicated on the fundamental tenet of access, preferably open, necessarily free.

Democracy, diversity, and the public good is our vision. Our active mission is to open access to users freely and equitably. Within that mission lie intellectual freedom (open access to information regardless of moralistic or political beliefs), privacy (fear of publicity can discourage people from openly accessing information), preservation (enabling future users to access the information), and other values that grow from the opening of access to books, articles, artifacts, the web, and more.

The Librarians (Fair use – parody)

By now you will have picked up on my wordplay. The phrase “open access” (OA) typically refers to scholarly literature that is “digital, online, free of charge, and free of most copyright and licensing restrictions” (Peter Suber). But when used as a verb rather than an adjective, “open” means not simply the state of being unrestricted but also the action of removing barriers to access. We librarians must not only cultivate the open fields–the commons–but also strive to dismantle paywalls and other obstacles to access. Recall Robert Frost’s Mending Wall:

Before I built a wall I’d ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
Something there is that doesn’t love a wall,
That wants it down.’ I could say ‘Elves’ to him…

Or librarians, good sir. Or librarians.

District Dispatch: Free webinar: Library partnerships

Fri, 2015-02-27 01:49

Library e-government resource Lib2gov today announced “A Library Partnership = Neighborhood Resource Center,” a free webinar that will explore library community partnerships:

Alachua County Library District

This webinar highlights the Library Partnership of the Alachua County Library District (ACLD), which is a collaboration with the Partnership for Strong Families and over 30 other service providers expanding the traditional library role in its community. By sharing space, library staff and social service partners provide coordinated and complementary services to meet a client’s full needs. ACLD has now modeled a second branch, Cone Park, to offer similar services.

Both of these branches are in at-risk, low-income areas of the City of Gainesville. Alachua County Library District (ACLD) originally became involved with e-government through a collaboration with NEFLIN (Northeast Florida Library Information Network) to administer a two-year LSTA grant. The grant goals were to research, develop and provide e-government service at libraries throughout the north-central Florida region. As networking for this project expanded, the concept of a neighborhood resource center combining a library and all its resources with community service providers, both governmental and non-profit, became a reality.

Date: March 4, 2015
Time: 3:00-4:00 p.m. EST
Register for this free event

Speaker: Chris Culp is Public Services Division Director in the Alachua County Library District. Chris has a background in academic and school libraries, as well as non-profit organizations.

If you cannot attend this live session, a recorded archive will be available to view at your convenience. To view past webinars also done in collaboration with iPAC, please visit Lib2Gov.org.

The post Free webinar: Library partnerships appeared first on District Dispatch.

District Dispatch: ALA applauds FCC vote to protect open Internet

Fri, 2015-02-27 01:04

The Federal Communications Commission (FCC) voted today to assert the strongest possible open Internet protections—banning paid prioritization and the blocking and throttling of lawful content and services. In a statement issued today, the American Library Association (ALA), applauded this bold step forward in ensuring a fair and open Internet.

ALA President Courtney Young

“America’s libraries collect, create and disseminate essential information to the public over the Internet, and ensure our users are able to access the Internet and create and distribute their own digital content and applications. Network neutrality is essential to meeting our mission in serving America’s communities,” said ALA President Courtney Young. “Today’s FCC vote in favor of strong, enforceable net neutrality rules is a win for students, creators, researchers and learners of all ages.”

As is usually the case, the final Order language is not yet available, but statements from Chairman Tom Wheeler and fellow Commissioners, as well as an earlier fact sheet (pdf) on the draft Order, outline several key provisions. The Order:

  • Reclassifies “broadband Internet access service”–including both fixed and mobile—as a telecommunications service under Title II.
  • Asserts “bright line” rules that ban blocking or throttling of legal content, applications and services; and paid prioritization of some Internet traffic over other traffic.
  • Enhances transparency rules regarding network management and practices.
  • Distinguishes between the public and private networks.

“After almost a year of robust engagement across the spectrum of stakeholders, the FCC has delivered the rules we need to ensure equitable access to online information, applications and services for all,” said Larra Clark, deputy director for the ALA Office for Information Technology Policy. “ALA worked closely with nearly a dozen library and higher education organizations to develop and advocate for network neutrality principles, and we are pleased the FCC’s new rules appear to align nearly perfectly.”

The FCC vote marks the end of one chapter in a lively debate over the future of the Internet, but it’s unlikely to be the last word on the matter. Yesterday the House Energy & Commerce Committee held a hearing to discuss the issue, and several Internet service providers (ISPs) have signaled they will challenge the rules in court. ALA, working with our allies, will continue our engagement to maintain net neutrality.

“On the eve of the FCC’s vote, the House Energy and Commerce Committee provided a preview of the challenges ahead in defending the open Internet,” said Kevin Maher, assistant director of the ALA Office of Government Relations. “Committee Chairman Greg Walden (R-OR) argued that the Order may lead to future regulation while not protecting consumers, while ranking member Anna Eshoo (D-CA) countered that the Order, in fact, guarantees an open Internet.”

More information on libraries and network neutrality is available on the ALA website. Stay tuned at the District Dispatch for updates.

The post ALA applauds FCC vote to protect open Internet appeared first on District Dispatch.

DuraSpace News: Report From the Advanced DSpace Course

Fri, 2015-02-27 00:00
Winchester, MA  Last month DuraSpace and the Texas Digital Library partnered to offer Advanced DSpace Training at the Perry Casteñeda Library on the University of Texas at Austin campus. DuraSpace Members were eligible to register at reduced prices. Course participants learned about advanced features and customization.

DPLA: Profit & Pleasure in Goat Keeping

Thu, 2015-02-26 22:17

Two weeks ago, we officially announced the initial release of Krikri, our new metadata aggregation, mapping, and enrichment toolkit.

In light of its importance, we would like to take a moment for a more informal introduction to the newest members of DPLA’s herd. Krikri and Heiðrún (a.k.a. Heidrun; pronounced like hey-droon) are key to many of DPLA’s plans and serve as a critical piece of infrastructure for DPLA.

They are also names for, or types, of goats.

National Archives and Records Administration.

Why goats? As naturally curious, browsing animals that will try to consume almost anything, our new caprine friends are especially suited to their role in uncovering and sharing treasures of our cultural heritage (by consuming metadata and producing enriched Linked Data).

Krikri makes it possible to harvest from multiple sources (e.g. OAI-PMH), map the resulting metadata to Linked Data, and perform quality control and data enrichment.

We’ve taken care to build these features to be broadly useful so we can share with others doing similar work.  So the name is fitting, as Kri-Kri is a feral goat.

Heidrun, Krikri’s ruminating counterpart, is DPLA’s local implementation of the Krikri features. This is the application that handles the harvests specific to our partners and enriches metadata for the DPLA Portal and Platform API. Heiðrún is named after the mythical Norse goat that eats the leaves and buds off the tree Læraðr and produces mead—enough for all to have their fill.  We couldn’t think of a better goat to represent DPLA!

The title of this post notwithstanding, we remain as committed as ever to openness and to free, democratic access to knowledge. Accordingly, both Krikri and Heidrun are free and open source software.

Goats are notorious escape artists, anyhow. Let us know if any of ours wander into your backyards.

You can find more information about Krikri and Heidrun at our overview page for the project, and in the form of a recent presentation at Code4lib 2015 (video).

The Miriam and Ira D. Wallach Division of Art, Prints and Photographs: Print Collection, The New York Public Library.

FOSS4Lib Recent Releases: Hydra - 8.0.0

Thu, 2015-02-26 21:18

Last updated February 26, 2015. Created by Peter Murray on February 26, 2015.
Log in to edit this page.

Package: HydraRelease Date: Wednesday, February 25, 2015

Journal of Web Librarianship: A Review of “User Experience (UX) Design for Libraries”

Thu, 2015-02-26 18:34
10.1080/19322909.2014.983413
Aida M. Smith

Eric Hellman: "Free" can help a book do its job

Thu, 2015-02-26 18:32


(Note: I wrote this article for NZCommons, based on my presentation at the 2015 PSP Annual Conference in February.)

Every book has a job to do. For many books, that job is to make money for its creators. But a lot of books have other jobs to do. Sometimes the fact that people pay for books helps that job, but other times the book would be able to do its job better if it was free for everyone.

That's why Creative Commons licensing is so important. But while CC addresses the licensing problem nicely, free ebooks face many challenges that make it difficult for them to do their jobs.

Let's look at some examples.

When Oral Literature in Africa was first published in 1970, its number one job was to earn tenure for the author, a rising academic. It succeeded, and then some. The book became a classic, elevating an obscure topic and creating an entire field of scholarly inquiry in cultural anthropology. But in 2012, it was failing to do any job at all. The book was out of print and unavailable to young scholars on the very continent whose culture it documented. Ruth Finnegan, the author, considered it her life's work and hoped it would continue to stimulate original research and new insights. To accomplish that, the book needed to be free. It needed to be translatable, it needed to be extendable.


Nga Reanga Youth Development: Maori Styles, an Open Access book by Josie Keelan, is another example of an academic book with important jobs to do. While its primary job is a local one, the advancement of understanding and practice in Maori youth development, it has another job, a global one. Being free helps it speak to scholars and researchers around the world.

Leanne Brown's Good and Cheap is a very different book. It's a cookbook. But the job she wanted it to do made it more than your usual cookbook. She wanted to improve the lives of people who receive "nutrition assistance"- food stamps, by providing recipes for nutritious and healthy meals that can be made without spending much money. By being free, Good and Cheap helps more people in need eat well.

My last example is Casey Fiesler's Barbie™ I Can Be A Computer Engineer The Remix! Now With Less Sexism! The job of this book is to poke fun at the original Barbie™ I Can Be A Computer Engineer, in which Barbie needs boys to do the actual computer coding. But because Fiesler uses material from the original under "fair use", anything other than free, non-commercial distribution isn't legal. Barbie, remixed can ONLY be a free ebook.

But there's a problem with free ebooks. The book industry runs on a highly evolved and optimized cradle-to-grave supply chain, comprising publishers, printers, production houses, distributors, wholesalers, retailers, aggregators, libraries, publicists, developers, cataloguers, database suppliers, reviewers, used-book dealers, even pulpers. And each entity in this supply chain takes its percentage. The entire chain stops functioning when an ebook is free. Even libraries (most of them) lack the processes that would enable them to include free ebooks in their collections.

At Unglue.it, we ran smack into this problem when we set out to bring books into the creative commons. We helped Open Book Publishers crowd fund a new ebook edition of Oral Literature in Africa. The ebook was then freely available, but it wasn't easy to make it free on Amazon, which dominates the ebook market. We couldn't get the big ebook aggregators that serve libraries to add it to their platforms. We realized that someone had to do the work that the supply chain didn't want to do.

Over the past year, we've worked to turn Unglue.it into a "bookstore for free books". The transformation isn't done yet, but we've built a database of over 1200 downloadable ebooks, licensed under Creative Commons or other free licenses. We have a long way to go, but we're distributing over 10,000 ebooks per month. We're providing syndication feeds, developing relationships with distributors, improving metadata, and promoting wonderful books that happen to be free.

The creators of these books still need to find support. To help them, we've developed three revenue programs. For books that already have free licenses, we help the creators ask for financial support in the one place where readers are most appreciative of their work- inside the books themselves. We call this "thanks for ungluing".

For books that exist as ebooks but need to recoup production costs, we offer "buy-to-unglue". We'll sell these books until they reach a revenue target, after which they'll become open access. For books that exist in print but need funding for conversion to open access ebook, we offer "pledge-to-unglue", which is a way of crowd-funding the conversion.

After a book has finished its job, it can look forward to a lengthy retirement. There's no need for books to die anymore, but we can help them enjoy retirement, and maybe even enjoy a second life. Project Gutenberg has over 50,000 books that have "retired" into the public domain. We're starting to think about the care these books need. Formats change along with the people that use them, and the book industry's supply chain does its best to turn them back into money-earners to pay for that care.

Recently we received a grant from the Knight Foundation to work on ways to provide the long-term care that these books need to be productive AND free in their retirements. GITenberg, a collaboration between the folks at Unglue.it and ebook technologist Seth Woodward is exploring the use of Github for free ebook maintenance. Github is a website that supports collaborative software development with source control and workflow tools. Our hope is that the ingredients that have made Github wildly successful in the open source software world will will prove to by similarly effective in supporting ebooks.

It wasn't so long ago that printing costs made free ebooks impossible. So it's no wonder that free ebooks haven't realized their full potential. But with cooperation and collaboration, we can really make wonderful things happen.

State Library of Denmark: Long tail packed counters for faceting

Thu, 2015-02-26 18:14

Our home brew Sparse Faceting for Solr is all about counting: When calculating a traditional String facet such as

Author - H.C.Andersen (120) - Brothers Grimm (90) - Arabian Nights (7) - Carlo Collodi (5) - Ronald Reagan (2)

the core principle is to have a counter for each potential term (author name in this example) and update that counter by 1 for each document with that author. There are different ways of handling such counters.

Level 0: int[]

Stock Solr uses an int[] to keep track of the term counts, meaning that each unique term takes up 32 bits or 4 bytes of memory for counting. Normally that is not an issue, but with 6 billion terms (divided between 25 shards) in our Net Archive Search, this means a whopping 24GB of memory for each concurrent search request.

Level 1: PackedInts

Sparse Faceting tries to be clever about this. An int can count from 0 to 2 billion, but if the maximum number of documents for any term is 3000, there will be a lot of wasted space. 2^12 = 4096, so in the case of maxCount=3000, we only need 12 bits/term to keep track of it all. Currently this is handled by using Lucene’s PackedInts to hold the counters. With the 6 billion terms, this means 9GB of memory. Quite an improvement on the 24GB from before.

Level 2: Long tail PackedInts with fallback

Packing counters has a problem: The size of all the counters is dictated by the maxCount. Just a single highly used term can nullify the gains: If all documents share one common term, the size of the individual counters will be log(docCount) bits. With a few hundred millions of documents, that puts the size to 27-29 bits/term, very close to the int[] representation’s 32 bits.

Looking at the Author-example at the top of this page, it seems clear that the counts for the authors are not very equal: The top-2 author has counts a lot higher than the bottom-3. This is called a long tail and it is a very common pattern. This means that the overall maxCount for the terms is likely to be a lot higher than the maxCount for the vast majority of the terms.

While I was loudly lamenting of all the wasted bits, Mads Villadsen came by and solved the problem: What if we keep track of the terms with high maxCount in one structure and the ones with a lower maxCount in another structure? Easy enough to do with a lot of processing overhead, but tricky do do efficiently. Fortunately Mads also solved that (my role as primary bit-fiddler is in serious jeopardy). The numbers in the following explanation are just illustrative and should not be seen as the final numbers.

The counter structure

We have 200M unique terms in a shard. The terms are long tail-distributed, with the most common ones having maxCount in the thousands and the vast majority with maxCount below 100.

We locate the top-128 terms and see that their maxCount range from 2921 to 87. We create an int[128] to keep track of their counts and call it head.

head Bit 31 30 29 … 2 1 0 Term h_0 0 0 0 … 0 0 0 Term h_1 0 0 0 … 0 0 0 … 0 0 0 … 0 0 0 Term h_126 0 0 0 … 0 0 0 Term h_127 0 0 0 … 0 0 0

The maxCount for the terms below the 128 largest ones is 85. 2^7=128, so we need 7 bits to hold each of those. We allocate a PackedInt structure with 200M entries of 7+1 = 8 bits and call it tail.

tail Bit 7* 6 5 4 3 2 1 0 Term 0 0 0 0 0 0 0 0 0 Term 1 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 Term 199,999,998 0 0 0 0 0 0 0 0 Term 199,999,999 0 0 0 0 0 0 0 0

The tail has an entry for all terms, including those in head. For each of the large terms in head, we locate its position in tail. At the tail-counter, we set the value to term’s index in the head counter structure and set the highest bit to 1.

Let’s say that head entry term h_0 is located at position 1 in tail, h_126 is located at position 199,999,998 and h_127 is located at position 199,999,999. After marking the head entries in the tail structure, it would look like this:

tail with marked heads Bit 7* 6 5 4 3 2 1 0 Term 0 0 0 0 0 0 0 0 0 Term 1 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 Term 199,999,998 1 1 1 1 1 1 1 0 Term 199,999,999 1 1 1 1 1 1 1 1

Hanging on so far? Good.

Incrementing the counters
  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)
Pros and cons

In this example, the counters takes up 6 billion * 8 bits + 25 * 128 * 32 bits = 5.7GB. The performance overhead, compared to the PackedInts version, is tiny: Whenever a head bit is encountered, there will be an extra read to get the old head value before writing the value+1. As head will statistically be heavily used, it is likely to be in Level 2/3 cache.

This is just an example, but it should be quite realistic as approximate values from the URL field in our Net Archive Index has been used. Nevertheless, it must be stressed that the memory gains from long tail PackedInts is highly dictated by the shape of the long tail curve.

Afterthought

It is possible to avoid the extra bit in tail by treating the large terms as any other term, until their tail-counters reaches maximum (127 in the example above). When a counter’s max has been reached, the head-counter can then be located using a lookup mechanism, such as a small HashMap or maybe just a linear scan through a short array with the ordinals and counts for the large terms. This would reduce the memory requirements to approximately 6 billion * 7 bits = 5.3GB. Whether this memory/speed trade-off is better or worse is hard to guess and depends on result set size.

Implementation afterthought

The long tail PackedInts could implement the PackedInts-interface itself, making it usable elsewhere. Its constructor could take another PackedInts filled with maxCounts or a histogram with maxbit requirements.

Heureka update 2015-02-27

There is no need to mark the head terms in the tail structure up front. All the entries in tail acts as standard counters until the highest bit is set. At that moment the bits used for counting switches to be a pointer into the next available entry in the head counters. The update workflow thus becomes

  1. Read the counter value from the tail structure: count = tail.get(ordinal)
  2. Check if bit 7 is set: if (count & 128 == 128)
  3. If bit 7 is set, increment the head counter: head.inc(count & 127)
  4. If bit 7 is not set, increment the tail counter: tail.set(ordinal, count+1)
  5. If the counter reaches bit 7, change the counter-bits to be pointer-bits:
    if ((count+1) == 128) tail.set(ordinal, 128 & headpos++)
Pros and cons

The new logic means that initialization and resetting of the structure is simply a matter of filling them with 0. Update performance will be on par with the current PackedInts implementation for all counters, whose value is within the cutoff. After that the penalty of an extra read is paid, but only for the overflowing values.

The memory overhead is unchanged from the long tail PackedInts implementation and still suffers from the extra bit used for signalling count vs. pointer.

Real numbers 2015-02-28

The store-pointers-as-values has the limitation that there can only be as many head counters as the maxCount for tail. Running the numbers on the URL-field for three of the shards in our net archive index resulted in bad news: The tip of the long tail shape was not very pointy and it is only possible to shave 8% of the counter size. Far less than the estimated 30%. The Packed64 in the table below is the current structure used by sparse faceting.

Shard 1 URL: 228M unique terms, Packed64 size: 371MB tail BPV required memory saved head size 11 342MB 29MB / 8% 106 12 371MB 0MB / 0% 6

However, we are in the process of experimenting with faceting on links, which has quite a higher point in the long tail shape. From a nearly fully build test shard we have:

8/9 build shard links: 519M unique terms, Packed64 size: 1427MB tail BPV required memory saved head size 15 1038MB 389MB / 27% 14132 16 1103MB 324MB / 23% 5936 17 1168MB 260MB / 18% 3129 18 1233MB 195MB / 14% 1374 19 1298MB 130MB / 9% 909 20 1362MB 65MB / 5% 369 21 1427MB 0MB / 0% 58

For links, the space saving was 27% or 389MB for the nearly-finished shard. To zoom out a bit: Doing faceting on links for our full corpus with stock Solr would take 50GB. Standard sparse faceting would use 35GB and long tail would need 25GB.

Due to sparse faceting, response time for small result sets is expected to be a few seconds for the links-facet. Larger result sets, not to mention the dreaded *:* query, would take several minutes, with worst-case (qualified guess) around 10 minutes.

Three-level long tail 2015-02-28

Previously:

  • pointer-bit: Letting the values in tail switch to pointers when they reach maximum has the benefit of very little performance overhead, with the downside of taking up an extra bit and limiting the size of head.
  • lookup-signal: Letting the values in tail signal “find the counter in head” when they reach maximum, has the downside that a sparse lookup-mechanism, such as a HashMap, is needed for head, making lookups comparatively slow.

New idea: Mix the two techniques. Use the pointer-bit principle until there is no more room in head. head-counters beyond that point all get the same pointer (all value bits set) in tail and their position in head is determined by a sparse lookup-mechanism ord2head.

This means that

  • All low-value counters will be in tail (very fast).
  • The most used high-value counters will be in head and will be referenced directly from tail (fast).
  • The less used high-value counters will be in head and will require a sparse lookup in ord2head (slow).

Extending the pseudo-code from before:

value = tail.get(ordinal) if (value == 255) { // indirect pointer signal head.inc(ord2head.get(ordinal)) } else if (value & 128 == 128) { // pointer-bit set head.inc(value & 127) } else { // term-count = value value++; if (value != 128) { // tail-value ok tail.set(value) } else { // tail-value overflow head.set(headpos, value) if (headpos < 127) { // direct pointer tail.set(128 & headpos++) } else { // indirect pointer tail.set(255) ord2head.put(ordinal, headpos++) } } }

Raffaele Messuti: SKOS Nuovo Soggettario, api e autocomplete

Thu, 2015-02-26 17:00

Come creare una api per un form con autocompletamento usando i termini del Nuovo Soggettario, con i Sorted Sets di Redis e Nginx+Lua.

District Dispatch: Boots on the ground advocacy

Thu, 2015-02-26 15:27

I don’t think I need to overemphasize for you the important roles that libraries play in our communities. The epicenter of progress and knowledge, libraries have evolved to meet the educational and technological needs of their patrons.

These institutions of learning and creation need to be protected and improved at all costs. And in the wake of the sweeping changes to both the House and the Senate in the 2014 Congressional elections, it is more important than ever that we speak up on behalf of libraries and the communities they serve.

Your firsthand library experience – from behind the reference desk or as a patron – is an invaluable part of helping legislators to understand the impact that libraries have in the day to day lives of their constituents. Without you, they may not realize what happens to a community when library budgets get cut and staff are let go, let alone how legislation on net neutrality, copyright, or privacy can involve libraries too. We need to urge Members of Congress to think about how the policy and legislation they are working on could harm or help libraries.  To do that, we need boots on the ground here in Washington, D.C. – we need library advocates. That’s why we’re inviting YOU to National Library Legislative Day 2015!

This two-day advocacy event brings hundreds of librarians, trustees, library supporters, and patrons to Washington, D.C. to meet with their Members of Congress to rally support for libraries issues and policies. This year, National Library Legislative Day will be held May 4-5, 2015. Participants will receive advocacy tips and training, along with important issues briefings prior to their meetings.

Registration information and hotel booking information are available on the ALA Washington Office website.

The post Boots on the ground advocacy appeared first on District Dispatch.

Hydra Project: Hydra-Head 8.0.0 released

Thu, 2015-02-26 13:26

We are pleased to announce the final release of hydra-head version 8.0.0!

Hydra-head 8.x is planned to be the final major version of the software to support Fedora Commons Repository version 3.x.

Release notes are here: https://github.com/projecthydra/hydra-head/releases/tag/v8.0.0.

DuraSpace News: Layne Johnson to Leave Position as VIVO Project Director

Thu, 2015-02-26 00:00

Winchester, MA  Dr. Layne Johnson will leave his position as VIVO Project Director effective February 28, 2015.  Layne has worked hard in the interests of VIVO's growth and sustainability, and we've seen some excellent progress. After reaching the important milestone of documenting a strategic plan for the project, Layne has decided to pass the role of leading the community through its implementation to a successor.  

HangingTogether: Transcription vs. Transliteration

Wed, 2015-02-25 20:44

This post is co-authored by Karen Coombs, OCLC Senior Product Analyst

Our virtual dialog began with Karen C’s tweet:

 

 

 

But Karen S-Y couldn’t respond in just the 140 characters Twitter allows. Instead she sent an email:

 Transcribing transliteration” from a piece is almost an oxymoron. It rarely occurs. Transliteration by definition is converting one writing system (e.g., Chinese characters) into another writing system (e.g, Latin-script characters, or romanization).  Catalogers in Anglo-American countries will transliterate non-Latin titles using ALA/LC romanization for the writing system on the piece; other countries may use other transliteration schemes.

You will generally find transliterated titles whenever there is a non-Latin title (in MARC, stored in the 880-245 field). But OCLC doesn’t support all scripts, and not everyone takes advantage of the scripts OCLC does support – e.g., we support Cyrillic but only 10% of all Russian-language titles in WorldCat have the Cyrillic that appears on the piece.  The ALA/LC romanization for Cyrillic is distinctly different from the ISO standard used by almost everyone else, so where we rely only on the romanized strings, the same title in Cyrillic may be represented by different clusters using different transliteration schemes. (In the graphic that precedes this entry, two romanizations are shown for the Russian “War and Peace”.)

In general, it’s better to rely on the non-Latin script title if we have it than any transliteration that may be also be in the record. The non-Latin script titles will be transcribed from the piece and any transliteration will be supplied by a cataloger, which may or may not match the transliteration supplied by another cataloger…

Karen C. wrote back: 

I think you answered the question the user was asking when you said that “The ALA/LC romanization for Cyrillic is distinctly different from the ISO standard used by almost everyone else, so where we rely only on the romanized strings, the same title (with the same Cyrillic string) will be represented by different clusters using different transliteration schemes.”

The user asked, “Your API returns texts in Russian in a strange transliteration format. As I see, it’s not ISO-9. For example, this text: “Oni vernulis? na rodnui?u? planetu, gde za vremi?a? mezhzve?znogo pole?ta proshlo bol?she sta let i vse? tak izmenilos?, chto Zemli?a? stala chuzhoi? im”. Please, can you tell me, how to convert this format into correct Cyrillic?”

At least I understand the why now.

Karen S-Y commented:

It also happens to be the case where there is almost a one-to-one correspondence between romanized Russian and its Cyrillic counterpart. That is why most libraries didn’t bother adding the Cyrillic. Since the system requires that if you put in non-Latin script you also enter the romanization, it represents “double work.”

This prompted Karen C. to ask:

Does the MARC record have any way to tell you if a title was romanized?

Karen S-Y answered:

By inference, yes.

If the language code is for a language not written in Latin characters, and there is no 880 in the MARC record, then the non-English information in the record is by definition all romanized (non-English information if the language of cataloging is English).

The following table shows the percentages of WorldCat records describing materials in the top 15 languages that are written in non-Latin scripts that WorldCat supports represented by the original script (transcribed from the piece) and by transliteration only (supplied by the cataloger). Most records for languages written in Cyrillic and Indic scripts contain transliterations only.

 Top 15 languages in WorldCat written in non-Latin character sets

About Karen Smith-Yoshimura

Karen Smith-Yoshimura, program officer, works on topics related to renovating descriptive and organizing practices with a focus on large research libraries and area studies requirements.

Mail | Web | Twitter | More Posts (54)

SearchHub: Infographic: 15 Years of The Apache Software Foundation

Wed, 2015-02-25 18:43
As the commercial stewards of Apache Solr, we know how crucial the open source software movement is to organizations all over the world and how important the Apache Software Foundation is in governing dozens of projects used by thousands of companies. Let’s take a look at it all got started:

The post Infographic: 15 Years of The Apache Software Foundation appeared first on Lucidworks.

Pages