You are here

Feed aggregator

Open Library Data Additions: OL.120301.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120301.meta.mrc 6003 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121201.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.121201.meta.mrc 7088 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121001.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.121001.meta.mrc 5235 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120601.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120601.meta.mrc 6018 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120501.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120501.meta.mrc 4685 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120401.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120401.meta.mrc 3851 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120201.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120201.meta.mrc 6433 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120101.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120101.meta.mrc 5284 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121101.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.121101.meta.mrc 6896 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120901.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120901.meta.mrc 6035 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120801.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120801.meta.mrc 5760 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120701.meta.mrc

planet code4lib - Fri, 2015-04-10 23:51

OL.120701.meta.mrc 5421 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

State Library of Denmark: Facet filtering

planet code4lib - Fri, 2015-04-10 21:24

In generation 2 of our net archive search we plan to experiment with real time graphs: We would like to visualize links between resources and locate points of interest based on popularity. Our plan is to use faceting with Solr on 500M+ unique links per shard, which is a bit of challenge in itself. To make matters worse, plain faceting does not fully meet the needs of the researchers. Some sample use cases for graph building are

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally
  3. The most popular images that pages on a given site links to internally
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Unfortunately, only the first one can be solved with plain faceting.

Blacklists & whitelists with regular expressions

The idea is to filter all viable term candidates through a series of blacklists and whitelists to check whether they should be part of the facet result or not. One flexible way of expressing conditions on Strings is with regular expressions. The main problem with that approach is that all the Strings for the candidates must be resolved, instead of only the ones specified by facet.limit.

Consider the whitelist condition .*wasp.* which matches all links containing the word wasp. That is a pretty rare word overall, so if a match-all query is issued and the top 100 links with the wasp-requirement are requested, chances are that millions of terms must be resolved to Strings and checked, before the top 100 allowed ones has been found. On the other hand, a search for gardening would likely have a much higher chance of wasp-related links and would thus require far less resolutions.

An extremely experimental (written today) implementation of facet filtering has been added to the pack branch of Sparse Faceting for Solr. Correctness testing has been performed, where testing means “tried it a few times and the results looked plausible”. Looking back at the cases in the previous section, facet filtering could be used to support them:

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally[^/]*example\.com
  3. The most popular images that pages on a given site links to internally[^/]*example\.com/.*\.(gif|jpg|jpeg|png)$
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Some questions like “The most popular resources larger than 2MB in size linked to from pages about video” cannot be answered directly with this solution as they rely on the resources at the end of the links, not just the links themselves.

Always with the performance testing

Two things of interest here:

  1. Faceting on 500M+ unique values (5 billion+ DocValues references) on a 900GB single-shard index with 200M+ documents
  2. Doing the trick with regular expressions on top

Note the single-shard thing! The measurements should not be taken as representative for the final speed of the fully indexed net archive, which will be 50 times larger. As we get more generation 2 shards, the tests will hopefully be re-run.

As always, Sparse Faceting is helping tremendously with the smaller result sets. This means that averaging the measurements to a single number is highly non-descriptive: Response times varies from < 100ms for a few thousand hits to 5 minutes for a match-all query.

Performance testing used a single thread to issue queries with random words from a Danish dictionary. The Solr server was a 24 core Intel i7 machine (only 1 active core due to the unfortunate single-threaded nature of faceting) with 256GB of RAM (200GB free for disk cache) and SSDs. All tests were with previously unused queries. 5 different types of requests were issued:

  1. no_facet: as the name implies, just a plain search with no faceting
  2. sparse: Sparse Faceting on the single links-field with facet limit 25
  3. regexp_easy: Sparse Faceting with whitelist regular expression .*htm.* which is fairly common in links
  4. regexp_evil: Sparse Faceting with whitelist regular expression .*nevermatches.* effectively forcing all terms in the full potential result set to be resolved and checked
  5. solr: Vanilla Solr faceting

900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

  • Sparse Faceting without regular expressions (purple) performs just as well with 500M+ values as it did with previous tests of 200M+ values.
  • Using a regular expression that allows common terms (green) has moderate impact on performance.
  • The worst possible regular expression (orange) has noticeable impact at 10,000 hits and beyond. At the very far end at match-all, the processing time was 10 minutes (versus 5 minutes for non-regular expression faceting). This is likely to be highly influenced by storage speed and be slower with more shards on the same machine.
  • The constant 2 second overhead of vanilla Solr faceting (yellow) is highly visible.

Worst case processing times has always been a known weakness of our net archive search. Facet filtering exacerbates this. As this is tightly correlated to the result set size, which is fast to calculate, adding a warning with “This query is likely to take minutes to process” could be a usable bandage.

With that caveat out of the way, the data looks encouraging so far; the overhead for regular expressions was less than feared. Real-time graphs or at least fill-the-coffee-cup-time graphs seems doable. At the cost of 2GB of extra heap per shard to run the faceting request.

Additional notes (updated 2015-04-11)

@maxxkrakoa noted “@TokeEskildsen you wrote Solr facet is 1 thread. facet.threads can change that – but each field will only be able to use one thread each.“. He is right and it does help significantly for our 6 field faceting. For single field faceting, support for real multi-thread counting would be needed.

The simple way of doing multi-thread counting is to update multiple copies of the counter structure and merge them at the end. For at 500M+ field, that is likely to be prohibitive with regards to both memory and speed: The time used for merging the multiple counters would likely nullify the faster counter update phase. Some sort of clever synchronization or splitting of the counter space would be needed. No plans yet for that part, but it has been added to “things to ponder when sleep is not coming”-list.

John Miedema: Cognitive computing. Computers already know how to do math, now they can read and understand.

planet code4lib - Fri, 2015-04-10 20:06

Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. It is characterized by the following:

  1. Life-world data. Operates on data that is large, varied, complex and dynamic, the stuff of daily human life.
  2. Natural questions. A question is more than a keyword query. A question embodies unstructured meaning. It may be asked in natural language. A dialog allows for refinement of questions.
  3. Reading and understanding. Computers already know how to do math. Cognitive computing provides the ability to read. Reading includes understanding context, nuance, and meaning.
  4. Analytics. Understanding is extended with statistics and reasoning. The system finds patterns and structures. It considers alternatives and chooses a best answer.
  5. Answers are structured and ordered. An answer is an “assembly,” a wiki-type summary, or a visualization such as a knowledge graph. It often includes references to additional information.

Cognitive computing is not artificial intelligence. Solutions are characterized by a partnership with humans:

  1. Taught rather than just programmed. Cognitive systems “borrow” from human intelligence. Computers use resources compiled from human knowledge and language.
  2. Learn from human interaction. A knowledge base is improved by feedback from humans. Feedback is ideally implicit in an interaction, or it may be explicit, e.g., thumbs up or down.

DPLA: DPLAfest 2015: Special Indy Activities and Attractions

planet code4lib - Fri, 2015-04-10 19:29

There’s lots to do in the Indianapolis area during DPLAfest 2015, just a week away! Here is a sampling of some of the excellent options.

Check out the Indiana Historical Society’s array of exhibitions (free for fest attendees!)

Did you know that you can get free admission to the Indiana Historical Society with your DPLAfest name badge? Simply show your DPLAfest name badge at the Historical Society welcome center and you’ll receive a wristband to explore the wonderful exhibits and activities inside:

  • Step into three-dimensional re-creations of historic photographs complete with characters come to life in You Are There: featuring That Ayres Look1939: Healing Bodies, Changing Mindsand 1904: Picture This
  • Let the latest technology take you back in time on virtual journeys throughout the state in Destination Indiana.
  • Pull up a stool at a cabaret and immerse yourself in the music of Hoosier legend Cole Porter in the Cole Porter Room.
  • In the W. Brooks and Wanda Y. Fortune History Lab, get a behind-the-scenes, hands-on look at conservation and the detective work involved history research.
  • Check out Lilly Hall, home of the Mezzanine Gallery, INvestigation Stations and Hoosier Legends.

Take a walking tour of the Indiana State Library’s stunning architecture

Join Indiana State Library staffer Steven Schmidt for a guided tour of the historic Indiana State Library. Please meet at least 5 minutes before the tour is slated to begin; see the DPLAfest schedule for more details.

Learn about the history and development of Indianapolis

If you’re interested in history, architecture, and anything in between,  be sure to check out Indianapolis Historical Development: A Quick Overview on  Saturday, 4/18 at 1:15 PM in IUPUI UL Room 1116. Led by William L. Selm, IUPUI Adjunct Faculty of the School of Engineering & Technology, this presentation will provide an “overview of the development of Indianapolis since its founding in 1821 as the capital city of Indiana in the center of the state. The factors that shaped the city will be presented as well as the buildings and monuments that are the products of these factors and forces.” Find out more about this session here.

Get inspired at the Indianapolis Museum of Art:

  • At the Innovative Museum Leaders Speaker Series on April 16, hear from Mar Dixon, the Founder of MuseuoMix UK, Museum Camp and Ask a Curator Day, London.
  • Peruse a number of interesting art exhibitions. This includes a Monuments Men-inspired look at the provenance research and ownership discussion surrounding one of the IMA’s European pieces, “Interior of Antwerp Cathedral.”

Learn more at The Eiteljorg Museum of American Indians and Western Art:

  • With a mission to inspire an appreciation and understanding of the cultures of the indigenous peoples of North America, the Eiteljorg Museum has a number of interesting offerings. See art from the American West, as well as an exhibit about the gold rush. 

Show off your sports side:

For other child (or child-at-heart) friendly options:

  • Visit the world’s largest children’s museum, the Children’s Museum of Indianapolis. And, yes, that’s Optimus Prime you’ll see parked out front–it’s part of a new Transformers exhibit, one of many fun options at the museum.
  • Explore at the Indianapolis Zoo, which now has a new immersive “Butterfly Kaleidoscope” conservatory, with 40 different butterfly species.

Don’t forget to take a second look at the DPLAfest schedule to make sure you don’t miss any of the exciting fest sessions! See you soon in Indianapolis!

Brown University Library Digital Technologies Projects: What is ORCID?

planet code4lib - Fri, 2015-04-10 15:46

ORCID is an open, non-profit initiative founded by academic institutions, professional bodies, funding agencies, and publishers to resolve authorship confusion in scholarly work.  The ORCID repository of unique scholar identification numbers will reliably identify and link scholars in all disciplines with their work, analogous to the way ISBN and DOI identify books and articles.

Brown is a member of ORCID which allows the University, among other things, to create ORCID records on behalf of faculty, students, and affiliated individuals; integrate authenticated ORCID identifiers into grant application processes; ingest ORCID data to maintain internal systems such as institutional repositories; and link ORCID identifiers to other IDs and registry systems.  ORCID identifiers will facilitate the gathering of publication, grant, and other data for use in Reseachers@Brown profiles.  The library, with long experience in authority control, is coordinating this effort.

Brown University Library Digital Technologies Projects: What is OCRA?

planet code4lib - Fri, 2015-04-10 15:40

OCRA is a platform for faculty to request digital course reserves in all formats.  Students access digital course reserves via Canvas or at the standalone OCRA site.  Students access physical reserves in library buildings via Josiah.

Hydra Project: SAVE THE DATE: Hydra Connect 2015 – Monday, September 21st – Thursday, September 24th

planet code4lib - Fri, 2015-04-10 15:21

Hydra today announced the dates for Hydra Connect 2015:

Hydra Connect 2015
Minneapolis, Minnesota
Monday, September 21 – Thursday, September 24, 2015

Registration and lodging details will be available in early June 2015.

The four day event will be structured as follows:

  • Mon 9/21        – Workshops and leader facilitated training sessions
  • Tue 9/22          – Main Plenary Session
  • Wed 9/23        – Morning Plenary Session, Afternoon Un-conference breakout sessions
  • Thu 9/24  – All day Un-conference breakouts and workgroup sessions

We are also finalizing details for a poster session within the program and a conference dinner to be held on the one of the main conference evenings.

Please mark your calendars and plan on joining us this September!

David Rosenthal: 3D Flash - not as cheap as chips

planet code4lib - Fri, 2015-04-10 15:00
Chris Mellor has an interesting piece at The Register pointing out that while 3D NAND flash may be dense, its going to be expensive.

The reason is the enormous number of processing steps per wafer - between 96 and 144 deposition layers for the three leading 3D NAND flash technologies. Getting non-zero yields from that many steps involves huge investments in the fab:
Samsung, SanDisk/Toshiba, and Micron/Intel have already announced +$18bn investment for 3D NAND.
  • Samsung’s new Xi’an, China, 3D NAND fab involves a +$7bn total capex outlay
  • Micron has outlined a $4bn spend to expand its Singapore Fab 10
This compares with Seagate and Western Digital’s capex totalling ~$4.3 bn over the past three years.Chris has this chart, from Gartner and Stifel, comparing the annual capital expenditure per TB of storage of NAND flash and hard disk. Each TB of flash contains at least 50 times as much capital as a TB of hard disk, which means it will be a lot more expensive to buy.

PS - "as cheap as chips" is a British usage.

Jonathan Rochkind: “Streamlining access to Scholarly Resources”

planet code4lib - Fri, 2015-04-10 14:14

A new Ithaka report, Meeting Researchers Where They Start: Streamlining Access to Scholarly Resources [thanks to Robin Sinn for the pointer], makes some observations about researcher behavior that many of us probably know, but that most of our organizations haven’t succesfully responded to yet:

  • Most researchers work from off campus.
  • Most researchers do not start from library web pages, but from google, the open web, and occasionally licensed platform search pages.
  • More and more of researcher use is on smaller screens, mobile/tablet/touch.

The problem posed by the first two points is the difficulty in getting access to licensed resources. If you start from the open web, from off campus, and wind up at a paywalled licensed platform — you will not be recognized as a licensed user.  Becuase you started from the open web, you won’t be going through EZProxy. As the Ithaka report says, “The proxy is not the answer… the researcher must click through the proxy server before arriving at the licensed content resource. When a researcher arrives at a content platform in another way, as in the example above, it is therefore a dead-end.”

Shibboleth and UI problems

Theoretically, Shibboleth federated login is an answer to some of that. You get to a licensed platform from the open web, you click on a ‘login’ link, and you have the choice to login via your university (or other host organization), using your institutional login at your home organization, which can authenticate you via Shibboleth to the third party licensed platform.

The problem here that the Ithaka report notes is that these Shibboleth federated login interfaces at our  licensed content providers — are terrible.

Most of them even use the word “Shibboleth” as if our patrons have any idea what this means. As the Ithaka report notes, “This login page is a mystery to most researchers. They can be excused for wondering “what is Shibboleth?” even if their institution is part of a Shibboleth federation that is working with the vendor, which can be determined on a case by case basis by pulling down the “Choose your institution” menu.”

Ironically, this exact same issue was pointed out in the NISO “Establishing Suggested Practices Regarding Single Sign-on” (ESPReSSO) report from 2011. The ESPReSSO report goes on to not only identify the problem but suggest some specific UI practices that licensed content providers could take to improve things.

Four years later, almost none have. (One exception is JStor, which actually acted on the ESPReSSO report, and as a result actually has an intelligible federated sign-on UI, which I suspect our users manage to figure out. It would have been nice if the Ithaka report had pointed out good examples, not just bad ones. edit: I just discovered JStor is actually currently owned by Ithaka, perhaps they didn’t want to toot their own horn.).

Four years from now, will the Ithaka report have had any more impact?  What would make it so?

There is one more especially frustrating thing to me regarding Shibboleth, that isn’t about UI.  It’s that even vendors that say they support Shibboleth, support it very unreliably. Here at my place of work we’ve been very aggressive at configuring Shibboleth with any vendor that supports it. And we’ve found that Shibboleth often simply stops working at various vendors. They don’t notice until we report it — Shibboleth is not widely used, apparently.  Then maybe they’ll fix it, maybe they won’t. In another example, Proquest’s shibboleth login requires the browser to access a web page on a four-digit non-standard port, and even though we told them several years ago that a significant portion of our patrons are behind a firewall that does not allow access to such ports, they’ve been uninterested in fixing/changing it. After all, what are we going to do, cancel our license?  As the several years since we first complained about this issue show, obviously not.  Which brings us to the next issue…

Relying on Vendors

As the Ithaka report notes, library systems have been effectively disintermediated in our researchers workflows. Our researchers go directly to third-party licensed platforms. We pay for these platforms, but we have very little control of them.

If a platform does not work well on a small screen/mobile device, there’s nothing we can do but plead. If a platform’s authentication system UI is incomprehensible to our patrons, likewise.

The Ithaka report recognizes this, and basically recommends that… we get serious when we tell our vendors to improve their UI’s:

Libraries need to develop a completely different approach to acquiring and licensing digital content, platforms, and services. They simply must move beyond the false choice that sees only the solutions currently available and instead push for a vision that is right for their researchers. They cannot celebrate content over interface and experience, when interface and experience are baseline requirements for a content platform just as much as a binding is for a book. Libraries need to build entirely new acquisitions processes for content and infrastructure alike that foreground these principles.

Sure. The problem is, this is completely, entirely, incredibly unrealistic.

If we were for real to stop “celebreating content over interface and experience”, and have that effected in our acquisitions process, what would that look like?

It might look like us refusing to license something with a terrible UX, even if it’s content our faculty need electronically. Can you imagine us telling faculty that? It’s not going to fly. The faculty wants the content even if it has a bad interface. And they want their pet database even if 90% of our patrons find it incomprehensible. And we are unable to tell them “no”.

Let’s imagine a situation that should be even easier. Let’s say we’re lucky enough to be able to get the same package of content from two different vendors with two different platforms. Let’s ignore the fact that “big deal” licensing makes this almost impossible (a problem which has only gotten worse since a D-Lib article pointed it out 14 years ago). Even in this fantasy land, where we say we could get the same content from two differnet platforms — let’s say one platform costs more but has a much better UX.  In this continued time of library austerity budgets (which nobody sees ending anytime soon), could we possibly pick the more expensive one with the better UX? Will our stakeholders, funders, faculty, deans, ever let us do that? Again, we can’t say “no”.

edit: Is it any surprise, then, that our vendors find business success in not spending any resources on improving their UX?  One exception again is JStor, which really has a pretty decent and sometimes outstanding UI.  Is the fact that they are a non-profit endeavor relevant? But there are other non-profit content platform vendors which have UX’s at the bottom of the heap.

Somehow we’ve gotten ourselves in a situation where we are completely unable to do anything to give our patrons what we know they need.  Increasingly, to researchers, we are just a bank account for licensing electronic platforms. We perform the “valuable service” of being the entity you can blame for how much the vendors are charging, the entity you require to somehow keep licensing all this stuff on smaller budgets.

I don’t think the future of academic libraries is bright, and I don’t even see a way out. Any way out would take strategic leadership and risk-taking from library and university administrators… that, frankly, institutional pressures seem to make it impossible for us to ever get.

Is there anything we can do?

First, let’s make it even worse — there’s a ‘technical’ problem that the Ithaka report doesn’t even mention that makes it even worse. If the user arrives at a paywall from the open web, even if they can figure out how to authenticate, they may find that our institution does not have a license from that particular vendor, but may very well have access to the same article on another platform. And we have no good way to get them to it.

Theoretically, the OpenURL standard is meant to address exactly this “appropriate copy” problem. OpenURL has been a very succesful standard in some ways, but the ways it’s deployed simply stop working when users don’t start from library web pages, when they start from the open web, and every place they end up has no idea what institution they belong to or their appropriate institutional OpenURL link resolver.

I think the only technical path we have (until/unless we can get vendors to improve their UI’s, and I’m not holding my breath) is to intervene in the UI.  What do I mean by intervene?

The LibX toolbar is one example — a toolbar you install in your browser that adds instititutionally specific content and links to web pages, links that can help the user authenticate against a platform arrived to via the open web, even links that can scrape the citation details from a page and help the user get to another ‘appropriate copy’ with authentication.

The problem with LibX specifically is that browser toolbars seem to be a technical dead-end.  It has proven pretty challenging to get a browser toolbar to keep working accross browser versions. The LibX project seems more and more moribund — it may still be developed, but it’s documentation hasn’t kept pace, it’s unclear what it can do or how to configure it, fewer browsers are supported. And especially as our users turn more and more to mobile (as the Ithaka report notes), they more and more often are using browsers in which plugins can’t be installed.

A “bookmarklet” approach might be worth considering, for targetting a wider range of browsers with less technical investment. Bookmarklets aren’t completely closed off in mobile browsers, although they are a pain in the neck for the user to add in many.

Zotero is another interesting example.  Zotero, as well as it’s competitors including Mendeley, can succesfully scrape citation details from many licensed platform pages. We’re used to thinking of Zotero as ‘bibliographic management’, but once it’s scraped those citation details, it can also send the user to the institutionally-appropriate link resolver with those citation details — which is what can get the user to the appropriate licensed copy, in an authenticated way.  Here at my place of work we don’t officially support Zotero or Mendeley, and haven’t spent much time figuring out how to get the most out of even the bibliographic management packages we do officially support.

Perhaps we should spend more time with these, not just to support ‘bibliographic management’ needs, but as a method to get users from the open web to authenticated access to an appropriate copy.  And perhaps we should do other R&D in ‘bookmarklets'; in machine learning for citation parsing so users can just paste a citation into a box (perhaps via bookmarklet) to get authenticated access to appropriate copy; in anything else we can think of to:

Get the user from the open web to licensed copies.  To be able to provide some useful help for accessing scholarly resources to our patrons, instead of just serving as a checkbook. With some library branding, so they recognize us as doing something useful after all.

Filed under: General


Subscribe to code4lib aggregator