You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 18 hours 57 min ago

John Miedema: Orlando and Watson Part II. Pseudonym as a simple illustration of semantic search.

Mon, 2014-11-03 15:57

A common problem with searching for information is that a concept can have many different surface forms. It is difficult for a researcher to know all the forms, let alone type them in for every search.

Orlando is a digital literary resource, a structured “textbase” about British women writers. This resource can be utilized by IBM’s Watson Content Analytics to provide semantic search and analysis. Here is a simple illustration.

In Figure 1, suppose I know the interesting pseudonym, “Will Chip, a Carpenter.” Must be a male writer, yes? Not so fast. I select the pseudonym for a search. There are sixteen matching documents in this small sample.

In Figure 2, I switch to Documents view. The name “Hannah More”, a female writer, is highlighted in the documents. Hannah More is Will Chip, a Carpenter. It is her pseudonym. This link was provided by Orlando. Semantic links like this can be applied to every concept in IBM’s Watson Content Analytics, facilitating literary research across millions of documents.

DPLA: Open Marketing and Outreach Committee Call: Friday, November 7, 2:00 PM Eastern

Mon, 2014-11-03 14:32

The DPLA’s Marketing and Outreach Committee will lead an open committee call on Friday, November 7 at 2:00 PM Eastern. To register, complete the short registration form available via the link below.

Agenda
  1. Update on recent education focus groups and discussion of other educational uses for DPLA
  2. Open discussion of broad uses for DPLA
  3. Questions, comments, and open discussion

All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

Islandora: Islandora 100

Mon, 2014-11-03 14:05

As of the time of this posting, there are 86 Islandora sites on our map of Islandora worldwide:

We know there are far more sites out there, so I am issuing a challenge to the Islandora Community: 100 dots on the Islandora Map by 2015. If your repo is not on our map, then please send me the details (institution, repo link, and location) of your library, university, museum, community group, or other public repository, so I can put you on the map. When we reach 100, I will draw three names from those who have submitted sites since the challenge, and those three lucky Islandorians will be the first to receive our awesome Islandora Tuque Tuque.

Fourteen more sites in two months. We can do it!

 

Hydra Project: New Hydra Steering Group members

Mon, 2014-11-03 09:35

We are delighted to announce that Jon Dunn (Indiana University) and Mike Giarlo (Penn State University) have accepted invitations to join the Hydra Steering Group. Both are acknowledged leaders in the community with much experience in digital libraries; Jon additionally brings his background of key roles on the Avalon Project and at Indiana University and Mike his background of key roles in Sufia, RDF, HAWG and at Penn State.  We look forward to working with them more closely.

FOSS4Lib Recent Releases: Sufia - 4.1.0

Sun, 2014-11-02 19:47
Package: SufiaRelease Date: Friday, October 31, 2014

Last updated November 2, 2014. Created by Peter Murray on November 2, 2014.
Log in to edit this page.

The 4.1.0 release of Sufia includes functionality to support proxy deposits and transfers of ownership.

Patrick Hochstenbach: Homework assignment #8 Sketchbookskool

Sun, 2014-11-02 16:14
“Visit a pub a and draw the body language of people you find there” So I went to a pub..actually two pubs this afternoon. The first one was a very quiet pub near a public park. Older couples entered and

Patrick Hochstenbach: Homework assignment #7 Sketchbookskool

Sun, 2014-11-02 15:56
For our next assignment we had to create a book cover of our favorite book. This time I had no inspiration and drew David Cameron instead. I tried to do the coloring in water color, but this isn’t really my

Patrick Hochstenbach: Homework assignment #6 Sketchbookskool

Sun, 2014-11-02 15:51
The assignment was a little warmup “I want you to draw a little character in less than 15 to 20 minutes”. I took my copic markers and set down at my desk. Using my imagination I drew this little story

John Miedema: Orlando and Watson demonstration. Analytics without metadata.

Sat, 2014-11-01 20:28

Orlando is a digital index of the lives and works of British women writers. I have the privilege of using the Orlando resource in collaboration with Susan Brown. For discussion in the context of NovelTM, I have put together a quick demo that integrates Orlando in IBM’s Watson Content Analytics.

Orlando is structured data, making associations between names, places and works. However, it is not precisely metadata. Metadata is “data about data”, and Orlando does not classify content directly. Not yet. I extracted a subset of the Orlando data and converted it into Natural Language Processing annotators. Annotators can be used to extract structure from unstructured content and make it analyzable. In this case, the content is a small set of about 300 biographical documents. The demo illustrates how analytics can be peformed without the labour intensive work of manual metadata classification.

Figure 1. The Orlando extract has been mapped to facets in Watson Content Analytics. For example, the “Author (Orlando)” facet lists Maria Abdy, Elizabeth Carter, Horace Walpole, and many others, with their associated frequency counts. Horace Walpole has 20 hits.

Figure 2. Switching to the Documents view, an analyst discovers documents federated from multiple sources. In this case, the 20 documents for Horace Walpole. The Documents view provides a single interface for in depth document analysis.

Figure 3. A number of visualizations provide a quick way to analyze documents. In this view, the Birth Region facet is paired up with the Religion facet, both showing values from Orlando. The red square highlights a strong correlation between the Midlothian birth region and the Free Church of Scotland. It’s a jumping point to filter documents and discover additional patterns.

There’s so much more to show and tell.

FOSS4Lib Recent Releases: Koha - Maintenance releases v 3.14.11 and 3.16.4

Fri, 2014-10-31 21:17
Package: KohaRelease Date: Thursday, October 2, 2014

Last updated October 31, 2014. Created by David Nind on October 31, 2014.
Log in to edit this page.

Bug fix and maintenance releases for Koha. See the release announcements for the details:

FOSS4Lib Recent Releases: Islandora - 7.x-1.4

Fri, 2014-10-31 21:07
Package: IslandoraRelease Date: Friday, October 31, 2014

Last updated October 31, 2014. Created by Peter Murray on October 31, 2014.
Log in to edit this page.

Release notes and download links are here along with updated documentation, and you can grab an updated VM (sandbox.islandora.ca will be updated soon).

State Library of Denmark: Sudden Solr performance drop

Fri, 2014-10-31 20:33

There we were, minding other assignments and keeping a quarter of an eye on our web archive indexer and searcher. The indexer finished its 900GB Solr shard number 22 and the searcher was updated to a total of 19TB / 6 billion documents. With a bit more than 100GB free for disk cache (or about 1/2 percent of total index size), things were relatively unchanged, compared to ~120GB free a few shards ago. We expected no problems. As part of the index update, an empty Solr was created as entry-point, with a maximum of 3 concurrent connections, to guard against excessive memory use.

But something was off. User issued searches seemed slower. Quite a lot slower for some of them. Time for a routine performance test and comparison with old measurements.

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index

As the graph shows very clearly, response times rose sharply with the number of hits in a search in our 19TB index. At first glance that seems natural, but as the article Ten times faster explains, this should be a bell curve, not an ever-upgoing hill. The bell curve can be seen for the old 12TB index. Besides, those new response times were horrible.

Investigating the logs showed that most of the time was spend resolving facet-terms for fine-counting. There are hundreds of those for the larger searches and the log said it took 70ms for each, neatly explaining total response times of 10 or 20 seconds. Again, this would not have been surprising if we were not used to much better numbers. See Even sparse faceting is limited for details.

A Systems guy turned off swap, then cleared the disk cache, as disk cache clearing has helped us before in similar puzzling situations. That did not help this time: Even non-faceted searches had outliers above 10 seconds, which is 10 times worse than with the 12TB index.

Due to unrelated circumstances, we then raised the number of concurrent connections for the entry-point-Solr from 3 to 50 and restarted all Solr instances.

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

Welcome back great performance! You were sorely missed. The spread as well as the average for the 19TB index is larger than its 12TB counter part, but not drastically so.

So what went wrong?

  • Did the limiting of concurrent searches at the entry-Solr introduce a half-deadlock? That seems unlikely as the low-level logs showed the unusual high 70ms/term lookup-time, which is done without contact to other Solrs.
  • Did the Solr-restart clean up OS-memory somehow, leading to better overall memory performance and/or disk caching?
  • Were the Solrs somehow locked in a state with bad performance? Maybe a lot of garbage collection? Their Xmx is 8GB, which has been fine since the beginning: As each shard runs in a dedicated tomcat, the new shards should not influence the memory requirements of the Solrs handling the old ones.

We don’t know what went wrong and which action fixed it. If performance starts slipping again, we’ll focus on trying one thing at a time.

Why did we think clearing the disk cache might help?

It is normally advisable to use Huge Pages when running a large Solr server. Whenever a program requests memory from the operating system, this is done as pages. If the page size is small and the system has a lot of memory, there will be a lot of bookkeeping. It makes sense to use larger pages and have less bookkeeping.

Our indexing machine has 256GB of RAM, a single 32GB Solr instance and constantly starts new Tika processes. Each Tika process takes up to 1GB of RAM and runs for an average of 3 minutes. 40 of these are running at all times, so at least 10GB of fresh memory is requested from the operating system each minute.

We observed that the indexing speed of the machine fell markedly after some time, down to 1/4th of the initial speed. We also observed that most of the processing time was spend in kernel space (the %sy in a Linux top). Systems theorized that we had a case of OS memory fragmentation due to the huge pages. They tried flushing the disk cache (echo 3 >/proc/sys/vm/drop_caches) to reset part of the memory and performance restored.

A temporary fix of clearing the disk cache worked fine for the indexer, but the lasting solution for us was to disable the use of huge pages on that server.

The searcher got the same no-huge-pages treatment, which might have been a mistake. Contrary to the indexer, the searcher rarely allocates new memory and as such looks like an obvious candidate for using huge pages. Maybe our performance problems stemmed from too much bookkeeping of pages? Not fragmentation as such, but simply the size of the structures? But why would it help to free most of the memory and re-allocate it? Does size and complexity of the page-tracking structures increase with use, rather than being constant? Seems like we need to level up in Linux memory management.

Note: I strongly advice against using repeated disk cache flushing as a solution. It is symptom curing and introduces erratic search performance. But it can be very useful as a poking stick when troubleshooting.

On the subject of performance…

The astute reader will have noticed that the performance-difference is strange at the 10³ mark. This is because the top of the bell curve moves to the right as the number of shards increases. See Even sparse faceting is limited for details.

In order to make the performance comparison apples-to-apples, the no_cache numbers were used. Between the 12TB and the 19TB mark, sparse facet caching was added, providing a slight speed-up to distributed faceting. Let’s add that to the chart:

2565GB RAM, faceting on 6 fields, facet limit 25, unwarmed searches, 12TB and 19TB of index, post Solr-restart

 

Although the index size was increased by 50%, sparse facet caching kept performance at the same level or better. It seem that our initial half-dismissal of the effectiveness of sparse facet caching was not fair. Now we just need to come up with similar software improvements each month and we we will never need to buy new hardware.

Do try this at home

If you want to try this on your own index, simply use sparse solr.war from GitHub.


LITA: Free Web Tools for Top-Notch Presentations

Fri, 2014-10-31 17:00

Visually appealing and energizing slideshows are the lifeblood of conference presentations. But using animated PowerPoints or zooming Prezis to dizzy audiences delivers little more appeal than packing slides with text on a low-contrast background. Key to winning hearts and minds are visual flair AND minimalism, humor, and innovative use of technology.

Memes

Delightfully whimsical, memes  are a fantastic ice-breaker and laugh-inducer. My last two library conference presentations used variants of the crowdpleasing “One does not simply…” Boromir meme above, which never fails to generate laughter and praise. Memes.com offers great selections, is free of annoying popup ads, and is less likely than other meme generators to be blocked by your workplace’s Internet filters for being “tasteless.” (Yes, I speak from personal experience…)

Keep Calm-o-matic 

Do you want your audience to chuckle and identify with you? Everyone who’s ever panicked or worked under a deadline will appreciate the Keep Calm-o-matic. As with memes, variations are almost infinite.

Recite This

Planning to include quotations on some of your slides? Simply copy and paste your text into Recite This, then select an aesthetically pleasing template in which the quote will appear. Save time, add value.

Wordle

This free web tool enables you to paste text or a URL to generate a groovy word cloud. Vary sizes, fonts, and color schemes too. Note that Wordle’s Java applet refuses to function smoothly in Chrome. There are other word cloud generators, but Wordle is still gold.

Dictation

This is the rare dictation tool that doesn’t garble what you say, at least not excessively. It’s free, online, and available as a Chrome app. Often when preparing presentations, I simply start talking and then read over what I said. This is a valuable exercise in prewriting and a way to generate zingers and lead-ins to substantive content.

Poll Everywhere

Conduct live polls of your audience using texting and Twitter! Ask open-ended or multiple-choice questions and then watch the live results appear on your PowerPoint slide or web browser.  Poll Everywhere and equivalents such as EverySlide engage audiences and heighten interest more than a mere show of hands, especially for larger audiences in which many members otherwise would not be able to contribute to the discussion. Use whenever appropriate.

Emaze

This online presentation software offers incredible visual appeal and versatility without inducing either vertigo or snoozes. Create your slides in the browser, customize a range of attractive templates, and access from any device with an Internet connection (major caveat, that). You must pay to go premium to download slideshows, but this reservation aside, the free version is an outstanding product.

DoNotLink

Ever attempted to show a website containing misinformation or hate speech as part of an information literacy session but didn’t want to drive traffic to the site? DoNotLink is your friend! Visit or link to shady sites without increasing their search engine ranking.

Serendip-o-matic

Simply paste some text, and this serendipity search tool will draw on the Digital Public Library of America (DPLA), Flickr, Europeana, and other open digital repositories to produce related photographs, art, and documents that are visually displayed. Serendip-o-matic reveals unexpected connections between diverse materials and offers good, nerdy fun to boot. “Let your sources surprise you!”

So . . . what free web tools do you use to jazz up your presentations?

Riley Childs: TTS Video

Fri, 2014-10-31 15:59

(Video is on it’s way, there is an issue with the Camera on my laptop.

Hello, I am Riley Childs a 17-year old student at Charlotte United Christian Academy. I am deeply involved there and am in charge/support of our network, *nix servers, viz servers, library stuff and of course end-user computers. One of my favorite things to do is work in the Library and administer IT awesomeness. I also work in the theater at CPCC as a Electrician. Another thing that I love to do is participate in a community called code4lib where I assist others and post about library technology. I also post to the Koha mailing list where I help out others who have issues with Koha. Overall I love technology and I believe in the freedom of information and that is why I love librarians because they are all about distribution of information. In addition to all this indoor stuff I also enjoy a good day hike and also like to go backpacking every once in a while.
Once again I am very sorry that isn’t a video, I will try and post one soon (I kinda jumped the gun on submitting my app!).
Thanks
//Riley

The post TTS Video appeared first on Riley's blog at https://rileychilds.net.

Riley Childs: TTS Video

Fri, 2014-10-31 15:59

(Video is on it’s way, there is an issue with the Camera on my laptop.

Hello, I am Riley Childs a 17-year old student at Charlotte United Christian Academy. I am deeply involved there and am in charge/support of our network, *nix servers, viz servers, library stuff and of course end-user computers. One of my favorite things to do is work in the Library and administer IT awesomeness. I also work in the theater at CPCC as a Electrician. Another thing that I love to do is participate in a community called code4lib where I assist others and post about library technology. I also post to the Koha mailing list where I help out others who have issues with Koha. Overall I love technology and I believe in the freedom of information and that is why I love librarians because they are all about distribution of information. In addition to all this indoor stuff I also enjoy a good day hike and also like to go backpacking every once in a while.
Once again I am very sorry that isn’t a video, I will try and post one soon (I kinda jumped the gun on submitting my app!).
Thanks
//Riley

The post TTS Video appeared first on Riley's blog at https://rileychilds.net.

OCLC Dev Network: Learn More About Software Development Practices at November Webinars

Fri, 2014-10-31 15:15

We're excited to announce two new webinars based on our recent popular blog series covering some of our favortie software development practices. Join Karen Coombs as she walks you through a collection of tools designed to close communication gaps throughout the development process. Registration for both 1-hour webinars is free and now open.

David Rosenthal: This is what an emulator should look like

Fri, 2014-10-31 15:00
Via hackaday, [Jörg]'s magnificently restored PDP10 console, connected via a lot of wiring to a BeagleBone running the SIMH PDP10 emulator. He did the same for a PDP11. Two computers that gave me hours of harmless fun back in the day!

Kids today have no idea what a computer should look like. But even they can run [Jörg]'s Java virtual PDP10 console!

Islandora: Islandora 7.x-1.4 Release Announcement

Fri, 2014-10-31 13:42

I am extremely pleased to announce the release of Islandora 7.x-1.4!

This is our second community release, and I couldn't be more happy with how much we've grown and progressed as a community. This software has continued to improve because of you!

We have an absolutely amazing team to thank for this:

Adam Vessey
Alan Stanley
Dan Aiken
Donald Moses
Ernie Gillis
Gabriela Mircea
Jordan Dukart
Kelli Babcock
Kim Pham
Kirsta Stapelfeldt
Lingling Jiang
Mark Jordan
Melissa Anez
Nigel Banks
Paul Pound
Robin Dean
Sam Fritz
Sara Allain
Will Panting
 

Now for the release info!

Release notes and download links are here along with updated documentation, and you can grab an updated VM here (sandbox.islandora.ca will be updated soon).

I'd like to highlight a few things. This release includes 48 bug fixes since the last release, and 23 document improvements. Along with those improvements, we have two new modules. Islandora Videojs (an Islandora viewer module using Video.js) and Islandora Solr Views (Exposes Islandora Solr search results into a Drupal view).

Our next release is will be out in April. If you would like to be apart of the release team (you'll get an awesome t-shirt!!!), keep an eye out on the list for a call for 7.x-1.5 volunteers. We'll need folks as component managers, testers, and documenters.

That's all I have for now.

cheers!

-nruest

Library of Congress: The Signal: An Online Event & Experimental Born Digital Collecting Project: #FolklifeHalloween2014

Fri, 2014-10-31 12:14

If you haven’t heard, as the title of the press release explains, the Library of Congress Seeks Halloween Photos For American Folklife Center Collection.  As of writing this morning, there are now 288 photos shared on Flickr with the #folklifehalloween2014 tag. If you browse through the results, you can see a range of ways folks are experiencing, seeing, and documenting Halloween and Dia de los Muertos. Everyone has until November 5th to participate. So send this, or some of the links in this post, along to a few other people to spread the word.

Svayambhunath Buddha O’Lantern, Shared by user birellsalsh on Flickr

Because of the nature of this event, you can follow along in real time and see how folks are responding to this in the photostream. See the American Folklife Center’s blog posts on this for a more in depth explanation and some additional context of this project and a set of step-by-step directions about how people can participate. As this is still a live and active event, I wanted to make sure we had a post up about it today for people to share these links with others.

Consider emailing a link to this to any shutterbug friends and colleagues you have. In particular, there is an explicit interest in photos that document the diverse range of communities’ experiences of the holiday. So if you are part of an often underrepresented community it would be great to see that point of view in the photo stream. With that noted, I also wanted to take this opportunity to highlight some of the things about this event that I think are relevant to the digital collecting and preservation focus of The Signal.

Rapid Response Collecting & a Participatory Online Event

Aside from the fun of this project (I mean, its people’s Halloween photos!) I am interested to see how it plays out as a potential mode of contemporary collecting. I think there is a potential for this kind of public event focused on documenting our contemporary world to fit in with ideas like “rapid response collecting” that the Victoria and Albert Museum has been forwarding as well as notions of shared historical authority and conceptions of public participation in collection development.

We can’t know how this will end up playing out over the next few days of the event. However, I can already see how something like this could serve cultural institutions as a means to work with communities to document, interpret and share our perspectives on themes and issues that cultural heritage organizations document and collect in.

Oh and just a note of thanks to Adriel Luis, who shared a bit of wisdom and lessons learned from his work at the Asian Pacific American Center on the Day in the Life of Asian Pacific American event.

So, consider helping to spread the word and sharing some photos!

Pages