You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 1 hour 29 min ago

William Denton: He knew the past, which is sometimes much worse

Wed, 2015-09-16 23:52

I’m rereading the Three Musketeers saga by Alexandre Dumas—one of the greatest of all works of literature—and just re-met a quote I wrote down when I first read it. It’s from chapter twenty-seven of Twenty Years After (Vingt ans après), and is about Urbain Grandier, but I’ll leave out his name to strengthen it:

“[He] was not a sorceror, but a learned man, which is quite another thing. He did not foretell the future. He knew the past, which is sometimes much worse.”

In the original, it reads:

“[Il] n'était pas un sorcier, c'était un savant, ce qui est tout autre chose. [Il] ne prédisait pas l'avenir. Il savait le passé, ce qui quelquefois est bien pis.”

Evergreen ILS: Evergreen 2.9.0 released

Wed, 2015-09-16 23:49

Thanks to the efforts of many contributors, the Evergreen community is pleased to announce the release of version 2.9.0 of the Evergreen open source integrated library system. Please visit the download page to get it!

New features and enhancements of note in Evergreen 2.9.0 include:

  • Evergreen now supports placing blanket orders, allowing staff to invoice an encumbered amount multiple times, paying off the charge over a period of time.
  • There is now better reporting of progress when a purchase order is activated.
  • The Acquisitions Administration menu in the staff client is now directly accessible from the main “Admin” menu.
  • There is now an action/trigger event definition for sending alerts to users before their accounts are scheduled to expire.
  • When registering a new user, the duplicate record search now includes inactive users.
  • Evergreen now offers more options for controlling whether and when users can carry negative balances on their account.
  • The web-based self-check interface now warns the user if their session is about to expire.
  • The “Manage Authorities” results list now displays the thesaurus associated with each authority record.
  • Item statistical categories can now be set during record import.
  • The web staff interface preview now includes cataloging functionality, including a new MARC editor, Z39.50 record import, and a new volume/copy editor.
  • The account expiration date is now displayed on the user’s “My Account” page in the public catalog.
  • Users can now sort their lists of items checked out, check out history, and holds when logged into the public catalog.
  • The bibliographic record source is now available for use by public catalog templates.
  • The public catalog can now cache Template Toolkit templates, improving its speed.
  • On the catalog’s record summary page, there is now a link to allow staff to to forcibly clear the cache of added content for that record.
  • Google Analytics (if enabled at all) is now disabled in the staff client.
  • Several deprecated parts of the code have been removed, including script-based circulation policies, the open-ils.penalty service, the legacy self-check interface, and the old “JSPAC” public catalog interface.

For more information about what’s in the release, check out the release notes.


District Dispatch: Pew study affirms vital role of libraries

Wed, 2015-09-16 22:41

Libraries are transforming amidst the changing information landscape and a report released this week by the Pew Research Center, Libraries at the Crossroads, affirms the evolving role of public libraries within their communities as vital resources that advance education and digital empowerment.

“Public libraries are transforming beyond their traditional roles and providing more opportunities for community engagement and new services that connect closely with patrons’ needs,” ALA President Sari Feldman said. “The Pew Research Center report shows that public libraries are far from being just ‘nice to have,’ but serve as a lifeline for their users, with more than 65 percent of those surveyed indicating that closing their local public library would have a major impact on their community.

Photo credit: Francis W. Parker School (Chicago, IL).

“Libraries are not just about what we have for people, but what we do for and with people,” Feldman said. “Today’s survey found that three-quarters of the public say libraries have been effective at helping people learn how to use new technologies. This is buttressed by the ALA’s Digital Inclusion Survey, which finds that virtually all libraries provide free public access to computers and the Internet, Wi-Fi, technology training and robust digital content that supports education, employment, e-government access and more.

“Although the report affirms the value of public libraries, the ALA recognizes the need for greater public awareness of the transformation of library services, as the report shows library visits over the past three years have slightly decreased. In response, libraries of all types are preparing for the launch of a national public awareness campaign entitled ‘Libraries Transform.’

“Libraries from across the county will participate in the campaign and will work to change the perception that ‘libraries are just quiet places to do research, find a book, and read’ to ‘libraries are centers of their communities: places to learn, create and share, with the help of library staff and the resources they provide,” she noted.

The report also reveals that 75 percent of the public say libraries have been effective at helping people learn how to use new technologies. This is buttressed by the ALA’s Digital Inclusion Survey, which finds that virtually all libraries provide free public access to computers and the Internet, Wi-Fi, technology training and robust digital content that supports education, employment, e-government access and more.
With their accessibility to the public in virtually every community around the country, libraries offer online educational tools for students, employment resources for job-seekers, computer access for those without it and innovation centers for entrepreneurs of all ages.

Other interesting findings in the report that point to the vital role of libraries in communities nationwide include:

o 65 percent maintain that libraries contribute to helping people decide what information they can trust.

o 75 percent say libraries have been effective at helping people learn how to use new technologies.

o 78 percent believe that libraries are effective at promoting literacy and love of reading.

The post Pew study affirms vital role of libraries appeared first on District Dispatch.

SearchHub: Lucidworks Fusion 2.1 Now Available!

Wed, 2015-09-16 19:27
Today we’re releasing Fusion 2.1 LTS, our most recent version of Fusion offering Long-Term Support (LTS). Last month, we released version 2.0 which brought a slew of new features as well as a new user experience. With Fusion 2.1 LTS, we have polished these features, and tweaked the visual appearance and the interactions. With the refinements now in place, we’ll be providing support and maintenance releases on this version for at least the next 18 months. If you’ve already tried out Fusion 2.0, Fusion 2.1 won’t be revolutionary, but you’ll find that it works a little more smoothly and gracefully. Besides the improvements to the UI, we’ve made a few back end changes:
  • Aggregation jobs now run only using Spark. In previous versions, you could run them in Spark optionally, or natively in Fusion. We’ve found we’re happy enough with Spark to make it the only option now.
  • You can now send alerts to PagerDuty. Previously, you could send and email or a Slack message. PagerDuty was a fairly popular request.
  • Several new options for crawling websites
  • Improvements to SSL when communicating between Fusion nodes
  • A reorganization of the Fusion directory structure to better isolate your site-specific data and config from version-specific Fusion binaries, for easier upgrades and maintenance releases
  • Better logging and debuggability
  • Incremental enhancements to document parsing
  • As always, some performance, reliability, and stability improvements
Whether you’re new to Fusion, or have only seen Fusion 1.x, we think there’s a lot you’ll like, so go ahead, download and try it out today!  

The post Lucidworks Fusion 2.1 Now Available! appeared first on Lucidworks.

David Rosenthal: "The Prostate Cancer of Preservation" Re-examined

Wed, 2015-09-16 15:00
My third post to this blog, more than 8 years ago, was entitled Format Obsolescence: the Prostate Cancer of Preservation. In it I argued that format obsolescence for widely-used formats such as those on the Web, would be rare. If it ever happened, would be a very slow process allowing plenty of time for preservation systems to respond.

Thus devoting a large proportion of the resources available for preservation to obsessively collecting metadata intended to ease eventual format migration was economically unjustifiable, for three reasons. First, the time value of money meant that paying the cost later would allow more content to be preserved. Second, the format might never suffer obsolescence, so the cost of preparing to migrate it would be wasted. Third, if the format ever did suffer obsolescence, the technology available to handle it when obsolescence occurred would be better than when it was ingested.

Below the fold, I ask how well the predictions have held up in the light of subsequent developments?

Research by Matt Holden at INA in 2012 showed that the vast majority of even 15-year old audio-visual content was easily rendered with current tools. The audio-visual formats used in the early days of the Web would be among the most vulnerable to obsolescence. The UK Web Archive's Interject prototype's Web site claims that these formats are obsolete and require migration:
  • image/x-bitmap and  image/x-pixmap, both rendered in my standard Linux environment via Image Viewer.
  • x-world/x-vrml, versions 1 and 2, not rendered in my standard Linux environment, but migration tools available.
  • ZX Spectrum software, not suitable for migration.
These examples support the prediction that archives will contain very little content in formats that suffer obsolescence.

Click image to start emulationThe prediction that technology for access to preserved content would improve is borne out by recent developments. Two and a half years ago the team from Freiburg University presented their emulation framework bwFLA which, like those from the Olive Project at CMU and the Internet Archive, is capable of delivering an emulated environment to the reader as a part of a normal Web page. An example of this is Rhizome's art piece from 2000 by Jan Robert Leegte untitled[scrollbars]. To display the artist's original intent, it is necessary to view the piece using a contemporary Internet Explorer, which Rhizome does using bwFLA.

Viewed with Safari on OS XIncreasingly, scrollbars are not permanent but pop up when needed. Viewing the piece with, for example, Safari on OS X is baffling because the scrollbars are not visible.

The prediction that if obsolescence were to happen to a widely used format it would happen very slowly is currently being validated, but not for the expected reason and not as a demonstration of the necessity of format migration. Adobe's Flash has been a very widely used Web format. It is not obsolete in the sense that it can no longer be rendered. It is becoming obsolete in the sense that browsers are following Steve Jobs lead and deprecating its use, because it is regarded as too dangerous in today's Internet threat environment:
Five years ago, 28.9% of websites used Flash in some way, according to Matthias Gelbmann, managing director at web technology metrics firm W3Techs. As of August, Flash usage had fallen to 10.3%.

But larger websites have a longer way to go. Flash persists on 15.6% of the top 1,000 sites, Gelbmann says. That’s actually the opposite situation compared to a few years ago, when Flash was used on 22.2% of the largest sites, and 25.6% of sites overall.If browsers won't support Flash because it poses an unacceptable risk to the underlying system, much of the currently preserved Web will become unusable. It is true that some of that preserved Web is Flash malware, thus simply asking the user to enable Flash in their browser is not a good idea. But if Web archives emulated a browser with Flash, either remotely or locally, the risk would be greatly reduced.

Even if the emulation fell victim to the malware, the underlying system would be at much less risk. If the goal of the malware was to use the compromised system as part of a botnet, the emulation's short life-cycle would render it ineffective. Users would have to be warned against input-ing any sensitive information that the malware might intercept, but it seems unlikely that many users would send passwords or other credentials via a historical emulation. And, because the malware was captured before the emulation was created, the malware authors would be unable to update it to target the emulator itself rather than the system it was emulating.

So, how did my predictions hold up?
  • It is clear that obsolescence of widely used Web formats is rare. Flash is the only example in two decades, and it isn't obsolete in the sense that advocates of preemptive migration meant.
  • It is clear that if it occurs, obsolescence of widely used Web formats is a very slow process. For Flash, it has taken half a decade so far, and isn't nearly complete.
  • The technology for accessing preserved content has improved considerably. I'm not aware of any migration-based solution for safely accessing preserved Flash content. It seems very likely that a hypothetical technique for migrating Flash  would migrate the malware as well, vitiating the reason for the migration.
Three out of three, not bad!

LITA: My Capacity: What Can I Do and What Can I Do Well?

Wed, 2015-09-16 14:00

I like to take on a lot of projects. I love seeing projects come to fruition, and I want to provide the best possible services for my campus community. I think the work we do as librarians is important work.  As I’ve taken on more responsibilities in my current job though I’ve learned I can’t do everything.  I have had to reevaluate the number of things I can accomplish and projects I can support.

Photo by
Darren Tunnicliff. Published under a CC BY-NC-ND 2.0 license.

Libraries come in all different shapes and sizes. I happen to work at a small library. We are a small staff—3 professional librarians including the director, 2 full-time staff, 1 part-time staff member, and around 10 student workers. I think we do amazing things at my place of employment, but I know we can’t do everything. I would love to be able to do some of the projects I see staff at larger universities working on, but I am learning that I have to be strategic about my projects. Time is a limited resource and I need to use my time wisely to support the campus in the best way possible.

This has been especially true for tech projects. The maintenance, updating, and support needed for technology can be a challenge. Now don’t get me wrong, I love tech and my library does great things with technology, but I have also had to be more strategic as I recognize my capacity does have limits. So with new projects I’ve been asking myself:

  • How does this align with our strategic plan? (I’ve always asked this with new projects, but it is always good to remember)
  • What are top campus community needs?
  • What is the estimated time commitment for a specific project?
  • Can I support this long term?

Some projects are so important that you are going to work on the project no matter what your answers are to these questions. There are also some projects that are not even worth the little bit of capacity they would require.  Figuring out where to focus time and what will be the most beneficial for your community is challenging, but worth it.

How do you decide priorities and time commitments?

Library of Congress: The Signal: Cultural Institutions Embrace Crowdsourcing

Wed, 2015-09-16 13:28

“Children planting in Thos. Jefferson Park, N.Y.C.” Created by Bain News Service, publisher, between ca. 1910 and ca. 1915. Medium: 1 negative : glass ; 5 x 7 in. or smaller.

Many cultural institutions have accelerated the development of their digital collections and data sets by allowing citizen volunteers to help with the millions of crucial tasks that archivists, scientists, librarians, and curators face. One of the ways institutions are addressing these challenges is through crowdsourcing.

In this post, I’ll look at a few sample crowdsourcing projects from libraries and archives in the U.S. and around the world. This is strictly a general overview. For more detailed information, follow the linked examples or search online for crowdsourcing platforms, tools, or infrastructures.

In general, volunteers help with:

  • Analyzing images, creating tags and metadata, and subtitling videos
  • Transcribing documents and correcting OCR text
  • Identifying geographic locations, aligning/rectifying historical maps with present locations, and adding geospatial coordinates
  • Classifying data, cross-referencing data, researching historic weather, and monitoring and tracking dynamic activities.

The Library of Congress utilizes public input for its Flickr project. Visitors analyze and comment on the images in the Library’s general Flickr collection of over 20,000 images and the Library’s Flickr “Civil War Faces” collection. “We make catalog corrections and enhancements based on comments that users contribute,” said Phil Michel, digital conversion coordinator at the Library.

In another type of image analysis, Cancer Research UK’s Cellslider project invites volunteers to analyze and categorize cancer cell cores. Volunteers are not required to have a background in biology or medicine for the simple tasks. They are shown what visual elements to look for and instructed on how to categorize into the webpage what they see. Cancer Research UK states on its website that as of the publication of this story, 2,571,751 images have been analyzed.

“Three British soldiers in trench under fire during World War I,” created by Realistic Travels, c1916 Aug. 15. Medium: 1 photographic print on stereo card : stereograph.

Both of the examples above use descriptive metadata or tagging, which helps make the images more findable by means of the specific keywords associated with — and mapped to — the images.

The British National Archives runs a project, titled  “Operation War Diary,” in which volunteers  help tag and categorize diaries of World War I British soldiers. The tags are fixed in a controlled vocabulary list, a menu from which volunteers can select keywords, which helps avoid the typographical variations and errors that may occur when a crowd of individuals freely type their text in.

The New York Public Library’s “Community Oral History Project” makes oral history videos searchable by means of topic markers tagged into the slider bar by volunteers; the tags map to time codes in the video. So, for example, instead of sitting through a one-hour interview to find a specific topic, you can click on the tag — as you would select from a menu — and jump to that tagged topic in the video.

The National Archives and Records Administration offers a range of crowdsourcing projects on its Citizen Archivist Dashboard.  Volunteers can tag records and subtitle videos to be used for closed captions; they can even translate and subtitle non-English videos into English subtitles. One NARA project enables volunteers to transcribe handwritten old ship’s logs that, among other things, contain weather information for each daily entry. Such historic weather data is an invaluable addition to the growing body of data in climate-change research.

Transcription is one of the most in-demand crowdsourcing tasks. In the Smithsonian’s Transcription Center, volunteers can select transcription projects from at least ten of the Smithsonian’s 19 museums and archives. The source material consists of handwritten field notes, diaries, botanical specimen sheets, sketches with handwritten notations and more. Transcribers read the handwriting and type into the web page  what they think the handwriting says. The Smithsonian staff then runs the data through a quality control process before they finally accept it. In all, the process comprises three steps:

  1. The volunteer types the transcription into the web page
  2. Another set of registered users compares the transcriptions with the handwritten scans
  3. Smithsonian staff or trained volunteers review and have final approval over the transcription.

Notable transcription projects from other institutions are the British Library’s Card Catalogue project, Europeana’s World War I documents, the Massachusetts Historical Society’s “The Diaries of John Quincy Adams,” The University of Delaware’s, “Colored Conventions,” The University of Iowa’s “DIY History,” and the Australian Museum’s Atlas of Living Australia.

Excerpt from the Connecticut war record, May 1864, from OCLC.

Optical Character Recognition is the process of taking text that has been scanned into solid images — sort of a photograph of text –and machine-transforming that text image into text characters and words that can be searched. The process often generates incomplete or mangled text.  OCR is often a “best guess” by the software and hardware. Institutions ask for help comparing the source text image with its OCR text-character results and hand-correcting the mistakes.

Newspapers comprise much of the source material. The Library of Virginia, The Cambridge Public Library, and the California Digital Newspaper collection are a sampling of OCR-correction sites. Examples outside of the U.S. include the National Library of Australia and the National Library of Finland.

The New York Public Library was featured in the news a few years ago for the overwhelming number of people who volunteered to help with its “What’s on the Menu” crowdsourcing transcription project, where the NYPL asked volunteers to review a collection of scanned historic menus and type the menu contents into a browser form.

NYPL Labs has gotten even more creative with map-oriented projects. With “Building Inspector” (whose peppy motto is, “Kill time. Make history.”), it reaches out to citizen cartographers to review scans of very old insurance maps and identify each building — lot by lot, block by block — by its construction material, its address and its spatial footprint; in an OCR-like twist, volunteers are also asked to note the name of the then-existent business that is hand written on the old city map (e.g. MacNeil’s Blacksmith, The Derby Emporium). Given the population density of New York, and the propensity of most of its citizens to walk almost everywhere, there’s a potential for millions of eyes to look for this information in their daily environment, and go home and record it in the NYPL databases.

“Black-necked stilt,” photo by Edgar Alexander Mearns, 1887. Medium: 1 photographic print on cabinet card.

Volunteers can also user the NYPL Map Warper to rectify the alignment differences between contemporary maps and digitized historic maps. The British Library has a similar map-rectification crowdsourcing project called Georeferencer. Volunteers are asked to rectify maps scanned from 17th-, 18th- and 19th-century European books. In the course of the project, maps get geospatially enabled and become accessible and searchable through Old Maps Online.

Citizen Science projects range from the cellular level to the astronomical level. The Audubon Society’s Christmas Bird Count asks volunteers to go outside and report on what birds they see. The data goes toward tracking the migratory patterns of bird species.

Geo-Wiki is an international platform that crowdsources monitoring of the earth’s environment. Volunteers give feedback about spatial information overlaid on satellite imagery or they can contribute new data.

Gamification makes a game out of potentially tedious tasks. Malariaspot, from the Universidad Politécnica de Madrid, makes a game of identifying the parasites that lead to malaria. Their website states, “The analysis of all the games played will allow us to learn (a) how fast and accurate is the parasite counting of non-expert microscopy players, (b) how to combine the analysis of different players to obtain accurate results as good as the ones provided by expert microscopists.”

Carnegie Melon and Stanford collaboratively developed, EteRNA, a game where users play with puzzles to design RNA sequences that fold up into a target shapes and contribute to a large-scale library of synthetic RNA designs. MIT’s “Eyewire” uses gamification to get players to help map the brain. MIT’s “NanoDoc” enables game players to design new nanoparticle strategies towards the treatment of cancer. The University of Washington’s Center for Game Science offers “Nanocrafter,” a synthetic biology game, which enables players to use pieces of DNA to create new inventions. “Purposeful Gaming,” from the Biodiversity Heritage Library, is a gamified method of cleaning up sloppy OCR. Harvard uses the data from its “Test My Brain” game to test scientific theories about the way the brain works.

Crowdsourcing enables institutions to tap vast resources of volunteer labor, to gather and process information faster than ever, despite the daunting volume of raw data and limitations of in-house resources. Sometimes the volunteers’ work goes directly into a relational database that maps to target digital objects and sometimes the work resides somewhere until a human can review it and accept or reject it. The process requires institutions to trust “outsiders” — average people, citizen archivists, historians, hobbyists. If a project is well structured and the user instructions are clear and simple, there is little reason for institutions to not ask the general public for help. It’s a collaborative partnership that benefits everyone.

Ed Summers: Seminar Week 3

Wed, 2015-09-16 04:00

In this week’s class we discussed three readings: (Bates, 1999), (Dillon & Norris, 2005) and (Manzari, 2013). This was our last general set of readings about information science before diving into some of the specialized areas. I wrote about my reaction to Bates over in Red Thread.

The Manzari piece continues a series of studies that started in 1985, surveying faculty about the most prestigious journals in the field of Library and Information Science. 827 full time faculty in ALA accredited programs were surveyed and only 232 (27%) responded. No attempt seemed to have been made to see if non-response bias could have had an effect – but I’m not entirely sure if that was needed. I couldn’t help but wonder if this series of studies could have resulted in reinforcing the very thing they are studying. If faculty read the article, won’t it influence their thinking about where they should be publishing?

We used this article more as a jumping off point in class to discuss our advisors top-tier and second-tier conferences to present at and journals to publish in. There were quite a few up on the board, even with just four of us in the class. We subdivided them into 4 groups, that were distinguished by the competitiveness and level of peer review associated with them. It was pretty eye opening to hear how strong of a signal these conferences were for everyone, with a great deal of perceived prestige being associated with particular venues. I was surprised at how nuanced perceptions were.

I had asked Steven Jackson for his ideas about conferences since I’m hoping to continue his line of research about repair in my own initial work (more about that in another post shortly). I won’t detail his response here (since I didn’t ask to do that) but the one conference I learned about that I didn’t know about previously was Computer-Supported Cooperative Work and Social Computing which does look it could be an interesting venue to tune into. Another one is Society for Social Studies of Science.

We ended the class with the Marshmallow Challenge. We broke up into groups and then attempted to build the tallest structure we could using spaghetti, tape and string – as long as a marshmallow could be perched on top. The take home from the exercise was the importance of getting feedback from prototypes, and testing ideas in an iterative fashion. This was a bit of a teaser for the next class which is going to be focused on Users and Usability. The resulting structure also reminded me of more than one software projects I’ve contributed to over the years :-)


Bates, M. (1999). The invisible substrate of information science. Journal of the Society for Information Science, 50(12), 1043–1050.

Dillon, A., & Norris, A. (2005). Crying wolf: An examination and reconsideration of the perception of crisis in lIS educatino. Journal of Education in Library and Information Science.

Manzari, L. (2013). Library and information science journal prestige as assessed by library and information science faculty. Library Quarterly, 83, 42–60.

District Dispatch: ECPA overhaul way “Oprah”-due

Tue, 2015-09-15 20:07

Remember when a new talk show called “Oprah” debuted, a fourth TV network called “Fox” started broadcasting, and every other headline was about the return of Halley’s Comet?  No?  Don’t feel badly; 1986 was so long ago that nobody else does either.  That’s also how long it’s been since Congress passed the Electronic Communications Privacy Act (ECPA): the principal law that controls when the government needs a search warrant to obtain the full content of our electronic communications.

You read right. As previously reported here in District Dispatch, the law protecting all of our email, texts, Facebook pages, Instagram accounts and Dropbox files was written when there was no real Internet, almost nobody used email, and the smallest mobile phone was the size of a brick.  Under the circumstances, it’s profoundly disturbing — but hardly surprising — that law enforcement authorities, with few exceptions, don’t need a search warrant issued by a judge to compel any company or other institution that houses your “stuff” online to hand over your stored documents, photos, files and texts once they’re more than six months old. (This ACLU infographic lays it all out very well.)

Happily, legislation to finally update ECPA for the digital age has been building steam in Congress for several years and the legislation pending now in the House (H.R. 699) and Senate (S.356) has extraordinary support (including at this writing nearly a quarter of all Senators and almost 300 of the 435 Members of the House).  On September 16, the Senate Judiciary Committee will hold a hearing on ECPA – the first in quite some time.  While no vote on an ECPA reform bill may occur for a while yet, if your Senator is a member of the Senate Judiciary Committee, now would be a great time to endorse ECPA reform by sending them one of the timely tweets below:

  • My email deserves just as much privacy protection as snail mail. Pls reform #ECPA now.
  • I vote, but #ECPA’s older than I am; let’s overhaul it while it’s still in its 20s too.
  • If we can’t rewrite #ECPA for the internet age then please correct its name to the Electronic Communication Invasion Act.
  • If #ECPA were a dog it would be 197 years old now. In an internet age it might as well be.  Please overhaul it now.

Maybe, just maybe, 2016 — ECPA’s 30th anniversary — could be the year that this dangerously anachronistic law finally gets the overhaul it’s needed for decades. . .  with your help.

The post ECPA overhaul way “Oprah”-due appeared first on District Dispatch.

SearchHub: Tuning Apache Solr for Log Analysis

Tue, 2015-09-15 18:53
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Radu Gheorghe’s session on tuning Solr for analyzing logs. Performance tuning is always nice for keeping your applications snappy and your costs down. This is especially the case for logs, social media and other stream-like data, that can easily grow into terabyte territory. While you can always use SolrCloud to scale out of performance issues, this talk is about optimizing. First, we’ll talk about Solr settings by answering the following questions:
  • How often should you commit and merge?
  • How can you have one collection per day/month/year/etc?
  • What are the performance trade-offs for these options?
Then, we’ll turn to hardware. We know SSDs are fast, especially on cold-cache searches, but are they worth the price? We’ll give you some numbers and let you decide what’s best for your use-case. The last part is about optimizing the infrastructure pushing logs to Solr. We’ll talk about tuning Apache Flume for handling large flows of logs and about overall design options that also apply to other shippers, like Logstash. As always, there are trade-offs, and we’ll discuss the pros and cons of each option. Radu is a search consultant at Sematext where he works with clients on Solr and Elasticsearch-based solutions. He is also passionate about the logging ecosystem (yes, that can be a passion!), and feeds this passion by working on Logsene, a log analytics SaaS. Naturally, at conferences such as Berlin Buzzwords, Monitorama, and of course Lucene Revolution, he speaks about indexing logs. Previous presentations were about designing logging infrastructures that provide: functionality (e.g.: parsing logs), performance and scalability. This time, the objective is to take a deeper dive on performance. Tuning Solr for Logs: Presented by Radu Gheorghe, Sematext from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Tuning Apache Solr for Log Analysis appeared first on Lucidworks.

Tim Ribaric: 3/4 of class in the bag (Sabbatical Part 7)

Tue, 2015-09-15 18:20

Now my course work is just about done....

read more

D-Lib: The Future of Institutional Repositories at Small Academic Institutions: Analysis and Insights

Tue, 2015-09-15 14:43
Article by Mary Wu, Roger Williams University

D-Lib: Taking Control: Identifying Motivations for Migrating Library Digital Asset Management Systems

Tue, 2015-09-15 14:43
Article by Ayla Stein, University of Illinois at Urbana-Champaign and Santi Thompson, University of Houston Libraries

D-Lib: Enhancing the LOCKSS Digital Preservation Technology

Tue, 2015-09-15 14:43
Article by David S. H. Rosenthal, Daniel L. Vargas, Tom A. Lipkis and Claire T. Griffin, LOCKSS Program, Stanford University Libraries

D-Lib: The Value of Flexibility on Long-term Value of Grant Funded Projects

Tue, 2015-09-15 14:43
Article by Lesley Parilla and Julia Blase, Smithsonian Institution

D-Lib: The Sixth Annual VIVO Conference

Tue, 2015-09-15 14:43
Article by Carol Minton Morris, Duraspace

D-Lib: Enduring Access to Rich Media Content: Understanding Use and Usability Requirements

Tue, 2015-09-15 14:43
Article by Madeleine Casad, Oya Y. Rieger and Desiree Alexander, Cornell University Library

D-Lib: Success Criteria for the Development and Sustainable Operation of Virtual Research Environments

Tue, 2015-09-15 14:43
Article by Stefan Buddenbohm, Goettingen State and University Library; Heike Neuroth, University of Applied Science Potsdam; Harry Enke and Jochen Klar, Leibniz Institute for Astrophysics Potsdam; Matthias Hofmann, Robotics Research Institute, TU Dortmund University