You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib -
Updated: 2 hours 27 min ago

LITA: Creating campus-wide technology partnerships: Mission impossible?

Mon, 2015-07-27 13:00

Libraries have undergone significant changes in the last five years, shifting from repositories to learning spaces, from places to experiences. Much of this is due to our growing relationships with our IT, instructional technology, and research colleagues as the lines between technology and library-related work become continually more blurred.

But it’s not always easy to establish these types of partnerships, especially if there haven’t been any connections to build on. So how can you approach outreach to your IT campus departments and individuals?

There are typically two types of partnerships that you can initiate:

1. There is a program already established, and you would like the library to be involved where it wasn’t involved before

2. You are proposing something completely new

All you have to do is convince the coordinator or director of the project or department that having the library become a part of that initiative is a good thing especially if they don’t think you have anything to offer. Easier said than done, right? But what happens if that person is not responding to your painstakingly crafted email? If the person is a director or chair, chances are they have an assistant who is much more willing to communicate with you and can often make headway where you can’t.

Ask if you can attend a departmental meeting or if they can help you set up a meeting with the person who can help things move forward. Picking up the phone doesn’t hurt either-if someone is in their office, they might, just might, be inclined to talk with you as opposed to ignoring the email you sent them days ago which is by now buried under an avalanche of other emails and will be duly ignored.

Always try to send an agenda ahead of time so they know what you’re thinking-that additional time might just be the thing they need to be able to consider your ideas instead of having to come up with something on the spot. Plus, if you’re nervous, that will serve as your discussion blueprint and can prevent you from rambling or going off into tangents-remember, the person in front of you has many other things to think about, and like it or not, you have to make good use of their time!

After the meeting, along with your thank you, be sure to remind them of the action items that were discussed-that way when you contact others within the department to move forward with your initiative they are not wondering what’s going on and why you’re bugging them. Also asking who might be the best person to help with whatever action items you identify will help you avoid pestering the director later-there’s nothing worse than getting the green light then having to backtrack or delay because you forgot to ask them who to work with! From there on out, creating a system for communicating regularly with all those involved in moving forward is your priority. Make sure everyone who needs to be at the table receives an invitation and understands why they are there. Clarify who is in charge and what the expectations of the work are. Assume that they know nothing and the only thing their supervisor or colleague has said is that they will be working with the library on a project.

You might also have to think outside the proverbial IT box when it comes to building partnerships. For example, creating a new Makerspace might not start with IT, but rather with a department who is interested in incorporating it into their curriculum. Of course IT will become part of the equation at some point, but that unit might not be the best way to approach creating this type of space and an academic department would be willing to help split the cost because their students are getting the benefits.

Finally, IT nowadays comes in many forms and where you once thought the campus supercomputing center has nothing to do with your work, finding out exactly what their mission is and what they do, could come in handy. For example, you might discover that they can provide storage for large data sets and they could use some help to spread the word to faculty about this. Bingo! You’ve just identified an opportunity for those in the library who are involved in this type of work to collaborate on a shared communication plan where you can introduce what the library is doing to help faculty with their data management plans and the center can help store that same data.

Bottom line, technology partnerships are vital if libraries are going to expand their reach and become even more integrated into the academic fabric of their institutions. But making those connections isn’t always easy, especially because some units might not see the immediate benefits of such collaborations. Getting to the table is often the hardest step in the process, but keeping these simple things in mind will (hopefully) smooth the way:

1. Look at all possible partners, not just the obvious IT connections

2. Be willing to try different modes of outreach if your preferred method isn’t having success

3. Be prepared to demonstrate what the library can bring to the table and follow through

DuraSpace News: NOW AVAILABLE: Lower per Terabyte Cost for Additional ArchivesDirect Storage

Mon, 2015-07-27 00:00

Winchester, MA There will always be more, not less, data. That fact makes it likely that you will need more archival storage space than you originally planned for. Rapid, on-the-fly collection development, unexpected, gifted digital materials and rich media often require additional storage. ArchivesDirect has lowered the per terabyte cost of additional storage to make using the service more cost effective for organizations and institutions seeking to meet institutional demands for ensuring that their digital footprint is safe and accessible for future generations.

DuraSpace News: Telling VIVO Stories at Colorado University Boulder with Liz Tomich

Mon, 2015-07-27 00:00

“Telling VIVO Stories” is a community-led initiative aimed at introducing project leaders and their ideas to one another while providing VIVO implementation details for the VIVO community and beyond. The following interview includes personal observations that may not represent the opinions and views of Colorado University Boulder or the VIVO Project.

Julia Trimmer, Duke University, talked with the Liz Tomich at Colorado University Boulder to learn about their VIVO story.

ZBW German National Library of Economics: skos-history: New method for change tracking applied to STW Thesaurus for Economics

Sun, 2015-07-26 22:00

“What’s new?” and “What has changed?” are questions users of Knowledge Organization Systems (KOS), such as thesauri or classifications, ask when a new version is published. Much more so, when a thesaurus existing since the 1990s has been completely revised, subject area for subject area. After four intermediately published versions in as many consecutive years, ZBW's STW Thesaurus for Economics has been re-launched recently in version 9.0. In total, 777 descriptors have been added; 1,052 (of about 6,000) have been deprecated and in their vast majority merged into others. More subtle changes include modified preferred labels, or merges and splits of existing concepts.

Since STW has been published on the web in 2009, we went to great lengths to make change traceable: No concept and no web page has been deleted, everything from prior versions is still available. Following a presentation at DC-2013 in Lisbon, I've started the skos-history project, which aims to exploit published SKOS files of different versions for change tracking. A first beta implementation of Linked-Data-based change reports went live with STW 8.14, making use of SPARQL "live queries" (as described in a prior post). With the publication of STW 9.0, full reports of the changes are available. How do they work?

The basic idea is to exploit the power of SPARQL on named graphs of different versions of the thesaurus. After having loaded these versions into a "version store", we can compute deltas (version differences) and save them as named graphs, too. A combination of the dataset versioning ontology (dsv:) by Johan De Smedt, the skos-history ontology (sh:), SPARQL service description (sd:) and VoiD (void:) provides the necessary plumbing in a separate version history graph:


That in place, we can query the version store, for e.g. the concepts added between two versions, like this:

# Identify concepts inserted with a certain version
SELECT distinct ?concept ?prefLabel
 # query the version history graph to get a delta and via that the relevant graphs
  ?delta a sh:SchemeDelta ;
   sh:deltaFrom/dc:identifier "8.14" ;
   sh:deltaTo/dc:identifier "9.0" ;
   sh:deltaFrom/sh:usingNamedGraph/sd:name ?oldVersionGraph ;
   dct:hasPart ?insertions .
  ?insertions a sh:SchemeDeltaInsertions ;
   sh:usingNamedGraph/sd:name ?insertionsGraph .
 # for each inserted concept, a newly inserted prefLabel must exist ...
 GRAPH ?insertionsGraph {
  ?concept skos:prefLabel ?prefLabel
 # ... and the concept must not exist in the old version
  GRAPH ?oldVersionGraph {
   ?concept ?p []

The resulting report, cached for better performance and availability, can be found in the change reports section of the STW site, together with reports on deprecation/replacement of concepts, changed preferrred labels, hiearchy changes, merges and splits of concepts (descriptors as well as the higher level subject categories of STW). The queries used to create the reports are available on GitHub and linked from the report pages.

The methodology allows for aggregating changes over multiple versions and levels of the hierarchy of a concept scheme. That enabled us to gather information for the complete overhaul of STW, and to visualize it in change graphics:

The method applied here to STW is in no way specific to it. It does not rely on transaction logging of the internal thesaurus management system, nor on any other out-of-band knowledge, but solely on the published SKOS files. Thus, it can be applied to other knowledge management systems, by its publishers as well as by interested users of the KOS. Experiments with TheSoz, Agrovoc and the Finnish YSO have been conducted already; example endpoints with multiple versions of these vocabularies (and of STW, of course) are provided by ZBW Labs.

At the Finnish National Library, as well as the FAO, approaches are under way to explore the applicability of skos-history to the thesauri and maintenance workflows there. In the context of STW, the change reports are mostly optimized for human consumption. We hope to learn more how people use it in automatic or semi-automatic processes - for example, to update changed preferred label of systems working with prior versions of STW, to review indexed titles attached to split-up concepts, or to transfer changes to derived or mapped vocabularies. If you want to experiment, please fork on GitHub. Contributions in the issue queue as well as well as pull requests are highly welcome.

More detailed information can be found in a paper (Leveraging SKOS to trace the overhaul of the STW Thesaurus for Economics), which will be presented at DC-2015 in Sao Paulo.



skos-history Thesaurus   Linked data  

Open Library Data Additions: Big list of ASINs (ASIN)

Sun, 2015-07-26 21:15

A list of ASINs ( product identifiers) generated by extracting ASIN-shaped strings from the list of pages crawled by the wayback machine..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata

Eric Hellman: Library Privacy and the Freedom Not To Read

Sun, 2015-07-26 16:40
One of the most difficult privacy conundrums facing libraries today is how to deal with the data that their patrons generate in the course of using digital services. Commercial information services typically track usage in detail, keep the data indefinitely, and regard the data as a valuable asset. Data is used to make many improvements, often to personalize the service to best meet the needs of the user. User data can also be monetized; as I've written here before, many companies make money by providing web services in exchange for the opportunity to track users and help advertisers target them.

A Maginot Line fortification. Photo from the US Army.The downside to data collection is its impact on user privacy, something that libraries have a history of defending, even at the risk of imprisonment. Since the Patriot Act, many librarians have believed that the best way to defend user privacy against legally sanctioned intrusion is to avoid collecting any sensitive data. But as libraries move onto the web, that defense seems more and more like a Maginot Line, impregnable, but easy to get around. (I've written about an effort to shore up some weak points in library privacy defenses.)

At the same time, "big data" has clouded the picture of what constitutes sensitive data. The correlation of digital library use with web activity outside the library can impact privacy in ways that never would occur in a physical library. For example, I've found that many libraries unknowingly use Amazon cover images to enrich their online catalogs, so that even a user who is completely anonymous to the library ends up letting Amazon know what books they're searching for.

Recently, I've been serving on the Steering Committee of an initiative of NISO to try to establish a set of principles that libraries, providers of services to libraries, and publishers can use to support privacy patron privacy. We held an in-person meeting in San Francisco at the end of July. There was solid support from libraries, publishers and service companies for improving reader privacy, but some issues were harder than others. The issues around data collection and use attracted the widest divergence in opinion.

One approach that was discussed centered on classifying different types of data depending on the extent to which they impact user privacy. This also the approach taken by most laws governing privacy of library records. They mostly apply only to "Personally Identifiable Information" (PII), which usually would mean a person's name, address, phone number, etc., but sometimes is defined to include the user's IP address. While it's important to protect this type of information, in practice this usually means that less personal information lacks any protection at all.

I find that the data classification approach is another Maginot privacy line. It encourages the assumption that collection of demographics data – age, gender, race, religion, education, profession, even sexual orientation – is fair game for libraries and participants in the library ecosystem. I raised some eyebrows when I suggested that demographic groups might deserve a level of privacy protection in libraries, just as individuals do.

OCLC's Andrew Pace gave an example that brought this home for us all. When he worked as a librarian at NC State, he tracked usage of the books and other materials in the collection. Every library needs to do this for many purposes. He noticed that materials placed on reserve for certain classes received little or no usage, and he thought that faculty shouldn't be putting so many things on reserve, effectively preventing students not taking the class from using these materials. And so he started providing usage reports to the faculty.

In retrospect, Andrew pointed out that, without thinking much about it, he might have violated the privacy of students by informing their teachers that that they weren't reading the assigned materials. After all, if a library wants to protect a user's right to read, they also have to protect the right not to read. Nobody's personally identifiable information had been exposed, but the combination of library data – a list of books that hadn't circulated – with some non-library data – the list of students enrolled in a class and the list of assigned reading – had intersected in a way that exposed individual reading behavior.

What this example illustrates is that libraries MUST collect at least SOME data that impinges on reader privacy. If reader privacy is to be protected, a "privacy impact assessment" must be made on almost all uses of that data.  In today's environment, users expect that their data signals will be listened to and their expressed needs will be accommodated. Given these expectations, building privacy in libraries is going to require a lot of work and a lot of thought.

Terry Reese: Code4LibMW 2015 Write-up

Sat, 2015-07-25 02:17

Whew – it’s be a wonderfully exhausting past few days here in Columbus, OH as the Libraries played host to Code4LibMW.  This has been something that I’ve been looking forward to ever since making the move to The Ohio State University; the C4L community has always been one of my favorites, and while the annual conference continues to be one of the most important meetings on my calendar – it’s within these regional events where I’m always reminded why I enjoy being a part of this community. 

I shared a story with the folks in Columbus this week.  As one of the folks that attended the original C4L meeting in Corvallis back in 2006 (BTW, there were 3 other original attendees in Columbus this week), there are a lot of things that I remember about that event quite fondly.  Pizza at American Dream, my first experience doing a lightening talk, the joy of a conference where people were writing code as they were standing on stage waiting their turn to present, Roy Tennant pulling up the IRC channel while he was on stage, so he could keep an eye on what we were all saying about him.  It was just a lot of fun, and part of what made it fun was that everyone got involved.  During that first event, there were around 80 attendees, and nearly every person made it onto the stage to talk about something that they were doing, something that they were passionate about, or something that they had been inspired to build during the course of the week.  You still get this at times at the annual conference, but with it’s shear size and weight, it’s become much harder to give everyone that opportunity to share the things that interest them, or easily connect with other people that might have those same interests.  And I think that’s the purpose that these regional events can serve. 

By and large, the C4L regional events feel much more like those early days of the C4L annual conference.  They are small, usually free to attend, with a schedule that shifts and changes throughout the day.  They are also the place where we come together, meet local colleagues and learn about all the fantastic work that is being done at institutions of all sizes and all types.  And that’s what the C4LMW meeting was for me this year.  As the host, I wanted to make sure that the event had enough structure to keep things moving, but had a place for everyone to participate.  For me – that was going to be the measure of success…did we not just put on a good program – but did this event help to make connections within our local community.  And I think that in this, the event was successful.  I was doing a little bit of math, and over the course of the two days, I think that we had a participation rate close to 90%, and an opportunity for everyone that wanted to get up and just talk about something that they found interesting.  And to be sure – there is a lot of great work being done out here by my Midwest colleagues (yes, even those up in Michigan ).

Over the next few days, I’ll be collecting links and making the slides available via the C4LMW 2015 home page as well as wrapping up a few of the last responsibilities of hosting an event, but I wanted to take a moment and again thank everyone that attended.  These types of events have never been driven by the presentations, the hosts, or the presenters – but have always been about the people that attend and the connections that we make with the people in the room.  And it was a privilege this year to have the opportunity to host you all here in Columbus. 



Karen G. Schneider: The well of studiousness

Sat, 2015-07-25 01:58

Pride 2015

My relative quiet is because my life has been divided for a while between work and studying for exams. But I share this photo by former PUBLIB colleague and retired librarian Bill Paullin from the 2015 Pride March in San Francisco, where I marched with my colleagues in what suddenly became an off-the-hook celebration of what one parade marshal drily called, “Thank you, our newly-discovered civil rights.”

I remember the march, but I also remember the  hours before our contingent started marching, chatting with dear colleagues about all the important things in life while around us nothing was happening. It was like ALA Council, except with sunscreen, disco music, and free coconut water.

Work is going very well. Team Library is made of professionals who enjoy what they do and commit to walking the walk. The People of the Library did great things this summer, including eight (yes eight) very successful “chat with a librarian” sessions for parent orientations, and a wonderful “Love Your Library” carnival for one student group. How did we get parents to these sessions? Schmoozing, coffee, and robots (as in, tours of our automated retrieval system). We had a competing event, but really — coffee and robots? It’s a no-brainer. Then I drive home to our pretty street in a cute part of a liveable city, and that is a no-brainer, too.

I work with such great people that clearly I did something right in a past life. Had some good budget news. Yes please! Every once in a while I think, I was somewhere else before I came here, and it was good; I reflect on our apartment in San Francisco, and my job at Holy Names. I can see myself on that drive to work, early in the morning, twisting down Upper Market as the sun lit up the Bay Bridge and the day beckoned, full of challenge and possibility. It was a good part of my life, and I record these moments in the intergalactic Book of Love.

And yet: “a ship in port is safe, but that’s not what ships are built for.” I think of so many good things I learned in my last job, not the least of which the gift of radical hospitality.  I take these things with me, and yet the lesson for me is that I was not done yet. It is interesting to me that in the last few months I learned that for my entire adult life I had misunderstood the word penultimate. It does not mean the final capper; it means the place you go, before you go to that place.  I do not recall what made me finally look up this term, except when I did I felt I was receiving a message.

Studying is going very well, except my brain is unhappy about ingesting huge amounts of data into short-term memory to be regurgitated on a closed-book test. Cue lame library joke: what am I, an institutional repository? Every once in a while I want to share a bon mot from my readings with several thousand of my closest friends, then remember that people who may be designing the questions I’ll be grappling with are on the self-same networks. So you see pictures of our Sunday house meetings and perhaps a random post or share, but the things that make me go “HA HA HA! Oh, that expert in […….redacted……..] gets off a good one!” stay with me and Samson, our ginger cat, who is in charge of supervising my studies, something he frequently does with his eyes closed.

We have landed well, even after navigating without instruments through a storm. Life is good, and after this winter, I have a renewed appreciation for what it means for life to be good. That second hand moves a wee faster every year, but there are nonetheless moments captured in amber, which we roll from palm to palm, marveling in their still beauty.

Bookmark to:

David Rosenthal: Amazon owns the cloud

Fri, 2015-07-24 19:21
Back in May I posted about Amazon's Q1 results, the first in which they broke out AWS, their cloud services, as a separate item. The bottom line was impressive:
AWS is very profitable: $265 million in profit on $1.57 billion in sales last quarter alone, for an impressive (for Amazon!) 17% net margin.Again via Barry Ritholtz, Re/Code reports on Q2:
Amazon Web Services, ... grew its revenue by 81 percent year on year in the second quarter. It grew faster and with higher profit margins than any other aspect of Amazon’s business.

AWS, which offers leased computing services to businesses, posted revenue of $1.82 billion, up from $1 billion a year ago, as part of its second-quarter results.

By comparison, retail sales in North America grew only 26 percent to $13.8 billion from $11 billion a year ago.

The cloud computing business also posted operating income of $391 million — up an astonishing 407 percent from $77 million at this time last year — for an operating margin of 21 percent, making it Amazon’s most profitable business unit by far. The North American retail unit turned in an operating margin of only 5.1 percent.Revenue growing at 81% year-on-year at a 21% and growing margin despite:
price competition from the likes of Google, Microsoft and IBM.Amazon clearly dominates the market, the competition is having no effect on their business. As I wrote nearly a year ago, based on Benedict Evans' analysis:
Amazon's strategy is not to generate and distribute profits, but to re-invest their cash flow into starting and developing businesses. Starting each business absorbs cash, but as they develop they turn around and start generating cash that can be used to start the next one. Unfortunately, S3 is part of AWS for reporting purposes, so we can't see the margins for the storage business alone. But I've been predicting for years that if we could, we would find them to be very generous.

Harvard Library Innovation Lab: Link roundup July 24, 2015

Fri, 2015-07-24 16:56

A block of links sourced from the team. We’ve got Annie, Adam, dano, and Matt!

A Light Sculpture Is Harvesting San Francisco’s Secrets

The form reminds me of an alien planet’s shrubbery. Shrub with status updates.

Watch a Computer Attempt to Sing 90s Power Ballads—With Feeling

Soulful synth h/t @grok_

Street Artist and City Worker Have Year Long Exchange on a Red Wall in London «TwistedSifter

Street artist versus city worker

Toki Pona: A Language With a Hundred Words – The Atlantic

Now, to combine Toki Pona with emoji…

Swedish Puzzle Rooms Test Teams’ Wits and Strength | Mental Floss

Obstacle course puzzle rooms in an old department store. Why not in a library?

LITA: Outernet: A Digital Library in the Sky

Fri, 2015-07-24 13:00


To me, libraries have always represented a concentration of knowledge. Growing up I dreamt about how smart I’d be if I read all of the books in my hometown’s tiny local branch library.  I didn’t yet understand the subtle differences between libraries, archives and repositories, but I knew that the promise of the internet and digital content meant that, someday, I’d be able to access all of that knowledge as if I had a library inside my computer. The idea of aggregating all of humanity’s knowledge in a way that makes it freely accessible to everyone is what led me to library school, programming, and working with digital libraries/repositories, so whenever I find a project working towards that goal I get tingly. Outernet makes me feel very tingly.

In a nutshell, Outernet is a startup that got sponsored by a big nonprofit, and aims to use satellites to broadcast data down to Earth. By using satellites, they can avoid issues of internet connectivity, infrastructure, political censorship and local poverty. The data they plan to provide would be openly licensed educational materials specifically geared towards underprivileged populations such as local news, crop prices, emergency communications, open source applications, literature, textbooks and courseware, open access academic articles, and even the entirety of Wikipedia. Currently the only way to receive Outernet’s broadcasts is with a homemade receiver, but a low cost (~$100) solar-powered, weather-proof receiver with built in storage is in the works which could be mass produced and distributed to impoverished or disaster-stricken areas.

Outernet chooses the content to be added to its core archive with a piece of software called Whiteboard which acts as a kind of Reddit for broadcast content; volunteers submit new URLs pointing to content they believe Outernet should broadcast, and the community can upvote or downvote it with the top-ranking content making it into the core archive, democratizing the process. A separate piece of software called Librarian acts as the interface to locally received content; current receivers act as a Wi-Fi hotspot which users can connect to and use Librarian to explore, copy or delete content as well as configuring the data Librarian harvests. Public access points are being planned for places like schools, hospitals and public libraries where internet connectivity isn’t feasible, with a single person administering the receiver and its content but allowing read-only access to anyone.

While the core work is being done by Outernet Inc., much of the project relies on community members volunteering time to discuss ideas and test the system. You can find more about the community at, but the primary way to participate is to build a receiver yourself and report feedback or to submit/vote on content using Whiteboard. While Outernet is still a long way off from achieving its goals, its still one of the most exciting and fun ideas I’ve heard about in a while and definitely something to keep an eye on.


pinboard: Welcome

Fri, 2015-07-24 01:18

Roy Tennant: Yet Another Metadata Zoo

Thu, 2015-07-23 23:51

I was talking with my old friend John Kunze a little while back and he described a project that he is involved with called “Yet Another Metadata Zoo” or In a world of more ontologies than you can shake a stick at, it aims to provide a simple, easy-to-use mechanism for defining and maintaining individual metadata terms and their definitions.

The project explains itself like this:

The YAMZ Metadictionary (metadata dictionary) prototype…is a proof-of-concept web-based software service acting as an open registry of metadata terms from all domains and from all parts of “metadata speech”. With no login required, anyone can search for and link to registry term definitions. Anyone can register to be able to login and create terms.

We aim for the metadictionary to become a high-quality cross-domain metadata vocabulary that is directly connected to evolving user needs. Change will be rapid and affordable, with no need for panels of experts to convene and arbitrate to improve it. We expect dramatic simplification compared to the situation today, in which there is an overwhelming number of vocabularies (ontologies) to choose from.

Our hope is that users will be able to find most of the terms they need in one place (one vocabulary namespace), namely, the Metadictionary. This should minimize the need for maintaining expensive crosswalks with other vocabularies and cluttering up expressed metadata with lots of namespace qualifiers. Although it is not our central goal, the vocabulary is shovel-ready for those wishing to create linked data applications.

If you have a Google ID, signing in is dead simple and you can begin creating and editing terms. You can also vote terms up or down, which can eventually take a term from “vernacular” status (the default for new terms) to “canonical” — terms that are considered stable and unchanging. A third status is “deprecated”.

You can browse terms to see what is there already.

I really like this project for several reasons:

  • It’s dead simple.
  • It’s fast and easy to gain value from it. 
  • Every term has an identifier, forever and always (deprecated terms keep their identifier).
  • Voting and commenting are a key part of the infrastructure, and provide easy mechanisms for it to get ever better over time.

What it needs now is more people involved, so it can gain the kind of input and participation that is necessary to make it a truly authoritative source of metadata element names and descriptions. I’ve already contributed to it, how about you?


District Dispatch: Libraries: Apply now for 2016 IMLS National Medals

Thu, 2015-07-23 21:56

Institute of Museum and Library Services logo

The application period is now open for the 2016 National Medal for Museum and Library Service, the nation’s highest honor. Each year, the Institute of Museum and Library Services (IMLS) recognizes libraries and museums that make significant and exceptional contributions in service to their communities. Nomination forms are due October 1, 2015.

Read more from IMLS:

All types of nonprofit libraries and library organizations, including academic, school, and special libraries, archives, library associations, and library consortia, are eligible to receive this honor. Public or private nonprofit museums of any discipline (including general, art, history, science and technology, children’s, and natural history and anthropology), as well as historic houses and sites, arboretums, nature centers, aquariums, zoos, botanical gardens, and planetariums are eligible.

Winners are honored at a ceremony in Washington, DC, host a two-day visit from StoryCorps to record community member stories, and receive positive media attention. Approximately thirty finalists are selected as part of the process and are featured by IMLS during a six-week social media and press campaign.

Anyone may nominate a museum or library for this honor, and institutions may self-nominate. For more information, reach out to one of the following contacts.

Program Contact for Museums:
Mark Feitl, Museum Program Specialist

Program Contact for Libraries:
Katie Murray, Staff Assistant

The Institute of Museum and Library Services is the primary source of federal support for the nation’s 123,000 libraries and 35,000 museums.

The post Libraries: Apply now for 2016 IMLS National Medals appeared first on District Dispatch.

District Dispatch: Senate cybersecurity bill attacks civil liberties

Thu, 2015-07-23 21:08

Credit – 22860 (Flickr)

It’s back to the barricades for librarians and our many civil liberties coalition allies. District Dispatch sounded the alarm a year ago about the return of privacy-hostile cybersecurity or information sharing legislation. Widely dubbed a “zombie” for its ability to rise from the legislative dead, the current version of the bill goes by the innocuous name of the “Cybersecurity Information Sharing Act” but “CISA” (S. 754) is anything but. The bill was approved in secret session by the Senate’s Intelligence Committee and has not received a single public hearing.  Unfortunately, Senate Leadership is pushing for a vote on S. 754 in the few legislative days remaining before its August recess.

CISA is touted by its supporters as a means of preventing future large-scale data breaches, like the massive one just suffered by the federal government’s Office of Personnel Management. CISA, however, will create legal responsibilities and incentives for both private companies and federal agencies to collect, widely disseminate and retain huge amounts of Americans’ personally identifiable information that will itself then be vulnerable to sophisticated hacking attacks.  In the process, the bill also creates massive exemptions for private companies under virtually every major consumer privacy protection law now on the books.

Worse yet, collected personal information would be shared almost instantly not just among federal cyber-threat agencies, but with law enforcement entities at every level of government.  The bill does not restrict how long the data can be retained and what kinds of non-cyber offenses the information may be used to prosecute.  If enacted, that would be an unprecedented and sweeping end run on the Fourth Amendment.

CISA also allows both the government and private companies to take rapid unilateral “counter-measures” to disable any computer network, including for example a library system’s or municipal government’s, that the company believes is the source of a cyber-attack … and companies get immunity from paying for any resulting damages even if their “belief” turns out to be wrong.

It’s time for librarians to rise again, too . . . to the challenge of stopping CISA in its tracks yet again.  For lots more information, and to contact the offices of both your U.S. Senators, please visit now!  It’s fast, easy and couldn’t be more timely!

The post Senate cybersecurity bill attacks civil liberties appeared first on District Dispatch.

Nicole Engard: Bookmarks for July 23, 2015

Thu, 2015-07-23 20:30

Today I found the following resources and bookmarked them on Delicious.

  • hylafax The world’s most advanced open source fax server

Digest powered by RSS Digest

The post Bookmarks for July 23, 2015 appeared first on What I Learned Today....

Related posts:

  1. Faxing via the Web
  2. No more free faxing with
  3. MarkMail: Mailing List Search

Jonathan Rochkind: Virtual Shelf Browse

Thu, 2015-07-23 19:17

We know that some patrons like walking the physical stacks, to find books on a topic of interest to them through that kind of browsing of adjacently shelved items.

I like wandering stacks full of books too, and hope we can all continue to do so.

But in an effort to see if we can provide an online experience that fulfills some of the utility of this kind of browsing, we’ve introduced a Virtual Shelf Browse that lets you page through books online, in the order of their call numbers.

An online shelf browse can do a number of things you can’t do physically walking around the stacks:

  • You can do it from home, or anywhere you have a computer (or mobile device!)
  • It brings together books from various separate physical locations in one virtual stack. Including multiple libraries, locations within libraries, and our off-site storage.
  • It includes even checked out books, and in some cases even ebooks (if we have a call number on record for them)
  • Place one item at multiple locations in a Virtual Shelf, if we have more than one call number on record for it. There’s always more than one way you could classify or characterize a work; a physical item can only be in one place at a time, but not so in a virtual display.

The UI is based on the open source stackview code released by the Harvard Library Innovation Lab. Thanks to Harvard for sharing their code, and to @anniejocaine for helping me understand the code, and accepting my pull requests with some bug fixes and tweaks.

This is to some extent an experiment, but we hope it opens up new avenues for browsing and serendipitous discovery for our patrons.

You can drop into one example place in the virtual shelf browse here, or drop into our catalog to do your own searches — the Virtual Shelf Browse is accessed by navigating to an individual item detail page, and then clicking the Virtual Shelf Browse button in the right sidebar.  It seemed like the best way to enter the Virtual Shelf was from an item of interest to you, to see what other items are shelved nearby.

Our Shelf Browse is based on ordering by Library of Congress Call Numbers. Not all of our items have LC call numbers, so not every item appears in the virtual shelf, or has a “Virtual Shelf Browse” button to provide an entry point to it. Some of our local collections are shelved locally with LC call numbers, and these are entirely present. For other collections —  which might be shelved under other systems or in closed stacks and not assigned local shelving call numbers — we can still place them in the virtual shelf if we can find a cataloger-suggested call number in the MARC bib 050 or similar fields. So for those collections, some items might appear in the Virtual Shelf, others not.

On Call Numbers, and Sorting

Library call number systems — from LC, to Dewey, to Sudocs, or even UDC — are a rather ingenious 19th century technology for organizing books in a constantly growing collection such that similar items are shelved nearby. Rather ingenious for the 19th century anyway.

It was fun to try to bringing this technology — and the many hours of cataloger work that’s gone into constructing call numbers — into the 21st century to continue providing value in an online display.

It was also challenging in some ways. It turns out the nature of ordering of Library of Congress call numbers particularly is difficult to implement in computer software, there are a bunch of odd cases where to a human it might be clear what the proper ordering is  (at least to a properly trained human? and different libraries might even order differently!), but difficult to encode all the cases into software.

The newly released Lcsort ruby gem does a pretty marvelous job of allowing sorting of LC call numbers that properly sorts a lot of them — I won’t say it gets every valid call number, let alone local practice variation, right, but it gets a lot of stuff right including such crowd-pleasing oddities as:

  • `KF 4558 15th .G6` sorts after `KF 4558 2nd .I6`
  • `Q11 .P6 vol. 12 no. 1` sorts after `Q11 .P6 vol. 4 no. 4`
  • Can handle suffixes after cutters as in popular local practice (and NLM call numbers), eg `R 179 .C79ab`
  • Variations in spacing or punctuation that should not matter for sorting, `R 169.B59.C39` vs `R169 B59C39 1990` `R169 .B59 .C39 1990` etc.

Lcsort is based on the cummulative knowledge of years of library programmer attempts to sort LC calls, including an original implementation based on much trial and error by Bill Dueber of the University of Michigan, a port to ruby by Nikitas Tampakis of Princeton University Library, advice and test cases based on much trial and error from Naomi Dushay of Stanford, and a bunch more code wrangling by me.

I do encourage you to check out Lcsort for any LC call number ordering needs, if you can do it in ruby — or even port it to another language if you can’t. I think it works as well or better as anything our community of library technologies has done yet in the open.

Check out my code — rails_stackview

This project was possible only because of the work of so many that had gone before, and been willing to share their work, from Harvard’s stackview to all the work that went into figuring out how to sort LC call numbers.

So it only makes sense to try to share what I’ve done too, to integrate a stackview call number shelf browse in a Blacklight Rails app.  I have shared some components in a Rails engine at rails_stackview

In this case, I did not do what I’d have done in the past, and try to make a rock-solid, general-purpose, highly flexible and configurable tool that integrated as brainlessly as possible out of the box with a Blacklight app. I’ve had mixed success trying to do that before, and came to think it might have been over-engineering and YAGNI to try. Additionally, there are just too many ways to try to do this integration — and too many versions of Blacklight changes to keep track of — I just wasn’t really sure what was best and didn’t have the capacity for it.

So this is just the components I had to write for the way I chose to do it in the end, and for my use cases. I did try to make those components well-designed for reasonable flexibility, or at least future extension to more flexibility.

But it’s still just pieces that you’d have to assemble yourself into a solution, and integrate into your Rails app (no real Blacklight expectations, they’re just tools for a Rails app) with quite a bit of your own code.  The hardest part might be indexing your call numbers for retrieval suitable to this UI.

I’m curious to see if this approach to sharing my pieces instead of a fully designed flexible solution might still ends up being useful to anyone, and perhaps encourage some more virtual shelf browse implementations.

On Indexing

Being a Blacklight app, all of our data was already in Solr. It would have been nice to use the existing Solr index as the back-end for the virtual shelf browse, especially if it allowed us to do things like a virtual shelf browse limited by existing Solr facets. But I did not end up doing so.

To support this kind of call-number-ordered virtual shelf browse, you need your data in a store of some kind that supports some basic retrieval operations: Give me N items in order by some field, starting at value X, either ascending or descending.

This seems simple enough; but the fact that we want a given single item in our existing index to be able to have multiple call numbers makes it a bit tricky. In fact, a Solr index isn’t really easily capable of doing what’s needed. There are various ways to work around it and get what you need from Solr: Naomi Dushay at Stanford has engaged in some truly heroic hacks to do it, involving creating a duplicate mirror indexing field where all the call numbers are reversed to sort backwards. And Naomi’s solution still doesn’t really allow you to limit by existing Solr facets or anything.

That’s not the solution I ended up using. Instead, I just de-normalize to another ‘index’ in a table in our existing application rdbms, with one row per call number instead of one row per item.  After talking to the Princeton folks at a library meet-up in New Haven, and hearing this was there back-end store plan for supporting ‘browse’ functions, I realized — sure, why not, that’ll work.

So how do I get them indexed in rdbms table? We use traject for indexing to Solr here, for Blacklight.  Traject is pretty flexible, and it wasn’t too hard to modify our indexing configuration so that as the indexer goes through each input record, creating a Solr Document for each one — it also, in the same stream, creates 0 to many rows in an RDBMS for each call number encountered.

We don’t do any “incremental” indexing to Solr in the first place, we just do a bulk/mass index every night recreating everything from the current state of the canonical catalog. So the same strategy applies to building the call numbers table, it’s just recreated from scratch nightly.  After racking my brain to figure out how to do this without disturbing performance or data integrity in the rdbms table — I realized, hey, no problem, just index to a temporary table first, then when done swap it into place and delete the former one.

I included a snapshotted, completely unsupported, example of how we do our indexing with traject, in the rails_stackview documentation.  It ends up a bit hacky, and makes me wish traject let me re-use some of it’s code a little bit more concisely to do this kind of a bifurcated indexing operation — but it still worked out pretty well, and leaves me pretty satisfied with traject as our indexing solution over past tools we had used.

I had hoped that adding the call number indexing to our existing traject mass index process would not slow down the indexing at all. I think this hope was based on some poorly-conceived thought process like “Traject is parallel multi-core already, so, you know, magic!”  It didn’t quite work out that way, the additional call number indexing adds about 10% penalty to our indexing time, taking our slow mass indexing from a ~10 hour to an ~11 hour process.  We run our indexing on a fairly slow VM with 3 cores assigned to it. It’s difficult to profile a parallel multi-threaded pipeline process like traject, I can’t completely wrap my head around it, but I think it’s possible on a faster machine, you’d have bottlenecks in different parts of the pipeline, and get less of a penalty.

On call numbers designed for local adjustment, used universally instead

Another notable feature of the 19th century technology of call numbers that I didn’t truly appreciate until this project — call number systems often, and LC certainly,  are designed to require a certain amount of manual hand-fitting to a particular local collection.  The end of the call number has ‘cutter numbers’ that are typically based on the author’s name, but which are meant to be hand-fitted by local catalogers to put the book just the right spot in the context of what’s already been shelved in a particular local collection.

That ends up requiring a lot more hours of cataloger labor then if a book simply had one true call number, but it’s kind of how the system was designed. I wonder if it’s tenable in the modern era to put that much work into call number assignment though, especially as print (unfortunately) gets less attention.

However, this project sort of serves as an experiment of what happens if you don’t do that local easing. To begin with, we’re combining call numbers that were originally assigned in entirely different local collections (different physical library locations), some of which were assigned before these different libraries even shared the same catalog, and were not assigned with regard to each other as context.  On top of that, we take ‘generic’ call numbers without local adjustment from MARC 050 for books that don’t have locally assigned call numbers (including ebooks where available), so these also haven’t been hand-fit into any local collection.

It does result in occasional oddities, such as different authors with similar last names writing on a subject being interfiled together. Which offends my sensibilities since I know the system when used as designed doesn’t do that. But… I think it will probably not be noticed by most people, it works out pretty well after all.

Filed under: General

SearchHub: Preliminary Data Analysis with Fusion 2: Look, Leap, Repeat

Thu, 2015-07-23 18:10

Lucidworks Fusion is the platform for search and data engineering. In article Search Basics for Data Engineers, I introduced the core features of Lucidworks Fusion 2 and used it to index some blog posts from the Lucidworks blog, resulting in a searchable collection. Here is the result of a search for blog posts about Fusion:

Bootstrapping a search app requires an initial indexing run over your data, followed by successive cycles of search and indexing until your application does what you want it to do and what search users expect it to do. The above search results required one iteration of this process. In this article, I walk through the indexing and query configurations used to produce this result.


Indexing web data is challenging because what you see is not always what you get. That is, when you look at a web page in a browser, the layout and formatting is guiding your eye, making important information more prominent. Here is what a recent Lucidworks blog entry looks like in my browser:

There are navigational elements at the top of the page, but the most prominent element is the blog post title, followed by the elements below it: date, author, opening sentence, and the first visible section heading below that.

I want my search app to be able to distinguish which information comes from which element, and be able to tune my search accordingly. I could do some by-hand analysis of one or more blog posts, or I could use Fusion to index a whole bunch of them; I choose the latter option.

Leap Pre-configured Default Pipelines

For the first iteration, I use the Fusion defaults for search and indexing. I create a collection "lw_blogs" and configure a datasource "lw_blogs_ds_default". Website access requires use of the Anda-web datasource, and this datasource is pre-configured to use the "Documents_Parsing" index pipeline.

I start the job, let it run for a few minutes, and then stop the web crawl. The search panel is pre-populated with a wildcard search using the default query pipeline. Running this search returns the following result:

At first glance, it looks like all the documents in the index contain the same text, despite having different titles. Closer inspection of the content of individual documents shows that this is not what’s going on. I use the "show all fields" control on the search results panel and examine the contents of field "content":

Reading further into this field shows that the content does indeed correspond to the blog post title, and that all text in the body of the HTML page is there. The Apache Tika parser stage extracted the text from all elements in the body of the HTML page and added it to the "content" field of the document, including all whitespace between and around nested elements, in the order in which they occur in the page. Because all the blog posts contain a banner announcement at the top and a set of common navigation elements, all of them have the same opening text:

\n\n \n \n \tSecure Solr with Fusion: Join us for a webinar to learn about the security and access control system that Lucidworks Fusion brings to Solr.\n \n\tRead More \n\n\n \n\n\n \n \n\n \n \n \n \n \t\n\tFusion\n ...

This first iteration shows me what’s going on with the data, however it fails to meet the requirement of being able to distinguish which information comes from which element, resulting in poor search results.

Repeat Custom Index Pipeline

Iteration one used the "Documents_Parsing" pipeline, which consists of the following stages:

  • Apache Tika Parser – recognizes and parses most common document formats, including HTML
  • Field Mapper – transforms field names to valid Solr field names, as needed
  • Language Detection – transforms text field names based on language of field contents
  • Solr Indexer – transforms Fusion index pipeline document into Solr document and adds (or updates) document to collection.

In order to capture the text from a particular HTML element, I need to add an HTML transform stage to my pipeline. I still need to have an Apache Tika parser stage as the first stage in my pipeline in order to transform the raw bytes sent across the wire by the web crawler via HTTP into an HTML document. But instead of using the Tika HTML parser to extract all text from the HTML body into a single field, I use the HTML transform stage to harvest elements of interest each into its own field. As a first cut at the data, I’ll use just two fields: one for the blog title and the other for the blog text.

I create a second collection "lw_blogs2", and configure another web datasource, "lw_blogs2_ds". When Fusion creates a collection, it also creates an indexing and query pipeline, using the naming convention collection name plus "-default" for both pipelines. I choose the index pipeline "lw_blogs2-default", and open the pipeline editor panel in order to customize this pipeline to process the Lucidworks blog posts:

The initial collection-specific pipeline is configured as a "Default_Data" pipeline: it consists of a Field Mapper stage followed by a Solr Indexer stage.

Adding new stages to an index pipeline pushes them onto the pipeline stages stack, therefore first I add and HTML Transform stage then I add an Apache Tika parser stage, resulting in a pipeline which starts with an Apache Tika Parser stage followed by an HTML Transform stage. First I edit the Apache Tika Parser stage as follows:

When using an Apache Tika parser stage in conjunction with an HTML or XML Transform stage the Tika stage must be configured:

  • option "Add original document content (raw bytes)" setting: false
  • option "Return parsed content as XML or HTML" setting: true
  • option "Return original XML and HTML instead of Tika XML output" setting: true

With these settings, Tika transforms the raw bytes retrieved by the web crawler into an HTML document. The next stage is an HTML Transform stage which extracts the elmenets of interest from the body of the HTML document:

An HTML transform stage requires the following configuration:

  • property "Record Selector", which specifies the HTML element that contains the document.
  • HTML Mappings, a set of rules which specify how different HTML elements are mapped to Solr document fields.

Here the "Record Selector" property "body" is the same as the default "Body Field Name" because each blog post contains is a single Solr document. Inspection of the raw HTML shows that the blog post title is in an "h1" element, therefore the mapping rule shown above specifies that the text contents of tag "h1" is mapped to the document field named "blog_title_txt". The post itself is inside a tag "article", so the second mapping rule, not shown, specifies:

  • Select Rule: article
  • Attribute: text
  • Field: blog_post_txt

The web crawl also pulled back many pages which contain summaries of ten blog posts but which don’t actually contain a blog post. These are not interesting, therefore I’d like to restrict indexing to only documents which contain a blog post. To do this, I add a condition to the Solr Indexer stage:

I start the job, let it run for a few minutes, and then stop the web crawl. I run a wildcard search, and it all just works!

Custom Query Pipeline

To test search, I do a query on the words "Fusion Spark". My first search returns no results. I know this is wrong because the articles pulled back by the wildcard search above mention both Fusion and Spark.

The reason for this apparent failure is that search is over document fields. The blog title and blog post content are now stored in document fields named "blog_title_txt" and "blog_post_txt". Therefore, I need to configure the "Search Fields" stage of the query pipeline to specify that these are search fields.

The left-hand collection home page control panel contains controls for both search and indexing. I click on the "Query Pipelines" control under the "Search" heading, and choose to edit the pipeline named "lw_blogs2-default". This is the query pipeline that was created automatically when the collection "lw_blogs2" was created. I edit the "Search Fields" stage and specify search over both fields. I also set a boost factor of 1.3 on the field "blog_title_txt", so that documents where there’s a match on the title are considered more relevant that documents where there’s a match in the blog post. As soon as I save this configuration, the search is re-run automatically:

The results look good!


As a data engineer, your mission, should you accept it, is to figure out how to build a search application which bridges the gap between the information in the raw search query and what you know about your data in order to to serve up the document(s) which should be at the top of the results list. Fusion’s default search and indexing pipelines are a quick and easy way to get the information you need about your data. Custom pipelines make this mission possible.

The post Preliminary Data Analysis with Fusion 2: Look, Leap, Repeat appeared first on Lucidworks.

SearchHub: Search Basics for Data Engineers

Thu, 2015-07-23 18:10

Lucidworks Fusion is a platform for data engineering, built on Solr/Lucene, the Apache open source search engine, which is fast, scalable, proven, and reliable. Fusion uses the Solr/Lucene engine to evaluate search requests and return results in the form of a ranked list of document ids. It gives you the ability to slice and dice your data and search results, which means that you can have Google-like search over your data, while maintaining control of both your data and the search results.

The difference between data science and data engineering is the difference between theory and practice. Data engineers build applications given a goal and constraints. For natural language search applications, the goal is to return relevant search results given an unstructured query. The constraints include: limited, noisy, and/or downright bad data and search queries, limited computing resources, and penalties for returning irrelevant or partial results.

As a data engineer, you need to understand your data and how Fusion uses it in search applications. The hard part is understanding your data. In this post, I cover the key building blocks of Fusion search.

Fusion Key Concepts

Fusion extends Solr/Lucene functionality via a REST-API and a UI built on top of that REST-API. The Fusion UI is organized around the following key concepts:

  • Collections store your data.
  • Documents are the things that are returned as search results.
  • Fields are the things that are actually stored in a collection.
  • Datasource are the conduit between your data repository and Fusion.
  • Pipelines encapsulate a sequence of processing steps, called stages.
    • Indexing Pipelines process the raw data received from a datasource into fielded documents for indexing into a Fusion collection.
    • Query Pipelines process search requests and return an ordered list of matching documents.
  • Relevancy is the metric used to order search results. It is a non-negative real number which indicates the similarity between a search request and a document.
Lucene and Solr

Lucene started out as a search engine designed for following information retrieval task: given a set of query terms and a set of documents, find the subset of documents which are relevant for that query. Lucene provides a rich query language which allows for writing complicated logical conditions. Lucene now encompasses much of the functionality of a traditional DBMS, both in the kinds of data it can handle and the transactional security it provides.

Lucene maps discrete pieces of information, e.g., words, dates, numbers, to the documents in which they occur. This map is called an inverted index because the keys are document elements and the values are document ids, in contrast to other kinds of datastores where document ids are used as a key and the values are the document contents. This indexing strategy means that search requires just one lookup on an inverted index, as opposed to a document oriented search which would require a large number of lookups, one per document. Lucene treats a document as a list of named, typed fields. For each document field, Lucene builds an inverted index that maps field values to documents.

Lucene itself is a search API. Solr wraps Lucene in an web platform. Search and indexing are carried out via HTTP requests and responses. Solr generalizes the notion of a Lucene index to a Solr collection, a uniquely named, managed, and configured index which can be distributed (“sharded”) and replicated across servers, allowing for scalability and high availability.

Fusion UI and Workflow

The following sections show how the above set of key concepts are realized in the Fusion UI.


Fusion collections are Solr collections which are managed by Fusion. Fusion can manage as many collections as you need, want, or both. On initial login, the Fusion UI prompts you to choose or create a collection. On subsequent logins, the Fusion UI displays an overview of your collections and system collections:

The above screenshot shows the Fusion collections page for an initial developer installation, just after initial login and creation of a new collection called “my_collection”, which is circled in yellow. Clicking on this circled name leads to the “my_collection” collection home page:

The collection home page contains controls for both search and indexing. As this collection doesn’t yet contain any documents, the search results panel is empty.

Indexing: Datasources and Pipelines

Bootstrapping a search app requires an initial indexing run over your data, followed by successive cycles of search and indexing until you have a search app that does what you want it to do and what search users expect it to do. The collections home page indexing toolset contains controls for defining and using datasources and pipelines.

Once you have created a collection, clicking on the “Datasource” control changes the left hand side control panel over to the datasource configuration panel. The first step in configuring a datasource is specifying the kind of data repository to connect to. Fusion connectors are a set of programs which do the work of connecting to and retrieving data from specific repository types. For example, to index a set of web pages, a datasource uses a web connector.

To configure the datasource, choose the “edit” control. The datasource configuration panel controls the choice of indexing pipeline. All datasources are pre-configured with a default indexing pipeline. The “Documents_Parsing” indexing pipeline is the default pipeline for use with a web connector. Beneath the pipeline configuration control is a control “Open <pipeline name> pipeline”. Clicking on this opens a pipeline editing panel next to the datasource configuration panel:

Once the datasource is configured, the indexing job is run by controls on the datasource panel:

The “Start” button, circled in yellow, when clicked, changes to “Stop” and “Abort” controls. Beneath this button is a “Show details”/”Hide details” control, shown in its open state.

Creating and maintaining a complete, up-to-date index over your data is necessary for good search. Much of this process consists of data munging. Connectors and pipelines make this chore manageable, repeatable, and testable. It can be automated using Fusion’s job scheduling and alerting mechanisms.

Search and Relevancy

Once a datasource has been configured and the indexing job is complete, the collection can be searched using the search results tool. A wildcard query of “:” will match all documents in the collection. The following screenshot shows the result of running this query via the search box at the top of the search results panel:

After running the datasource exactly once, the collection consists of 76 posts from the Lucidworks blog, as indicated by the “Last Job” report on the datasource panel, circled in yellow. This agrees with the “num found”, also circled in yellow, at the top of the search results page.

The search query “Fusion” returns the most relevant blog posts about Fusion:

There are 18 blog posts altogether which have the word “Fusion” either in the title or body of the post. In this screenshot we see the 10 most relevant posts, ranked in descending order.

A search application takes a user search query and returns search results which the user deems relevant. A well-tuned search application is one where the both the user and the system agree on both the set of relevant documents returned for a query and the order in which they are ranked. Fusion’s query pipelines allow you to tune your search and the search results tool lets you test your changes.


Because this post is a brief and gentle introduction to Fusion, I omitted a few details and skipped over a few steps. Nonetheless, I hope that this introduction to the basics of Fusion has made you curious enough to try it for yourself.

The post Search Basics for Data Engineers appeared first on Lucidworks.

Villanova Library Technology Blog: Automatically updating locally customized files with Git and diff3

Thu, 2015-07-23 15:01

The Problem

VuFind follows a fairly common software design pattern: it discourages users from making changes to core files, and instead encourages them to copy files out of the core and modify them in a local location. This has several advantages, including putting all of your changes in one place (very useful when a newcomer needs to learn how you have customized a project) and easing upgrades (you can update the core files without worrying about which ones you have changed).

There is one significant disadvantage, however: when the core files change, your copies get out of sync. Keeping your local copies synched up with the core files requires a lot of time-consuming, error-prone manual effort.

Or does it?

The Solution

One argument against modifying files in a local directory is that, if you use a version control tool like Git, the advantages of the “local customization directory” approach are diminished, since Git provides a different mechanism for reviewing all local changes to a code base and for handling software updates. If you modify files in place, then “git merge” will help you deal with updates to the core code.

Of course, the Git solution has its own drawbacks — and VuFind would lose some key functionality (the ability for a single instance to manage multiple configurations at different URLs) if we threw away our separation of local settings from core code.

Fortunately, you can have the best of both worlds. It’s just a matter of wrangling Git and a 3-way merge tool properly.

Three Way Merges

To understand the solution to the problem, you need to understand what a three-way merge is. Essentially, this is an algorithm that takes three files: an “old” file, and two “new” files that each have applied different changes to the “old” file. The algorithm attempts to reconcile the changes in both of the “new” files so that they can be combined into a single output. In cases where each “new” file has made a different change in the same place, the algorithm inserts “conflict markers” so that a human can manually reconcile the situation.

Whenever you merge a branch in Git, it is doing a three-way merge. The “old” file is the nearest common ancestor version between your branch and the branch being merged in. The “new” files are the versions of the same file at the tips of the two branches.

If we could just do a custom three-way merge, where the “old” file was the common ancestor between our local version of the file and the core version of the file, with the local/core versions as the “new” files, then we could automate much of the work of updating our local files.

Fortunately, we can.

Lining Up the Pieces

Solving this problem assumes a particular environment (which happens to be the environment we use at Villanova to manage our VuFind instances): a Git repository forked from the main VuFind public repository, with a custom theme and a local settings directory added.

Assume that we have this repository in a state where all of our local files are perfectly synched up with the core files, but that the upstream public repository has changed. Here’s what we need to do:

1.) Merge the upstream master code so that the core files are updated.

2.) For each of our locally customized files, perform a three-way merge. The old file is the core file prior to the merge; the new files are the core file after the merge and the local file.

3.) Manually resolve any conflicts caused by the merging, and commit the local changes.

Obviously step 2 is the hard part… but it’s not actually that hard. If you do the local updates immediately after the merge commit, you can easily retrieve pre-merge versions of files using the “git show HEAD~1:/path/to/file” command. That means you have ready access to all three pieces you need for three-way merging, and the rest is just a matter of automation.

The Script

The following Bash script is the one we use for updating our local instance of VuFind. The key piece is the merge_directory function definition, which accepts a local directory and the core equivalent as parameters. We use this to sync up various configuration files, Javascript code and templates. Note that for configurations, we merge local directories with core directories; for themes, we merge custom themes with their parents.

The actual logic is surprisingly simple. We use recursion to navigate through the local directory and look at all of the local files. For each file, we use string manipulation to figure out what the core version should be called. If the core version exists, we use the previously-mentioned Git magic to pull the old version into the /tmp directory. Then we use the diff3 three-way merge tool to do the heavy lifting, overwriting the local file with the new merged version. We echo out a few helpful messages along the way so users are aware of conflicts and skipped files.

#!/bin/bash function merge_directory { echo merge_directory $1 $2 local localDir=$1 local localDirLength=${#localDir} local coreDir=$2 for current in $localDir/* do local coreEquivalent=$coreDir${current:$localDirLength} if [ -d "$current" ] then merge_directory "$current" "$coreEquivalent" else local oldFile="/tmp/tmp-merge-old-`basename \"$coreEquivalent\"`" local newFile="/tmp/tmp-merge-new-`basename \"$coreEquivalent\"`" if [ -f "$coreEquivalent" ] then git show HEAD~1:$coreEquivalent > $oldFile diff3 -m "$current" "$oldFile" "$coreEquivalent" > "$newFile" if [ $? == 1 ] then echo "CONFLICT: $current" fi cp $newFile $current else echo "Skipping $current; no equivalent in core code." fi fi done } merge_directory local/harvest harvest merge_directory local/import import merge_directory local/config/vufind config/vufind merge_directory themes/vuboot3/templates themes/bootstrap3/templates merge_directory themes/villanova_mobile/templates themes/jquerymobile/templates merge_directory themes/vuboot3/js themes/bootstrap3/js merge_directory themes/villanova_mobile/js themes/jquerymobile/js


I’ve been frustrated by this problem for years, and yet the solution is surprisingly simple — I’m glad it finally came to me. Please feel free to use this for your own purposes, and let me know if you have any questions or problems!