You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 1 hour 4 min ago

pinboard: Welcome

Fri, 2015-07-24 01:18

Roy Tennant: Yet Another Metadata Zoo

Thu, 2015-07-23 23:51

I was talking with my old friend John Kunze a little while back and he described a project that he is involved with called “Yet Another Metadata Zoo” or yamz.net. In a world of more ontologies than you can shake a stick at, it aims to provide a simple, easy-to-use mechanism for defining and maintaining individual metadata terms and their definitions.

The project explains itself like this:

The YAMZ Metadictionary (metadata dictionary) prototype…is a proof-of-concept web-based software service acting as an open registry of metadata terms from all domains and from all parts of “metadata speech”. With no login required, anyone can search for and link to registry term definitions. Anyone can register to be able to login and create terms.

We aim for the metadictionary to become a high-quality cross-domain metadata vocabulary that is directly connected to evolving user needs. Change will be rapid and affordable, with no need for panels of experts to convene and arbitrate to improve it. We expect dramatic simplification compared to the situation today, in which there is an overwhelming number of vocabularies (ontologies) to choose from.

Our hope is that users will be able to find most of the terms they need in one place (one vocabulary namespace), namely, the Metadictionary. This should minimize the need for maintaining expensive crosswalks with other vocabularies and cluttering up expressed metadata with lots of namespace qualifiers. Although it is not our central goal, the vocabulary is shovel-ready for those wishing to create linked data applications.

If you have a Google ID, signing in is dead simple and you can begin creating and editing terms. You can also vote terms up or down, which can eventually take a term from “vernacular” status (the default for new terms) to “canonical” — terms that are considered stable and unchanging. A third status is “deprecated”.

You can browse terms to see what is there already.

I really like this project for several reasons:

  • It’s dead simple.
  • It’s fast and easy to gain value from it. 
  • Every term has an identifier, forever and always (deprecated terms keep their identifier).
  • Voting and commenting are a key part of the infrastructure, and provide easy mechanisms for it to get ever better over time.

What it needs now is more people involved, so it can gain the kind of input and participation that is necessary to make it a truly authoritative source of metadata element names and descriptions. I’ve already contributed to it, how about you?

 

District Dispatch: Libraries: Apply now for 2016 IMLS National Medals

Thu, 2015-07-23 21:56

Institute of Museum and Library Services logo

The application period is now open for the 2016 National Medal for Museum and Library Service, the nation’s highest honor. Each year, the Institute of Museum and Library Services (IMLS) recognizes libraries and museums that make significant and exceptional contributions in service to their communities. Nomination forms are due October 1, 2015.

Read more from IMLS:

All types of nonprofit libraries and library organizations, including academic, school, and special libraries, archives, library associations, and library consortia, are eligible to receive this honor. Public or private nonprofit museums of any discipline (including general, art, history, science and technology, children’s, and natural history and anthropology), as well as historic houses and sites, arboretums, nature centers, aquariums, zoos, botanical gardens, and planetariums are eligible.

Winners are honored at a ceremony in Washington, DC, host a two-day visit from StoryCorps to record community member stories, and receive positive media attention. Approximately thirty finalists are selected as part of the process and are featured by IMLS during a six-week social media and press campaign.

Anyone may nominate a museum or library for this honor, and institutions may self-nominate. For more information, reach out to one of the following contacts.

Program Contact for Museums:
Mark Feitl, Museum Program Specialist
202-653-4635, mfeitl@imls.gov

Program Contact for Libraries:
Katie Murray, Staff Assistant
202-653-4644, kmurray@imls.gov

The Institute of Museum and Library Services is the primary source of federal support for the nation’s 123,000 libraries and 35,000 museums.

The post Libraries: Apply now for 2016 IMLS National Medals appeared first on District Dispatch.

District Dispatch: Senate cybersecurity bill attacks civil liberties

Thu, 2015-07-23 21:08

Credit – 22860 (Flickr)

It’s back to the barricades for librarians and our many civil liberties coalition allies. District Dispatch sounded the alarm a year ago about the return of privacy-hostile cybersecurity or information sharing legislation. Widely dubbed a “zombie” for its ability to rise from the legislative dead, the current version of the bill goes by the innocuous name of the “Cybersecurity Information Sharing Act” but “CISA” (S. 754) is anything but. The bill was approved in secret session by the Senate’s Intelligence Committee and has not received a single public hearing.  Unfortunately, Senate Leadership is pushing for a vote on S. 754 in the few legislative days remaining before its August recess.

CISA is touted by its supporters as a means of preventing future large-scale data breaches, like the massive one just suffered by the federal government’s Office of Personnel Management. CISA, however, will create legal responsibilities and incentives for both private companies and federal agencies to collect, widely disseminate and retain huge amounts of Americans’ personally identifiable information that will itself then be vulnerable to sophisticated hacking attacks.  In the process, the bill also creates massive exemptions for private companies under virtually every major consumer privacy protection law now on the books.

Worse yet, collected personal information would be shared almost instantly not just among federal cyber-threat agencies, but with law enforcement entities at every level of government.  The bill does not restrict how long the data can be retained and what kinds of non-cyber offenses the information may be used to prosecute.  If enacted, that would be an unprecedented and sweeping end run on the Fourth Amendment.

CISA also allows both the government and private companies to take rapid unilateral “counter-measures” to disable any computer network, including for example a library system’s or municipal government’s, that the company believes is the source of a cyber-attack … and companies get immunity from paying for any resulting damages even if their “belief” turns out to be wrong.

It’s time for librarians to rise again, too . . . to the challenge of stopping CISA in its tracks yet again.  For lots more information, and to contact the offices of both your U.S. Senators, please visit www.stopcyberspying.com now!  It’s fast, easy and couldn’t be more timely!

The post Senate cybersecurity bill attacks civil liberties appeared first on District Dispatch.

Nicole Engard: Bookmarks for July 23, 2015

Thu, 2015-07-23 20:30

Today I found the following resources and bookmarked them on Delicious.

  • hylafax The world’s most advanced open source fax server

Digest powered by RSS Digest

The post Bookmarks for July 23, 2015 appeared first on What I Learned Today....

Related posts:

  1. Faxing via the Web
  2. No more free faxing with drop.io
  3. MarkMail: Mailing List Search

Jonathan Rochkind: Virtual Shelf Browse

Thu, 2015-07-23 19:17

We know that some patrons like walking the physical stacks, to find books on a topic of interest to them through that kind of browsing of adjacently shelved items.

I like wandering stacks full of books too, and hope we can all continue to do so.

But in an effort to see if we can provide an online experience that fulfills some of the utility of this kind of browsing, we’ve introduced a Virtual Shelf Browse that lets you page through books online, in the order of their call numbers.

An online shelf browse can do a number of things you can’t do physically walking around the stacks:

  • You can do it from home, or anywhere you have a computer (or mobile device!)
  • It brings together books from various separate physical locations in one virtual stack. Including multiple libraries, locations within libraries, and our off-site storage.
  • It includes even checked out books, and in some cases even ebooks (if we have a call number on record for them)
  • Place one item at multiple locations in a Virtual Shelf, if we have more than one call number on record for it. There’s always more than one way you could classify or characterize a work; a physical item can only be in one place at a time, but not so in a virtual display.

The UI is based on the open source stackview code released by the Harvard Library Innovation Lab. Thanks to Harvard for sharing their code, and to @anniejocaine for helping me understand the code, and accepting my pull requests with some bug fixes and tweaks.

This is to some extent an experiment, but we hope it opens up new avenues for browsing and serendipitous discovery for our patrons.

You can drop into one example place in the virtual shelf browse here, or drop into our catalog to do your own searches — the Virtual Shelf Browse is accessed by navigating to an individual item detail page, and then clicking the Virtual Shelf Browse button in the right sidebar.  It seemed like the best way to enter the Virtual Shelf was from an item of interest to you, to see what other items are shelved nearby.

Our Shelf Browse is based on ordering by Library of Congress Call Numbers. Not all of our items have LC call numbers, so not every item appears in the virtual shelf, or has a “Virtual Shelf Browse” button to provide an entry point to it. Some of our local collections are shelved locally with LC call numbers, and these are entirely present. For other collections —  which might be shelved under other systems or in closed stacks and not assigned local shelving call numbers — we can still place them in the virtual shelf if we can find a cataloger-suggested call number in the MARC bib 050 or similar fields. So for those collections, some items might appear in the Virtual Shelf, others not.

On Call Numbers, and Sorting

Library call number systems — from LC, to Dewey, to Sudocs, or even UDC — are a rather ingenious 19th century technology for organizing books in a constantly growing collection such that similar items are shelved nearby. Rather ingenious for the 19th century anyway.

It was fun to try to bringing this technology — and the many hours of cataloger work that’s gone into constructing call numbers — into the 21st century to continue providing value in an online display.

It was also challenging in some ways. It turns out the nature of ordering of Library of Congress call numbers particularly is difficult to implement in computer software, there are a bunch of odd cases where to a human it might be clear what the proper ordering is  (at least to a properly trained human? and different libraries might even order differently!), but difficult to encode all the cases into software.

The newly released Lcsort ruby gem does a pretty marvelous job of allowing sorting of LC call numbers that properly sorts a lot of them — I won’t say it gets every valid call number, let alone local practice variation, right, but it gets a lot of stuff right including such crowd-pleasing oddities as:

  • `KF 4558 15th .G6` sorts after `KF 4558 2nd .I6`
  • `Q11 .P6 vol. 12 no. 1` sorts after `Q11 .P6 vol. 4 no. 4`
  • Can handle suffixes after cutters as in popular local practice (and NLM call numbers), eg `R 179 .C79ab`
  • Variations in spacing or punctuation that should not matter for sorting, `R 169.B59.C39` vs `R169 B59C39 1990` `R169 .B59 .C39 1990` etc.

Lcsort is based on the cummulative knowledge of years of library programmer attempts to sort LC calls, including an original implementation based on much trial and error by Bill Dueber of the University of Michigan, a port to ruby by Nikitas Tampakis of Princeton University Library, advice and test cases based on much trial and error from Naomi Dushay of Stanford, and a bunch more code wrangling by me.

I do encourage you to check out Lcsort for any LC call number ordering needs, if you can do it in ruby — or even port it to another language if you can’t. I think it works as well or better as anything our community of library technologies has done yet in the open.

Check out my code — rails_stackview

This project was possible only because of the work of so many that had gone before, and been willing to share their work, from Harvard’s stackview to all the work that went into figuring out how to sort LC call numbers.

So it only makes sense to try to share what I’ve done too, to integrate a stackview call number shelf browse in a Blacklight Rails app.  I have shared some components in a Rails engine at rails_stackview

In this case, I did not do what I’d have done in the past, and try to make a rock-solid, general-purpose, highly flexible and configurable tool that integrated as brainlessly as possible out of the box with a Blacklight app. I’ve had mixed success trying to do that before, and came to think it might have been over-engineering and YAGNI to try. Additionally, there are just too many ways to try to do this integration — and too many versions of Blacklight changes to keep track of — I just wasn’t really sure what was best and didn’t have the capacity for it.

So this is just the components I had to write for the way I chose to do it in the end, and for my use cases. I did try to make those components well-designed for reasonable flexibility, or at least future extension to more flexibility.

But it’s still just pieces that you’d have to assemble yourself into a solution, and integrate into your Rails app (no real Blacklight expectations, they’re just tools for a Rails app) with quite a bit of your own code.  The hardest part might be indexing your call numbers for retrieval suitable to this UI.

I’m curious to see if this approach to sharing my pieces instead of a fully designed flexible solution might still ends up being useful to anyone, and perhaps encourage some more virtual shelf browse implementations.

On Indexing

Being a Blacklight app, all of our data was already in Solr. It would have been nice to use the existing Solr index as the back-end for the virtual shelf browse, especially if it allowed us to do things like a virtual shelf browse limited by existing Solr facets. But I did not end up doing so.

To support this kind of call-number-ordered virtual shelf browse, you need your data in a store of some kind that supports some basic retrieval operations: Give me N items in order by some field, starting at value X, either ascending or descending.

This seems simple enough; but the fact that we want a given single item in our existing index to be able to have multiple call numbers makes it a bit tricky. In fact, a Solr index isn’t really easily capable of doing what’s needed. There are various ways to work around it and get what you need from Solr: Naomi Dushay at Stanford has engaged in some truly heroic hacks to do it, involving creating a duplicate mirror indexing field where all the call numbers are reversed to sort backwards. And Naomi’s solution still doesn’t really allow you to limit by existing Solr facets or anything.

That’s not the solution I ended up using. Instead, I just de-normalize to another ‘index’ in a table in our existing application rdbms, with one row per call number instead of one row per item.  After talking to the Princeton folks at a library meet-up in New Haven, and hearing this was there back-end store plan for supporting ‘browse’ functions, I realized — sure, why not, that’ll work.

So how do I get them indexed in rdbms table? We use traject for indexing to Solr here, for Blacklight.  Traject is pretty flexible, and it wasn’t too hard to modify our indexing configuration so that as the indexer goes through each input record, creating a Solr Document for each one — it also, in the same stream, creates 0 to many rows in an RDBMS for each call number encountered.

We don’t do any “incremental” indexing to Solr in the first place, we just do a bulk/mass index every night recreating everything from the current state of the canonical catalog. So the same strategy applies to building the call numbers table, it’s just recreated from scratch nightly.  After racking my brain to figure out how to do this without disturbing performance or data integrity in the rdbms table — I realized, hey, no problem, just index to a temporary table first, then when done swap it into place and delete the former one.

I included a snapshotted, completely unsupported, example of how we do our indexing with traject, in the rails_stackview documentation.  It ends up a bit hacky, and makes me wish traject let me re-use some of it’s code a little bit more concisely to do this kind of a bifurcated indexing operation — but it still worked out pretty well, and leaves me pretty satisfied with traject as our indexing solution over past tools we had used.

I had hoped that adding the call number indexing to our existing traject mass index process would not slow down the indexing at all. I think this hope was based on some poorly-conceived thought process like “Traject is parallel multi-core already, so, you know, magic!”  It didn’t quite work out that way, the additional call number indexing adds about 10% penalty to our indexing time, taking our slow mass indexing from a ~10 hour to an ~11 hour process.  We run our indexing on a fairly slow VM with 3 cores assigned to it. It’s difficult to profile a parallel multi-threaded pipeline process like traject, I can’t completely wrap my head around it, but I think it’s possible on a faster machine, you’d have bottlenecks in different parts of the pipeline, and get less of a penalty.

On call numbers designed for local adjustment, used universally instead

Another notable feature of the 19th century technology of call numbers that I didn’t truly appreciate until this project — call number systems often, and LC certainly,  are designed to require a certain amount of manual hand-fitting to a particular local collection.  The end of the call number has ‘cutter numbers’ that are typically based on the author’s name, but which are meant to be hand-fitted by local catalogers to put the book just the right spot in the context of what’s already been shelved in a particular local collection.

That ends up requiring a lot more hours of cataloger labor then if a book simply had one true call number, but it’s kind of how the system was designed. I wonder if it’s tenable in the modern era to put that much work into call number assignment though, especially as print (unfortunately) gets less attention.

However, this project sort of serves as an experiment of what happens if you don’t do that local easing. To begin with, we’re combining call numbers that were originally assigned in entirely different local collections (different physical library locations), some of which were assigned before these different libraries even shared the same catalog, and were not assigned with regard to each other as context.  On top of that, we take ‘generic’ call numbers without local adjustment from MARC 050 for books that don’t have locally assigned call numbers (including ebooks where available), so these also haven’t been hand-fit into any local collection.

It does result in occasional oddities, such as different authors with similar last names writing on a subject being interfiled together. Which offends my sensibilities since I know the system when used as designed doesn’t do that. But… I think it will probably not be noticed by most people, it works out pretty well after all.


Filed under: General

SearchHub: Preliminary Data Analysis with Fusion 2: Look, Leap, Repeat

Thu, 2015-07-23 18:10

Lucidworks Fusion is the platform for search and data engineering. In article Search Basics for Data Engineers, I introduced the core features of Lucidworks Fusion 2 and used it to index some blog posts from the Lucidworks blog, resulting in a searchable collection. Here is the result of a search for blog posts about Fusion:

Bootstrapping a search app requires an initial indexing run over your data, followed by successive cycles of search and indexing until your application does what you want it to do and what search users expect it to do. The above search results required one iteration of this process. In this article, I walk through the indexing and query configurations used to produce this result.

Look

Indexing web data is challenging because what you see is not always what you get. That is, when you look at a web page in a browser, the layout and formatting is guiding your eye, making important information more prominent. Here is what a recent Lucidworks blog entry looks like in my browser:

There are navigational elements at the top of the page, but the most prominent element is the blog post title, followed by the elements below it: date, author, opening sentence, and the first visible section heading below that.

I want my search app to be able to distinguish which information comes from which element, and be able to tune my search accordingly. I could do some by-hand analysis of one or more blog posts, or I could use Fusion to index a whole bunch of them; I choose the latter option.

Leap Pre-configured Default Pipelines

For the first iteration, I use the Fusion defaults for search and indexing. I create a collection "lw_blogs" and configure a datasource "lw_blogs_ds_default". Website access requires use of the Anda-web datasource, and this datasource is pre-configured to use the "Documents_Parsing" index pipeline.

I start the job, let it run for a few minutes, and then stop the web crawl. The search panel is pre-populated with a wildcard search using the default query pipeline. Running this search returns the following result:

At first glance, it looks like all the documents in the index contain the same text, despite having different titles. Closer inspection of the content of individual documents shows that this is not what’s going on. I use the "show all fields" control on the search results panel and examine the contents of field "content":

Reading further into this field shows that the content does indeed correspond to the blog post title, and that all text in the body of the HTML page is there. The Apache Tika parser stage extracted the text from all elements in the body of the HTML page and added it to the "content" field of the document, including all whitespace between and around nested elements, in the order in which they occur in the page. Because all the blog posts contain a banner announcement at the top and a set of common navigation elements, all of them have the same opening text:

\n\n \n \n \tSecure Solr with Fusion: Join us for a webinar to learn about the security and access control system that Lucidworks Fusion brings to Solr.\n \n\tRead More \n\n\n \n\n\n \n \n\n \n \n \n \n \t\n\tFusion\n ...

This first iteration shows me what’s going on with the data, however it fails to meet the requirement of being able to distinguish which information comes from which element, resulting in poor search results.

Repeat Custom Index Pipeline

Iteration one used the "Documents_Parsing" pipeline, which consists of the following stages:

  • Apache Tika Parser – recognizes and parses most common document formats, including HTML
  • Field Mapper – transforms field names to valid Solr field names, as needed
  • Language Detection – transforms text field names based on language of field contents
  • Solr Indexer – transforms Fusion index pipeline document into Solr document and adds (or updates) document to collection.

In order to capture the text from a particular HTML element, I need to add an HTML transform stage to my pipeline. I still need to have an Apache Tika parser stage as the first stage in my pipeline in order to transform the raw bytes sent across the wire by the web crawler via HTTP into an HTML document. But instead of using the Tika HTML parser to extract all text from the HTML body into a single field, I use the HTML transform stage to harvest elements of interest each into its own field. As a first cut at the data, I’ll use just two fields: one for the blog title and the other for the blog text.

I create a second collection "lw_blogs2", and configure another web datasource, "lw_blogs2_ds". When Fusion creates a collection, it also creates an indexing and query pipeline, using the naming convention collection name plus "-default" for both pipelines. I choose the index pipeline "lw_blogs2-default", and open the pipeline editor panel in order to customize this pipeline to process the Lucidworks blog posts:

The initial collection-specific pipeline is configured as a "Default_Data" pipeline: it consists of a Field Mapper stage followed by a Solr Indexer stage.

Adding new stages to an index pipeline pushes them onto the pipeline stages stack, therefore first I add and HTML Transform stage then I add an Apache Tika parser stage, resulting in a pipeline which starts with an Apache Tika Parser stage followed by an HTML Transform stage. First I edit the Apache Tika Parser stage as follows:

When using an Apache Tika parser stage in conjunction with an HTML or XML Transform stage the Tika stage must be configured:

  • option "Add original document content (raw bytes)" setting: false
  • option "Return parsed content as XML or HTML" setting: true
  • option "Return original XML and HTML instead of Tika XML output" setting: true

With these settings, Tika transforms the raw bytes retrieved by the web crawler into an HTML document. The next stage is an HTML Transform stage which extracts the elmenets of interest from the body of the HTML document:

An HTML transform stage requires the following configuration:

  • property "Record Selector", which specifies the HTML element that contains the document.
  • HTML Mappings, a set of rules which specify how different HTML elements are mapped to Solr document fields.

Here the "Record Selector" property "body" is the same as the default "Body Field Name" because each blog post contains is a single Solr document. Inspection of the raw HTML shows that the blog post title is in an "h1" element, therefore the mapping rule shown above specifies that the text contents of tag "h1" is mapped to the document field named "blog_title_txt". The post itself is inside a tag "article", so the second mapping rule, not shown, specifies:

  • Select Rule: article
  • Attribute: text
  • Field: blog_post_txt

The web crawl also pulled back many pages which contain summaries of ten blog posts but which don’t actually contain a blog post. These are not interesting, therefore I’d like to restrict indexing to only documents which contain a blog post. To do this, I add a condition to the Solr Indexer stage:

I start the job, let it run for a few minutes, and then stop the web crawl. I run a wildcard search, and it all just works!

Custom Query Pipeline

To test search, I do a query on the words "Fusion Spark". My first search returns no results. I know this is wrong because the articles pulled back by the wildcard search above mention both Fusion and Spark.

The reason for this apparent failure is that search is over document fields. The blog title and blog post content are now stored in document fields named "blog_title_txt" and "blog_post_txt". Therefore, I need to configure the "Search Fields" stage of the query pipeline to specify that these are search fields.

The left-hand collection home page control panel contains controls for both search and indexing. I click on the "Query Pipelines" control under the "Search" heading, and choose to edit the pipeline named "lw_blogs2-default". This is the query pipeline that was created automatically when the collection "lw_blogs2" was created. I edit the "Search Fields" stage and specify search over both fields. I also set a boost factor of 1.3 on the field "blog_title_txt", so that documents where there’s a match on the title are considered more relevant that documents where there’s a match in the blog post. As soon as I save this configuration, the search is re-run automatically:

The results look good!

Conclusion

As a data engineer, your mission, should you accept it, is to figure out how to build a search application which bridges the gap between the information in the raw search query and what you know about your data in order to to serve up the document(s) which should be at the top of the results list. Fusion’s default search and indexing pipelines are a quick and easy way to get the information you need about your data. Custom pipelines make this mission possible.

The post Preliminary Data Analysis with Fusion 2: Look, Leap, Repeat appeared first on Lucidworks.

SearchHub: Search Basics for Data Engineers

Thu, 2015-07-23 18:10

Lucidworks Fusion is a platform for data engineering, built on Solr/Lucene, the Apache open source search engine, which is fast, scalable, proven, and reliable. Fusion uses the Solr/Lucene engine to evaluate search requests and return results in the form of a ranked list of document ids. It gives you the ability to slice and dice your data and search results, which means that you can have Google-like search over your data, while maintaining control of both your data and the search results.

The difference between data science and data engineering is the difference between theory and practice. Data engineers build applications given a goal and constraints. For natural language search applications, the goal is to return relevant search results given an unstructured query. The constraints include: limited, noisy, and/or downright bad data and search queries, limited computing resources, and penalties for returning irrelevant or partial results.

As a data engineer, you need to understand your data and how Fusion uses it in search applications. The hard part is understanding your data. In this post, I cover the key building blocks of Fusion search.

Fusion Key Concepts

Fusion extends Solr/Lucene functionality via a REST-API and a UI built on top of that REST-API. The Fusion UI is organized around the following key concepts:

  • Collections store your data.
  • Documents are the things that are returned as search results.
  • Fields are the things that are actually stored in a collection.
  • Datasource are the conduit between your data repository and Fusion.
  • Pipelines encapsulate a sequence of processing steps, called stages.
    • Indexing Pipelines process the raw data received from a datasource into fielded documents for indexing into a Fusion collection.
    • Query Pipelines process search requests and return an ordered list of matching documents.
  • Relevancy is the metric used to order search results. It is a non-negative real number which indicates the similarity between a search request and a document.
Lucene and Solr

Lucene started out as a search engine designed for following information retrieval task: given a set of query terms and a set of documents, find the subset of documents which are relevant for that query. Lucene provides a rich query language which allows for writing complicated logical conditions. Lucene now encompasses much of the functionality of a traditional DBMS, both in the kinds of data it can handle and the transactional security it provides.

Lucene maps discrete pieces of information, e.g., words, dates, numbers, to the documents in which they occur. This map is called an inverted index because the keys are document elements and the values are document ids, in contrast to other kinds of datastores where document ids are used as a key and the values are the document contents. This indexing strategy means that search requires just one lookup on an inverted index, as opposed to a document oriented search which would require a large number of lookups, one per document. Lucene treats a document as a list of named, typed fields. For each document field, Lucene builds an inverted index that maps field values to documents.

Lucene itself is a search API. Solr wraps Lucene in an web platform. Search and indexing are carried out via HTTP requests and responses. Solr generalizes the notion of a Lucene index to a Solr collection, a uniquely named, managed, and configured index which can be distributed (“sharded”) and replicated across servers, allowing for scalability and high availability.

Fusion UI and Workflow

The following sections show how the above set of key concepts are realized in the Fusion UI.

Collections

Fusion collections are Solr collections which are managed by Fusion. Fusion can manage as many collections as you need, want, or both. On initial login, the Fusion UI prompts you to choose or create a collection. On subsequent logins, the Fusion UI displays an overview of your collections and system collections:

The above screenshot shows the Fusion collections page for an initial developer installation, just after initial login and creation of a new collection called “my_collection”, which is circled in yellow. Clicking on this circled name leads to the “my_collection” collection home page:

The collection home page contains controls for both search and indexing. As this collection doesn’t yet contain any documents, the search results panel is empty.

Indexing: Datasources and Pipelines

Bootstrapping a search app requires an initial indexing run over your data, followed by successive cycles of search and indexing until you have a search app that does what you want it to do and what search users expect it to do. The collections home page indexing toolset contains controls for defining and using datasources and pipelines.

Once you have created a collection, clicking on the “Datasource” control changes the left hand side control panel over to the datasource configuration panel. The first step in configuring a datasource is specifying the kind of data repository to connect to. Fusion connectors are a set of programs which do the work of connecting to and retrieving data from specific repository types. For example, to index a set of web pages, a datasource uses a web connector.

To configure the datasource, choose the “edit” control. The datasource configuration panel controls the choice of indexing pipeline. All datasources are pre-configured with a default indexing pipeline. The “Documents_Parsing” indexing pipeline is the default pipeline for use with a web connector. Beneath the pipeline configuration control is a control “Open <pipeline name> pipeline”. Clicking on this opens a pipeline editing panel next to the datasource configuration panel:

Once the datasource is configured, the indexing job is run by controls on the datasource panel:

The “Start” button, circled in yellow, when clicked, changes to “Stop” and “Abort” controls. Beneath this button is a “Show details”/”Hide details” control, shown in its open state.

Creating and maintaining a complete, up-to-date index over your data is necessary for good search. Much of this process consists of data munging. Connectors and pipelines make this chore manageable, repeatable, and testable. It can be automated using Fusion’s job scheduling and alerting mechanisms.

Search and Relevancy

Once a datasource has been configured and the indexing job is complete, the collection can be searched using the search results tool. A wildcard query of “:” will match all documents in the collection. The following screenshot shows the result of running this query via the search box at the top of the search results panel:

After running the datasource exactly once, the collection consists of 76 posts from the Lucidworks blog, as indicated by the “Last Job” report on the datasource panel, circled in yellow. This agrees with the “num found”, also circled in yellow, at the top of the search results page.

The search query “Fusion” returns the most relevant blog posts about Fusion:

There are 18 blog posts altogether which have the word “Fusion” either in the title or body of the post. In this screenshot we see the 10 most relevant posts, ranked in descending order.

A search application takes a user search query and returns search results which the user deems relevant. A well-tuned search application is one where the both the user and the system agree on both the set of relevant documents returned for a query and the order in which they are ranked. Fusion’s query pipelines allow you to tune your search and the search results tool lets you test your changes.

Conclusion

Because this post is a brief and gentle introduction to Fusion, I omitted a few details and skipped over a few steps. Nonetheless, I hope that this introduction to the basics of Fusion has made you curious enough to try it for yourself.

The post Search Basics for Data Engineers appeared first on Lucidworks.

Villanova Library Technology Blog: Automatically updating locally customized files with Git and diff3

Thu, 2015-07-23 15:01

The Problem

VuFind follows a fairly common software design pattern: it discourages users from making changes to core files, and instead encourages them to copy files out of the core and modify them in a local location. This has several advantages, including putting all of your changes in one place (very useful when a newcomer needs to learn how you have customized a project) and easing upgrades (you can update the core files without worrying about which ones you have changed).

There is one significant disadvantage, however: when the core files change, your copies get out of sync. Keeping your local copies synched up with the core files requires a lot of time-consuming, error-prone manual effort.

Or does it?

The Solution

One argument against modifying files in a local directory is that, if you use a version control tool like Git, the advantages of the “local customization directory” approach are diminished, since Git provides a different mechanism for reviewing all local changes to a code base and for handling software updates. If you modify files in place, then “git merge” will help you deal with updates to the core code.

Of course, the Git solution has its own drawbacks — and VuFind would lose some key functionality (the ability for a single instance to manage multiple configurations at different URLs) if we threw away our separation of local settings from core code.

Fortunately, you can have the best of both worlds. It’s just a matter of wrangling Git and a 3-way merge tool properly.

Three Way Merges

To understand the solution to the problem, you need to understand what a three-way merge is. Essentially, this is an algorithm that takes three files: an “old” file, and two “new” files that each have applied different changes to the “old” file. The algorithm attempts to reconcile the changes in both of the “new” files so that they can be combined into a single output. In cases where each “new” file has made a different change in the same place, the algorithm inserts “conflict markers” so that a human can manually reconcile the situation.

Whenever you merge a branch in Git, it is doing a three-way merge. The “old” file is the nearest common ancestor version between your branch and the branch being merged in. The “new” files are the versions of the same file at the tips of the two branches.

If we could just do a custom three-way merge, where the “old” file was the common ancestor between our local version of the file and the core version of the file, with the local/core versions as the “new” files, then we could automate much of the work of updating our local files.

Fortunately, we can.

Lining Up the Pieces

Solving this problem assumes a particular environment (which happens to be the environment we use at Villanova to manage our VuFind instances): a Git repository forked from the main VuFind public repository, with a custom theme and a local settings directory added.

Assume that we have this repository in a state where all of our local files are perfectly synched up with the core files, but that the upstream public repository has changed. Here’s what we need to do:

1.) Merge the upstream master code so that the core files are updated.

2.) For each of our locally customized files, perform a three-way merge. The old file is the core file prior to the merge; the new files are the core file after the merge and the local file.

3.) Manually resolve any conflicts caused by the merging, and commit the local changes.

Obviously step 2 is the hard part… but it’s not actually that hard. If you do the local updates immediately after the merge commit, you can easily retrieve pre-merge versions of files using the “git show HEAD~1:/path/to/file” command. That means you have ready access to all three pieces you need for three-way merging, and the rest is just a matter of automation.

The Script

The following Bash script is the one we use for updating our local instance of VuFind. The key piece is the merge_directory function definition, which accepts a local directory and the core equivalent as parameters. We use this to sync up various configuration files, Javascript code and templates. Note that for configurations, we merge local directories with core directories; for themes, we merge custom themes with their parents.

The actual logic is surprisingly simple. We use recursion to navigate through the local directory and look at all of the local files. For each file, we use string manipulation to figure out what the core version should be called. If the core version exists, we use the previously-mentioned Git magic to pull the old version into the /tmp directory. Then we use the diff3 three-way merge tool to do the heavy lifting, overwriting the local file with the new merged version. We echo out a few helpful messages along the way so users are aware of conflicts and skipped files.

#!/bin/bash function merge_directory { echo merge_directory $1 $2 local localDir=$1 local localDirLength=${#localDir} local coreDir=$2 for current in $localDir/* do local coreEquivalent=$coreDir${current:$localDirLength} if [ -d "$current" ] then merge_directory "$current" "$coreEquivalent" else local oldFile="/tmp/tmp-merge-old-`basename \"$coreEquivalent\"`" local newFile="/tmp/tmp-merge-new-`basename \"$coreEquivalent\"`" if [ -f "$coreEquivalent" ] then git show HEAD~1:$coreEquivalent > $oldFile diff3 -m "$current" "$oldFile" "$coreEquivalent" > "$newFile" if [ $? == 1 ] then echo "CONFLICT: $current" fi cp $newFile $current else echo "Skipping $current; no equivalent in core code." fi fi done } merge_directory local/harvest harvest merge_directory local/import import merge_directory local/config/vufind config/vufind merge_directory themes/vuboot3/templates themes/bootstrap3/templates merge_directory themes/villanova_mobile/templates themes/jquerymobile/templates merge_directory themes/vuboot3/js themes/bootstrap3/js merge_directory themes/villanova_mobile/js themes/jquerymobile/js

Conclusion

I’ve been frustrated by this problem for years, and yet the solution is surprisingly simple — I’m glad it finally came to me. Please feel free to use this for your own purposes, and let me know if you have any questions or problems!


Like0

pinboard: MDC - Code4Lib

Thu, 2015-07-23 14:53
Just added this as a potential #code4lib mdc topic: JupyterHub for documenting code/data practices & instruction.

LITA: Jobs in Information Technology: July 22, 2015

Wed, 2015-07-22 19:02

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

Library Director, Hingham Public Library, Hingham MA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Access Conference: Access is Looking for Next Year’s Hosts

Wed, 2015-07-22 12:27

Do you love Access? Are you an event planning whiz, thrilled by the prospect of planning Canada’s foremost library technology conference? Were you inspired by the Herculean efforts of past Access Organizing Committees in bringing passionate, intelligent humans together to discuss and troubleshoot some of the most significant issues in libraries? Do you have a burning desire to ensure that Access 2016 is in your neck of the woods? Excellent. We want to hear from you!

Like any forward thinking conference, Access is already looking for next year’s host. There are no restrictions placed on where the conference can be held or who organizes it (because we’re super accepting like that). As a general library technology conference, we encourage people from academic, public, or special libraries (or even some combination of all three if that’s more your style) to throw their hat in the ring.

For those who are sort of/possibly/definitely interested in hosting, you can check out the hosting guidelines on our site (http://accessconference.ca/about/hosting-guidelines/).

If the prospect of hosting Access tickles your fancy (and ideally, the fancy of some other humans you know in the same general geographic area), go ahead and submit a proposal to accesslibcon@gmail.com. Proposals must address all aspects of the hosting guidelines.

We’ll be accepting proposals until August 28th, 2015*.

 

*That’s really more than enough time to put together a solid team and a brilliant proposal, right? Right. We thought so too.

Library Tech Talk (U of Michigan): Transparency and our "Front Door" Process

Wed, 2015-07-22 00:00

In the Fall of 2014, the University of Michigan Library IT unit launched a new initiative called the “Front Door process.” The name resulted from our desire to create a centralized space or “Front Door” through which Library colleagues can submit project requests. With an eye towards increasing transparency, LIT developed this new process with three goals in mind: gather IT project requests into a centralized space, provide a space for a simplified IT project queue or workflow, and have both spaces accessible to everyone in the Library.

District Dispatch: Who should be the next Librarian of Congress?

Tue, 2015-07-21 20:41

Library of Congress

In the past few months, there has been considerable discussion among Washington information and technology leaders about who should lead the Library of Congress, now that James Billington has announced his retirement. Today, in an op-ed published today in Roll Call, Alan Inouye, director of the American Library Association’s (ALA) Office for Information Technology Policy, asks leaders to re-imagine the future of the agency and its role in Washington. In the piece, Inouye writes:

We are at a pivotal point in the digital revolution. How Americans work, govern, live, learn and relax is changing in fundamental ways. Today’s digital technologies are helping spur economic and social shifts as significant as those brought on by the manufacturing and distribution innovations of the industrial revolution. As part of the digital revolution, the roles, capabilities and expectations of information institutions (e.g., mass media, Internet companies, universities and libraries) have transformed and must continue to evolve.

And, indeed, a number of the key players in today’s information ecosystem didn’t even exist when Billington became the librarian of Congress in 1987 — such as Google, Yahoo!, the Internet Archive, Facebook, Twitter, Amazon, Wikipedia, the Bill & Melinda Gates Foundation and the Digital Public Library of America.

In light of this we need to consider: What are the necessary roles of federal government institutions such as the Library of Congress in the digital revolution? How can such institutions best promote innovation and creativity, and not get in the way of it? Which public interest responsibilities in the digital society inherently fall into the bailiwick of the federal government? We have many questions but few answers, and only modest cross-agency discussion and a dearth of cross-sector discussion of these bigger questions.

Read the full op-ed

The post Who should be the next Librarian of Congress? appeared first on District Dispatch.

Nicole Engard: Bookmarks for July 21, 2015

Tue, 2015-07-21 20:30

Today I found the following resources and bookmarked them on Delicious.

Digest powered by RSS Digest

The post Bookmarks for July 21, 2015 appeared first on What I Learned Today....

Related posts:

  1. Learn to Code by Watching
  2. How To Get More Kids To Code
  3. Educate future female programmers

Bohyun Kim: What It Takes to Implement a Makerspace

Tue, 2015-07-21 20:15

I haven’t been able to blog much recently. But at the recent ALA Annual Conference at San Francisco, I presented on the topic of what it takes to implement a makerspace at an academic library. This was to share the work of my library’s Makerspace Task Force that I chaired and the lessons that we learned from the implementation process as well as after we opened the Innovation Space at University of Maryland, Baltimore, Health Sciences and Human Services Library back in April.

If you are planning on creating a makerspace, this may be useful to you. And here is a detailed guide on 3D printing and 3D scanning that I have created before the launch of our makerspace.

I had many great questions and interesting discussion with the audience. If you have any comments and things to share, please write in the comments section below! If you are curious about the makerspace implementation timeline, please see the poster “Preparing for the Makerspace Implemnetation at UMB HS/HSL”  below, which my coworker Everly and I presented at the MLA (Medical Library Association) Meeting this spring.

(1) Slides for the program at ALA 2015 Conference, “Making a Makerspace Happen: A discussion of the current practices in library makerspaces and experimentation at University of Maryland, Baltimore.”

Making a Makerspace Happen (2) Poster from the MLA 2015 Meeting: Preparing for the Makerspace Implemnetation at UMB HS/HSL

 

SearchHub: Fusion 2.0 Now Available – Now With Apache Spark!

Tue, 2015-07-21 17:21
I’m really excited today to announce that we’ve released Lucidworks Fusion 2.0! Of course, this a new major release, and as such there is some big changes from previous versions: New User Experience The most visible change is a completely new user interface. In Fusion 2.0, we’ve re-thought and re-designed the UI workflows. It’s now easier to create and configure Fusion, especially for users who are not familiar with Fusion and Solr architecture. Along with the new workflows, there’s a redesigned look and several convenience and usability improvements. Another part of the UI that has been rebuilt is the Silk dashboard system. While the previous Silk/Banana framework was based on Kibana 3, we’ve updated to a new Silk dashboard framework based on Kibana 4. Hierarchical facet management is now included. This lets administrators or developers change how search queries return data, depending on where and how a user is browsing results. The most straightforward use of this is to select different facets and fields according to (for example) the category or section that user is looking at. And More Along with this all-new front-end, the back-end of Fusion has been making less prominent but important incremental performance and stability improvements, especially around our integrations with external security systems like Kerberos and Active Directory and Sharepoint. We’ve already been supporting Solr 5.x, and In Fusion 2.0, we’re also shipping with Solr 5.x built-in. (We continue to support Solr 4.x versions as before.) And again as always, we’re putting down foundations for more exciting features around Apache Spark processing, deep learning, and system operations. But that’s for another day. While I’m very proud of this release and could go on at greater length about it, I’ll instead encourage you to try it out yourself. Read the release notes. Coverage in Silicon Angle: “LucidWorks Inc. has added integration with the speedy data crunching framework in the new version of its flagship enterprise search engine that debuted this morning as part of an effort to catch up with the changing requirements of CIOs embarking on analytics projects. .. That’s thanks mainly to its combination of speed and versatility, which Fusion 2.0 harnesses to try and provide more accurate results for queries run against unstructured data. More specifically, the software – a commercial implementation of Solr, one of the most popular open-source search engines for Hadoop, in turn the de facto storage backend for Spark – uses the project’s native machine learning component to help hone the relevance of results.” Coverage in Software Development Times: Fusion 2.0’s Spark integration within its data-processing layer enables real-time analytics within the enterprise search platform, adding Spark to the Solr architecture to accelerate data retrieval and analysis. Developers using Fusion now also have access to Spark’s store of machine-learning libraries for data-driven analytics. “Spark allows users to leverage real-time streams of data, which can be accessed to drive the weighting of results in search applications,” said Lucidworks CEO Will Hayes. “In regards to enterprise search innovation, the ability to use an entirely different system architecture, Apache Spark, cannot be overstated. This is an entirely new approach for us, and one that our customers have been requesting for quite some time.” Press release on MarketWired.

The post Fusion 2.0 Now Available – Now With Apache Spark! appeared first on Lucidworks.

District Dispatch: ConnectHome connects libraries too

Tue, 2015-07-21 17:10

The American Library Association (ALA) and libraries have a long and increasingly recognized commitment to addressing digital inclusion and digital readiness needs in the United States. I count my own engagement back more than a decade talking to reporters about the role libraries play in providing both a digital safety net and a launching pad for deeper technology use to advance education, employment and creativity.

In the wake of the National Broadband Plan and new opportunities at the Federal Communications Commission, the ALA Office for Information Technology Policy convened a digital literacy task force in 2011. Librarians from school, public and academic libraries delved more deeply into issues of effective practice, assessing impact, and building capacity to raise awareness of library work in this space and how to further advance it. These lessons and research from the Digital Inclusion Survey have continued to inform our work—from E-rate to Workforce Innovation and Opportunity Act advocacy.

From left: New York City Mayor Bill de Blasio, HUD Secretary Julián Castro, and New York City Council Member Melissa Mark-Viverito kick off the New York ConnectHome program at the Mott Haven Community Center in the Bronx

So, when staff at the Department of Housing and Urban Development (HUD) reached out to ALA, we were happy to talk about an emerging initiative to address the divide we both see clearly for low-income Americans. Not only is it the right thing to do, but we know America’s libraries already are doing this work day in and day out and are actively engaged in building locally relevant solutions to address community priorities. Committing to work with local libraries to deliver tailored, on-site digital literacy programming and resources to public housing residents is a “no-brainer.”

The launch of ConnectHome last week marked a milestone in this work, and we were excited to help kick off the program with a statement featuring ALA President Sari Feldman and Oklahoma State Librarian Susan McVey on-site representing libraries at the event with President Obama in Durant, Okla. I was pleased to join HUD Secretary Julián Castro in the Bronx with Metropolitan New York Library Council executive director Nate Hill and New York Public Library President Anthony Marx; and North Carolina State Librarian Cal Shepard met the Secretary later that day in Durham, N.C., along with Durham County Public Library Director Tammy Baggett and N.C. Chief Deputy Secretary of the Department of Cultural Resources Karin Cochran (see Cal’s blog here).

Nationally, the program will initially reach over 275,000 low-income households – and nearly 200,000 children – with the support needed to access the Internet at home. Internet service providers, non-profits and the private sector will offer broadband access, technical training, digital literacy programs, and devices for residents in assisted housing units.

ConnectHome Logo

As much as I enjoyed hearing from Secretary Castro, New York Mayor Bill de Blasio and others at the Mott Haven Community Center, meeting and brainstorming with ConnectHome collaborators PBS, New York Public Media and GitHub was even more fun. The options for digital content creation, collaboration and distribution to advance education and community engagement – at both the local and national levels – were dizzying. Of course, it helps that more libraries already are engaging with GitHub and that ALA and libraries have a long history of programming with PBS and public media through the National Endowment for the Humanities.

I expect the excitement and possibility of our conversation last week will be played out dozens of times in the coming months in the 28 communities where ConnectHome will pilot. Libraries will find both familiar community collaborators and new opportunities to serve public housing residents and explore new intersections. Building greater awareness of libraries’ roles in meeting community needs, creating and strengthening relationships with governmental, non-profit and commercial partners, and building mutually beneficial and impactful programs and policy responses are at the heart of the Policy Revolution! initiative and the future of libraries.

Like pretty much everyone outside the Obama administration, ALA learned which communities would be in the pilot last Wednesday and began reaching out to library directors in these communities:
Albany, GA; Atlanta, GA; Baltimore, MD; Baton Rouge, LA; Boston, MA; Camden, NJ; Choctaw Nation, OK; Cleveland, OH; Denver, CO; Durham, NC; Fresno, CA; Kansas City, MO; Little Rock, AR; Los Angeles, CA; Macon, GA; Memphis, TN; Meriden, CT; Nashville, TN; New Orleans, LA; New York, NY; Newark, NJ; Philadelphia, PA; Rockford, IL; San Antonio, TX; Seattle, WA; Springfield, MA; Tampa, FL; and Washington, DC.

Last week’s launch, of course, was just the beginning. ALA looks forward to amplifying the great work already underway in libraries in ConnectHome communities and coordinating directly with libraries to support their work individually and as a group, as well as develop and share relevant resources.

Does your library have a relationship with its Public Housing Authority already? We’d love to hear about it! Do you have other questions or suggestions about ConnectHome? Check out the FAQ or drop us a line at oitp@ala.org.

The post ConnectHome connects libraries too appeared first on District Dispatch.

District Dispatch: ConnectHome kicks off in Durham

Tue, 2015-07-21 16:41

Guest Post by Cal Shepard, State Librarian of North Carolina

On July 16 I made my way to a small recreation center in the middle of an urban neighborhood in Durham, N.C. Not my usual haunt on a summer afternoon, to be sure! I went to hear Department of Housing and Urban Development (HUD) Secretary Julián Castro announce the ConnectHome project as Durham is one of 27 pilot cities. The initiative will expand high-speed broadband by making it available and affordable to low-income families living in public housing.

Secretary Castro came to the T.A. Grady Recreation Center in Durham to talk about how children and adults alike need to be connected in order to do homework and learn digital literacy skills. One of the goals of this initiative is to build regional partnerships and Secretary Castro made sure to recognize ALL partners during his remarks (including the American Library Association.)

From left to right – Chief Deputy Secretary of the Department of Cultural Resources, Karin Cochran, Durham County Public Library Director Tammy Baggett, HUD Secretary Julián Castro, State Librarian of N.C., Cal Shepard

Librarians know that far too many Americans currently lack the technology access and skills to participate fully in education, employment and civic life. Broadband is essential, and I am pleased President Obama has made digital opportunity for all a top priority. In Durham county last year, public library users signed on to 360,000 sessions on library computers, and were offered more than 350 free digital literacy courses, from basic computer classes to advanced Microsoft Office and graphic design training. And that is just within the walls of the library!

Later this year Durham County Public Library Director Tammy Baggett and her staff will begin offering technology-based courses in the Oxford Manor public housing community, including programs that focus on STEAM literacy for children and teens, and job readiness computer training for adults.

ConnectHome Logo

This is what libraries do every day — we connect with partners to move our communities forward. Last Thursday I met multiple people, and I can see working with any one of them for the benefit of libraries and our communities throughout North Carolina. I learned about organizations that I didn’t even know existed and swapped business cards like I was playing a card game! I’m sure that not all of these contacts will result in lasting relationships, but I am guessing that a few of them will—if not today then tomorrow or next year.

And THIS is why I’m excited about this project in Durham. While the ConnectHome idea may come from Washington, the execution will come from the local level. In the process, the library will make new friends and develop new partners. That is good for everybody and makes us all stronger.

By the way – after the event Secretary Castro was looking forward to his first taste of famous North Carolina barbecue. I wonder how he liked it?

The post ConnectHome kicks off in Durham appeared first on District Dispatch.

Terry Reese: Merge Record Changes

Tue, 2015-07-21 16:06

With the last update, I made a few significant modifications to the Merge Records tool, and I wanted to provide a bit more information around how these changes may or may not affect users.  The changes can be broken down into two groups:

  1. User Defined Merge Field Support
  2. Multiple Record merge support

Prior to MarcEdit 6.1, the merge records tool utilized 4 different algorithms for doing record merges.  These were broken down by field class, and as such, had specific functionality built around them since the limited scope of the data being evaluated, made it possible.  Two of these specific functions was the ability for users to change the value in a field group class (say, change control numbers from 001 to 907$b) and the ability for the tool to merge multiple records in a merge file, into the source.

When I made the update to 6.1, I tossed out the 3 field specific algorithms, and standardized on a single processing algorithm – what I call the MARC21 option.  This is an algorithm that processes data from a wide range of fields, and provides a high level of data evaluation – but in doing this, I set the fields that could be evaluated, and the function dropped the ability to merge multiple records into a single source file.  The effect of this was that:

  • Users could no longer change the fields/subfields used to evaluate data for merge outside of those fields set as part of the MARC21 option.
  • if a user had a file that looked like the following —
    sourcefile1 – record 1
    mergefile – record1 (matches source1)
    mergefile – record2
    mergefile – record3 (matches source1)

    Only data from the mergefile – record 1 would be merged.  The tool didn’t see the secondary data that might be in the merge file.  This has always been the case when working with the MARC21 merge option, but by making this the only option, I removed this functionality from the program (as the 3 custom field algorithms did make accommodations for merging data from multiple records into a single source).

With the last update, I’ve brought both of these to elements back to the tool.  When a user utilizes the Merge Records tool, they can change the textbox with the field data – and enter a new field/subfield combination for matching (at this point, it must be a field/subfield combination).  Secondly, the tool now handles the merging of multiple records if those data elements are matched via a title or control number.  Since MarcEdit will treat user defined fields as the same class as a standard number (ISBN technically) for matching – users will now see that the tool can merge duplicate data into a single source file.

Questions about this – just let me know.

–tr

Pages