You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 5 hours 25 min ago

Terry Reese: Building a better MarcEdit for Mac users

Sun, 2015-04-12 04:40

This all started with a conversation over twitter (https://twitter.com/_whitni/status/583603374320410626) about a week ago.  A discussion about why the current version of MarcEdit is so fragile when being run on a Mac.  The short answer has been that MarcEdit utilizes a cross platform toolset when building the UI which works well on Linux and Windows systems, but tends to be less refined on Mac systems.  I’ve known this for a while, but to really do it right, I’d need to develop a version of MarcEdit that uses native Mac APIs, which would mean building a new version of MarcEdit for the Mac (at least, the UI components).  And I’ve considered it – mapped out a road map – but what’s constantly stopped me has been a lack of interest from the MarcEdit community and a lack of a Mac system.  On the community-side, I can count on two hands the number of times I’ve had someone request a version of MarcEdit  specifically for a Mac.  And since I’ve been making a Mac App version of MarcEdit available – it’s use has been fairly low (though this could be due to the struggles noted above).  With an active community of over 20,000, I try to put my time where it will make the most impact, and up until a week ago, better support for Mac systems didn’t seem to be high on the list.  The second reason is I don’t own a Mac.  My technology stack is made up of about a dozen Windows and Linux systems embedded around my house because they play surprisingly well together, where as, Apple’s walled garden just doesn’t thrive within my ecosystem.  So, I’ve been waiting and hoping that the cross-platform toolset would get better and that in time, this problem would eventually go away.

I’m giving that background because apparently I’ve been misreading the MarcEdit community.  As I said, this all started with this conversation on twitter (https://twitter.com/_whitni/status/583603374320410626) – and out of that, two co-conspirators, Whitni Watkins and Francis Kayiwa set out to see just how much interest there actually was in having dedicated version of MarcEdit for the Mac.  The two set out to see if they could raise funds to acquire a Mac to do this development and indirectly, demonstrate that there was actually a much larger slice of the community interested in seeing this work done.  And, so, off they went – and I set back and watched.  I made a conscious decision that if this was going to happen, it was going to be because the community wanted it and in that, my voice wasn’t necessary.  And after 8 days, it’s done.  In all, 40 individuals contributed to the campaign, but more importantly to me, I heard directly from around 200+ individuals that were hopeful that this project would proceed. 

Development Roadmap

Now the hard work starts.  MarcEdit is a program that has been under constant development since 1999 – so even just rewriting the UI components of the application will be a significant undertaking.  So, I’m breaking up this work in chunks.  I figure it would take approximately 8-12 months to completely port the UI, which is a long-time.  Too long…so I’m breaking the development into 3 month “sprints”.  the first sprint will target the 80%, the functionality that would make MarcEdit productive when doing MARC editing.  This means porting the functionality for all the resources found in the MARC Tools and much of the functionality found in the MarcEditor components.  My guess is these two components are the most important functional areas for catalogers – so finishing those would allow the tool to be immediately useful for doing production cataloging and editing.  After that – I’ll be able to evaluate the remainder of the program and begin working on functional parity between all versions of the application. 

But I’ll admit, at this point, the road map is somewhat even cloudy to me.  See, I’ve written up the following document (http://1drv.ms/1ake4gO) and shared it with Whitni and asked her to work with other Mac users to refine the list and let me know what falls into that 80%.  So, I’ll be interested to see where their list differs from my own.  In the mean time, I’ll be starting work on the port – creating wireframes and spending time over the next week hitting the books and familiarizing myself with Apple’s API docs and the UI best practices (though, I will be trying to keep the program looking very familiar to the current application – best practices be damned).  Coding on the new UI will start in earnest around May 1 – and by August 1, 2015, I hope to have the first version built specifically for a Mac available.  For those interested in following the development process – I’ll be creating a build page on the MarcEdit website (http://marcedit.reeset.net) and will be posting regular builds as new areas of the application are ported so that folks can try them, and give feedback. 

So, that’s where this stands and this point.  For those interested in providing feedback, feel free to contact me directly at reeset@gmail.com.  And for those of you that reached out or participated in the campaign to make this happen, my sincere thanks. 

–TR

Open Library Data Additions: Amazon Crawl: part bu

Sun, 2015-04-12 03:20

Part bu of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

John Miedema: The cognitive computing features of Lila. Candidate technologies, limitations, and future plans.

Sat, 2015-04-11 14:47

Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. In the previous post I summarized the characteristics of a cognitive system. This post maps the characteristics to Lila features, along with candidate technology to deliver them. Limitations and future plans are also listed.

Cognitive Characteristic Lila Features Candidate Technology Limitations and Future Plans 1. Life-world data Lila operates on unstructured data from multiple sources. Unstructured data includes author notes, digital articles and books. Data is collected from many sources, including smart phone notes, email, web pages, documents, PDFs.

Lila operates on rapidly changing data, as is expected when writing a work. Lila’s functions can be re-calculated on demand.

Data volume is expected to be the size of an average non-fiction work (about 100,000 words), up to 1000 full length articles, and about 100 full length books.

There are existing tools for gathering content from different sources. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote. English only.

Digital text only.

Text must be text analyzable, i.e., no locked formats.

Table content can be analyzed, but no table look-up operations.

Image analysis is limited to associated text labels.

2. Natural questions Lila analyzes author notes, treating them as questions to be asked of other notes and unread articles and books. The following features combine to build meaningful queries on the content.

  • The finite size of the note itself helps capture the author’s meaning.
  • Lila use author suggested categories, tags and markup to understand what the author considers important.
  • Lila develops a model of the author’s work, used to better understand the author’s intent.
New Lila technology will be built. This technology will be used to create more meaningful structured queries.

Structured queries will be performed using existing technology, Apache Solr.

Questions are constructed implicitly from author notes, not from a voice or text question box.

No direct dialog interface is provided, but see 6&7.

3. Reading and understanding Lila uses natural language processing (NLP) to read author notes and unread content.

Language dictionaries provide an understanding of synonyms and parts-of-speech. This knowledge of language is an advance over simple keyword matching.

Entity identification is performed automatically using machine learning. Identification includes person names, organizations and locations. Lila can be extended to include custom entity identification models.

Lila uses additional input from the author to build a model of the author’s work. This model is used to better understand the the author’s meaning when questioning content. See 6&7.

Existing NLP technologies, e.g., OpenNLP.

New Lila technology for the model.

English only.

Lila does not perform deep parsing of syntax.

4. Analytics Lila calculates a correlation between author notes, and between author notes and unread content. Lila also calculates a suggested order for notes. The open source R tool can be used for statistical calculations.

Language resources such as the MRC psycholinguistic database will be used to create new Lila technology for ordering notes.

The calculations for suggesting order are experimental. It is likely that this function will need development over time. 5. Answers are structured and ordered Lila provides two visualizations:

  • A connections view to visualize correlations between notes and unread content.
  • A suggested order for notes, a visual hierarchy or a table of contents.
New Lila technology for the visualizations. Web-based. Lila will use open source add-ins to generate visualizations. 6. Taught rather than just programmed7. Learn from human interaction Lila’s user interface provides the author with a simple and natural way to:

  • Classify content with categories and tags.
  • Inline markup of entities, concepts and relations.

These inputs create the model used to question content and create correlations. The author can manually edit the model with improvements.

The connections view will allow the author to “pin” correct relationships and delete incorrect relationships.

There are existing technologies for classifying content. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote.

New Lila technology for the model.

The Evernote interface for collecting and editing notes has limitations. In the future, Lila will need its own interface to allow for advanced functions, e.g., inline markup, sorting of notes without numbered labels.

In the future, Lila may use author ordering of notes as a suggestion toward its calculated order.

Roy Tennant: Challenging the Open Source Religious Viewpoint

Sat, 2015-04-11 04:45

I’ve been involved with open source software projects since at least the 1990s. I even saved a Unix application from certain death that I still use today. But that doesn’t mean I’m all rosy-eyed about all open source software projects. They are not all created equal.

To be clear, there are “open source” projects that are neither all that open nor all that successful. 

But let me parse my terms before you get all hot and bothered. “Open” can be as little as dropping the code out on a repository somewhere, which is the level at which many projects currently sit. Or, it could mean that the code is actively managed under an open governance model. Most fall somewhere in between, and a number die a quiet death from neglect. I’ve also seen projects that claim the open source label long before releasing any code. And, as we’ve seen with Kuali, there is no guarantee that open source software will remain that way.

Meanwhile, someone like Terry Reese, who has programmed and maintained the amazing MARCEdit application for many years, is criticized for not open sourcing his software. If he refused to also make it better and add capabilities then maybe there would be reason for concern. But it has been tirelessly maintained and improved. Managing an open source community is not easy. I can certainly understand why Terry may want to simplify his job by vastly reducing the number of variables involved.

All things being equal, open source is better than closed source. But things are rarely equal. And it doesn’t follow that software must be open source to be useful and valued. It also doesn’t mean that someone such as Terry may choose to open source the software when he no longer wishes to maintain it. So let’s stop beating up people and projects that wish to control their own code. There should be many options for software development, not just one.

Now go ahead and give me hell, people, because I know you want to.

Picture by J. Albert Bowden II, https://www.flickr.com/photos/jalbertbowdenii/, Creative Commons License CC BY 2.0.

Open Library Data Additions: OL.120301.meta.mrc

Fri, 2015-04-10 23:51

OL.120301.meta.mrc 6003 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121201.meta.mrc

Fri, 2015-04-10 23:51

OL.121201.meta.mrc 7088 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121001.meta.mrc

Fri, 2015-04-10 23:51

OL.121001.meta.mrc 5235 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120601.meta.mrc

Fri, 2015-04-10 23:51

OL.120601.meta.mrc 6018 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120501.meta.mrc

Fri, 2015-04-10 23:51

OL.120501.meta.mrc 4685 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120401.meta.mrc

Fri, 2015-04-10 23:51

OL.120401.meta.mrc 3851 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120201.meta.mrc

Fri, 2015-04-10 23:51

OL.120201.meta.mrc 6433 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120101.meta.mrc

Fri, 2015-04-10 23:51

OL.120101.meta.mrc 5284 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.121101.meta.mrc

Fri, 2015-04-10 23:51

OL.121101.meta.mrc 6896 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120901.meta.mrc

Fri, 2015-04-10 23:51

OL.120901.meta.mrc 6035 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120801.meta.mrc

Fri, 2015-04-10 23:51

OL.120801.meta.mrc 5760 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

Open Library Data Additions: OL.120701.meta.mrc

Fri, 2015-04-10 23:51

OL.120701.meta.mrc 5421 records.

This item belongs to: data/ol_data.

This item has files of the following types: Archive BitTorrent, Metadata, Unknown

State Library of Denmark: Facet filtering

Fri, 2015-04-10 21:24

In generation 2 of our net archive search we plan to experiment with real time graphs: We would like to visualize links between resources and locate points of interest based on popularity. Our plan is to use faceting with Solr on 500M+ unique links per shard, which is a bit of challenge in itself. To make matters worse, plain faceting does not fully meet the needs of the researchers. Some sample use cases for graph building are

  1. The most popular resources that pages about gardening links to overall
  2. The most popular resources that pages on a given site links to externally
  3. The most popular images that pages on a given site links to internally
  4. The most popular non-Danish resources that Danish pages links to
  5. The most popular JavaScripts that all pages from a given year links to

Unfortunately, only the first one can be solved with plain faceting.

Blacklists & whitelists with regular expressions

The idea is to filter all viable term candidates through a series of blacklists and whitelists to check whether they should be part of the facet result or not. One flexible way of expressing conditions on Strings is with regular expressions. The main problem with that approach is that all the Strings for the candidates must be resolved, instead of only the ones specified by facet.limit.

Consider the whitelist condition .*wasp.* which matches all links containing the word wasp. That is a pretty rare word overall, so if a match-all query is issued and the top 100 links with the wasp-requirement are requested, chances are that millions of terms must be resolved to Strings and checked, before the top 100 allowed ones has been found. On the other hand, a search for gardening would likely have a much higher chance of wasp-related links and would thus require far less resolutions.

An extremely experimental (written today) implementation of facet filtering has been added to the pack branch of Sparse Faceting for Solr. Correctness testing has been performed, where testing means “tried it a few times and the results looked plausible”. Looking back at the cases in the previous section, facet filtering could be used to support them:

  1. The most popular resources that pages about gardening links to overall
    q=gardening
  2. The most popular resources that pages on a given site links to externally
    q=domain:example.com&facet.sparse.blacklist=https?://[^/]*example\.com
  3. The most popular images that pages on a given site links to internally
    q=domain:example.com&facet.sparse.whitelist=https?://[^/]*example\.com/.*\.(gif|jpg|jpeg|png)$
  4. The most popular non-Danish resources that Danish pages links to
    q=domain_suffix:dk&facet.sparse.blacklist=https?://[^/]*\.dk
  5. The most popular JavaScripts that all pages from a given year links to
    q=harvest_year:2015&facet.sparse.whitelist=.*\.js$

Some questions like “The most popular resources larger than 2MB in size linked to from pages about video” cannot be answered directly with this solution as they rely on the resources at the end of the links, not just the links themselves.

Always with the performance testing

Two things of interest here:

  1. Faceting on 500M+ unique values (5 billion+ DocValues references) on a 900GB single-shard index with 200M+ documents
  2. Doing the trick with regular expressions on top

Note the single-shard thing! The measurements should not be taken as representative for the final speed of the fully indexed net archive, which will be 50 times larger. As we get more generation 2 shards, the tests will hopefully be re-run.

As always, Sparse Faceting is helping tremendously with the smaller result sets. This means that averaging the measurements to a single number is highly non-descriptive: Response times varies from < 100ms for a few thousand hits to 5 minutes for a match-all query.

Performance testing used a single thread to issue queries with random words from a Danish dictionary. The Solr server was a 24 core Intel i7 machine (only 1 active core due to the unfortunate single-threaded nature of faceting) with 256GB of RAM (200GB free for disk cache) and SSDs. All tests were with previously unused queries. 5 different types of requests were issued:

  1. no_facet: as the name implies, just a plain search with no faceting
  2. sparse: Sparse Faceting on the single links-field with facet limit 25
  3. regexp_easy: Sparse Faceting with whitelist regular expression .*htm.* which is fairly common in links
  4. regexp_evil: Sparse Faceting with whitelist regular expression .*nevermatches.* effectively forcing all terms in the full potential result set to be resolved and checked
  5. solr: Vanilla Solr faceting

900GB, 200M+ docs, 500M+ unique values, 5 billion+ references

Observations
  • Sparse Faceting without regular expressions (purple) performs just as well with 500M+ values as it did with previous tests of 200M+ values.
  • Using a regular expression that allows common terms (green) has moderate impact on performance.
  • The worst possible regular expression (orange) has noticeable impact at 10,000 hits and beyond. At the very far end at match-all, the processing time was 10 minutes (versus 5 minutes for non-regular expression faceting). This is likely to be highly influenced by storage speed and be slower with more shards on the same machine.
  • The constant 2 second overhead of vanilla Solr faceting (yellow) is highly visible.
Conclusion

Worst case processing times has always been a known weakness of our net archive search. Facet filtering exacerbates this. As this is tightly correlated to the result set size, which is fast to calculate, adding a warning with “This query is likely to take minutes to process” could be a usable bandage.

With that caveat out of the way, the data looks encouraging so far; the overhead for regular expressions was less than feared. Real-time graphs or at least fill-the-coffee-cup-time graphs seems doable. At the cost of 2GB of extra heap per shard to run the faceting request.

Additional notes (updated 2015-04-11)

@maxxkrakoa noted “@TokeEskildsen you wrote Solr facet is 1 thread. facet.threads can change that – but each field will only be able to use one thread each.“. He is right and it does help significantly for our 6 field faceting. For single field faceting, support for real multi-thread counting would be needed.

The simple way of doing multi-thread counting is to update multiple copies of the counter structure and merge them at the end. For at 500M+ field, that is likely to be prohibitive with regards to both memory and speed: The time used for merging the multiple counters would likely nullify the faster counter update phase. Some sort of clever synchronization or splitting of the counter space would be needed. No plans yet for that part, but it has been added to “things to ponder when sleep is not coming”-list.


John Miedema: Cognitive computing. Computers already know how to do math, now they can read and understand.

Fri, 2015-04-10 20:06

Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. It is characterized by the following:

  1. Life-world data. Operates on data that is large, varied, complex and dynamic, the stuff of daily human life.
  2. Natural questions. A question is more than a keyword query. A question embodies unstructured meaning. It may be asked in natural language. A dialog allows for refinement of questions.
  3. Reading and understanding. Computers already know how to do math. Cognitive computing provides the ability to read. Reading includes understanding context, nuance, and meaning.
  4. Analytics. Understanding is extended with statistics and reasoning. The system finds patterns and structures. It considers alternatives and chooses a best answer.
  5. Answers are structured and ordered. An answer is an “assembly,” a wiki-type summary, or a visualization such as a knowledge graph. It often includes references to additional information.

Cognitive computing is not artificial intelligence. Solutions are characterized by a partnership with humans:

  1. Taught rather than just programmed. Cognitive systems “borrow” from human intelligence. Computers use resources compiled from human knowledge and language.
  2. Learn from human interaction. A knowledge base is improved by feedback from humans. Feedback is ideally implicit in an interaction, or it may be explicit, e.g., thumbs up or down.

DPLA: DPLAfest 2015: Special Indy Activities and Attractions

Fri, 2015-04-10 19:29

There’s lots to do in the Indianapolis area during DPLAfest 2015, just a week away! Here is a sampling of some of the excellent options.

Check out the Indiana Historical Society’s array of exhibitions (free for fest attendees!)

Did you know that you can get free admission to the Indiana Historical Society with your DPLAfest name badge? Simply show your DPLAfest name badge at the Historical Society welcome center and you’ll receive a wristband to explore the wonderful exhibits and activities inside:

  • Step into three-dimensional re-creations of historic photographs complete with characters come to life in You Are There: featuring That Ayres Look1939: Healing Bodies, Changing Mindsand 1904: Picture This
  • Let the latest technology take you back in time on virtual journeys throughout the state in Destination Indiana.
  • Pull up a stool at a cabaret and immerse yourself in the music of Hoosier legend Cole Porter in the Cole Porter Room.
  • In the W. Brooks and Wanda Y. Fortune History Lab, get a behind-the-scenes, hands-on look at conservation and the detective work involved history research.
  • Check out Lilly Hall, home of the Mezzanine Gallery, INvestigation Stations and Hoosier Legends.

Take a walking tour of the Indiana State Library’s stunning architecture

Join Indiana State Library staffer Steven Schmidt for a guided tour of the historic Indiana State Library. Please meet at least 5 minutes before the tour is slated to begin; see the DPLAfest schedule for more details.

Learn about the history and development of Indianapolis

If you’re interested in history, architecture, and anything in between,  be sure to check out Indianapolis Historical Development: A Quick Overview on  Saturday, 4/18 at 1:15 PM in IUPUI UL Room 1116. Led by William L. Selm, IUPUI Adjunct Faculty of the School of Engineering & Technology, this presentation will provide an “overview of the development of Indianapolis since its founding in 1821 as the capital city of Indiana in the center of the state. The factors that shaped the city will be presented as well as the buildings and monuments that are the products of these factors and forces.” Find out more about this session here.

Get inspired at the Indianapolis Museum of Art:

  • At the Innovative Museum Leaders Speaker Series on April 16, hear from Mar Dixon, the Founder of MuseuoMix UK, Museum Camp and Ask a Curator Day, London.
  • Peruse a number of interesting art exhibitions. This includes a Monuments Men-inspired look at the provenance research and ownership discussion surrounding one of the IMA’s European pieces, “Interior of Antwerp Cathedral.”

Learn more at The Eiteljorg Museum of American Indians and Western Art:

  • With a mission to inspire an appreciation and understanding of the cultures of the indigenous peoples of North America, the Eiteljorg Museum has a number of interesting offerings. See art from the American West, as well as an exhibit about the gold rush. 

Show off your sports side:

For other child (or child-at-heart) friendly options:

  • Visit the world’s largest children’s museum, the Children’s Museum of Indianapolis. And, yes, that’s Optimus Prime you’ll see parked out front–it’s part of a new Transformers exhibit, one of many fun options at the museum.
  • Explore at the Indianapolis Zoo, which now has a new immersive “Butterfly Kaleidoscope” conservatory, with 40 different butterfly species.

Don’t forget to take a second look at the DPLAfest schedule to make sure you don’t miss any of the exciting fest sessions! See you soon in Indianapolis!

Brown University Library Digital Technologies Projects: What is ORCID?

Fri, 2015-04-10 15:46

ORCID is an open, non-profit initiative founded by academic institutions, professional bodies, funding agencies, and publishers to resolve authorship confusion in scholarly work.  The ORCID repository of unique scholar identification numbers will reliably identify and link scholars in all disciplines with their work, analogous to the way ISBN and DOI identify books and articles.

Brown is a member of ORCID which allows the University, among other things, to create ORCID records on behalf of faculty, students, and affiliated individuals; integrate authenticated ORCID identifiers into grant application processes; ingest ORCID data to maintain internal systems such as institutional repositories; and link ORCID identifiers to other IDs and registry systems.  ORCID identifiers will facilitate the gathering of publication, grant, and other data for use in Reseachers@Brown profiles.  The library, with long experience in authority control, is coordinating this effort.

Pages