You are here

Feed aggregator

Journal of Web Librarianship: “Snow Fall”-ing Special Collections and Archives

planet code4lib - Tue, 2015-08-25 06:37
Jason Paul Michel

District Dispatch: Court cases shaping the fair use landscape

planet code4lib - Mon, 2015-08-24 20:28

U.S. Supreme Court Building. From Wikimedia Commons.

Join us on CopyTalk in September to hear about the leading legal cases affecting Fair Use and our ability to access, archive and foster our common culture. Our presenter on this topic will be Corynne McSherry, Legal Director at the Electronic Frontier Foundation.

CopyTalk will take place on Thursday, September 3rd at 11am Pacific/2pm Eastern time.  After a brief introduction, Corynne will present for 50 minutes, and we will end with a Q&A session (questions will be collected during the presentations).

Please join us at

We are limited to 100 concurrent viewers, so we ask you to watch with others at your institution if at all possible.  The presentations are recorded and will be available online  soon after the presentation. Audio is provided online via the webinar software only, so you will need speakers for your computer; there is no call-in number for audio.

The post Court cases shaping the fair use landscape appeared first on District Dispatch.

Jonathan Rochkind: blacklight_cql plugin

planet code4lib - Mon, 2015-08-24 18:13

I’ve updated the blacklight_cql plugin for running without deprecation warnings on Blacklight 5.14.

I wrote this plugin way back in BL 2.x days, but I think many don’t know about it, and I don’t think anyone but me is using it, so I thought I’d take the opportunity having updated it, to advertise it.

blacklight_cql gives your BL app the ability to take CQL queries as input. CQL is a query language for writing boolean expressions (; I don’t personally consider it suitable for end-users to enter manually, and don’t expose it that way in my BL app.

But I do it use it as an API for other internal software to make complex boolean queries against my BL app; like “format = ‘Journal’ AND (ISSN = X OR ISSN =Y OR ISBN = Z)”  Paired with the BL Atom response, it’s a pretty powerful query API against a BL app.

Both direct Solr fields, and search_fields you’ve configured in Blacklight are available in CQL; they can even be mixed and matched in a single query.

The blacklight_cql plug-in also provides an SRU/ZeeRex EXPLAIN handler, for a machine-readable description of what search fields are supported via CQL.  Here’s “EXPLAIN” on my server:

The plug-in does NOT provide a full SRU/SRW implementation — but as it does provide some of the hardest parts of an SRW implementation, it would probably not be too hard to write a bit more glue code to get a full implementation.  I considered doing that to make my BL app a target of various federated search products that speak SRW, but never wound up having a business case for it here.  (Also, it may or may not actually work out, as SRW tends to vary enough that even if it’s a legal-to-spec SRW implementation, that’s no guarantee it will work with a given client).

Even though the blacklight_cql plugin has been around for a while, it’s perhaps still somewhat immature software (or maybe it’s that it’s “legacy” software now?). It’s worked out quite well for me, but I’m not sure anyone else has used it, so it may have edge case bugs I’m not running into, or bugs that are triggered by use cases other than mine. It’s also, I’m afraid, not very well covered by automated tests. But I think what it does is pretty cool, and if you have a use for what it does, starting with blacklight_cql should be a lot easier than starting from scratch.

Feel free to let me know if you have questions or run into problems.

Filed under: General

Islandora: Meet Your Developer: Jared Whiklo

planet code4lib - Mon, 2015-08-24 14:26

The Islandora community has seen a lot of growth since the Islandora Foundation got its start in 2013. The growth of our user and institutional community has been easy to see, but there has been another layer of growth in a vital part of the community that isn't always as visible: Islandora developers. Modules, bug fixes, and other commits to the Islandora codebase are coming from a much wider varsity of sources that in the early days of Islandora.

Today, we are going to learn more about one of those community developers. Jared Whiklo is an Applications Developer at the University of Manitoba. He has also been an integral part of the Islandora 7.x-2.x development team and will be co-leading Islandora's first Community Sprint at the end of the month. Jared has authored some handy Islandora tools of his own, including Islandora Custom Solr to replace Sparql queries with Solr queries where possible for speed improvements. You can learn more about how he runs Islandora from the University of Manitoba's entry in the Islandora Deployments Repo.

Please tell us a little about yourself. What do you do when you’re not at work?

I am a self-taught programmer from days past (like Turbo Pascal on 14 disks, past). I am married with two young kids. I like to build, fix things, camp (in a tent), bike, skate and run the occasional marathon.


How long have you been working with Islandora? How did you get started?

Over the past 3 years in my current position I have slowly gotten deeper and more involved in Islandora. Our institution had invested early in the Islandora project, we liked the flexibility as we were moving away from about 3 different legacy products.


Sum up your area of expertise in three words:

Master of none


What are you working on right now?

We are migrating content from various different systems into our Islandora instance as well as bringing other groups on campus on-board to store their data.


What contribution to Islandora are you most proud of?

I am proud of each little contribution. Every little bit helps to move the community forward.


What new feature or improvement would you most like to see?

Islandora 7.x-2.x!!


What’s the one tool/software/resource you cannot live without?

Git. When you swing between work for different interests it makes it vital. 


If you could leave the community with one message from reading this interview, what would it be?

Don't get discouraged.

Casey Bisson: Compact camera recommendations

planet code4lib - Mon, 2015-08-24 03:55

A friend asked the internet:

Can anyone recommend a mirrorless camera? I have some travel coming up and I’m hesitant to lug my DSLR around.

Of course I had an opinion:

I go back and forth on this question myself. My current travel camera is a Sony RX100 mark 3 (the mark 4 was recently released). Some of my photos with that camera are on Flickr. If I decide to get a replacement for my for my bigger cameras, I’ll probably go with a full frame Sony A7 of some sort. The Fuji X system APS-C, and Olympus and Panasonic Micro 4/3 cameras look great, but they don’t offer enough improvement over the RX100 to excite me much.   One of the biggest issues for me is sensor size. The smallest camera with the largest sensor is usually the winner for me. Other compact cameras I like include the Panasonic LUMIX LX100 and Canon PowerShot G1 X Mark II. Both have bigger sensors for shallower depth of field. If the Panasonic supported remote shutter release I would definitely have picked that instead of the Sony (I have a predecessor to the LX100, the LX3, that I loved). If you don’t care to do timelapse like I do, then remote shutter release might not be a requirement for you.   Back to my RX100: its my go-to digital. I shoot raw, sometimes with auto-bracketing, to maximize dynamic range. Even without bracketing, the raw files have great dynamic range–much more than my Canon bodies. The only reason I’ve used my Canon bodies recently is when I needed a hot shoe for strobist work (which I’d like to do more of).   To give context to my rambling: I offered my camera history up to mid-2014 previously. After that, I got deep into film, including instant and celluloid. My darling wife agreed to let me to buy a Hasselblad in March if I promised not to say a word about buying another camera for a full year. That lasted about a month, but at least (most) film cameras are cheap. I’m easy to find on Flickr and Instagram.

Terry Reese: MarcEdit Validate Headings: Part 2

planet code4lib - Mon, 2015-08-24 02:16

Last week, I posted an update that included the early implementation of the Validate Headings tool.  After a week of testing, feedback and refinement, I think that the tool now functions in a way that will be helpful to users.  So, let me describe how the tool works and what you can expect when the tool is run.


The Validate Headings tool was added as a new report to the MarcEditor to enable users to take a set of records and get back a report detailing how many records had corresponding Library of Congress authority headings.  The tool was designed to validate data in the 1xx, 6xx, and 7xx fields.  The tool has been set to only query headings and subjects that utilize the LC authorities.  At some point, I’ll look to expand to other vocabularies.

How does it work

Presently, this tool must be run from within the MarcEditor – though at some point in the future, I’ll extract this out of the MarcEditor, and provide a stand alone function and a integration with the command line tool.  Right now, to use the function, you open the MarcEditor and select the Reports/Validate Headings menu.

Selecting this option will open the following window:

Options – you’ll notice 3 options available to you.  The tool allows users to decide what values that they would like to have validated.  They can select names (1xx, 600,10,11, 7xx) or subjects (6xx).  Please note, when you select names, the tool does look up the 600,610,611 as part of the process because the validation of these subjects occurs within the name authority file.  The last option deals with the local cache.  As MarcEdit pulls data from the Library of Congress – it caches the data that it receives so that it can use it on subsequent headings validation checked.  The cache will be used until it expires in 30 days…however, a user at any time can check this option and MarcEdit will delete the existing cache and rebuild it during the current data run. 

Couple things you’ll also note on this screen. There is an extract button and it’s not enabled.  Once the Validate report is run, this button will become enabled if there are any records that are identified as having headings that could not be validated against the service. 

Running the Tool:

Couple notes about running the tool.  When you run the tool, what you are asking MarcEdit to do is process your data file and query the Library of Congress for information related to the authorized terms in your records.  As part of this process, MarcEdit sends a lot of data back and forth to the Library of Congress utilizing the service.  The tool attempts to use a light touch, only pulling down headings for a specific request – but do realize that a lot of data requests are generated through this function.  You can estimate approximately how many requests might be made on a specific file by using the following formula: (number of records x 2)  + (number of records), assuming that most records will have 1 name to authorize and 1 subjects per record.  So a file with 2500 records would generate ~7500 requests to the Library of Congress.  Now, this is just a guess, in my tests, I’ve had some sets generate as many as 12,000 requests for 2500 records and as few as 4000 requests for 2500 records – but 7500 tended to be within 500 requests in most test files.

So why do we care?  Well, this report has the potential to generate a lot of requests to the Library of Congress’s identifier service – and while I’ve been told that there shouldn’t be any issues with this – I think that question won’t really be known until people start using it.  At the same time – this function won’t come as a surprise to the folks at the Library of Congress – as we’ve spoken a number of times during the development.  At this point, we are all kind of waiting to see how popular this function might be, and if MarcEdit usage will create any noticeable up-tick in the service usage.

Validation Results:

When you run the validation tool, the program will go through each record, making the necessary validation requests of the LC ID service.  When the service has completed, the user will receive a report with the following information:

Validation Results: Process completed in: 121.546001431667 minutes. Average Response Time from LC: 0.847667984420415 Total Records: 2500 Records with Invalid Headings: 1464 ************************************************************** 1xx Headings Found: 1403 6xx Headings Found: 4106 7xx Headings Found: 1434 ************************************************************** 1xx Headings Not Found: 521 6xx Headings Not Found: 1538 7xx Headings Not Found: 624 ************************************************************** 1xx Variants Found: 6 6xx Variants Found: 1 7xx Variants Found: 3 ************************************************************** Total Unique Headings Queried: 8604 Found in Local Cache: 1001 ***************************************************************

This represents the header of the report.  I wanted users to be able to quickly, at a glance, see what the Validator determined during the course of the process.  From here, I can see a couple of things:

  1. The tool queried a total of 2500 records
  2. Of those 2500 records, 1464 of those records had a least one heading that was not found
  3. Within those 2500 records, 8604 unique headers were queried
  4. Within those 2500 records, there were 1001 duplicate headings across records (these were not duplicate headings within the same record, but for example, multiple records with the same author, subject, etc.)
  5. We can see how many Headings were found by the LC ID service within the 1xx, 6xx, and 7xx blocks
  6. Likewise, we can see how many headings were not found by the LC ID service within the 1xx, 6xx, and 7xx blocks.
  7. We can see number of Variants as well.  Variants are defined as names that resolved, but that the preferred name returned by the Library of Congress didn’t match what was in the record.  Variants will be extracted as part of the records that need further evaluation.

After this summary of information, the Validation report returns information related to the record # (record number count starts at zero) and the headings that were not found.  For example:

Record #0 Heading not found for: Performing arts--Management--Congresses Heading not found for: Crawford, Robert W Record #5 Heading not found for: Social service--Teamwork--Great Britain Record #7 Heading not found for: Morris, A. J Record #9 Heading not found for: Sambul, Nathan J Record #13 Heading not found for: Opera--Social aspects--United States Heading not found for: Opera--Production and direction--United States

The current report format includes specific information about the heading that was not found.  If the value is a variant, it shows up in the report as:

Record #612 Term in Record: bible.--criticism, interpretation, etc., jewish LC Preferred Term: Bible. Old Testament--Criticism, interpretation, etc., Jewish URL: Heading not found for: Bible.--Criticism, interpretation, etc

Here you see – the report returns the record number, the normalized form of the term as queried, the current LC Preferred term, and the URL to the term that’s been found.

The report can be copied and placed into a different program for viewing or can be printed (see buttons).

To extract the records that need work, minimize or close this window and go back to the Validate Headings Window.  You will now see two new options:

First, you’ll see that the Extract button has been enabled.  Click this button, and all the records that have been identified as having headings in need of work will be exported to the MarcEditor.  You can now save this file and work on the records. 

Second, you’ll see the new link – save delimited.  Click on this link, and the program will save a tab delimited copy of the validation report.  The report will have the following format:

Record ID [tab] 1xx [tab] 6xx [tab] 7xx [new line]

Each column will be delimited by a colon, so if two 1xx headings appear in a record, the current process would create a single column, but with the headings separated by a colon like: heading 1:heading 2. 

Future Work:

This function required making a number of improvements to the linked data components – and because of that, the linking tool should work better and faster now.  Additionally, because of the variant work I’ve done, I’ll soon be adding code that will give the user the option to update headings for Variants as is report or the linking tool is running – and I think that is pretty cool.  If you have other ideas or find that this is missing a key piece of functionality – let me know.


DuraSpace News: Welcome Jared Whiklo, University of Manitoba, to the Fedora Committers

planet code4lib - Mon, 2015-08-24 00:00

From Andrew Woods, on behalf of on behalf of the Fedora Committers and Leadership Team

Winchester, MA  The Fedora Committers and Leadership Teams are pleased to welcome Jared Whiklo, Web Application Developer at the University of Manitoba, to the Fedora Committers team.

DuraSpace News: Second meeting of the DSpace UI Working Group, 8/25

planet code4lib - Mon, 2015-08-24 00:00

From Tim Donohue, DSpace Tech Lead, DuraSpace

Winchester, MA  A reminder that the second meeting of the DSpace UI Working Group is TOMORROW (Tues, Aug 25) at 15:00 UTC (11:00am EDT).  Connection information is below.

Anyone is welcome to attend and join this new working group. A working group charter, with deliverables, is available at

Meeting Agenda:

Nicole Engard: Bookmarks for August 23, 2015

planet code4lib - Sun, 2015-08-23 20:30

Today I found the following resources and bookmarked them on Delicious.

  • MediaGoblin MediaGoblin is a free software media publishing platform that anyone can run. You can think of it as a decentralized alternative to Flickr, YouTube, SoundCloud, etc.
  • The Architecture of Open Source Applications
  • A web whiteboard A Web Whiteboard is touch-friendly online whiteboard app that lets you use your computer, tablet or smartphone to easily draw sketches, collaborate with others and share them with the world.

Digest powered by RSS Digest

The post Bookmarks for August 23, 2015 appeared first on What I Learned Today....

Related posts:

  1. Create Android Apps
  2. Why pay for copies?
  3. Another big name online office suite

Karen G. Schneider: Come together, right now

planet code4lib - Sun, 2015-08-23 18:24

“Golden Eagle in flight – 5” by Tony Hisgett from Birmingham, UK. Licensed under CC BY 2.0

Tomorrow is my first convocation at my new university. For my international readers, a convocation in this part of the world is usually a ceremony in the autumn where faculty, students, and the schools that serve them are welcomed into the new academic year. (Although sometimes “convocation” is a graduation, which I suppose makes it a contronym, and it is also the collective noun for eagles).

At Holy Names, convocation was a student-centered event, and began with the university community, dress in its finest, climbing up the 100-plus stairs to the dining hall for speeches and a lunch. I do not know entirely what to expect from tomorrow’s event (except there is no lunch, and it is held in the largest theater on campus, and relatively few students will be present), but I know that it will be different and that in its difference I will learn new meanings, symbols, and ways of being.

All weekend I have had the last four lines of Yeats’ “A prayer for my daughter” running through my mind:

How but in custom and in ceremony
Are innocence and beauty born?
Ceremony’s a name for the rich horn,
And custom for the spreading laurel tree.

There is a saying on the Internet, “do not read the comments,” and when it comes to major poems, I extend this to “do not read the commentary.” I made the mistake of browsing discussions of this poem, only to discover that rather than the sky-wide reflection on chaos versus order I know it to be, it is actually, among other flaws, a poem advocating the oppression of women. The idea that the poem is a product of its time, or that a father would want to be protective of his daughter, or that there is something to be said for the sanity of a well-ordered home life, is pushed aside in favor of squeezing this poem through a highly specific modern sensibility, then finding it wanting.

Higher education has been described as irrelevant, in a crisis, in need of great change, overpriced, stodgy, out of touch with the world, a waste of effort, and most of all, in need of disruption. And yet every fall universities around the country unite the stewards of academia in a ceremony that is anything but disruptive (convocation: convene, come together) and reminds us that the past, however conflicted and flawed, is the inevitable set of struts for building the future. Convocation reminds us that the work of summer is done, and now it is time for students to matriculate, spend a few days having fun and learning the campus culture, then settle down to work. The clock is wound, and begins to tick:  professors teaching, administrators administrating, and librarians librarying and otherwise being their bad (as in good) information-professional selves.

When I think about the harsh words tossed at higher education, I am reminded not only of the dishonoring of great poems by forcing them through a chemist’s retort of present-day sensibility, but also how some leaders–and I have been guilty of this myself–are in such a rush to embrace new ideas (particularly our own new ideas) and express our pride in our forward-looking stance that we forget that many times, things were the way they were for a good reason that made sense at the time; and we also forget that in a decade or two our own ideas will be found ill-suited for the way things are done in that new era. When we do that we hurt feelings and body-block the gradual changing of minds, and for what purpose? We can and should continue the hard work of making higher education better, but we should also honor and embrace the past. Give the past its due, because for all of its failings, it birthed the present.

I see now that part of the thrill of convocation for me is how it fills a necessary void: the honoring of my own conflicted past (and all human pasts are conflicted), as well as my commitment to movement into the future. We have events honoring our own birth and also the calendar year, but too many cultures lack a Yom Kippur or Ramadan to help us reset and recommit. Lent comes close, but it is now nearly ruined by Secular Easter and muddy symbolism; as Sandy observes, it is strange behavior to celebrate the Lamb of God, and then roast him for Easter dinner. I am also impressed by how many clueless people schedule ordinary events for Good Friday, which is the religious observance that makes Easter Easter.

So onward into the academic year. The spreading laurel tree of academic custom, framed by convocation in early autumn and graduation in spring, gives my life well-framed pauses for introspection and inventory, pausing the slipstream of dailiness, stirring memories, reflection, atonement, and even where warranted, a little quiet praise. Births and deaths, broken friendships and promises, things (to borrow from the Book of Common Prayer) done and left undone, achievements big and small, harsh words and kind actions, frustrations and triumphs, times of fear and times of fearlessness, critical moments of thoughtlessness and those of careful consideration: tomorrow morning, dressed as one does for signature moments, I will tag along behind librarians as they wend their way to a place I have never visited and yet will come to know well, and learn a new way of coming together, in this autumn that closes one book and starts another.

Bookmark to:

District Dispatch: Archived webinar on university copyright services now available

planet code4lib - Fri, 2015-08-21 18:38

By trophygeek

An archived copy of the CopyTalk webinar “University Copyright Services” is now available. Originally webcasted on August 6th by the Office for Information Technology Policy’s Copyright Education Subcommittee, presenters were Sandra Enimil, Program Director, University Libraries Copyright Resources Center from the Ohio State University, Pia Hunter, Visiting Assistant Professor and Copyright and Reserve Librarian from the University of Illinois at Chicago, and Cindy Kristof, Head of Copyright and Document Services from Kent State University. They described the copyright services they offer to faculty, staff, and students at their respective institutions.

Plan ahead! One hour CopyTalk webinars occur on the first Thursday of every month at 11am Pacific/2 pm Eastern Time. It’s free!

The post Archived webinar on university copyright services now available appeared first on District Dispatch.

FOSS4Lib Upcoming Events: Islandora Camp FL 2016

planet code4lib - Fri, 2015-08-21 13:34
Date: Wednesday, March 9, 2016 - 08:00 to Friday, March 11, 2016 - 17:00Supports: IslandoraFedora Repository

Last updated August 21, 2015. Created by Peter Murray on August 21, 2015.
Log in to edit this page.

From the announcement:

Islandora is going to Florida! March 9 - 11, 2016, join us on the campus of Florida Gulf Coast University in Fort Myers Florida.

FOSS4Lib Upcoming Events: Islandora Camp CT 2015

planet code4lib - Fri, 2015-08-21 13:31
Date: Tuesday, October 20, 2015 - 08:00 to Friday, October 23, 2015 - 17:00Supports: IslandoraFedora Repository

Last updated August 21, 2015. Created by Peter Murray on August 21, 2015.
Log in to edit this page.

From the announcement:

Islandora is heading to Connecticut! October 20 - 23, join us on the beautiful campus of the University of Connecticut Graduate Business Learning Center, in downtown Hartford.

DuraSpace News: INVITATION: DSpace User Group Meeting in Tübingen, Germany

planet code4lib - Fri, 2015-08-21 00:00

From Pascal-Nicolas Becker, University Library Tübingen

Winchester, MA  The German DSpace User Group was reconstituted in October of 2014. The DSpace German-speaking community now has the opportunity to talk about DSpace topics face-to-face.

SearchHub: Solr as an Apache Spark SQL DataSource

planet code4lib - Thu, 2015-08-20 20:55
Join us for our upcoming webinar, Solr & Spark for Real-Time Big Data Analytics. You’ll learn more about how to use Solr as an Apache Spark SQL DataSource and how to combine data from Solr with data from other enterprise systems to perform advanced analysis tasks at scale. Full details and registration… Part 1 of 2: Read Solr results as a DataFrame This post is the first in a two-part series where I introduce an open source toolkit created by Lucidworks that exposes Solr as a Spark SQL DataSource. The DataSource API provides a clean abstraction layer for Spark developers to read and write structured data from/to an external data source. In this first post, I cover how to read data from Solr into Spark. In the next post, I’ll cover how to write structured data from Spark into Solr. To begin, you’ll need to clone the project from github and build it using Maven: git clone cd spark-solr mvn clean package -DskipTests After building, run the twitter-to-solr example to populate Solr with some tweets. You’ll need your own Twitter API keys, which can be created by following the steps documented here. Start Solr running in Cloud mode and create a collection named “socialdata” partitioned into two shards: bin/solr -c && bin/solr create -c socialdata -shards 2 The remaining sections in this document assume Solr is running in cloud mode on port 8983 with embedded ZooKeeper listening on localhost:9983. Also, to ensure you can see tweets as they are indexed in near real-time, you should enable auto soft-commits using Solr’s Config API. Specifically, for this exercise, we’ll commit tweets every 2 seconds. curl -XPOST http://localhost:8983/solr/socialdata/config \ -d '{"set-property":{"updateHandler.autoSoftCommit.maxTime":"2000"}}' Now, let’s populate Solr with tweets using Spark streaming: $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER \ --conf "spark.executor.extraJavaOptions=-Dtwitter4j.oauth.consumerKey=? -Dtwitter4j.oauth.consumerSecret=? -Dtwitter4j.oauth.accessToken=? -Dtwitter4j.oauth.accessTokenSecret=?" \ --class com.lucidworks.spark.SparkApp \ ./target/spark-solr-1.0-SNAPSHOT-shaded.jar \ twitter-to-solr -zkHost localhost:9983 -collection socialdata Replace $SPARK_MASTER with the URL of your Spark master server. If you don’t have access to a Spark cluster, you can run the Spark job in local mode by passing: --master local[2] However, when running in local mode, there is no executor, so you’ll need to pass the Twitter credentials in the spark.driver.extraJavaOptions parameter instead of spark.executor.extraJavaOptions. Tweets will start flowing into Solr; be sure to let the streaming job run for a few minutes to build up a few thousand tweets in your socialdata collection. You can kill the job using ctrl-C. Next, let’s start up the Spark Scala REPL shell to do some interactive data exploration with our indexed tweets: cd $SPARK_HOME ADD_JARS=$PROJECT_HOME/target/spark-solr-1.0-SNAPSHOT-shaded.jar bin/spark-shell $PROJECT_HOME is the location where you cloned the spark-solr project. Next, let’s load the socialdata collection into Spark by executing the following Scala code in the shell: val tweets = sqlContext.load("solr", Map("zkHost" -> "localhost:9983", "collection" -> "socialdata") ).filter("provider_s='twitter'") On line 1, we use the sqlContext object loaded into the shell automatically by Spark to load a DataSource named “solr”. Behind the scenes, Spark locates the solr.DefaultSource class in the project JAR file we added to the shell using the ADD_JARS environment variable. On line 2, we pass configuration parameters needed by the Solr DataSource to connect to Solr using a Scala Map. At a minimum, we need to pass the ZooKeeper connection string (zkHost) and collection name. By default, the DataSource matches all documents in the collection, but you can pass a Solr query to the DataSource using the optional “query” parameter. This allows to you restrict the documents seen by the DataSource using a Solr query. On line 3, we use a filter to only select documents that come from Twitter (provider_s=’twitter’). At this point, we have a Spark SQL DataFrame object that can read tweets from Solr. In Spark, a DataFrame is a distributed collection of data organized into named columns (see: Conceptually, DataFrames are similar to tables in a relational database except they are partitioned across multiple nodes in a Spark cluster. The following diagram depicts how a DataFrame is constructed by querying our two-shard socialdata collection in Solr using the DataSource API: It’s important to understand that Spark does not actually load the socialdata collection into memory at this point. We’re only setting up to perform some analysis on that data; the actual data isn’t loaded into Spark until it is needed to perform some calculation later in the job. This allows Spark to perform the necessary column and partition pruning operations to optimize data access into Solr. Every DataFrame has a schema. You can use the printSchema() function to get information about the fields available for the tweets DataFrame: tweets.printSchema() Behind the scenes, our DataSource implementation uses Solr’s Schema API to determine the fields and field types for the collection automatically. scala> tweets.printSchema() root |-- _indexed_at_tdt: timestamp (nullable = true) |-- _version_: long (nullable = true) |-- accessLevel_i: integer (nullable = true) |-- author_s: string (nullable = true) |-- createdAt_tdt: timestamp (nullable = true) |-- currentUserRetweetId_l: long (nullable = true) |-- favorited_b: boolean (nullable = true) |-- id: string (nullable = true) |-- id_l: long (nullable = true) ... Next, let’s register the tweets DataFrame as a temp table so that we can use it in SQL queries: tweets.registerTempTable("tweets") For example, we can count the number of retweets by doing: sqlContext.sql("SELECT COUNT(type_s) FROM tweets WHERE type_s='echo'").show() If you check your Solr log, you’ll see the following query was generated by the Solr DataSource to process the SQL statement (note I added the newlines between parameters to make it easier to read the query): q=*:*& fq=provider_s:twitter& fq=type_s:echo& distrib=false& fl=type_s,provider_s& cursorMark=*& start=0& sort=id+asc& collection=socialdata& rows=1000 There are a couple of interesting aspects of this query. First, notice that the provider_s field filter we used when we declared the DataFrame translated into a Solr filter query parameter (fq=provider_s:twitter). Solr will cache an efficient data structure for this filter that can be reused across queries, which improves performance when reading data from Solr to Spark. In addition, the SQL statement included a WHERE clause that also translated into an additional filter query (fq=type_s:echo). Our DataSource implementation handles the translation of SQL clauses to Solr specific query constructs. On the backend, Spark handles the distribution and optimization of the logical plan to execute a job that accesses data sources. Even though there are many fields available for each tweet in our collection, Spark ensures that only the fields needed to satisfy the query are retrieved from the data source, which in this case is only type_s and provider_s. In general, it’s a good idea to only request the specific fields you need access to when reading data in Spark. The query also uses deep-paging cursors to efficiently read documents deep into the result set. If you’re curious how deep paging cursors work in Solr, please read: Also, matching documents are streamed back from Solr, which improves performance because the client side (Spark task) does not have to wait for a full page of documents (1000) to be constructed on the Solr side before receiving data. In other words, documents are streamed back from Solr as soon as the first hit is identified. The last interesting aspect of this query is the distrib=false parameter. Behind the scenes, the Solr DataSource will read data from all shards in a collection in parallel from different Spark tasks. In other words, if you have a collection with ten shards, then the Solr DataSource implementation will use 10 Spark tasks to read from each shard in parallel. The distrib=false parameter ensures that each shard will only execute the query locally instead of distributing it to other shards. However, reading from all shards in parallel does not work for Top N type use cases where you need to read documents from Solr in ranked order across all shards. You can disable the parallelization feature by setting the parallel_shards parameter to false. When set to false, the Solr DataSource will execute a standard distributed query. Consequently, you should use caution when disabling this feature, especially when reading very large result sets from Solr. Not only SQL Beyond SQL, the Spark API exposes a number of functional operations you can perform on a DataFrame. For example, if we wanted to determine the top authors based on the number of posts, we could use the following SQL: sqlContext.sql("select author_s, COUNT(author_s) num_posts from tweets where type_s='post' group by author_s order by num_posts desc limit 10").show() Or, you can use the DataFrame API to achieve the same: tweets.filter("type_s='post'").groupBy("author_s").count(). orderBy(desc("count")).limit(10).show() Another subtle aspect of working with DataFrames is that you as a developer need to decide when to cache the DataFrame based on how expensive it was to create it. For instance, if you load 10’s of millions of rows from Solr and then perform some costly transformation that trims your DataFrame down to 10,000 rows, then it would be wise to cache the smaller DataFrame so that you won’t have to re-read millions of rows from Solr again. On the other hand, caching the original millions of rows pulled from Solr is probably not very useful, as that will consume too much memory. The general advice I follow is to cache DataFrames when you need to reuse them for additional computation and they require some computation to generate. Wrap-up Of course, you don’t need the power of Spark to perform a simple count operation as I did in my example. However, the key takeaway is that the Spark SQL DataSource API makes it very easy to expose the results of any Solr query as a DataFrame. Among other things, this allows you to combine data from Solr with data from other enterprise systems, such as Hive or Postgres, to perform advanced data analysis tasks at scale. Another advantage of the DataSource API is that it allows developers to interact with a data source using any language supported by Spark. For instance, there is no native R interface to Solr, but using Spark SQL, a data scientist can pull data from Solr into an R job seamlessly. In the next post, I’ll cover how to write a DataFrame to Solr using the DataSource API. Join Tim for our upcoming webinar, Solr & Spark for Real-Time Big Data Analytics. You’ll learn more about how to use Solr as a Spark SQL DataSource and how to combine data from Solr with data from other enterprise systems to perform advanced analysis tasks at scale. Full details and registration…

The post Solr as an Apache Spark SQL DataSource appeared first on Lucidworks.

District Dispatch: Start-ups, start your engines

planet code4lib - Thu, 2015-08-20 17:25

From Flickr

Yesterday marked the 12th annual Start-Up Day Across America – a “holiday” dedicated to raising awareness of the importance of local entrepreneurship. Organized under the auspices of the Congressional Caucus on Innovation and Entrepreneurship, National Start-Up Day represents an opportunity for owners of local businesses to meet directly with their elected federal representatives and share their ideas and concerns about the direction of the innovation economy.

The American Library Association believes strongly in the foundational purpose of National Start-Up Day: The advancement of our economy through the encouragement of the entrepreneurial spirit. In fact, you might say that every day is National Start-Up day for America’s libraries. Libraries of all types provide a host of services and resources that can help entrepreneurs and aspiring entrepreneurs at every stage of their efforts to bring an innovative idea to fruition.

According to the ALA/University of Maryland Digital Inclusion Survey, most public libraries (over 99%) report providing economic/workforce services. Of those, about 48% report providing entrepreneurship and small business development services. These services range from providing programming and informational resources on financial analysis, customer relations, supply chain management and marketing, to working directly with actors in the financial sector to help patrons gain access to seed capital and consulting services.

For example, New York Public Library’s Science, Industry and Business Library offers one-on-one small business counseling through SCORE (a small business non-profit associated with the Small Business Administration), and the Brooklyn and Houston Public Libraries are partners in business plan competitions that offer seed capital to local entrepreneurs. These are just a few examples of how libraries are doing their part to create synergies that grow our economy by fostering innovation.

Furthermore, libraries are not just places to launch a business – they’re also places to grow a business. Last Spring, Larra Clark of the ALA Office for Information Technology Policy organized and participated in a program highlighting the growth of co-working areas in libraries at the annual South by Southwest (SXSW) festival. To accommodate the increasing number of self-employed, temp and freelance workers in our communities, numerous libraries offer dedicated work spaces. Startups use the spaces to build and launch new businesses and enterprises using robust digital technologies and resources. Jonathan Marino – one of Larra’s collaborators at SXSW – runs MapStory, an interactive and collaborative platform for mapping change over time, out of the co-working space at D.C. Public Library. Jonathan is just one of numerous entrepreneurs who rely on library resources to operate their respective ventures on a daily basis.

Even if you’re not ready to monetize or market your product, you can come to the library to bring your product into the physical world for the first time. As makerspaces sprout up in libraries across the country, people of all ages with nothing more than a budding idea and a library card are becoming engineers; they’re using their library’s 3D printer, laser cutter and/or CNC router to build a prototype of an item they hope may one day galvanize consumers.

The point of all of this is not just that libraries do lots of stuff to help entrepreneurs (although we do, and we’re proud of that). It’s also that the library community doesn’t have a single, narrow vision for advancing the innovation economy – we help individuals in all areas, of all ages and backgrounds advance their own visions. In short, librarians encourage the diverse communities we serve to ignite the innovation economy in a diversity of ways.

The ALA hopes that National Start-Up Day Across America enriches the discourse on small business policy and encourages entrepreneurs in every part of our country to continue to drive our economy forward.

The post Start-ups, start your engines appeared first on District Dispatch.


Subscribe to code4lib aggregator