You are here

Feed aggregator

Mita Williams: The Hashtag Syllabus: Part Three

planet code4lib - Tue, 2016-08-16 19:25

In The Future of the Library: From Electric Media to Digital Media
Robert K. Logan and Marshall McLuhan, you can find this passage from Chapter 9: The Compact Library and Human Scale:

As an undergraduate at the University of Cambridge, I (McLuhan) encountered a library in the English Department that had immense advantages. I never have seen one like it since. It consisted of no more than 1,500 or 2,000 books. These books, however, were chosen from many fields of history and aesthetics, philosophy, anthropology, mathematics, and the sciences in general. The one criterion, which determined the presence of any book in this collection, was the immediate and top relevance for twentieth-century awareness.  The shelf-browser could tell at a glance exactly which poets, novelists, critics, painters, and which of their individual writings were indispensable for knowing “where it’s at….”

… The library of which I spoke existed in a corner of the English Faculty Library at Cambridge, but it enabled hundreds of students to share all the relevant poets, painters, critics, musicians, and scientists of that time as a basis for an ongoing dialog. Would it not be possible to have similar libraries created by other departments in the university? Could not the History Department indicate those areas of anthropology and sociology that were indispensable to the most advanced historical studies of the hour? Could not the Department of Philosophy pool its awareness of many fields in order to create a composite image of all the relevant of many fields in order to create a composite image of all the relevant speculation and discovery of our time? Only now have I begun to realize that this unique library represented the meeting of both a written and oral tradition at an ancient university. It is this figure-ground pattern of the written and the oral that completes the meaning of the book and the library.

McLuhan isn’t the first scholar to recognize that there is something feels fundamentally different between a library collection of material selected by librarians and a working collection of material selected by practitioners. While the ideal academic library is close at hand and contains a vast amount of material relevant to one’s interests, the ideal working library is compact and at ‘human-scale.’

It is as if there are two kinds of power at hand.

From Karen Coyle’s FRBR Before and After‘s chapter The Model [pdf]

Patrick Wilson’s Two Kinds of Power, published in 1968, and introduced in chapter 1, is a book that is often mentioned in library literature but whose message does not seem to have disseminated through library and cataloging thinking. If it had, our catalogs today might have a very different character. A professor of Library Science at the University of California at Berkeley, Wilson’s background was in philosophy, and his book took a distinctly philosophical approach to the question he posed, which most likely limited its effect on the practical world of librarianship. Because he approached his argument from all points of view, argued for and against, and did not derive any conclusions that could be implemented, there would need to be a rather long road from Wilson’s philosophy to actual cataloging code.

Wilson takes up the question of the goals of what he calls “bibliography,” albeit applied to the bibliographical function of the library catalog. The message in the book, as I read it, is fairly straightforward once all of Wilson’s points and counterpoints are contemplated. He begins by stating something that seems obvious but is also generally missing from cataloging theory, which is that people read for a purpose, and that they come to the library looking for the best text (Wilson limits his argument to texts) for their purpose. This user need was not included in Cutter’s description of the catalog as an “efficient instrument.” By Wilson’s definition, Cutter (and the international principles that followed) dealt only with one catalog function: “bibliographic control.” Wilson suggests that in fact there are two such functions, which he calls “powers”: the first is the evaluatively neutral description of books, which was first defined by Cutter and is the role of descriptive cataloging, called “bibliographic control”; the second is the appraisal of texts, which facilitates the exploitation of the texts by the reader. This has traditionally been limited to the realm of scholarly bibliography or of “recommender” services.

This definition pits the library catalog against the tradition of bibliography, the latter being an analysis of the resources on a topic, organized in terms of the potential exploitation of the text: general works, foundational works, or works organized by school of thought. These address what he sees as the user’s goal, which is “the ability to make the best use of a body of writings.” The second power is, in Wilson’s view, the superior capability. He describes descriptive control somewhat sarcastically as “an ability to line up a population of writings in any arbitrary order, and make the population march to one’s command” (Wilson 1968)

Karen goes on to write…

If one accepts Wilson’s statement that users wish to find the text that best suits their need, it would be hard to argue that libraries should not be trying to present the best texts to users. This, however, goes counter to the stated goal of the library catalog as that of bibliographic control, and when the topic of “best” is broached, one finds an element of neutrality fundamentalism that pervades some library thinking. This is of course irreconcilable with the fact that some of these same institutions pride themselves on their “readers’ services” that help readers find exactly the right book for them. The popularity of the readers’ advisory books of Nancy Pearl and social networks like Goodreads, where users share their evaluations of texts, show that there is a great interest on the part of library users and other readers to be pointed to “good books.” How users or reference librarians are supposed to identify the right books for them in a catalog that treats all resources neutrally is not addressed by cataloging theory.

I’m going copy and past that last sentence again for re-emphasis:

How users or reference librarians are supposed to identify the right books for them in a catalog that treats all resources neutrally is not addressed by cataloging theory.

As you can probably tell from my more recent posts and from my recent more readings, I’ve been delving deeper into the relationship between libraries and readers. To explain why this is necessary, I’ll end with another quotation from McLuhan:

The content of a library, paradoxically is not its books but its users, as a recent study of the use of campus libraries by university faculty revealed. It was found that the dominant criterion for selection of a library was the geographical proximity of the library to the professor’s office. The depth of the collection in the researcher’s field was not as important a criterion as convenience (Dougherty & Blomquist, 1971, pp. 64-65). The researcher was able to convert the nearest library into a research facility that met his needs. In other words, the content of this conveniently located facility was its user. Any library can be converted from the facility it was designed to be, into the facility the user wishes it to become. A library designed for research can be used for entertainment, and vice-versa. As we move into greater use of electronic media, the user of the library will change even more. As the user changes, so will the library’s content or the use to which the content of the library will be subjected. In other words, as the ground in which the library exists changes, so will the figure of the library. The nineteenth-century notion of the library storing basically twentieth-century material will have to cope with the needs of twenty-first century users.

This is the third part series called The Hashtag Syllabus. Part One is a brief examination of the recent phenomenon of generating and capturing crowdsourced syllabi on Twitter and Part Two is a technical description of how to use Zotero to collect and re-use bibliographies online.

District Dispatch: Video series makes the case for libraries

planet code4lib - Tue, 2016-08-16 18:23

U.S. libraries—120,000 strong—represent a robust national infrastructure for advancing economic and educational opportunity for all. From pre-K early learning to computer coding to advanced research, our nation’s libraries develop and deliver impactful programs and services that meet community needs and advance national policy goals.

This message is one that our Washington Office staff bring to federal policymakers and legislators every day, and we know it’s one that library directors and trustees also are hitting home in communities across the country. With Library Card Sign-up Month almost upon us, a new series of short videos (1-2 minutes) can help make the case for libraries, including one featuring school principal Gwen Abraham highlighting the important role of public libraries in supporting education. “Keep the innovation coming. Our kids benefit from it, this will affect their futures, and this is really what we need to make sure our kids are prepared with 21st century skills.”

As the nation considers our vision for the future this election year and begins to plot actionable steps to achieve that vision, we offer The E’s of Libraries® as part of the solution. Education, Employment, Entrepreneurship, Empowerment and Engagement are hallmarks of America’s libraries—but may not be as obvious to decision makers, influencers, and potential partners.

“Cleveland Public Library, like many of our colleagues, is using video more and more to share our services with more people in an increasingly visual world,” said Public Library Association (PLA) President Felton Thomas. “I know this is a catalog we need to build, and I hope these diverse videos will be used in our social media, public presentations and outreach to better reflect today’s library services and resources.”

For Employment: “The library was not a place I thought of right away, but it turned out to be the best place for my job search,” says Mike Munoz about how library programs helped him secure a job in a new city after only four months.

For Entrepreneurship: “”Before I walked into the public library, I knew nothing about 3D printing,” says brewery owner John Fuduric, who used library resources to print unique beer taps for his business. “The library is a great resource, but with the technology, the possibilities are endless.”

And Kristin Warzocha, CEO of the Cleveland Food Bank, speaks to the power of partnerships to address community needs: “Hunger is everywhere, and families across our country are struggling. Libraries are ideal partners because libraries are everywhere, too. Being able to partner with libraries…is a wonderful win-win situation for us.” In dozens of communities nationwide, libraries are partnering to address food security concerns for youth as part of summer learning programs. In Cleveland, this partnership has expanded to afterschool programming and even “checking out” groceries at the library.

All of the videos are freely available from the PLA YouTube page, and I’d love to hear how you’re using the videos—or even developing videos of your own. I’m at lclark@alawash.org. Thanks!

The post Video series makes the case for libraries appeared first on District Dispatch.

Jez Cope: What happened to the original Software Carpentry?

planet code4lib - Tue, 2016-08-16 17:03

“Software Carpentry was originally a competition to design new software tools, not a training course. The fact that you didn’t know that tells you how well it worked.”

When I read this in a recent post on Greg Wilson’s blog, I took it as a challenge. I actually do remember the competition, although looking at the dates it was long over by the time I found it.

I believe it did have impact; in fact, I still occasionally use one of the tools it produced, so Greg’s comment got me thinking: what happened to the other competition entries?

Working out what happened will need a bit of digging, as most of the relevant information is now only available on the Internet Archive. It certainly seems that by November 2008 the domain name had been allowed to lapse and had been replaced with a holding page by the registrar.

There were four categories in the competition, each representing a category of tool that the organisers thought could be improved:

  • SC Build: a build tool to replace make
  • SC Conf: a configuration management tool to replace autoconf and automake
  • SC Track: a bug tracking tool
  • SC Test: an easy to use testing framework

I’m hoping to be able to show that this work had a lot more impact than Greg is admitting here. I’ll keep you posted on what I find!

Islandora: Islandora 7.x-1.8 Release Team: You're Up!

planet code4lib - Tue, 2016-08-16 15:42

It's that time again. Islandora has a twice-yearly release schedule, shooting to get a new version out at the end of April and October. We are now looking for volunteers to join the team for the October release of Islandora 7.x-1.8, under the guidance of Release Manager Danny Lamb.

Given how fortunate we have been to have so many volunteers on our last few releases, we are changing things up a little bit to improve the experience, both through consolidating our documentation and by adding a few new roles to the release:

  • Communications Manager - Works with the Release Manager to announce release timeline milestones to the community. Reminds volunteers of upcoming deadlines and unfinished tasks. Reports to the Release Manager.
  • Testing Manager - Oversees testing of the release and reports back to the Release Manager. Advises Testers on how to complete their tasks. Monitors testing status and reminds Testers to complete their tasks on time. Helps the Release Manager to assign testing tickets to Testers during the release.
  • Documentation Manager - Oversees documenting the release and reports back to the Release Manager. Advises Documenters on how to complete their tasks. Monitors testing status and reminds Documenters to complete their tasks on time.
  • Auditing Manager - Oversees audit of the release and reports back to the Release Manager. Advises Auditors on how to complete their tasks. Monitors testing status and reminds Auditors to complete their tasks on time.

If you have been a Tester, Documenter, or Auditor for a previous Islandora Release, please consider taking on a little more responsibility and being a mentor to new volunteers by managing a role!

These are in addition to our existing release roles:

  • Component Manager - Component Managers take responsibility for a single module or collection of modules, reviewing open pull requests and referring the results to the Release Manager. Outside of a release cycle, Component Managers act as Maintainer on their modules until the next release. Components with no Component Manager will not be included in the release.
  • Tester - Installing and running the latest Islandora release candidate and testing for bugs. No programming experience required! We are looking for people with a general familiarity with Islandora to try out the latest releases and put them through their paces to look for bugs and make suggestions. Any JIRA tickets marked “Ready for Test” for a given component will also be assigned to the designated tester for a component, along with instructions on how to test.
  • Documenter - Checking modules readme files and updating the Islandora Documentation Wiki to reflect new releases.
  • Auditor -  Each release we audit our README and LICENSE files. Auditors will be responsible for auditing a given component by verifying that these document are current and fit into their proper templates.

All of these roles are outlined here and details on exactly how to Audit, Test, and Document an Islandora release are listed here.

SIGN UP HERE

Why join the 7.x-1.8 Release Team?
  • Give back to Islandora. This project survives because of our volunteers. If you've been using Islandora and want to contribute back to the project, being a part of a Release Team is one of the most helpful commitments you can make.
  • There's a commitment to fit your skills and time. Do you have a strong grasp of the inner workings of a module and want to make sure bugs, improvements, and features are properly managed in its newest version? Be a Component Manager. Do you work with a module a lot as an end user and think you can break it? Be a Tester! Do you want to learn more about a module and need an excuse to take a deep dive? Be a Documenter! Do you have a busy few months coming up and can't give a lot of time to the Islandora release?  Be an Auditor (small time commitment - big help!). You can take on a single module or sign up for several. 
  • Credit. Part of my job as inaugural Communication Manager is to create Release Team pages on our documentation so that future users can know who helped to make the release a reality.
  • T-Shirts. Each member of an Islandora Release Team gets a t-shirt unique to that release. They really are quite nifty:

Tenative schedule for the release:

  • Code Freeze: Tuesday, September 5, 2016
  • Release Candidate: Monday, September 19, 2016
  • Release: Monday October 31, 2016

SearchHub: Solr as SparkSQL DataSource, Part II

planet code4lib - Tue, 2016-08-16 15:13
Solr as a SparkSQL DataSource Part II

Co-authored with Kiran Chitturi, Lucidworks Data Engineer

Last August, we introduced you to Lucidworks’ spark-solr open source project for integrating Apache Spark and Apache Solr, see: Part I. To recap, we introduced Solr as a SparkSQL Data Source and focused mainly on read / query operations. In this posting, we show how to write data to Solr using the SparkSQL DataFrame API and also dig a little deeper into the advanced features of the library, such as data locality and streaming expression support.

Writing Data to Solr

For this posting, we’re going to use the Movielens 100K dataset found at: http://grouplens.org/datasets/movielens/. After downloading the zip file, extract it locally and take note of the directory, such as /tmp/ml-100k.

Setup Solr and Spark

Download Solr 6.x (6.1 is the latest at this time) and extract the archive to a directory, referred to as $SOLR_INSTALL hereafter. Start it in cloud mode by doing:

cd $SOLR_INSTALL bin/solr -cloud

Create some collections to host our movielens data:

bin/solr create -c movielens_ratings bin/solr create -c movielens_movies bin/solr create -c movielens_users

Also, make sure you’ve installed Apache Spark 1.6.2; see Spark’s getting started instructions for more details. Spark Documentation

Load Data using spark-shell

Start the spark-shell with the spark-solr JAR added to the classpath:

cd $SPARK_HOME ./bin/spark-shell --packages "com.lucidworks.spark:spark-solr:2.1.0"

Let’s load the movielens data into Solr using SparkSQL’s built-in support for reading CSV files. We provide the bulk of the loading code you need below, but you’ll need to specify a few environmental specific variables first. Specifically, declare the path to the directory where you extracted the movielens data, such as:

val dataDir = "/tmp/ml-100k"

Also, verify the zkhost val is set to the correct value for your Solr server.

val zkhost = "localhost:9983"

Next, type :paste into the spark shell so that you can paste in the following block of Scala:

sqlContext.udf.register("toInt", (str: String) => str.toInt) var userDF = sqlContext.read.format("com.databricks.spark.csv") .option("delimiter","|").option("header", "false").load(s"${dataDir}/u.user") userDF.registerTempTable("user") userDF = sqlContext.sql("select C0 as user_id,toInt(C1) as age,C2 as gender,C3 as occupation,C4 as zip_code from user") var writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_users") userDF.write.format("solr").options(writeToSolrOpts).save var itemDF = sqlContext.read.format("com.databricks.spark.csv") .option("delimiter","|").option("header", "false").load(s"${dataDir}/u.item") itemDF.registerTempTable("item") val selectMoviesSQL = """ | SELECT C0 as movie_id, C1 as title, C1 as title_txt_en, | C2 as release_date, C3 as video_release_date, C4 as imdb_url, | C5 as genre_unknown, C6 as genre_action, C7 as genre_adventure, | C8 as genre_animation, C9 as genre_children, C10 as genre_comedy, | C11 as genre_crime, C12 as genre_documentary, C13 as genre_drama, | C14 as genre_fantasy, C15 as genre_filmnoir, C16 as genre_horror, | C17 as genre_musical, C18 as genre_mystery, C19 as genre_romance, | C20 as genre_scifi, C21 as genre_thriller, C22 as genre_war, | C23 as genre_western | FROM item """.stripMargin itemDF = sqlContext.sql(selectMoviesSQL) itemDF.registerTempTable("item") val concatGenreListSQL = """ | SELECT *, | concat(genre_unknown,genre_action,genre_adventure,genre_animation, | genre_children,genre_comedy,genre_crime,genre_documentary, | genre_drama,genre_fantasy,genre_filmnoir,genre_horror, | genre_musical,genre_mystery,genre_romance,genre_scifi, | genre_thriller,genre_war,genre_western) as genre_list | FROM item """.stripMargin itemDF = sqlContext.sql(concatGenreListSQL) // build a multi-valued string field of genres for each movie sqlContext.udf.register("genres", (genres: String) => { var list = scala.collection.mutable.ListBuffer.empty[String] var arr = genres.toCharArray val g = List("unknown","action","adventure","animation","children", "comedy","crime","documentary","drama","fantasy", "filmnoir","horror","musical","mystery","romance", "scifi","thriller","war","western") for (i <- arr.indices) { if (arr(i) == '1') list += g(i) } list }) itemDF.registerTempTable("item") itemDF = sqlContext.sql("select *, genres(genre_list) as genre from item") itemDF = itemDF.drop("genre_list") writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_movies") itemDF.write.format("solr").options(writeToSolrOpts).save sqlContext.udf.register("secs2ts", (secs: Long) => new java.sql.Timestamp(secs*1000)) var ratingDF = sqlContext.read.format("com.databricks.spark.csv") .option("delimiter","\t").option("header", "false").load(s"${dataDir}/u.data") ratingDF.registerTempTable("rating") ratingDF = sqlContext.sql("select C0 as user_id, C1 as movie_id, toInt(C2) as rating, secs2ts(C3) as rating_timestamp from rating") ratingDF.printSchema writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_ratings") ratingDF.write.format("solr").options(writeToSolrOpts).save

Hit ctrl-d to execute the Scala code in the paste block. There are a couple of interesting aspects of this code to notice. First, I’m using SQL to select and name the fields I want to insert into Solr from the DataFrame created from the CSV files. Moreover, I can use common SQL functions, such as CONCAT, to perform basic transformations on the data before inserting into Solr. I also use some user-defined functions (UDF) to perform custom transformations, such as collapsing the genre fields into a multi-valued string field that is more appropriate for faceting using a UDF named “genres”. In a nutshell, you have the full power for Scala and SQL to prepare data for indexing.

Also notice that I’m saving the data into three separate collections and not de-normalizing all this data into a single collection on the Solr side, as is common practice when building a search index. With SparkSQL and streaming expressions in Solr, we can quickly join across multiple collections, so we don’t have to de-normalize to support analytical questions we want to answer with this data set. Of course, it may still make sense to de-normalize to support fast Top-N queries where you can’t afford to perform joins in real-time, but for this blog post, it’s not needed. The key take-away here is that you now have more flexibility in joining across collections in Solr, as well as joining with other data sources using SparkSQL.

Notice how we’re writing the resulting DataFrames to Solr using code such as:

var writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_users") userDF.write.format("solr").options(writeToSolrOpts).save

Behind the scenes, the spark-solr project uses the schema of the source DataFrame to define fields in Solr using the Schema API. Of course, if you have special needs for specific fields (i.e. custom text analysis), then you’ll need to predefine them before using Spark to insert rows into Solr.

This also assumes that auto soft-commits are enabled for your Solr collections. If auto soft-commits are not enabled, you can do that using the Solr Config API or just include the soft_commit_secs option when writing to Solr, such as:

var writeToSolrOpts = Map("zkhost" -> zkhost, "collection" -> "movielens_users", "soft_commit_secs" -> "10")

One caveat is if the schema of the DataFrame you’re indexing is not correct, then the spark-solr code will create the field in Solr with incorrect field type. For instance, I didn’t convert the rating field into a numeric type on my first iteration, which resulted in it getting indexed as a string. As a result, I was not able to perform any aggregations on the Solr side, such as computing the average rating of action movies for female reviewers in Boston. After correcting the issue on the Spark side, the field was already incorrectly defined in Solr, so I had to use the Solr Schema API to drop and re-create the field definition with the correct data type. The key take-away here is that seemingly minor data type issues in the source data can lead to confusing issues when working with the data in Solr.

In this example, we’re using Spark’s CSV DataSource, but you can write any DataFrame to Solr. This means that you can read data from any SparkSQL DataSource, such as Cassandra or MongoDB, and write to Solr using the same approach as what is shown here. You can even use SparkSQL as a more performant replacement of Solr’s Data Import Handler (DIH) for indexing data from an RDBMS; we show an example of this in the Performance section below.

Ok, so now you have some data loaded into Solr and everything setup correctly to query from Spark. Now let’s dig into some of the additional features of the spark-solr library that we didn’t cover in the previous blog post.

Analyzing Solr Data with Spark

Before you can analyze data in Solr, you need to load it into Spark as a DataFrame, which was covered in the first blog post in this series. Run the following code in the spark-shell to read the movielens data from Solr:

var ratings = sqlContext.read.format("solr").options(Map("zkhost" -> zkhost, "collection" -> "movielens_ratings")).load ratings.printSchema ratings.registerTempTable("ratings") var users = sqlContext.read.format("solr").options(Map("zkhost" -> zkhost, "collection" -> "movielens_users")).load users.printSchema users.registerTempTable("users") sqlContext.cacheTable("users") var movies = sqlContext.read.format("solr").options(Map("zkhost" -> zkhost, "collection" -> "movielens_movies")).load movies.printSchema movies.registerTempTable("movies") sqlContext.cacheTable("movies") Joining Solr Data with SQL

Here is an example query you can send to Solr from the spark-shell to explore the dataset:

sqlContext.sql(""" | SELECT u.gender as gender, COUNT(*) as num_ratings, avg(r.rating) as avg_rating | FROM ratings r, users u, movies m | WHERE m.movie_id = r.movie_id | AND r.user_id = u.user_id | AND m.genre='romance' AND u.age > 30 | GROUP BY gender | ORDER BY num_ratings desc """.stripMargin).show

NOTE: You may notice a slight delay in executing this query as Spark needs to distribute the spark-solr library to the executor process(es).

In this query, we’re joining data from three different Solr collections and performing an aggregation on the result. To be clear, we’re loading the rows of all three Solr collections into Spark and then relying on Spark to perform the join and aggregation on the raw rows.

Solr 6.x also has the ability to execute basic SQL. But at the time of this writing, it doesn’t support a broad enough feature set to be generally useful as an analytics tool. However, you should think of SparkSQL and Solr’s parallel SQL engine as complementary technologies in that it is usually more efficient to push aggregation requests down into the engine where the data lives, especially when the aggregation can be computed using Solr’s faceting engine. For instance, consider the following SQL query that performs a join on the results of a sub-query that returns aggregated rows.

SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON solr.movie_id = m.movie_id ORDER BY aggCount DESC

It turns out that the sub-query aliased here as “solr” can be evaluated on the Solr side using the facet engine, which as we all know is one of the most powerful and mature features in Solr. The sub-query:

SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating='[4 TO *]' GROUP BY movie_id ORDER BY aggCount desc LIMIT 10

Is effectively the same as doing:

/select?q=*:* &fq=rating_i:[4 TO *] &facet=true &facet.limit=10 &facet.mincount=1 &facet.field=movie_id

Consequently, what would be nice is if the spark-solr library could detect when aggregations can be pushed down into Solr to avoid loading the raw rows into Spark. Unfortunately, this functionality is not yet supported by Spark, see: SPARK-12449. As that feature set evolves in Spark, we’ll add it to spark-solr. However, we’re also investigating using some of Spark’s experimental APIs to weave push-down optimizations into the query planning process, so stay tuned for updates on this soon. In the meantime, you can perform this optimization in your client application by detecting when sub-queries can be pushed down into Solr’s parallel SQL engine and then re-writing your queries to use the results of the push-down operation. We’ll leave it as an exercise for the user for now and move on to using streaming expressions from Spark.

Streaming Expressions

Streaming expressions are one of the more exciting features in Solr 6.x. We’ll refer you to the Solr Reference Guide for details about streaming expressions, but let’s take a look at an example showing how to use streaming expressions with Spark:

val streamingExpr = """ parallel(movielens_ratings, hashJoin( search(movielens_ratings, q="*:*", fl="movie_id,user_id,rating", sort="movie_id asc", qt="/export", partitionKeys="movie_id"), hashed=search(movielens_movies, q="*:*", fl="movie_id,title", sort="movie_id asc", qt="/export", partitionKeys="movie_id"), on="movie_id" ), workers="1", sort="movie_id asc" ) """ var opts = Map( "zkhost" -> zkhost, "collection" -> "movielens_ratings", "expr" -> streamingExpr ) var ratings = sqlContext.read.format("solr").options(opts).load ratings.printSchema ratings.show

Notice that instead of just reading all rows from the movielens_ratings collection, we’re asking the spark-solr framework to execute a streaming expression and then expose the results as a DataFrame. Specifically in this case, we’re asking Solr to perform a hashJoin of the movies collection with the ratings collection to generate a new relation that includes movie_id, title, user_id, and rating. Recall that a DataFrame is an RDD[Row] and a schema. The spark-solr framework handles turning a streaming expression into a SparkSQL schema automatically. Here’s another example the uses Solr’s facet/stats engine to compute the average rating per genre:

val facetExpr = """ facet(movielens_movies, q="*:*", buckets="genre", bucketSorts="count(*) desc", bucketSizeLimit=100, count(*)) """ val opts = Map( "zkhost" -> zkhost, "collection" -> "movielens_movies", "expr" -> facetExpr ) var genres = sqlContext.read.format("solr").options(opts).load genres.printSchema genres.show

Unlike the previous SQL example, the aggregation is pushed down into Solr’s aggregation engine and only a small set of aggregated rows are returned to Spark. Smaller RDDs can be cached and broadcast around the Spark cluster to perform in-memory computations, such as joining to a larger dataset.

There are a few caveats to be aware of when using streaming expressions and spark-solr. First, until Solr 6.2 is released, you cannot use the export handler to retrieve timestamp or boolean fields, see SOLR-9187. In addition, we don’t currently support the gatherNodes stream source as it’s unclear how to map the graph-oriented results into a DataFrame, but we’re always interested in use cases where gatherNodes might be useful.

So now you have the full power of Solr’s query, facet, and streaming expression engines available to Spark. Next, let’s look at one more cool feature that opens up analytics on your Solr data to any JDBC compliant BI / dashboard tool.

Accessing Solr from Spark’s distributed SQL Engine and JDBC

Spark provides a thrift-based distributed SQL engine (built on HiveServer2) to allow client applications to execute SQL against Spark using JDBC. Since the spark-solr framework exposes Solr as a SparkSQL data source, you can easily execute queries using JDBC against Solr. Of course we’re aware that Solr provides its own JDBC driver now, but it’s based on the Solr SQL implementation, which as we’ve discussed is still maturing and does not provide the data type and analytic support needed by most applications.

First, you’ll need to start the thrift server with the –jars option to add the spark-solr shaded JAR to the classpath. In addition, we recommend running the thrift server with the following configuration option to allow multiple JDBC connections (such as those served from a connection pool) to share cached data and temporary tables:

--conf spark.sql.hive.thriftServer.singleSession=true

For example, here’s how I started the thrift server on my Mac.

sbin/start-thriftserver.sh --master local[4] \ --jars spark-solr/target/spark-solr-2.1.0-shaded.jar \ --executor-memory 2g --conf spark.sql.hive.thriftServer.singleSession=true \ --conf spark.driver.extraJavaOptions="-Dsolr.zkhost=localhost:2181/solr610"

Notice I’m also using the spark.driver.extraJavaOptions config property to set the zkhost as a Java system property for the thrift server. This alleviates the need for client applications to pass in the zkhost as part of the options when loading the Solr data source.

Use the following SQL command to initialize the Solr data source to query the movielens_ratings collection:

CREATE TEMPORARY TABLE ratings USING solr OPTIONS ( collection "movielens_ratings" )

Note that the required zkhost property will be resolved from the Java System property I set when starting the thrift server above. We feel this is a better design in that your client application only needs to know the JDBC URL and not the Solr ZooKeeper connection string. Now you have a temporary table backed by the movielens_ratings collection in Solr that you can execute SQL statements against using Spark’s JDBC driver. Here’s some Java code that uses the JDBC API to connect to Spark’s distributed SQL engine and execute the same query we ran above from the spark-shell:

import java.sql.*; public class SparkJdbc { public static void main(String[] args) throws Exception { String driverName = "org.apache.hive.jdbc.HiveDriver"; String jdbcUrl = "jdbc:hive2://localhost:10000/default"; String jdbcUser = "???"; String jdbcPass = "???"; Class.forName(driverName); Connection conn = DriverManager.getConnection(jdbcUrl, jdbcUser, jdbcPass); Statement stmt = null; ResultSet rs = null; try { stmt = conn.createStatement(); stmt.execute("CREATE TEMPORARY TABLE movies USING solr OPTIONS (collection \"movielens_movies\")"); stmt.execute("CREATE TEMPORARY TABLE users USING solr OPTIONS (collection \"movielens_users\")"); stmt.execute("CREATE TEMPORARY TABLE ratings USING solr OPTIONS (collection \"movielens_ratings\")"); rs = stmt.executeQuery("SELECT u.gender as gender, COUNT(*) as num_ratings, avg(r.rating) as avg_rating "+ "FROM ratings r, users u, movies m WHERE m.movie_id = r.movie_id AND r.user_id = u.user_id AND m.genre='romance' "+ " AND u.age > 30 GROUP BY gender ORDER BY num_ratings desc"); int rows = 0; while (rs.next()) { ++rows; // TODO: do something with each row } } finally { if (rs != null) rs.close(); if (stmt != null) stmt.close(); if (conn != null) conn.close(); } } } Data Locality

If the Spark executor and Solr replica live on the same physical host, SolrRDD provides faster query execution time using the Data Locality feature. During the partition creation, SolrRDD provides the placement preference option of running on the same node where the replica exists. This saves the overhead of sending the data across the network between different nodes.

Performance

Before we wrap up this blog post, we wanted to share our results from running a performance experiment to see how well this solution scales. Specifically, we wanted to measure the time taken to index data from Spark to Solr and also the time taken to query Solr from Spark using the NYC green taxi trip dataset between 2013-2015. The data is loaded onto an Postgres RDS instance in AWS. We used the Solr scale toolkit (solr-scale-tk) to deploy a 3-node Lucidworks Fusion cluster, which includes Apache Spark and Solr. More details are available at https://gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181

Setup
  • 3 EC2 nodes of r3.2xlarge instances running Amazon Linux and deployed in the same placement group
  • Solr nodes and Spark worker processes are co-located together on the same host
  • Solr collection ‘nyc-taxi’ created with 6 shards (no replication)
  • Total number of rows ‘91748362’ in the database
Writing to Solr

The documents are loaded from the RDS instance and indexed to Solr using the spark-shell script. 91.49M rows are indexed to Solr in 49 minutes.

  • Docs per second: 31.1K
  • JDBC batch size: 5000
  • Solr indexing batch size: 50000
  • Partitions: 200
Reading from Solr

The full collection dump from Solr to Spark is performed in two ways. To be able to test the streaming expressions, we chose a simple query that only uses fields with docValues. The result set includes all the documents present in the ‘nyc-taxi’ collection (91.49M)

Deep paging with split queries using Cursor Marks
  • Docs per second (per task): 6350
  • Total time taken: 20 minutes
  • Partitions: 120
Streaming using the export handler
  • Docs per second (per task): 108.9k
  • Total time taken: 2.3 minutes
  • Partitions: 6

Full data dumps from Spark to Solr using the JDBC datasource is faster than the traditional DIH approach. Streaming using the export handler is ~10 times faster than the traditional deep paging. Using DocValues gives us this performance benefit.

We hope this post gave you some insights into using Apache Spark and Apache Solr for analytics. There are a number of other interesting features of the library that we could not cover in this post, so we encourage you to explore the code base in more detail: github.com/lucidworks/spark-solr. Also, if you’re interested in learning more about how to use Spark and Solr to address your big data needs, please attend Timothy Potter’s talk at this year’s Lucene Revolution: Your Big Data Stack is Too Big.

The post Solr as SparkSQL DataSource, Part II appeared first on Lucidworks.com.

Tara Robertson: update on On Our Backs and Reveal Digital

planet code4lib - Tue, 2016-08-16 03:18

In March I wrote a post outlining the ethical issues of Reveal Digital digitizing On Our Backs, a lesbian porn magazine. Last week I spoke at code4lib NYS and shared examples of where libraries have digitized materials where they really shouldn’t have. My slides are online, and here’s a PDF of the slides and notes. Also: Jenna Freedman and I co-hosted a #critlib discussion on digitization ethics.

Susie Bright’s papers in Cornell’s Rare Book and Manuscripts Collection

A couple of weeks before code4lib NYS, I learned that Cornell has Susie Bright’s papers, which include some of the administrative records for On Our Backs. When I was at Cornell I visited the Rare Book and Manuscripts Collection and looked through this amazing collection. The first book of erotica I ever bought was Herotica, edited by Susie Bright, so it was especially amazing to see her papers. It was so exciting to see photo negatives or photos of images that became iconic for lesbians either in On Our Backs, or on the covers of other books. While the wave of nostalgia was fun, the purpose of my visit was to see if the contracts with the contributors were in the administrative papers.

I hit the jackpot when found a thin folder labelled Contributors Agreements. All of them weren’t there, but there were many contracts where the content creators did not sign over all rights to the magazine. Here are three examples.

This contributor contract from 1991 is for “one-time rights only”.

This contributor contract from 1988 is for “1st time N.A. serial rights”. In this context N.A. means North American. 

This contributor’s contract from 1985 is “for the period of one year, beginning 1.1.86”. 

Copyright and digitizing On Our Backs

Initially I thought that Reveal Digital had proper copyright clearances to put this content online. In addition to the above contributors contract examples, I talked to someone who modeled for On Our Backs (see slides 9 to 11 for model quotes) who said there was an agreement with the editor that the photo shoot would never appear online. These things make me wonder if the perceived current rights holder of this defunct magazine actually had the rights to grant to Reveal Digital to put this content online.

I’m still puzzled by Reveal Digital’s choice for a Creative Commons attribution (CC-BY) license. One of the former models describes how inappropriate this license is, and more worrisome as the lack of her consent in making this content available online.

People can cut up my body and make a collage. My professional and personal life can be high jacked. These are uses I never intended and still don’t want.

Response from Reveal Digital

Last week I spoke with Peggy Glahn, Program Director and part of the leadership team at Reveal Digital. She updated me on some Reveal Digital’s response to my critiques.

Takedown policy and proceedures

Peggy informed me that they had a takedown request and will be redacting some content and with their workflow it takes about 3 weeks to make those changes. She also said that they’ll be posting their takedown policy and process on their website but that there are technical challenges with their digital collections platform. It shouldn’t be difficult to link to a HTML page with the takedown policy, procedures and contact information. I’m not sure why this is a technical challenge. In the meantime, people can email Tech.Support@revealdigital.com with takedown requests. Reveal Digital will “assess each request on a case-by-case basis”.

Not removing this collection

I am really disappointed to hear that Reveal Digital does not have plans to take down this entire collection. Peggy spoke about a need to balance the rights of people accessing this collection and individual people’s right to privacy. It was nice to hear that they recognized that lesbian porn from the 80s and 90s differs from historical newspapers, both in content and in relative age. However by putting both types of collections on the web in the same way it feels like this is a shallow understanding of the differences.

Peggy mentioned that Reveal Digital had consulted the community and made the decision to leave this collection online. I asked who the community was in this case and she answered that the community was the libraries who are funding this initiative. This is an overly narrow definition of community, which is basically the fiscal stakeholders (thanks Christina Harlow for this phrase). If you work at one of these institutions, I’d love to hear what the consultation process looked like.

Community consultation is critical

As this is porn from the lesbian community in the 80s and 90s it is important that these people are consulted about their wishes and desires. Like most communities, I don’t think the lesbian and queer women’s community has ever agreed on anything, but it’s important that this consultation takes place. It’s also important to centre the voices of the queer women whose asses are literally on the page and respect their right to keep this content offline. I don’t have quick or simple solutions on how this can happen, but this is the responsibility that one takes on when you do a digitization project like this.

Learning from the best practices of digitizing traditional knowledge

The smart folks behind the Murkutu project, and Local Contexts (including the Traditional Knowledge labels) are leading the way in digitizing content in culturally appropriate and ethical ways. Reveal Digital could look at the thoughtful work that’s going on around the ethics of digitizing traditional knowledge as a blueprint for providing the right kind of access to the right people. The New Zealand Electronic Text Centre has also put a thoughtful paper outlining the consultation process and project outcomes how they to digitized the historic text Moko; or Maori tattooing.

After talking to several models who appeared in On Our Backs a common thread was that they did not consent to have their bodies online and that this posed a risk to their careers. Keeping this collection online is an act of institutional violence against the queer women who do not want this extremely personal information about themselves to so easily accessible online.

Librarians–we need to do better.

DuraSpace News: PROGRAMS available: DLF’s Liberal Arts Colleges Pre-Conf, 2016 DLF Forum, and NDSA’s Digital Preservation 2016

planet code4lib - Tue, 2016-08-16 00:00

From Bethany Nowviskie, Director of the Digital Library Federation (DLF) at CLIR, on behalf of organizing committees and local hosts for Liberal Arts Colleges Pre-Conference, 2016 DLF Forum, and NDSA Digital Preservation 2016

DuraSpace News: Fedora Project in Australia this Fall

planet code4lib - Tue, 2016-08-16 00:00

Austin, TX  Traveling to Melbourne, Australia for the eResearch Australasia Conference 2016?

Mita Williams: The Hashtag Syllabus: Part Two

planet code4lib - Mon, 2016-08-15 19:33

Last week I finally uploaded a bibliography of just under 150 items from the Leddy Library that could be found on the BlackLivesCDNSyllabi that has been circulating on Twitter since July 5th. In this post, I will go into some technical detail why it took me so long to do this.

For the most part, the work took time simply because there were lots of items from the original collection that was collected by Monique Woroniak in a Storify collection that needed to be imported into Zotero. I’m not exactly sure how many items are in that list, but in my original Zotero library of materials there are 220 items.

Because I’ve made this library public, you can open Zotero while on the page and download all or just some of the citations that I’ve collected.

I transferred the citations into Zotero because I wanted to showcase how citations could be repurposed using its API as well as through its other features. I’m a firm believer in learning by doing because sometimes you only notice the low beam once you’ve hit your head. In this case, it was only when I tried to reformat my bibliography using  Zotero’s API, I then learned that  Zotero’s API has a limit of 150 records.

(This is why I decided to showcase primarily the scholarly works in the “Leddy Library” version of the #BlackLivesCDNSyllabus and cut down the list to below 15o by excluding websites, videos, and musical artists.)

One of the most underappreciated features of Zotero is its API.

To demonstrate its simple power: here’s the link to the Leddy Library #BlackLivesCDNSyllabus using the API in which I’ve set the records to be formatted using the MLA Style: https://api.zotero.org/groups/609764/collections/V7E2UPJP/items/top?format=bib&style=mla [documentation]

You can embed this code into a website using jQuery like so:

<!doctype html> <html lang="en"> <head>   <meta charset="utf-8">   <title>Leddy Library #BlackLivesCDNSyllabus</title>   <style>   body {     font-size: 12px;     font-family: Arial;   }   </style>   <script src="https://code.jquery.com/jquery-1.10.2.js"></script> </head> <body>   <h1>Leddy Library #BlackLivesCDNSyllabus</h1> <p>   <div id="a"></div> <script> $( "#a" ).load("https://api.zotero.org/groups/609764/collections/V7E2UPJP/items/top?format=bib&style=mla" ); </script>   </body> </html>

The upshot of using the API is that when you need to update the bibliography, any additions to your Zotero group will automatically be reflected through the API: you don’t need to update the website manually.

For my purposes, I didn’t want to use Zotero to generate a just bibliography: I wanted it to generate a list of titles and links so that a user could directly travel from bibliographic entry to the Leddy Library catalogue to see if and where a book was waiting on a shelf in the Leddy Library.

Now, I know that’s not the purpose of a bibliography – a bibliography presents identifying information about a work and it doesn’t have to tell you where it is located (unless, of course, that item is available online, then, why wouldn’t you?).  Generally you don’t want to embed particular information such as links to your local library catalogue into your bibliography precisely because that information makes your bibliography less useful to everyone else who isn’t local.

The reason why I wanted to include direct links to material is largely because I believe our library catalogue’s OpenURL resolver has been realized so poorly that it is actually harmful to the user experience. You see, if you use our resolver while using Google Scholar to find an article – the resolver works as it should.

But if the reader is looking for a book, the resolver states that there is No full text available — even the library currently has the book on the shelf (this information is under the holdings tab).

In order to ensure that book material would be found without ambiguity, myself and our library’s co-op student manually added URLs that pointed directly to each respective record in the library catalogue to each of the 150 or so Zotero entries in our #BlackLivesCDNSylllabus collection. This took some time.

Now all I had to do was create a blog entry that included the bibliography…

I will now explain two ways you can re-purpose the display of Zotero records for your own use.

The first method I investigated was the creation of my own Zotero Citation Style. Essentially, I took an existing citation style and then added the option to include the call number and the URL field using the Visual Citation Style Editor,  a project which was the result of a collaboration of Columbia University Libraries and Mendeley from some years ago.


I took my now customized citation style and uploaded it up to a server and now I can use it as my own style whenever I need it:

https://api.zotero.org/groups/609764/collections/V7E2UPJP/items/top?format=bib&style=http://librarian.aedileworks.com/leddylibrarybibliography.csl


I can now copy this text and paste into my library’s website ‘blog form’ and in doing so, all the URLs will automatically turn into active links.

There’s another method to achieve the same ends but in an even easier way. Zotero has an option called Reports that allows you to generate a printer-friendly report of a collection of citations.

Unfortunately, the default view of the report is to show you every single field that has information in it. Luckily there is the Zotero Reports Customizer which allows one to limit what’s shown in the report:

There’s only one more hack left to mention. While the Zotero Report Customizer was invaluable, it doesn’t allow you to remove the link from each item’s title. The only option seemed to remove the almost 150 links by hand…

Luckily the text editor Sublime Text has an amazing power: Quick Find All — which allows the user to select all the matching text at once.

Then after I had the beginning of all the links selected for, I used the ‘Expand selection to quotes’ option that you can add to Sublime Text via Package Control and then removed the offending links. MAGIC!

The resulting HTML was dropped into my library’s Drupal-driven blog form and results in a report that looks like this:

Creating and sharing bibliographies and lists of works from our library catalogues should not be this hard.

It should not be so hard for people to share their recommendations of books, poets, and to creative works with each other.

It brings all to mind this mind this passage by Paul Ford from his essay The Sixth of Grief Is Retro-computing:

Technology is What We Share

Technology is what we share. I don’t mean “we share the experience of technology.” I mean: By my lights, people very often share technologies with each other when they talk. Strategies. Ideas for living our lives. We do it all the time. Parenting email lists share strategies about breastfeeding and bedtime. Quotes from the Dalai Lama. We talk neckties, etiquette, and Minecraft, and tell stories that give us guidance as to how to live. A tremendous part of daily life regards the exchange of technologies. We are good at it. It’s so simple as to be invisible. Can I borrow your scissors? Do you want tickets? I know guacamole is extra. The world of technology isn’t separate from regular life. It’s made to seem that way because of, well…capitalism. Tribal dynamics. Territoriality. Because there is a need to sell technology, to package it, to recoup the terrible investment. So it becomes this thing that is separate from culture. A product.

Let’s not make sharing just another product that we have to buy from a library vendor. Let’s remember that sharing is not separate from culture.

This is the second part series called The Hashtag Syllabus. Part One is a brief examination of the recent phenomenon of generating and capturing crowdsourced syllabi on Twitter and Part Three looks to Marshall McLuhan and Patrick Wilson for comment on the differences between a library and a bibliography.

LibUX: On ethics in the digital divide

planet code4lib - Mon, 2016-08-15 17:02

People without a good understanding of the tech ecosystem are vulnerable to people who want to sell them things and can’t properly evaluate what they they are being sold. — Jessamyn West

SearchHub: Parallel Computing in SolrCloud

planet code4lib - Mon, 2016-08-15 16:30

As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Joel Bernstein’s session about Parallel Computing in SolrCloud.

This presentation provides a deep dive into SolrCloud’s parallel computing capabilities – breaking down the framework into four main areas: shuffling, worker collections, the Streaming API, and Streaming Expressions. The talk describes how each of these technologies work individually and how they interact with each other to provide a general purpose parallel computing framework.

Also included is a discussion of some of the core use cases for the parallel computing framework. Use cases involving real-time map reduce, parallel relational algebra, and streaming analytics will be covered.

Joel Bernstein is a Solr committer and search engineer for the open source ECM company Alfresco.

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco from Lucidworks

Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Parallel Computing in SolrCloud appeared first on Lucidworks.com.

Islandora: Meet Your New Technical Lead

planet code4lib - Mon, 2016-08-15 13:23

Hi, I'm Danny, and I'm the newest hire of the Islandora Foundation. My role within the Foundation is to serve as Technical Lead, and I want to take some time to introduce myself and inform everyone of just exactly what I'll be doing for them.

I guess for starters, I should delve a bit into my background. Since a very young age, I've always considered myself to be pretty nerdy. As soon as I learned how to read, my father had me in front of a 386 and taught me DOS. In high school, I discovered Linux and was pretty well hooked. It was around that time I started working with HTML, and eventually javascript and Flash. I graduated with honors from East Tennessee State University with a B.Sc. in Mathematics and a Physics minor, and was exposed to a lot of C++ and FORTRAN. I specialized in Graph Theory, which I didn't think at the time would lead to a career as a programmer, since I had decided to be an actuary after completing school. Fast forward a few years, and I have a couple actuarial exams under my belt and have become well versed in VBA programming and Microsoft Office. But I didn't really like it, and wanted to do more than spreadsheet automation. So I moved to Canada and went back to school for Computer Science, but quickly found my way into the workforce for pragmatic reasons (read: I had bills to pay). I managed to score a job in the small video game industry that's evolved on PEI. I did more Flash (sadly) but was also exposed to web frameworks like Ruby on Rails and Django. A lot of my time was spent writing servers for Facebook games, and tackled everything from game logic to payment systems. But that career eventually burned me out, as it eventually does to most folks, and I applied for a position at a small company named discoverygarden that I heard was a great place to work.

And that's how I first met Islandora. I was still pretty green for the transition from Drupal 6 to 7, but shortly after the port I was able to take on more meaningful tasks. After learning tons about the stack while working with discoverygarden, I was given the opportunity to work on what would eventually become CLAW. And now I'm fortunate enough to have the opportunity to see that work through as an employee of the Islandora Foundation. So before I start explaining my duties as Tech Lead, I'd really like to thank the Islandora Foundation for hiring me, and discoverygarden for helping me gain the skills I needed to grow into this position.

Now, as is tradition in Islandora, I'll describe my roles as hats.  I'm essentially wearing three of them:

  • Hat #1: 7.x-1.x guy 
    1. We have increasingly more defined processes and workflows, and I'm committed to making sure those play out the way they should. But, for whatever reason, If there's a time where a pull request has sat for too long and the community hasn't responded, I will make sure it is addressed. I will either try to facilitate a community member who has time/interest to look at it, and if that's not possible, I will review myself.
    2. I will take part in and help chair committers' calls every other Thursday.
    3. I will attend a handful of Interest Group meetings.  There's too many for me to attend them all, but I'm starting with the IR and Security interest groups.
    4. Lastly, I will be serving as Release Manager for the next release, and will be working towards continuing to document and standardize the process to the best of my abilities, so that it's easier for other community members to take part in and lead that process from here on out.

  • Hat #2: CLAW guy
    1. We're currently in the process of pivoting from a Drupal 7 to Drupal 8 codebase, and I'm going to be sheparding that process as transparently as possible.  This means I will be working with community members to develop a plan for the minimum viable product (or MVP for short).  This will help defend against scope creep, and force ourselves as a community to navigate what all these changes mean.  Between Fedora 4, PCDM, and Drupal 8, there's a lot that's different, and we need to all be on the same page.  For everyone's sake, this work will be conducted as much as possible by conversations through the mailing lists, instead of solely at the tech calls.  In the Apache world, if it doesn't happen on the list, it never happened.  And I think that's a good approach to making sure people can at least track down the rationale for why we're doing certain things in the codebase.
    2. Using the MVP, I will be breaking down the work into the smallest consumable pieces possible.  In the past few years I've learned a lot about working with volunteer developers, and I fully understand that people have day jobs with other priorities.  By making the units of work as small as possible, we have better chance of receiving more contributions from interested community members.  In practice, I think this means I will be writing a lot of project templates to decrease ramp-up time for people, filling in boilerplate, and ideally even providing tests beforehand.
    3. I will be heavily involved in Sprint planning, and will be running community sprints for CLAW.
    4. I will be chairing and running CLAW tech calls, along with Nick Ruest, the CLAW Project Director.

  • Hat #3: UBC project guy
    • As part of a grant, the Foundation is working with the University of British Columbia Library and Press to integrate a Fedora 4 with a front-end called Scalar. They will also be using CLAW as a means of ingesting multi-pathway books. So I will be overseeing contractor work for the integration with Scalar, while also pushing CLAW towards full book support.

I hope that's enough for everyone to understand what I'll be doing for them, and how I can be of help if anyone needs it.  If you need to get in touch with me, I can be found on the lists, in #islandora on IRC as dhlamb, or at dlamb@islandora.ca.  I look forward to working with everyone in the future to help continue all the fantastic work that's been done by everyone out there.

Attributions:

LibUX: Content Style Guide – University of Illinois Library

planet code4lib - Mon, 2016-08-15 00:36

University of Illinois Library has made their content style guide available through a creative commons license. I feel like more than anyone I point to Suzanne Chapman‘s work. I saw her credited in the site’s footer and thought, “oh, yeah – of course.” She’s pretty great.

One of the best ways to ensure that our website is user-friendly is to follow industry best practices, keep the content focused on key user tasks, and keep our content up-to-date at all times.

Also, walk through their 9 Principles for Quality Content.

We have many different users with many different needs. They are novice and expert users, desktop and mobile users, people with visual, hearing, motor, or cognitive impairments, non-native English speakers, and users with different cultural expectations. Following these guidelines will help ensure a better experience for all our users. They will also help us create a more sustainable website.

  1. Content is in the right place
  2. Necessary, needed, useful, and focused on patron needs
  3. Unique
  4. Correct and complete
  5. Consistent, clear, and concise
  6. Structured
  7. Discoverable and makes sense out of context
  8. Sustainable (future-friendly)
  9. Accessible

All of these are elaborated and link out to a rabbit-hole of further reading.

Jonathan Rochkind: UC Berkeley Data Science intro to programming textbook online for free

planet code4lib - Sat, 2016-08-13 22:36

Looks like a good resource for library/information professionals who don’t know how to program, but want to learn a little bit of programming along with (more importantly) computational and inferential thinking, to understand the technological world we work in. As well as those who want to learn ‘data science’!

http://www.inferentialthinking.com/

Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference. This text develops a consistent approach to all three, introducing statistical ideas and fundamental ideas in computer science concurrently. We focus on a minimal set of core techniques that they apply to a vast range of real-world applications. A foundation in data science requires not only understanding statistical and computational techniques, but also recognizing how they apply to real scenarios.

For whatever aspect of the world we wish to study—whether it’s the Earth’s weather, the world’s markets, political polls, or the human mind—data we collect typically offer an incomplete description of the subject at hand. The central challenge of data science is to make reliable conclusions using this partial information.

In this endeavor, we will combine two essential tools: computation and randomization. For example, we may want to understand climate change trends using temperature observations. Computers will allow us to use all available information to draw conclusions. Rather than focusing only on the average temperature of a region, we will consider the whole range of temperatures together to construct a more nuanced analysis. Randomness will allow us to consider the many different ways in which incomplete information might be completed. Rather than assuming that temperatures vary in a particular way, we will learn to use randomness as a way to imagine many possible scenarios that are all consistent with the data we observe.

Applying this approach requires learning to program a computer, and so this text interleaves a complete introduction to programming that assumes no prior knowledge. Readers with programming experience will find that we cover several topics in computation that do not appear in a typical introductory computer science curriculum. Data science also requires careful reasoning about quantities, but this text does not assume any background in mathematics or statistics beyond basic algebra. You will find very few equations in this text. Instead, techniques are described to readers in the same language in which they are described to the computers that execute them—a programming language.


Filed under: General

Jenny Rose Halperin: Hello world!

planet code4lib - Fri, 2016-08-12 22:30

Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

LITA: Using Text Editors in Everyday Work

planet code4lib - Fri, 2016-08-12 14:35

In the LITA Blog Transmission featuring yours truly I fumbled in trying to explain a time logging feature in the Windows native program Notepad. You can see in the screenshot above that the syntax is .LOG and you put it at the top of the file. Then every time you open the file it adds a time stamp at the end of the file and places the cursor there so you can begin writing. This specific file is where I keep track of my continuing education work. Every time I finish a webinar or class I open this file and write it down (I’ll be honest, I’ve missed a few). At the end of the year I’ll have a nice document with dates and times that I can use to write my annual report.

I use Microsoft Notepad several times a day. In addition to its cool logging function I find it a dependable, simple text editor. Notepad is great for distraction-free, no-frills writing. If I have to copy and paste something from the web into a document—or from one document to another—and I don’t want to drag in all the formatting from the source I put the text in Notepad first. It cleans all the formatting off the text and lets me use it as I need. I use it as a quick way to create to-do lists or jot down notes while working on something else. It launches quickly and lets me get to work right away. Plus, if you delete all the text you’ve written—after you’re done working of course—you can close Notepad and there’s no dialog box asking if you want to save the file.

Prior to becoming a librarian I worked as a programmer writing code. Every single coder I worked with used Notepad to create, revise, and edit code. Sure, you can work in the program you’re writing and your office’s text editor—and you often do; we used something like the vi text editor—but sometimes you need to think through your code and you can’t do that in an executable. I used to have several Notepad files of handy code so that I could reference it quickly without needing to search through source code for it.

I’ve been thinking about Notepad more and more as I prepare for a coding program at my library. A good text editor is essential to writing code. Once you start using one you’ll find yourself reaching for it all the time. But it isn’t all Notepad all the time. If I actually have to troubleshoot code—which these days is mostly things in WordPress—I use Notepad++:

You can see the color highlighting that Notepad++ uses which is a great visual way to see if there are problems in your code without even reading it. It also features a document map which is a high-level view of your entire document on the right-hand side of the screen that highlights where you are in the code. There’s a function list that lists all the functions called in the file. Notepad++ has some other cool text editor functions like multi-editing (editing in several places in the file at the same time), and column mode editing (where you can select a column of text to edit instead of entire lines of code). It’s a very handy tool when you’re working on code.

These are not the only text editors out there. A quick search for lists of text editors gives you more choices than you need. Notepad++ is at the top of several lists and I have to say that I like it better than others I’ve tried. The best thing is most of these text editors are free so they’re easy to try out and see what works for you. They all have very similar feature sets so it often comes down to the user interface. While these two options are Windows operating system only, there are plenty of good text editors for Mac users, too.

Text editors won’t be the starting point for my coding program. We’ll focus on some non-tech coding exercises and some online tools like Scratch or Tynker and some physical items like Sphero or LEGO Mindstorm. While these are geared towards children they are great for adults who have never interacted with code. (Sphero and Mindstorm do have a cost associated with them) When I get to the point in our coding program where I want to talk about text editors I’ll focus on Notepad and Notepad++ but let people know there are other options. If I know my patrons, they’ll have suggestions for me.

Do you have any cool tips for your favorite text editor or perhaps just a recommendation?

SearchHub: Pivoting to the Query: Using Pivot Facets to build a Multi-Field Suggester

planet code4lib - Fri, 2016-08-12 13:43

Suggesters, also known as autocomplete, typeahead or “predictive search” are powerful ways to accelerate the conversation between user and search application. Querying a search application is a little like a guessing game – the user formulates a query that they hope will bring back what they want – but sometimes there is an element of “I don’t know what I don’t know” – so the initial query may be a bit vague or ambiguous. Subsequent interactions with the search application are sometimes needed to “drill-in” to the desired information. Faceted navigation and query suggestions are two ways to ameliorate this situation. Facets generally work after the fact – after an initial attempt has been made, whereas suggesters seek to provide feedback in the act of composing the initial query – to improve it’s precision from the start. Facets also provide a contextual multi-dimensional visualization of the result set that can be very useful in the “discovery” mode of search.

A basic tenet of suggester implementations is to never suggest queries that will not bring back results. To do otherwise is pointless (it also does not inspire confidence in your search application!). Suggestions can come from a number of sources – previous queries that were found to be popular, suggestions that are intended to drive specific business goals and suggestions that are based on the content that has been indexed into the search collection. There are also a number of implementations that are available in Solr/Lucene out-of-the-box.

My focus here is on providing suggestions that go beyond the single term query – that provide more detail on the desired results and combine the benefits of multi-dimensional facets with typeahead. Suggestions derived from query logs can have this context but these are not controlled in terms of their structure. Suggestions from indexed terms or field values can also be used but these only work with one field at a time. Another focus of this and my previous blogs is to inject some semantic intelligence into the search process – the more the better. One way to do that is to formulate suggestions that make grammatical sense – constructed from several metadata fields – that create query phrases that clearly indicate what will be returned.

So what do I mean by “suggestions that make grammatical sense”? Just that we can think of the metadata that we may have in our search index (and if we don’t have, we should try to get it!) as attributes or properties of some items or concepts represented by indexed documents. There are potentially a large number of permutations of these attribute values, most of which make no sense from a grammatical perspective. Some attributes describe the type of thing involved (type attributes), and others describe the properties of the thing. In a linguistic sense, we can think of these as noun and adjective properties respectively.

To provide an example of what I mean, suppose that I have a search index about people and places. We would typically have fields like first_name, last_name, profession, city and state. We would normally think of these fields in this order or maybe last_name, first_name city, state – profession as in:

Jones, Bob Cincinnati, Ohio – Accountant

Or

Bob Jones, Accountant, Cincinnati, Ohio

But we would generally not use:

Cincinnati Accountant Jones Ohio Bob

Even though this is a valid mathematical permutation of field value ordering. So if we think of all of the possible ways to order a set of attributes, only some of these “make sense” to us as “human-readable” renderings of the data.

Turning Pivot Facets “Around” – Using Facets to generate query suggestions

While facet values by themselves are a good source of query suggestions because they encapsulate a record’s “aboutness”, they can only do so one attribute at a time. This level of suggestion is already available out-of-the-box with Solr/Lucene Suggester implementations which use the same field value data that facets do in the form of a so-called uninverted index (aka the Lucene FieldCache or indexed Doc Values). But what if we want to combine facet fields as above? Solr pivot facets (see “Pivot Facets Inside And Out” for background on pivot facets) provide one way of combining an arbitrary set of fields to produce a cascading or nested sets of field values. Think of is as a way of generating a facet value “taxonomy” – on the fly. How does this help us? Well, we can use pivot facets (at index time) to find all of the permutations for a compound phrase “template” composed of a sequence of field names – i.e. to build what I will call “facet phrases”. Huh? Maybe an example will help.

Suppose that I have a music index, which has records for things like songs, albums, musical genres and the musicians, bands or orchestras that performed them as well as the composers, lyricists and songwriters that wrote them. I would like to search for things like “Jazz drummers”, “Classical violinists”, “progressive rock bands”, “Rolling Stones albums” or “Blues songs” and so on. Each of these phrases is composed of values from two different index fields – for example “drummer”, “violinist” and “band” are musician or performer types. “Rolling Stones” are a band which as a group is a performer (we are dealing with entities here which can be single individuals or groups like the Stones). “Jazz”, “Classical”, “Progressive Rock” and “Blues” are genres and “albums” and “songs” are recording types (“song” is also a composition type). All of these things can be treated as facets. So if I create some phrase patterns for these types of queries like “musician_type, recording_type” or “genre, musician_type” or “performer, recording_type” and submit these as pivot facet queries, I can construct many examples of the above phrases from the returned facet values. So for example, the pivot pattern “genre, musician_type” would return things like, “jazz pianist”, “rock guitarist”, “classical violinist”, “country singer” and so on – as long as I have records in the collection for each of these category combinations.

Once I have these phrases, I can use them as query suggestions by indexing them into a collection that I use for this purpose. It would also be nice if the precision that I am building into my query suggestions was honored at search time. This can be done in several ways. When I build my suggester collection using these pivot patterns, I can capture the source fields and send them back with the suggestions. This would enable precise filter or boost queries to be used when they are submitted by the search front end. One potential problem here is if the user types the exact same query that was suggested – i.e. does not select from the typeahead dropdown list. In this case, they wouldn’t get the feedback from the suggester but we want to ensure that the results would be exactly the same.

The query autofiltering technique that I have been developing and blogging about is another solution to matching the precision of the response with the added precision of these multi-field queries. It would work whether or not the user clicked on a suggestion or typed in the phrase themselves and hit “enter”. Some recent enhancements to this code that enable it to respond to verb, prepositional or adjectives and to adjust the “context” of the generated filter or boost query, provide another layer of precision that we can use in our suggestions. That is, suggestions can be built from templates or patterns in which we can add “filler” terms such as the verbs, prepositions and adjectives that the query autofilter now supports.

Once again, an example may help to clear up the confusion. In my music ontology, I have attributes for “performer” and “composer” on documents about songs or recordings of songs. Many artists whom we refer to as “singer-songwriters” for example, occur as both composers and performers. So if I want to search for all of their songs regardless of whether they wrote or performed them, I can search for something like:

Jimi Hendrix songs

If I want to just see the songs that Jimi Hendrix wrote, I would like to search for

“songs Jimi Hendrix wrote” or “songs written by Jimi Hendrix”

which should return titles like “Purple Haze”, “Foxy Lady” and “The Wind Cries Mary”

In contrast the query:

“songs Jimi Hendrix performed”

should include covers like “All Along the Watchtower” (for your listening pleasure, here’s a link), “Hey Joe” and “Sgt Peppers Lonely Hearts Club Band”

and

“songs Jimi Hendrix covered”

would not include his original compositions.

In this case, the verb phrases “wrote” or “written by”, “performed” or “covered” are not field values in the index but they tell us that the user wants to constrain the results either to compositions or to performances. The new features in the query autofilter can handle these things now but what if we want to make suggestions like this?

To do this, we write pivot template pattern like this

${composition_type} ${composer} wrote

${composition_type} written by ${composer}

${composition_type} ${performer} performed

Code to do Pivot Facet Mining

The source code to build multi-field suggestions using pivot facets is available on github. The code works as a Java Main client that builds a suggester collection in Solr.

The design of the suggester builder includes one or more “query collectors” that feed query suggestions to a central “suggester builder” that a) validates the suggestions against the target content collection and b) can obtain context information from the content collection using facet queries (see below). One of the implementations of query collector is the PivotFacetQueryCollector. Other implementations can get suggestions from query logs, files, Fusion signals and so on.

The github distribution includes the music ontology dataset that was used for this blog article and a sample configuration file to build a set of suggestions on the music data. The ontology itself is also on github as a set of XML files that can be used to create a Solr collection but note that some preprocessing of the ontology was done to generate these files. The manipulations that I did on the ontology to ‘denormalize’ or flatten it will be the subject of a future blog post as it relates to techniques that can be used to ‘surface’ interesting relationships and make them searchable without the need for complex graph queries.

Using facets to obtain more context about the suggestions

The notion of “aboutness” introduced above can be very powerful. Once we commit to building a special Solr collection (also known as a ‘sidecar’ collection) just for typeahead, there are other powerful search features that we now have to work with. One of them is contextual metadata. We can get this by applying facets to the query that the suggester builder uses to validate the suggestion against the content collection. One application of this is to generate security trimming ACL values for a suggestion by getting the set of ACLs for all of the documents that a query suggestion would hit on – using facets. Once we have this, we can use the same security trimming filter query on the suggester collection that we use on the content collection. That way we never suggest a query to a user that cannot return any results for them – in this case because they don’t have access to any of the documents that the query would return. Another thing we can do when we build the suggester collection is to use facets to obtain context about various suggestions. As discussed in the next section, we can use this context to boost suggestions that share contextual metadata with recently executed queries. 

Dynamic or On-The-Fly Predictive Analytics

One of the very powerful and extremely user-friendly things that you can do with typeahead is to make it sensitive to recently issued queries. Typeahead is one of those use cases where getting good relevance is critical because the user can only see a few results and can’t use facets or paging to see more. Relevance is often dynamic in a search session meaning that what the user is looking for can change – even in the course of a single session. Since typeahead starts to work with only a few characters entered, the queries start at a high level of ambiguity. If we can make relevance sensitive to recently searched things we can save the user a lot of a) work and b) grief. Google seems to do just this. When I was building the sample Music Ontology, I was using Google and Wikipedia (yes, I did contribute!) to lookup songs and artists and to learn or verify things like who was/were the songwriter(s) etc. I found that if I was concentrating on a single artist or genre, after a few searches, Google would start suggesting long song titles with just two or three characters entered!! It felt as if it “knew” what my search agenda was! Honestly, it was kinda spooky but very satisfying.

So how can we get a little of Google’s secret sauce in our own typeahead implementations? Well the key here is context. If we can know some things about what the user is looking for we can do a better job of boosting things with similar properties. And we can get this context using facets when we build the suggestion collection! In a nutshell, we can use facet field values to build boost queries to use in future queries in a user session. The basic data flow is shown below:

 

 

This requires some coordination between the suggester builder and the front-end (typically Javascript based) search application. The suggester builder extracts context metadata for each query suggestion using facets obtained from the source or content collection and stores these values with the query suggestions in the suggester collection. To demonstrate how this contextual metadata can be used in a typeahead app, I have written a simple Angular JS application that uses this facet-based metadata in the suggester collection to boost suggestions that are similar to recently executed queries. When a query is selected from a typeahead list, the metadata associated with that query is cached and used to construct a boost query on subsequent typeahead actions.

So, for example if I type in the letter ‘J’ into the typeahead app, I get

Jai Johnny Johanson Bands
Jai Johnny Johanson Groups
J.J. Johnson
Jai Johnny Johanson
Juke Joint Jezebel
Juke Joint Jimmy
Juke Joint Johnny

But if I have just searched for ‘Paul McCartney’, typing in ‘J’ now brings back:

John Lennon
John Lennon Songs
John Lennon Songs Covered
James P Johnson Songs
John Lennon Originals
Hey Jude

The app has learned something about my search agenda! To make this work, the front end application caches the returned metadata for previously executed suggester results and stores this in a circular queue on the client side. It then uses the most recently cached sets of metadata to construct a boost query for each typeahead submission. So when I executed the search for “Paul McCartney”, the returned metadata was:

genres_ss:Rock,Rock & Roll,Soft Rock,Pop Rock
hasPerformer_ss:Beatles,Paul McCartney,José Feliciano,Jimi Hendrix,Joe Cocker,Aretha Franklin,Bon Jovi,Elvis Presley ( … and many more)
composer_ss:Paul McCartney,John Lennon,Ringo Starr,George Harrison,George Jackson,Michael Jackson,Sonny Bono
memberOfGroup_ss:Beatles,Wings

From this returned metadata – taking the top results, the cached boost query was:

genres_ss:”Rock”^50 genres_ss:”Rock & Roll”^50 genres_ss:”Soft Rock”^50 genres_ss:”Pop Rock”^50
hasPerformer_ss:”Beatles”^50 hasPerformer_ss:”Paul McCartney”^50 hasPerformer_ss:”José Feliciano”^50 hasPerformer_ss:”Jimi Hendrix”^50
composer_ss:”Paul McCartney”^50 composer_ss:”John Lennon”^50 composer_ss:”Ringo Starr”^50 composer_ss:”George Harrison”^50
memberOfGroup_ss:”Beatles”^50 memberOfGroup_ss:”Wings”^50

And since John Lennon is both a composer and a member of the Beatles, records with John Lennon are boosted twice which is why these records now top the typeahead list. (not sure why James P Johnson snuck in there except that there are two ‘J’s in his name).

This demonstrates how powerful the use of context can be. In this case, the context is based on the user’s current search patterns. Another take home here is that use of facets besides the traditional use as a UI navigation aid are a powerful way to build context into a search application. In this case, they were used in several ways – to create the pivot patterns for the suggester, to associate contextual metadata with suggester records and finally to use this context in a typeahead app to boost records that are relevant to the user’s most recent search goals. (The source code for the Angular JS app is also included in the github repository.)

We miss you Jimi – thanks for all the great tunes! (You are correct, I listened to some Hendrix – Beatles too – while writing this blog – is it that obvious?)

 

The post Pivoting to the Query: Using Pivot Facets to build a Multi-Field Suggester appeared first on Lucidworks.com.

LibUX: How to Talk about User Experience — The Webinar!

planet code4lib - Fri, 2016-08-12 04:28

Hey there. My writeup (“How to Talk about User Experience“) is now a 90-minute LITA webinar. I have pretty strong ideas about treating the “user experience” as a metric and I am super grateful to my friends at LITA for another opportunity to make the case.

Details

The explosion of new library user experience roles, named and unnamed, the community growing around it, the talks, conferences, and corresponding literature signal a major shift. But the status of library user experience design as a professional field is impacted by the absence of a single consistent definition of the area. While we can workshop card sorts and pick apart library redesigns, even user experience librarians can barely agree about what it is they do – let alone why it’s important. How we talk about the user experience matters. So, in this 90 minute talk, we’ll fix that.

  • September 7, 2016
  • 1 – 2:30 p.m. (Eastern)
Costs
  • $45 – LITA Members
  • $105 – Non-members
  • $196 – Groups

Register

Pages

Subscribe to code4lib aggregator