You are here

Feed aggregator

Patrick Hochstenbach: Trying out caricature styles

planet code4lib - Sun, 2014-09-14 11:08
Filed under: Doodles Tagged: art, caricature, cartoon, copic, depardieu, doodle, marker

Patrick Hochstenbach: Doodling at JCDL2014

planet code4lib - Sat, 2014-09-13 17:22
Visited JCDL2014 in London last week. Some talks ended up in cartoons Filed under: Doodles Tagged: adobe, copic, dhcl, digital library, jcdl, london, marker

Dan Scott: My small contribution to schema.org this week

planet code4lib - Sat, 2014-09-13 07:27

Version 1.91 of the http://schema.org vocabulary was released a few days ago, and I once again had a small part to play in it.

With the addition of the workExample and exampleOfWork properties, we (Richard Wallis, Dan Brickley, and I) realized that examples of these CreativeWork example properties were desperately needed to help clarify their appropriate usage. I had developed one for the blog post that accompanied the launch of those properties, but the question was, where should those examples live in the official schema.org docs? CreativeWork has so many children, and the properties are so broadly applicable, that it could have been added to dozens of type pages.

It turns out that an until-now unused feature of the schema.org infrastructure is that examples can live on property pages; even Dan Brickley didn't think this was working. However, a quick test in my sandbox showed that it _was_ in perfect working order, so we could locate the examples on their most relevant documentation pages... Huzzah!

I was then able to put together a nice, juicy example showing relationships between a Tolkien novel (The Fellowship of the Ring), subsequent editions of that novel published by different companies in different locations at different times, and movies based on that novel. From this librarian's perspective, it's pretty cool to be able to do this; it's a realization of a desire to express relationships that, in most library systems, are hard or impossible to accurately specify. (Should be interesting to try and get this expressed in Evergreen and Koha...)

In an ensuing conversation on public-vocabs about the appropriateness of this approach to work relationships, I was pleased to hear Jeff Young say "+1 for using exampleOfWork / workExample as many times as necessary to move vaguely up or down the bibliographic abstraction layers."... To me, that's a solid endorsement of this pragmatic approach to what is inherently messy bibliographic stuff.

Kudos to Richard for having championed these properties in the first place; sometimes we're a little slow to catch on!

FOSS4Lib Recent Releases: OpenWayback - 2.0.0

planet code4lib - Fri, 2014-09-12 16:03

Last updated September 12, 2014. Created by Peter Murray on September 12, 2014.
Log in to edit this page.

Package: OpenWaybackRelease Date: Friday, September 12, 2014

FOSS4Lib Updated Packages: OpenWayback

planet code4lib - Fri, 2014-09-12 15:08

Last updated September 12, 2014. Created by Peter Murray on September 12, 2014.
Log in to edit this page.

OpenWayback is an open source Java application designed to query and access archived web material. It was first released by the Internet Archive in September 2005, based on the (then) perl-based Internet Archive Wayback Machine, to enable public distribution of the application and increase its maintainability and extensibility. The Open Source Wayback Machine (OSWM) since then has been widely used by members of the International Internet Preservation Consortium (IIPC) and become the de facto rendering software for web archives.

Package Links Releases for OpenWayback Upcoming Events for the OpenWayback Package TechnologyPackage Type: Data Preservation and ManagementLicense: Apache 2.0Development Status: Production/StableOperating System: Browser/Cross-PlatformTechnologies Used: SOLRTomcatProgramming Language: Java

Library of Congress: The Signal: Teaching Integrity in Empirical Research: An Interview with Richard Ball and Norm Medeiros

planet code4lib - Fri, 2014-09-12 14:26

Richard Ball (Associate Professor of Economics, Haverford College) and Norm Medeiros (Associate Librarian, Haverford Libraries)

This post is the latest in our NDSA Innovation Working Group’s ongoing Insights Interview series. Chelcie Rowell (Digital Initiatives Librarian, Wake Forest University) interviews Richard Ball (Associate Professor of Economics, Haverford College) and Norm Medeiros (Associate Librarian, Haverford Libraries) about Teaching Integrity in Empirical Research, or Project Tier.

Chelcie: Can you briefly describe Teaching Integrity in Empirical Research, or Project TIER, and its purpose?

Richard: For close to a decade, we have been teaching our students how to assemble comprehensive documentation of the data management and analysis they do in the course of writing an original empirical research paper. Project TIER is an effort to reach out to instructors of undergraduate and graduate statistical methods classes in all the social sciences to share with them lessons we have learned from this experience.

When Norm and I started this work, our goal was simply to help our students learn to do good empirical research; we had no idea it would turn into a “project.” Over a number of years of teaching an introductory statistics class in which students collaborated in small groups to write original research papers, we discovered that it was very useful to have students not only turn in a final printed paper reporting their analysis and results, but also submit documentation of exactly what they did with their data to obtain those results.

We gradually developed detailed instructions describing all the components that should be included in the documentation and how they should be formatted and organized. We now refer to these instructions as the TIER documentation protocol. The protocol specifies a set of electronic files (including data, computer code and supporting information) that would be sufficient to allow an independent researcher to reproduce–easily and exactly–all the statistical results reported in the paper. The protocol is and will probably always be an evolving work in progress, but after several years of trial and error, we have developed a set of instructions that our students are able to follow with a high rate of success.

Even for students who do not go on to professional research careers, the exercise of carefully documenting the work they do with their data has important pedagogical benefits. When students know from the outset that they will be required to turn in documentation showing how they arrive at the results they report in their papers, they approach their projects in a much more organized way and keep much better track of their work at every phase of the research. Their understanding of what they are doing is therefore substantially enhanced, and I in turn am able to offer much more effective guidance when they come to me for help.

Research Data Management by user jannekestaaks on Flickr

Despite these benefits, methods of responsible research documentation are virtually, if not entirely, absent from the curricula of all the social sciences. Through Project TIER, we are engaging in a variety of activities that we hope will help change that situation. The major events of the last year were two faculty development workshops that we conducted on the Haverford campus. A total of 20 social science faculty and research librarians from institutions around the US attended these workshops, at which we described our experiences teaching our students good research documentation practices, explained the nuts and bolts of the TIER documentation protocol, and discussed with workshop participants the ways in which they might integrate the protocol into their teaching and research supervision. We have also been spreading the word about Project TIER by speaking at conferences and workshops around the country, and by writing articles for publications that we hope will attract the attention of social science faculty who might be interested in joining this effort.

We are encouraged that faculty at a number of institutions are already drawing on Project TIER and teaching their students and research advisees responsible methods of documenting their empirical research. Our ultimate goal is eventually to see a day when the idea of a student turning in an empirical research paper without documentation of the underlying data management and analysis is considered as aberrant as the idea of a student turning in a research paper for a history class without footnotes or a reference list.

Chelcie: How did TIER and your 10-year collaboration (so far!) get started?

Norm: When I came to the Haverford Libraries in 2000, I was assigned responsibility for the Economics Department. Soon thereafter I began providing assistance to Richard’s introductory statistics students, both in locating relevant literature as well as in acquiring data for statistical analysis. I provided similar, albeit more specialized, assistance to seniors in the context of their theses. Richard invited me to his classes and advised students to make appointments with me. Through regular communication, I came to understand the outcomes he sought from his students’ research assignments, and tailored my approach to meet these expectations. A strong working relationship ensued.

Meanwhile, in 2006 the Haverford Libraries in conjunction with Bryn Mawr and Swarthmore Colleges implemented DSpace, the widely-deployed open source repository system. The primary collection Haverford migrated into DSpace was its senior thesis archive, which had existed for the previous five years in a less-robust system. Based on the experience I had accrued to that point working with Richard and his students, I thought it would be helpful to future generations of students if empirical theses coexisted with the data from which the results were generated.

The DSpace platform provided a means of storing such digital objects and making them available to the public. I mentioned this idea to Richard, who suggested that not only should we post the data, but also all the documentation (the computer command files, data files and supporting information) specified by our documentation protocol. We didn’t know it at the time, but the seeds of Project TIER were planted then. The first thesis with complete documentation was archived on DSpace in 2007, and several more have been added every year since then.

Chelcie: You call TIER a “soup-to-nuts protocol for documenting data management and analysis.” Can you walk us through the main steps of that protocol?

The data by ken fager on Flickr

Richard: The term “soup-to-nuts” refers to the fact that the TIER protocol entails documenting every step of data management and analysis, from the very beginning to the very end of a research project. In economics, the very beginning of the empirical work is typically the point at which the author first obtains the data to be used in the study, either from an existing source such as a data archive, or by conducting a survey or experiment; the very end is the point at which the final paper reporting the results of the study is made public.

The TIER protocol specifies that the documentation should contain the original data files the author obtained at the very beginning of the study, as well as computer code that executes all the processing of the data necessary to prepare them for analysis–including, for example, combining files, creating new variables, and dropping cases or observations–and finally generating the results reported in the paper. The protocol also specifies several kinds of additional information that should be included in the documentation, such as metadata for the original data files, a data appendix that serves as a codebook for the processed data used in the analysis and a read-me file that serves as a users’ guide to everything included in the documentation.

This “soup-to-nuts” standard contrasts sharply with the policies of academic journals in economics and other social sciences. Some of these journals require authors of empirical papers to submit documentation along with their manuscripts, but the typical policy requires only the processed data file used in the analysis and the computer code that uses this processed data to generate the results. These policies do not require authors to include copies of the original data files or the computer code that processes the original data to prepare them for analysis. In our view, this standard, sometimes called “partial replicability,” is insufficient. Even in the simplest cases, construction of the processed dataset used in the analysis involves many decisions, and documentation that allows only partial replication provides no record of the decisions that were made.

Complete instructions for the TIER protocol are available online. The instructions are presented in a series of web pages, and they are also available for download in a single .pdf document.

Chelcie: You’ve taught the TIER protocol in two main curricular contexts: introductory statistics courses and empirical senior thesis projects. What is similar or different about teaching TIER in these two contexts?

Richard: The main difference is that in the statistics courses students do their research projects in groups made up of 3-5 members. It is always a challenge for students to coordinate work they do in groups, and the challenge is especially great when the work involves managing several datasets and composing several computer command files. Fortunately, there are some web-based platforms that can facilitate cooperation among students working on this kind of project. We have found two platforms to be particularly useful: Dataverse, hosted by the Harvard Institute for Quantitative Social Science, and the Open Science Framework, hosted by the Center for Open Science.

Another difference is that when seniors write their theses, they have already had the experience of using the protocol to document the group project they worked on in their introductory statistics class. Thanks to that experience, senior theses tend to go very smoothly.

Chelcie: Can you elaborate a little bit about the Haverford Dataverse you’ve implemented for depositing the data underlying senior theses?

Norm: In 2013 Richard and I were awarded a Sloan/ICPSR challenge grant with which to promote Project TIER and solicit participants. As we considered this initiative, it was clear to us that a platform for hosting files would be needed both locally for instructors who perhaps didn’t have a repository system in place, as well as for fostering cross-institutional collaboration, whereby students learning the protocol in one participating institution could run replications against finished projects at another institution.

We imagined such a platform would need an interactive component, such that one could comment on the exactness of the replication. DSpace is a strong platform in many ways, but it is not designed for these purposes, so Richard and I began investigating available options. We came across Dataverse, which has many of the features we desired. Although we have uploaded some senior theses as examples of the protocol’s application, it was really the introductory classes for which we sought to leverage Dataverse. Our Project TIER Dataverse is available online.

In fall 2013, we experimented with using Dataverse directly with students. We sought to leverage the platform as a means of facilitating file management and communication among the various groups. We built Dataverses for each of the six groups in Richard’s introductory statistics course. We configured templates that helped students understand where to load their data and associated files. The process of building these Dataverses was time consuming, and at points we needed to jury rig the system to meet our needs. Although Dataverse is a robust system, we found its interface too complex for our needs. This fall we plan to use the Open Science Framework system to see if it can serve our students slightly better. Down the road, we can envision complementary roles for Dataverse and OSF as it relates to Project TIER.

Chelcie: After learning the TIER protocol, do students’ perceptions of the value of data management change?

Richard: Students’ perceptions change dramatically. I see this every semester. For the first few weeks, students have to do a few things to prepare to do what is required by the protocol, like setting up a template of folders in which to store the documentation as they work on the project throughout the semester, and establishing a system that allows all the students in the group to access and work on the files in those folders. There are always a few wrinkles to work out, and sometimes there is a bit of grumbling, but as soon as students start working seriously with their data they see how useful it was to do that up-front preparation. They realize quickly that organizing their work as prescribed by the protocol increases their efficiency dramatically, and by the end of the semester they are totally sold–they can’t imagine doing it any other way.

Chelcie: Have you experienced any tensions between developing step-by-step documentation for a particular workflow and technology stack versus developing more generic documentation?

Richard: The issue of whether the TIER protocol should be written in generic terms or tailored to a particular platform and/or a particular kind of software is an important one, but for the most part has not been the source of any tensions. All of the students in our introductory statistics class and most of our senior thesis advisees use Stata, on either a Windows or Mac operating system. The earliest versions of the protocol were therefore written particularly for Stata users, which meant, for example, we used the term “do-file” instead of “command file,” and instead of saying something like “a data file saved in the proprietary format of the software you are using” we would say “a data file saved in Stata’s .dta format.”

But fundamentally there is nothing Stata-specific about the protocol. Everything that we teach students to do using Stata works just fine with any of the other major statistical packages, like SPSS, R and SAS. So we are working on two ways of making it as easy as possible for users of different software to learn and teach the protocol. First, we have written a completely software-neutral version. And second, with the help of colleagues with expertise in other kinds of software, we are developing versions for R and SPSS, and we hope to create a SAS version soon. We will make all these versions available on the Project TIER website as they become available.

The one program we have come across for which the TIER protocol is not well suited is Microsoft Excel. The problem is that Excel is an exclusively interactive program; it is difficult or impossible to write an editable program that executes a sequence of commands. Executable command files are the heart and soul of the TIER protocol; they are the tool that makes it possible literally to replicate statistical results. So Excel cannot be the principal program used for a project for which the TIER documentation protocol is being followed.

Chelcie: What have you found to be the biggest takeaways from your experience introducing a data management protocol to undergraduates?

Richard: In the response to the first question in this interview, I described some of the tangible pedagogical benefits of teaching students to document their empirical research carefully. But there is a broader benefit that I believe is more fundamental. Requiring students to document the statistical results they present in their papers reinforces the idea that whenever they want to claim something is true or advocate a position, they have an intellectual responsibility to be able to substantiate and justify all the steps of the argument that led them to their conclusion. I believe this idea should underlie almost every aspect of an undergraduate education, and Project TIER helps students internalize it.

Chelcie: Thanks to funding from the Sloan Foundation and ICPSR at the University of Michigan, you’ve hosted a series of workshops focused on teaching good practices in documenting data management and analysis. What have you learned from “training the trainers”?

Richard: Our experience with faculty from other institutions has reinforced our belief that the time is right for initiatives that, like Project TIER, aim to increase the quality and credibility of empirical research in the social sciences. Instructors frequently tell us that they have thought for a long time that they really ought to include something about documentation and replicability in their statistics classes, but never got around to figuring out just how to do that. We hope that our efforts on Project TIER, by providing a protocol that can be adopted as-is or modified for use in particular circumstances, will make it easier for others to begin teaching these skills to their students.

We have also been reminded of the fact that faculty everywhere face many competing demands on their time and attention, and that promoting the TIER protocol will be hard if it is perceived to be difficult or time-consuming for either faculty or students. In our experience, the net costs of adopting the protocol, in terms of time and attention, are small: the protocol complements and facilitates many aspects of a statistics class, and the resulting efficiencies largely offset the start-up costs. But it is not enough for us to believe this: we need to formulate and present the protocol in such a way that potential adopters can see this for themselves. So as we continue to tinker with and revise the protocol on an ongoing basis, we try to be vigilant about keeping it simple and easy.

Chelcie: What do you think performing data management outreach to undergraduate, or more specifically TIER as a project, will contribute to the broader context of data management outreach?

Richard: Project TIER is one of a growing number of efforts that are bubbling up in several fields that share the broad goal of enhancing the transparency and credibility of research in the social sciences. In Sociology, Scott Long of Indiana University is a leader in the development of best practices in responsible data management and documentation. The Center for Open Science, led by psychologists Brian Nosek and Jeffrey Spies of the University of Virginia, is developing a web-based platform to facilitate pre-registration of experiments as well as replication studies. And economist Ted Miguel at UC Bekeley has launched the Berkeley Initiative for Transparency in the Social Sciences (BITSS), which is focusing its efforts to strengthen professional norms of research transparency by reaching out to early career social scientists. The Inter-university Consortium for Political and Social Research (ICPSR), which for over 50 year has served as a preeminent archive for social science research data, is also making important contributions to responsible data stewardship and research credibility. The efforts of all these groups and individuals are highly complementary, and many fruitful collaborations and interactions are underway among them. Each has a unique focus, but all are committed to the common goal of improving norms and practices with respect to transparency and credibility in social science research.

These bottom-up efforts also align well with several federal initiatives. Beginning in 2011, the NSF requires all proposals to include a “data management plan” outlining procedures that will be followed to support the dissemination and sharing of research results. Similarly, the NIH requires all investigator-initiated applications with direct costs greater than $500,000 in any single year to address data sharing in the application. More recently, in 2013 the White House Office on Science and Technology Policy issued a policy memorandum titled “Increasing Access to the Results of Federally Funded Scientific Research,” directing all federal agencies with more than $100 million in research and development expenditures to establish guidelines for the sharing of data from federally funded research.

Like Project TIER, many of these initiatives have been launched just within the past year or two. It is not clear why so many related efforts have popped up independently at about the same time, but it appears that momentum is building that could lead to substantial changes in the conduct of social science research.

Chelcie: Do you think the challenges and problems of data management outreach to students will be different in 5 years or 10 years?

Richard: As technology changes, best practices in all aspects of data stewardship, including the procedures specified by the TIER protocol, will necessarily change as well. But the principles underlying the protocol–replicability, transparency, integrity–will remain the same. So we expect the methods of implementing Project TIER will continually be evolving, but the aim will always be to serve those principles.

Chelcie: Based on your work with TIER, what kinds of challenges would you like for the digital preservation and stewardship community to grapple with?

Norm: We’re glad to know that research data are specifically identified in the National Agenda for Digital Stewardship. There is an ever-growing array of non-profit and commercial data repositories for the storage and provision of research data; ensuring the long-term availability of these is critical. Although our protocol relies on a platform for file storage, Project TIER is focused on teaching techniques that promote transparency of empirical work, rather than on digital object management per se. This said, we’d ask that the NDSA partners consider the importance of accommodating supplemental files, such as statistical code, within their repositories, as these are necessary for the computational reproducibility advocated by the TIER protocol. We are encouraged by and grateful to the Library of Congress and other forward-looking institutions for advancing this ambitious Agenda.

FOSS4Lib Upcoming Events: Sharing Images of Global Cultural Heritage

planet code4lib - Fri, 2014-09-12 13:27
Date: Monday, October 20, 2014 - 09:00 to 17:00Supports: IIPImageLorisOpenSeadragon

Last updated September 12, 2014. Created by Peter Murray on September 12, 2014.
Log in to edit this page.

The International Image Interoperability Framework community (http://iiif.io/) is hosting a one day information sharing event about the use of images in and across Cultural Heritage institutions. The day will focus on how museums, galleries, libraries and archives, or any online image service, can take advantage of a powerful technical framework for interoperability between image repositories.

FOSS4Lib Recent Releases: Loris - 2.0.0-alpha2

planet code4lib - Fri, 2014-09-12 13:23

Last updated September 12, 2014. Created by Peter Murray on September 12, 2014.
Log in to edit this page.

Package: LorisRelease Date: Tuesday, September 9, 2014

LITA: 2014 LITA Forum Student Registration Rate Available

planet code4lib - Thu, 2014-09-11 21:23

LITA is offering a special student registration rate to the 2014 LITA National Forum for a limited number of graduate students enrolled in ALA accredited programs.   The Forum will be held November 5-8, 2014 at the Hotel Albuquerque in Albuquerque, NM.  Learn more about the Forum here.

In exchange for a discounted registration, students will assist the LITA organizers and the Forum presenters with on-site operations.  This year’s theme is “Transformation: From Node to Network.”  We are anticipating an attendance of 300 decision makers and implementers of new information technologies in libraries.

The selected students will be expected to attend the full LITA National Forum, Thursday noon through Saturday noon.  This does not include the pre-conferences on Thursday and Friday.  You will be assigned a variety of duties, but you will be able to attend the Forum programs, which include 3 keynote sessions, 30 concurrent sessions, and a dozen poster presentations.

The special student rate is $180 – half the regular registration rate for LITA members.  This rate includes a Friday night reception at the hotel, continental breakfasts, and Saturday lunch.  To get this rate you must apply and be accepted per below.

To apply for the student registration rate, please provide the following information:

  1. Complete contact information including email address,
  2. The name of the school you are attending, and
  3. 150 word (or less) statement on why you want to attend the 2014 LITA Forum

Please send this information no later than September 30, 2014 to lita@ala.org, with 2014 LITA Forum Student Registration Request in the subject line.

Those selected for the student rate will be notified no later than October 3, 2014.

Open Knowledge Foundation: Matchmakers in Action – Help Wanted

planet code4lib - Thu, 2014-09-11 19:30

Do you have a skill to share? Want to host an online discussion/debate about an Open Knowledge-like topic? Have an idea for a skillshare or discussion, but need help making it happen? Some of you hosted or attended sessions at OKFest. Why not host one online? At OKFestival, we had an Open Matchmaker wall to connect learning and sharing. This is a little experiment to see if we can replicate that spirit online. We’d love to collaborate with you to make this possible.

How to help with Online Community Sessions:

We’ve set up a Community Trello board where you can add ideas, sign up to host or vote for existing ideas. Trello, a task management tool, has fairly simple instructions.

The Community Sessions Trello Board is live. Start with the Read me First card.

Hosting or leading a Community Session is fairly easy. You can host it via video or even as an editathon or a IRC chat.

  • For video, we have been using G+. We can help you get started on this.
  • For Editathons, you could schedule it, share on your favourite communications channel and then use a shared document like a google doc or an etherpad.
  • For an IRC chat, simply set up a topic, time and trello card to start planning.

We highly encourage you to do the sessions in your own language.

Upcoming Community Sessions

We have a number of timeslots open for September – October 2014. We will help you get started and even co-host a session online. As a global community, we are somewhat timezone agnostic. Please suggest a time that works for you and that might work with others in the community.

In early October, we will be joined by Nika Aleksejeva of Infogr.am to do a Data Viz 101 skillshare. She makes it super easy for beginners to use data to tell stories.

The Data Viz 101 session is October 8, 2014. Register here.

Community Session Conversation – September 10, 2014

In this 40 minute community conversation, we brainstormed some ideas and talked about some upcoming community activities:

Some of the ideas shared including global inclusiveness and how to fundraise. Remember to vote or share your ideas. Or, if you are super keen, we would love it if you would lead an online session.

(photo by Gregor Fischer)

LITA: Voice your ideas on LITA’s strategic goals

planet code4lib - Thu, 2014-09-11 17:45

As mentioned in a previous post, LITA is beginning a series of informal discussions to let members voice their thoughts about the current strategic goals of LITA. These “kitchen table talks” will be lead by President Rachel Vacek and Vice-President Thomas Dowling.

The kitchen table talks will discuss LITA’s strategic goals – collaboration and networking; education and sharing of expertise; advocacy; and infrastructure – and how meeting those goals will help LITA better serve you. The talks also align with ALA’s strategic planning process and efforts to communicate the association’s overarching goals of professional development, information policy, and advocacy.

When
  • ONLINE: Friday, September 19, 2014, 1:30-2:30 pm EDT
  • ONLINE: Tuesday, October 14, 2014, 12:00-1:00 pm EDT
  • IN-PERSON: Friday, November 7, 2014, 6:45-9:00 pm MDT at the LITA Forum in Albuquerque, NM
How to join the online conversations

On the day and time of the online events, join in on the conversation in this Google Hangout.

We look forward to the conversations!

State Library of Denmark: Even sparse faceting is limited

planet code4lib - Thu, 2014-09-11 15:08

Recently, Andy Jackson from UK Web Archive discovered a ginormous Pit Of Pain with Solr distributed faceting, where some response times reached 10 minutes. The culprit is facet.limit=100 (the number of returned values for each facet is 100), as the secondary fine-counting of facet terms triggers a mini-search for each term that has to be checked. With the 9 facets UK Web Archive uses, that’s 9*100 searches in the worst-case. Andy has done a great write-up on their setup and his experiments: Historical UKWA Solr Query Performance Benchmarking.

Pit Of Pain by Andy Jackson, UK Web Archive

The shape of the pit can be explained by the probability of the need for fine-counts: When there is less than 1K hits, chances are that all shards has delivered all matching terms with count > 0 and thus need not be queried again (clever merging). When there are more than 1M hits, chances are that the top-100 terms in each facet are nearly the same for all shards, so that only a few of the terms needs fine-counting. Between those two numbers, chances are that a lot of the terms are not present in all initial shard results and thus require fine-counting.

While the indexes at Statsbiblioteket and UK Web Archive are quite comparable; 12TB vs. 16TB, build with nearly the same analysis chain, the setups differ with regard to hardware as well as facet setup. Still, it would be interesting to see if we can reproduce the Pit Of Pain™ with standard Solr faceting on our 6 facet fields and facet.limit=100.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100

Sorta, kinda? We do not have the low response times < 100 hits, and 10 minutes testing only gave 63 searches, but with the right squinting of eyes, the Pit Of Pain (visualized as a hill to trick the enemy) is visible from ~1K to 1M hits. As for the high response times < 100 hits, it is due to a bad programming decision from my side – expect yet another blog post. As for the pit itself, let’s see how it changes when the limit goes down.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 & 5

Getting a little crowded with all those dots, so here’s a quartile plot instead.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 100, 50 & 5

Again, please ignore results below 100 hits. I will fix it! Promise! But other than that, it seems pretty straight forward: High limits has a severe performance penalty, which seems to be more or less linear to the limit requested (hand waving here).

The burning question is of course how it looks with sparse faceting. Technically, distributed sparse faceting avoids the mini-searches in the fine-counting phase, but still requires each term to be looked up in order to resolve its ordinal (it is used as index in the internal sparse faceting counter structure). Such a lookup does take time, something like 0.5ms on average on our current setup, so sparse faceting is not immune to large facet limits. Let’s keep the y-axis-max of 20 seconds for comparison with standard Solr.

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

There does appear to be a pit too! Switching to quartiles and zooming in:

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 100, 50 & 5

sparse_limit_10min_12.5TB_4.2B_sparse_finecount_l1000_100_50_5_nomax.png

This could use another round of tests, but it seems that the pit is present from 10K to 1M hits, fairly analogue to Solr fc faceting. The performance penalty of high limits also matches, just an order of magnitude lower. With worst-case of 6*100 fine-counts (with ~10^5 hits) on each shard and an average lookup time of ½ms, having a mean for the total response time around 1000ms seems reasonable. Everything checks out and we are happy.

Update 20140912

The limit for each test were increased to 1 hour or 1000 searches, whichever comes first, and the tests repeated with facet.limits of 1K, 10K and 100K. The party stopped early with OutOfMemoryError for 10K and since raising the JVM heap size skews all previous results, what we got is what we have.

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000

Quite similar to the Solr fc faceting test with facet.limit=100 at the beginning of this post, but with the Pit Of Pain moved a bit to the right and a worst-case of 3 minutes. Together with the other tested limits and quartiled, we have

12TB index / 4.2B docs / 2565GB RAM, Solr fc faceting on 6 fields, facet limit 1000, 100, 50 & 5

Looking isolated at the Pit Of Pain, we have the median numbers

facet.limit 10^4 hits 10^5 hits 10^6 hits 10^7 hits 1000 24559 70061 141660 95792 100 9498 16615 12876 11582 50 9569 9057 7668 6892 5 2469 2337 2249 2168

Without cooking the numbers too much, we can see that the worst increase switching from limit 50 to 100 is for 10^5 hits: 9057ms -> 16615ms or 1.83 times, with the expected increase being 2 (50 -> 100). Likewise the worst increase from limit 100 to 1000 is for 10^6 hits: 12876ms -> 141660ms or 11.0 times, with the expected increase being 10 (100 -> 1000). In other words: Worst-case median response times (if such a thing makes sense) for distributed fc faceting with Solr scales lineary to the facet.limit.

Repeating with sparse faceting and skipping right to the quartile plot (note that the y-axis dropped by a factor 10):

12TB index / 4.2B docs / 2565GB RAM, sparse faceting on 6 fields, facet limit 1000, 100, 50 & 5

Looking isolated at the Pit Of Pain, we have the median numbers

facet.limit 10^4 hits 10^5 hits 10^6 hits 10^7 hits 1000 512 2397 3311 2189 100 609 960 698 939 50 571 635 395 654 5 447 215 248 588

The worst increase switching from limit 50 to 100 is for 10^6 hits: 395ms -> 698ms or 1.76 times, with the expected increase being 2. Likewise the worst increase from limit 100 to 1000 is also for 10^6 hits: 698ms -> 3311ms or 4.7 times, with the expected increase being 10. In other words: Worst-case median response times for distributed sparse faceting appears to scale better than lineary to the facet.limit.

Re-thinking this, it becomes apparent that there are multiple parts to facet fine-counting: A base overhead and an overhead for each term. Assuming the base overhead is the same, since the number of hits is so, we calculate this to 408ms and the overhead per term to 0.48ms for sparse (remember we have 6 facets so facet.limit=1000 means a worst-case of fine-counting 6000 terms). If that holds, setting facet.limit=10K would have a worst-case median response time of around 30 seconds.


OCLC Dev Network: Software Development Practices: Getting Specific with Acceptance Criteria

planet code4lib - Thu, 2014-09-11 13:30

If you’ve been following our product development practices series, you know how to think about identifying problems and articulating those problems as user stories. But even the best user story can’t encompass all of the details of the user experience that need to be considered in the development process.  This week’s post explains the important role of acceptance criteria.

Karen Coyle: Philosophical Musings: The Work

planet code4lib - Thu, 2014-09-11 11:30
We can't deny the idea of work - opera, oeuvre - as a cultural product, a meaningful bit of human-created stuff. The concept exists, the word exists. I question, however that we will ever have, or that we should ever have, precision in how works are bounded; that we'll ever be able to say clearly that the film version of Pride and Prejudice is or is not the same work as the book. I'm not even sure that we can say that the text of Pride and Prejudice is a single work. Is it the same work when read today that it was when first published? Is it the same work each time that one re-reads it? The reading experience varies based on so many different factors - the cultural context of the reader; the person's understanding of the author's language; the age and life experience of the reader.

The notion of work encompasses all of the complications of human communication and its consequent meaning. The work is a mystery, a range of possibilities and of possible disappointments. It has emotional and, at its best, transformational value. It exists in time and in space. Time is the more canny element here because it means that works intersect our lives and live on in our memories, yet as such they are but mere ghosts of themselves.

Take a book, say, Moby Dick; hundreds of pages, hundreds of thousands of words. We read each word, but we do not remember the words -- we remember the book as inner thoughts that we had while reading. Those could be sights and smells, feelings of fear, love, excitement, disgust. The words, external, and the thoughts, internal, are transformations of each other; from the author's ideas to words, and from the words to the reader's thoughts. How much is lost or gained during this process is unknown. All that we do know is that, for some people at least, the experience is vivid one. The story takes on some meaning in the mind of the reader, if one can even invoke the vague concept of mind without torpedoing the argument altogether.

Brain scientists work to find the place in the maze of neuronic connections that can register the idea of "red" or "cold" while outside of the laboratory we subject that same organ to the White Whale, or the Prince of Denmark, or the ever elusive Molly Bloom. We task that organ to taste Proust's madeleine; to feel the rage of Ahab's loss; to become a neighbor in one of Borges' villages. If what scientists know about thought is likened to a simple plastic ping-pong ball, plain, round, regular, white, then a work is akin to a rainforest of diversity and discovery, never fully mastered, almost unrecognizable from one moment to the next.

As we move from textual works to musical ones, or on to the visual arts, the transformation from the work to the experience of the work becomes even more mysterious. Who hasn't passed quickly by an unappealing painting hanging on the wall of a museum before which stands another person rapt with attention. If the painting doesn't speak to us, then we have no possible way of understanding what it is saying to someone else.

Libraries are struggling to define the work as an abstract but well-bounded, nameable thing within the mass of the resources of the library. But a definition of work would have to be as rich and complex as the work itself. It would have to include the unknown and unknowable effect that the work will have on those who encounter it; who transform it into their own thoughts and experiences. This is obviously impractical. It would also be unbelievably arrogant (as well as impossible) for libraries to claim to have some concrete measure of "workness" for now and for all time. One has to be reductionist to the point of absurdity to claim to define the boundaries between one work and another, unless they are so far apart in their meaning that there could be no shared messages or ideas or cultural markers between them. You would have to have a way to quantify all of the thoughts and impressions and meanings therein and show that they are not the same, when "same" is a target that moves with every second that passes, every synapse that is fired.

Does this mean that we should not try to surface workness for our users? Hardly. It means that it is too complex and too rich to be given a one-dimensional existence within the current library system. This is, indeed, one of the great challenges that libraries present to their users: a universe of knowledge organized by a single principle as if that is the beginning and end of the story. If the library universe and the library user's universe find few or no points of connection, then communication between them fails. At best, like the user of a badly designed computer interface, if any communication will take place it is the user who must adapt. This in itself should be taken the evidence of superior intelligence on the part of the user as compared to the inflexibility of the mechanistic library system.

Those of us in knowledge organization are obsessed with neatness, although few as much as the man who nearly single-handled defined our profession in the late 19th century; the man who kept diaries in which he entered the menu of every meal he ate; whose wedding vows included a mutual promise never to waste a minute; the man enthralled with the idea that every library be ordered by the simple mathematical concept of the decimal.

To give Dewey due credit, he did realize that his Decimal Classification had to bend reality to practicality. As the editions grew, choices had to be made on where to locate particular concepts in relation to others, and in early editions, as the Decimal Classification was used in more libraries and as subject experts weighed in, topics were relocated after sometimes heated debate. He was not seeking a platonic ideal or even a bibliographic ideal; his goal was closer to the late 19th century concept of efficiency. It was a place for everything, and everything in its place, for the least time and money.

Dewey's constraints of an analog catalog, physical books on physical shelves, and a classification and index printed in book form forced the limited solution of just one place in the universe of knowledge for each book. Such a solution can hardly be expected to do justice to the complexity of the Works on those shelves. Today we have available to us technology that can analyze complex patterns, can find connections in datasets that are of a size way beyond human scale for analysis, and can provide visualizations of the findings.

Now that we have the technological means, we should give up the idea that there is an immutable thing that is the work for every creative expression. The solution then is to see work as a piece of information about a resource, a quality, and to allow a resource to be described with as many qualities of work as might be useful. Any resource can have the quality of the work as basic content, a story, a theme. It can be a work of fiction, a triumphal work, a romantic work. It can be always or sometimes part of a larger work, it can complement a work, or refute it. It can represent the philosophical thoughts of someone, or a scientific discovery. In FRBR, the work has authorship and intellectual content. That is precisely what I have described here. But what I have described is not based on a single set of rules, but is an open-ended description that can grow and change as time changes the emotional and informational context as the work is experienced.

I write this because we risk the petrification of the library if we embrace what I have heard called the "FRBR fundamentalist" view. In that view, there is only one definition of work (and of each other FRBR entity). Such a choice might have been necessary 50 or even 30 years ago. It definitely would have been necessary in Dewey's time. Today we can allow ourselves greater flexibility because the technology exists that can give us different views of the same data. Using the same data elements we can present as many interpretations of Work as we find useful. As we have seen recently with analyses of audio-visual materials, we cannot define work for non-book materials identically to that of books or other texts. [1] [2] Some types of materials, such as works of art, defy any separation between the abstraction and the item. Just where the line will fall between Work and everything else, as well as between Works themselves, is not something that we can pre-determine. Actually, we can, I suppose, and some would like to "make that so", but I defy such thinkers to explain just how such an uncreative approach will further new knowledge.

[1] Kara Van Malssen. BIBFRAME A-V modeling study
[2] Kelley McGrath. FRBR and Moving Images

Peter Murray: Thursday Threads: Sakai Reverberations, Ada Initiative Fundraising, Cost of Bandwidth

planet code4lib - Thu, 2014-09-11 10:43
Receive DLTJ Thursday Threads:

by E-mail

by RSS

Delivered by FeedBurner

Welcome to the latest edition of Thursday Threads. This week’s post has a continuation of the commentary about the Kuali Board’s decisions from last month. Next, news of a fundraising campaign by the Ada Initiative in support of women in technology fields. Lastly, an article that looks at the relative bulk bandwidth costs around the world.

Feel free to send this to others you think might be interested in the topics. If you find these threads interesting and useful, you might want to add the Thursday Threads RSS Feed to your feed reader or subscribe to e-mail delivery using the form to the right. If you would like a more raw and immediate version of these types of stories, watch my Pinboard bookmarks (or subscribe to its feed in your feed reader). Items posted to are also sent out as tweets; you can follow me on Twitter. Comments and tips, as always, are welcome.

Discussion about Sakai’s Shift Continues

The Kuali mission continues into its second decade. Technology is evolving to favor cloud-scale software platforms in an era of greater network bandwidth via fast Internet2 connections and shifting economics for higher education. The addition of a Professional Open Source organization that is funded with patient capital from university interests is again an innovation that blends elements to help create options for the success of colleges and universities.

- The more things change, the more they look the same… with additions, by Brad Wheeler, Kuali Blog, 27-Aug-2014

Yet many of the true believers in higher education’s Open Source Community, which seeks to reduce software costs and provide better e-Learning and administrative IT applications for colleges and universities, may feel that they have little reason to celebrate the tenth anniversaries of Sakai, an Open Source Learning Management System and Kuali, a suite of mission critical, Open Source, administrative applications, both of which launched in 2004.  Indeed, for some Open Source evangelists and purists, this was probably a summer marked by major “disturbances in the force” of Open Source

- Kuali Goes For Profits by Kenneth C. Green, 9-Sep-2014, Digital Tweed blog at Inside Higher Ed

The reverberations from the decision by the Kuali Foundation Board to fork the Kuali code to a different open source license and to use Kuali capital reserves to form a for-profit corporation continue to reverberate. (This was covered in last week’s DLTJ Thursday Threads and earlier in a separate DLTJ post.) In addition to the two articles above, I would encourage readers to look at Charles Severance’s “How to Achieve Vendor Lock-in with a Legit Open Source License – Affero GPL”. Kuali is forking its code from using the Educational Community License to the Affero GPL license, which it has the right to do. It also comes with some significant changes, as Kenneth Green points out. There is still more to this story, so expect it to be covered in additional Thursday Threads posts.

Ada Initiative, Supporting Women in Open Technology and Culture, Focuses Library Attention with a Fundraising Campaign

The Ada Initiative has my back. In the past several years they have been a transformative force in the open source software community and in the lives of women I know and care about. To show our support, Andromeda Yelton, Chris Bourg, Mark Matienzo and I have pledged to match up to $5120 of donations to the Ada Initiative made through this link before Tuesday September 16. That seems like a lot of money, right? Well, here’s my story about how the Ada Initiative helped me when I needed it most.

- The Ada Initiative Has My Back, by Bess Sadler, Solvitur Ambulando blog, 9-Sep-2014

The Ada Initiative does a lot to support women in open technology and culture communities; in the library technology community alone, many women have been affected by physical and emotional violence. (See the bottom of the campaign update blog post from Ada Initiative for links to the stories.) I believe it is only decent to enable anyone to participate in our communities without fear for their physical and psychic space, and that our communities are only as strong as they can be when the barriers to participation are low. The Ada Initiative is making a difference, and I’m proud to have supported them with a financial contribution as well as being an ally and a amplifier for the voice of women in technology.

The Relative Cost of Bandwidth Around the World

The chart above shows the relative cost of bandwidth assuming a benchmark transit cost of $10/Megabits per second (Mbps) per month (which we know is higher than actual pricing, it’s just a benchmark) in North America and Europe. From CloudFlare

Over the last few months, there’s been increased attention on networks and how they interconnect. CloudFlare runs a large network that interconnects with many others around the world. From our vantage point, we have incredible visibility into global network operations. Given our unique situation, we thought it might be useful to explain how networks operate, and the relative costs of Internet connectivity in different parts of the world.

- The Relative Cost of Bandwidth Around the World, by Matthew Prince, CloudFlare Blog, 26-Aug-2014

Bandwidth is cheapest in Europe and highest in Australia? Who knew? CloudFlare published this piece showing their costs on most of the world’s continents with some interesting thoughts about the role competition has on the cost of bandwidth.

Link to this post!

Code4Lib: code4libBC: November 27 and 28, 2014

planet code4lib - Thu, 2014-09-11 00:36

What: It’s a 2 day unconference in Vancouver, BC! A participant-driven meeting featuring lightning talks in the mornings, breakout sessions in the afternoons, with coffee, tea and snacks provided. Lightning talks are brief presentations which are typically 5-10 minutes in length (15 minutes is the maximum) on topics related to library technologies: current projects, tips and tricks, or hacks in the works. Breakout sessions is an opportunity to bring participants together in an ad hoc fashion for a short, yet sustained period of problem solving, software development and fun. In advance of the event, we will gather project ideas in a form available through our wiki and registration pages. Each afternoon the code4libBC participants will review and discuss the proposals, break into groups, and work on some of the projects.

When: November 27 and 28, 2014

Where: SFU Harbour Centre, Vancouver, BC map

Cost: $20

Accommodations: Info coming soon.

Register: here

Who: A diverse and open community of library developers and non-developers engaging in effective, collaborative problem-solving through technology.Anyone from the library community who is interested in library technologies are welcome to join and participate, regardless of their department or background: systems and IT, public services, circulation, cataloguing and technical services, archives, digitization and preservation. All are welcome to help set the agenda, define the outcomes and develop the deliverables!

Why: Why not? code4libBC is a group of dynamic library technology practitioners throughout the province who want to build new relationships as much as develop new software solutions to problems.

Tag d'hash: #c4lbc

More information

DuraSpace News: Contribute, and Learn More About the New Features in DSpace 5

planet code4lib - Thu, 2014-09-11 00:00
Peter Dietz, Longsight, on behalf of the DSpace 5.0 Release team

M. Ryan Hess: Is Apple Pay Really Private?

planet code4lib - Wed, 2014-09-10 20:03

Apply Pay, the new payment system unveiled by Apple yesterday was an intriguing alternative to using Debit and Credit Cards. But how private, and how secure, is this new payment system going to really be?

Tim Cook, Apple CEO, made it very clear that Apple intends to never collect data on you or what you purchase via Apple Pay. The service, in fact, adds a few new layers of security to transactions. But you have to wonder.

A typical model for data collection business models is to promise robust privacy assurances in their service agreements and marketing even though the long-term strategy is to leverage that data for profit. Anyone who was with Facebook early on knows how quickly these terms can change.

So, when we’re assured that our purchases will remain wholly private and marketing firms will never have access to them, how can we really be confident that this will always remain the case? We can’t. So, as users, we should approach such services with skepticism.

As with anything related to personal data, we should assume that enterprising hackers or government agents can and will figure out a way to access and exploit our information. Just last week, celebrities using Apple’s iCloud had their accounts compromised and embarrassing photos were made public. And while Apple has done a pretty good job at securing Apple Pay, it’s still possible someone could figure out a way in…and then you’re not just dealing with incriminating photos, you’ve got your financial history exposed.

So ask yourself:

  1. Can you think of things you buy that could prove embarrassing or might give people with malign intent a way to blackmail or do financial damage to me?
  2. If my most embarrassing purchases were to become permanently public, can I live with that?
  3. How would such public exposure impact my reputation, professionally and personally?
  4. Does the convenience of purchasing something with my phone outweigh the risks to my financial security?

Depending on how you answer this, you may want to stick with your credit card.

Or just go the analog route and use the most anonymous medium of exchange: cash.


FOSS4Lib Upcoming Events: VIVO Project Hackathon at Cornell University

planet code4lib - Wed, 2014-09-10 18:57
Date: Monday, October 13, 2014 - 08:00 to Wednesday, October 15, 2014 - 17:00Supports: Vivo

Last updated September 10, 2014. Created by Peter Murray on September 10, 2014.
Log in to edit this page.

The VIVO Project is hosting a hackathon event on the Cornell University campus in Ithaca, New York from October 13-15. This event builds on the March, 2014 hackathon held in conjunction with the VIVO I-Fest at Duke University, and is open to anyone interested in actively participating in improving some aspect of the VIVO software, ontology, documentation, testing, or related applications and tools.

Pages

Subscribe to code4lib aggregator