Blogs and feeds of interest to the Code4Lib community, aggregated.


April 23, 2014

In the Library, With the Lead Pipe

Librarian, Heal Thyself: A Scholarly Communication Analysis of LIS Journals

In Brief

This article presents an analysis of 111 Library and Information Science journals based on measurements of “openness” including copyright policies, open access self-archiving policies and open access publishing options. We propose a new metric to rank journals, the J.O.I. Factor (Journal Openness Index), based on measures of openness rather than perceived rank or citation impact. Finally, the article calls for librarians and researchers in LIS to examine our scholarly literature and hold it to the principles and standards that we are asking of other disciplines. [Also available as an EPUB for reading on mobile devices, or as a PDF.]

konkiel tweet

Introduction

January 2014 saw the launch of Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3), which was the first major disciplinary or field-specific shift toward open access. Considerable numbers of journals and publishers are moving to embrace open access, exploring a variety of business models, but SCOAP3 represents a significant and new partnership between libraries, publishers and researchers.1 Simply, 10 journals under the SCOAP3 program were converted to open access overnight and are being supported financially by libraries paying article processing charges through a consortium rather than purchasing subscriptions. The Physics field has been at the forefront of open access for more than 20 years, beginning with the foundation of arXiv.org and followed by their premier society, American Physical Society (APS), actively evolving their publications to provide efficient open access options for authors. There has yet to be any such movement in the professional literature of Library and Information Sciences (LIS), despite the fact that the library world is inextricably linked to “open access” both in principle and in practice. The authors note this disciplinary discrepancy, and through an analysis of LIS journals and professional literature hope to inspire those researching and publishing in the LIS field to take control of our professional research practices. We conducted this analysis by grading 111 select LIS journals using a metric we propose to call the “J.O.I Factor” (Journal Openness Index), judging “How Open Is it?” based on a simplified version of the open access spectrum proposed by Public Library of Science (PLOS), the Scholarly Publishing and Academic Resources Coalition (SPARC), and the Open Access Scholarly Publishers Association (OASPA). It is our hope that doing so will lead to the shifts in the scholarly communication system that libraries are necessarily pursuing.2

Background

Scholarly publishing is evolving in many ways, as anyone connected to academia knows. Discussions about publishing often center on the potential that digital technology offers to disseminate the results of scholarly research, a role traditionally filled by scholarly associations, societies, university presses, and commercial publishers. Scholars and researchers at institutions ranging from Ivy League universities to state colleges are raising questions about how non-traditional “digital” scholarship will be evaluated, what criteria and credence should be given to new, openly accessible online journals, and what role open access repositories have in disseminating and preserving the scholarly record. Reaching even into public policy, the Office of Science and Technology Policy (OSTP) convened a Scholarly Publishing Roundtable in 2009. That group’s final report offered the recommendation that each federal research agency (National Science Foundation, National Endowment for the Humanities, etc.) should expeditiously develop and implement public access policies, offering free access to results of federally funded research. 2013 saw OSTP revisit that recommendation and, in response to an overwhelming petition, issue a directive to all federal funding agencies with more than $100 million in R&D funding to develop and implement open access policies, similar to the National Institute of Health’s Public Access Policy, in effect since May 2, 2005.

Popular media are also taking up the question of how scholarly publishing will evolve. The Guardian regularly features pieces in its Higher Education Network calling for redefinition of the publishing cycle that earns large publishing companies significant financial gains off of the gift economy of intellectual content and peer-reviewing in which faculty participate. One opinion piece went so far to say, “Academic publishers make Murdoch look like a socialist… down with the knowledge monopoly racketeers.” The Economist coined the term “Academic Spring” in a February 2012 piece, referring to faculty’s rising discontent with the current system. They cite the example of Timothy Gowers, an award winning Cambridge mathematician who called for a boycott of Elsevier, a large STEM publisher, for its unsatisfactory business practices. As of April 22nd, 2014 that boycott, thecostofknowledge.com, had 14,602 signatories. Finally, US News and World Report published a piece in July 2012 that opened with Harvard University’s Faculty Advisory Council stating “many large journal publishers have made the scholarly communication environment financially unsustainable and academically restrictive.”

Responding to these “tectonic shifts in publishing,” university libraries and academic librarians are undergirding a system that is shaky at best. Budgets remain flat, while subscription costs continue to rise; all the while many libraries are investing in staff and infrastructure in the area of scholarly communication, supporting open access initiatives, or moving directly into publishing themselves.3 While the primary push for adapting this system has been working through disciplinary faculty to change research culture, academic librarians are slowly engaging the idea that publishing practices within our own journals and professional writing could be an effective way to mold the future of academic publishing. The scope of this article is to engage our own community, librarians who publish in professional or academic literature, and target pressure points in our subset of academic publishing that could be capitalized upon to push the whole system forward. We are approaching this topic with the goal of plainly sketching out what LIS publishing looks like currently, in terms of scholarly communication practices like copyright assignment, journal policies for open access self-archiving and open access publishing.

Literature Review

Studies of this magnitude have been conducted in the recent past, although they have primarily focused on the attitudes of individual librarian authors toward publishing practices more than analyzing the publishing practices and policies journals themselves. Elaine Peterson, in 2006, produced an exploration of “Librarian Publishing Preferences and Open-Access Electronic Journals”, in which she conducts a brief survey. The results show that academic librarians often consider open access journals as a means of sharing their research but hold the same reservations about them as many other disciplines, i.e. concerns about peer review and valuation by administration in terms of promotion and tenure.4 This line of thought is continued in Snyder, Imre and Carter’s 2007 study, which focused more specifically on intellectual property concerns of academic librarian authors and allowable self-archiving practices. They quote Peter Suber, author of Open Access and director of Harvard’s Open Access Project, writing, “‘There is a serious problem [serials pricing and permission crisis], known best to librarians, and a beautiful solution [open access] within the reach of scholars.’ One can draw the conclusion from Suber’s statement that librarians as authors should be the most prominent supporters of open access and that, as scholars, they would practice self-archiving.”5  This study in particular lays a unsettling foundation that 50% of respondents cared mostly about publication without considering the copyright policies of the journals in which they published and that only 16% had exercised the right to self-archive in an institutional repository.((Ibid)) These and other similar studies highlight the simple fact that concerns about changing publishing habits are the same within librarianship as they are in many other disciplines.

College and Research Libraries (C&RL), a well-regarded journal for academic librarianship, published four articles between 2009 and 2013 that studied the publishing practices of academic librarians through surveys.6 Each has contributed valuable insights while reaching very similar conclusions across the board. Palmer, Dill and Christie conclude that in attitude, “Librarians are in favor of seeing their profession take some actions toward open access [...] yet this survey found that agreement with various open access–related concepts does not constitute actual action.”7 Mercer, focused on the publishing and archiving behaviors rather than attitudes of academic librarians, highlights the substantial differences between the dual role many academic librarians inhabit; library professional first and academic researcher second. She writes, “…librarians may be risk takers in their professional roles, where they are actively encouraging changes in the system of scholarly communication and adoption of new technologies but are risk-averse as faculty in their roles as researchers and authors.”8 Taken together, the research could lead one to think that academic librarians are invested in changes to the scholarly publishing system about as little as disciplinary faculty and are just as cautious about evolving their own publishing habits.

Many academic authors write and publish out of passion for their research and to contribute to the progression of knowledge in society. Unfortunately, because of the system of measurement in which academia is mired, credentials, merit and perception can also play a substantial role in the publishing decisions of faculty. Without delving too deep into the discussion of tenure for librarians, the expectations for publishing in certain journals, or at all, are slightly different for librarians than other university faculty. Both the h-index and journal impact factor are measurements of supposed “impact,” based on the citations an article receives, which have in turn been equated with quality.9 The h-index is an impact measure for an individual, whereas impact factor applies to the journal level. Two recent studies follow Mercer’s line of argument and look at the journals in the LIS field, rather than the authors, using these two traditional measures of “impact.”

Jingfeng Xia conducted a fascinating study proposing that the h-index of authors published in a journal, as opposed to that journal’s impact factor, could provide an efficient method of ranking LIS journals, especially those that are open access and not listed in Journal Citation Reports. Xia’s article also underscores some of the complications that arise when lumping together all journals in the Library and Information Science field; Library and Information Science Research (LISR), a researcher-focused journal published by Elsevier (h-index = 21, impact factor = 1.4, not open access) is judged alongside D-Lib Magazine published by the Corporation for National Research Initiatives (h-index = 33, impact factor = 0.7, open access), a journal aimed at the practice of digital librarianship. LISR’s impact factor (1.4) is high for LIS journals (median 0.74), but when compared to the h-index of D-Lib’s authors LISR seems to have less “impact.” Xia’s employment of the h-index as a measurement, illustrated in this example, shows the breadth and depth that alternate matrices may introduce, the complications of judging journal quality based on citations, and the potential inversion of perceived impact depending on how one looks at it.

Expanding on the idea that acknowledging the perceived quality of journals is a valuable practice within librarianship, Judith Nixon’s “Core Journals in Library and Information Science: Developing a Methodology for Ranking LIS Journals” was published in 2013 by C&RL. She proposes, based on successful practices at Purdue University Libraries, that “Top LIS journals can be identified and ranked into tiers by compiling journals that are peer-reviewed and highly rated by the experts, have low acceptance rates and high circulation rates, are journals that local faculty publish in, and have strong citation ratings as indicated by an ISI impact factor and a high h-index using Google Scholar data.”10 The production of a ranked list like this aligns perfectly with the type of study we performed, and our conclusions will highlight some similarities and differences between Nixon’s list and our findings, pitting the Journal Openness Index (J.O.I) Factor against the Top Tier journals she presents.

Whereas some of these studies in LIS publishing focused on the “people” angle, studying librarians and their attitudes and practices around publishing, we chose to follow more recent research and widen the lens to look at the journals in which librarians might publish. A challenge presents itself when broadening to this scale: there is the ever-present blurred line between the publishing habits of working librarians and those of teaching/research faculty in library schools and academic departments — Library and Information Science Research vs. D-Lib Magazine for example. There are obvious differences between these groups, so pairing analysis on the specific journals where professional librarians typically publish with the more specific studies on that same group’s publishing habits will present the most accurate portrait of the scholarly communication landscape as it has been studied to date. We leave the extension of this research for future study.

Methodology

Our live dataset is viewable on this Google Spreadsheet. Downloadable and citable data are accessible on figshare.

Journal Selection

The journals that we began with came from an internal list compiled as part of a professional development initiative at Florida State University Libraries. A student worker in the Assessment department compiled the original list of 74 journals, and then the co-authors of this piece expanded that list to 111 after consulting the LIS Publications Wiki. The journals were ingested into a spreadsheet with columns for impact factor, scope, instructions for authors, indexing information and other common details. Our first task was to add columns for copyright policy, open access archiving policy, and open access publishing options. Our journal list includes an extraordinarily broad range of journals including research focused journals and those in subfields of librarianship like archives and technical services. This decision was made so as to gather data from the broadest possible representation of LIS scholarship.

Data Collection

After compiling and organizing the journal list, we collected each journal’s standard policies on copyright assignment, open access self-archiving (“green open access”), and open access publishing (“gold open access”). We began gathering these data by searching the SHERPA/RoMEO database for commercial journals and the Directory of Open Access Journals (DOAJ) for open access journals. After searching these databases, we double checked policies and open access options on the journal and/or publisher’s website using the following workflow: locate the policies section of the website, which is commonly labeled “Policies,” “Policies and Guidelines,” “Author’s Rights,” or “Author’s Guidelines”; identify the copyright policy of the journal; identify the open access self-archiving policy or “green open access” options that the journal permits; identify the open access publishing or “gold open access options” of the journal, which may be listed in the policies section or a specific “Open Access Options” section; and finally view the copyright transfer agreement or other author agreement, if available. All details were inputted to the spreadsheet and coded for consistency.

J.O.I Factor (Journal Openness Index)

Grading journals based on how “open” they are, as opposed to citation impact or h-index, is a novel approach, and one that had not been applied to LIS literature to our knowledge. In fact, it is not clear that this measurement has been used extensively in any field or practice aside from the production of the spectrum and some supporting documentation by PLOS, SPARC, and OASPA. Potentially then, as further research is done using the J.O.I Factor, the grades we apply to journals herein may be different based on how many measures of openness are used and how they are counted. Our proposed enumeration of the J.O.I Factor is indicated on the image below, superimposed over the “How Open Is It?” scale produced by SPARC/PLOS. The application of J.O.I Factors to specific journals is contained to our Conclusion section for purposes of clarity and emphasis.

Journal Openness Index

Our proposed Journal Openness Index, adding numerical values to PLOS/SPARC’s How Open Is It spectrum.

The original spectrum breaks openness down to six categories, three of which overlap neatly with the criteria we used in our analysis: 1) Copyrights, 2) Reuse Rights, and 3) Author Posting Rights. The remaining categories, Reader Rights, Automatic Posting, and Machine Readability were mostly ancillary to our focus, and so the J.O.I Factor numbers that we apply only account for the three criteria we researched. The “Reader Rights” category does include some details about embargoes but typically refers to embargoes on the final published PDF released after that term expires by the publisher. Our use of the embargo data point was in terms of Author Posting Rights, so we chose not to include Reader Rights as a category in our J.O.I Factor calculations.

Also, the spectrum lumps open access publishing options, another of our data points, in with Reader Rights as “immediate access to some, but not all, articles (including the ‘hybrid’ model” — “hybrid” meaning the business model where articles can be made open access on a one-by-one basis for a fee. We decided to add a “-” for journals that offer open access publishing for a fee, illustrating the negative connotation that might have for authors. Journals that are fully open access without any publishing fees will have a J.O.I number and a “+” illustrating positive connotations. Information Technology and Libraries, for example, published by Library and Information Technology Association/ALA, would have a J.O.I Factor of 12+; four points for author retention of copyrights, four points for broad reuse rights (CC-BY), four points for the author being allowed to post any version of the article in a repository and “+” for the journal being fully open access without imposing any publication fees.

We hope that the application of the J.O.I Factor in this article serves merely as a proof of concept, and we invite colleagues to use our data, apply J.O.I Factors to all the journals we listed there, and extend this work to account for the full range of possible factors of openness.

Data Analysis

The most common major publishers from our sample were Taylor & Francis (25 journals), Emerald (12 journals), and Elsevier (8 journals). Society and Association publishers followed closely with 23 journals, and Universities and University Presses had 18. The remainder were either unknown, other types of organizations, smaller publishing houses or “self-affiliated.” The three clearly self-affiliated journals, First Monday, Code4Lib and In the Library with the Lead Pipe are all fully open access but have a range of difference in their copyright policies, illustrating the variety of publishing options within the LIS field.11

Each journal was assigned a corresponding code for its copyright, open access archiving, and open access publication policies. These codes were used primarily for organizing the information in our spreadsheet, and are not conflated with our proposed J.O.I Factors which are applied after all data were collected, organized, and analyzed. The codes represent the range of possible options under each category, based on the variety of options we identified in the journals we reviewed. For example, the Copyright field could range from (1) required full transfer of copyright to (4) copyright jointly shared between author and publisher. Self-archiving policies ranged from Not Permitted (0) to allowing the final published PDF (6), with a range of embargo periods for each category in between. (See Table 1 for all codes)

Table 1: Data Codes

Table 1: Journal policy codes as applied to our data.

Copyright Policies

Despite librarianship’s ongoing waltz with copyright complications, 43 of the LIS journals we reviewed still require the author to transfer all copyrights to the publisher, “during the full term of copyright and any extensions or renewals… including but not limited to the right to publish, republish, transmit, sell, distribute and otherwise use the [article] in whole or in part… in derivative works throughout the world, in all languages and in all media of expression now known or later developed” (emphasis our own).12 However, leaning toward a more expansive rights agreement, 61 journals allow the author to retain copyright, 38 of which require a License to Publish be granted to the publisher.13 21 of the 38 that require a license granted to the publisher are Taylor & Francis journals, which fall under their new author rights for LIS journals. Taylor and Francis shows leadership in adapting their rights agreements for LIS journals, although one co-author of this article sought to push them further, with success. The remaining 23 journals that allow the author to retain copyright also offer the article to be published under a Creative Commons (CC) license, ranging from Attribution-Non-Commercial-No Derivatives (Collaborative Librarianship) to Public Domain (First Monday). The boldest and most progressive copyright policy goes to First Monday, which offers total author choice, from copyright transfer (©), through every possible Creative Commons license, to releasing the work in the public domain (CC0).

Open Access Self-Archiving Policies (Green open access)

This category provided the broadest range of possibilities, mostly due to the fact that different publishers assign different terms of embargo for self-archiving. Assuming that well-informed LIS authors who submit to these journals desire the simplest and broadest open access options, 24 journals allow the pre-print (submitted version), post-print (accepted version) and final published PDF to be archived in an open access institutional repository, with no stated embargoes. 22 of these 24 are fully open access journals, and they are all published by societies, associations, universities or self-affiliated groups. Common thought in academic publishing tends to say that society/association publishers lose the most when going open access; it is heartening to see this is absolutely untrue in LIS literature. The strictest embargo on self-archiving in an institutional repository is 18 months for 10 of the Taylor and Francis journals. University of Texas Press and University of Chicago Press both allow archiving after 12 months, while ironically, given the topic of the journal, the Journal of Scholarly Publishing published by Toronto University Press only allows archiving of the pre-print with no policies for post-prints.

An important point to consider when discussing self-archiving policies is the farce that they truly are. Kevin Smith, Duke University’s Scholarly Communication Officer stated it most plainly in his February 5 blog post titled It’s the Content, Not the Version! He writes,

…this notion of versions is, at least in part, an artificial construction that publishers use to assert control while also giving the appearance of generosity in their licensing back to authors of very limited rights to use earlier versions.  The versions are artificially based on steps in the publication permission process (before submission, peer-review, submission, publication), not on anything intrinsic to the work itself that would justify a change in copyright status.14

The practice of self-archiving is totally dependent on copyright transfer agreements, and based on the representative sample of LIS journals we reviewed, all but 8% had direct or implied policies regarding what the author is allowed to do with specific versions of the same work. The author’s false sense of control over their work and the publisher’s exploitation of that sense deserves a study unto its own. Suffice it to say that if the field of Library and Information Studies considers a green open access policy a good deal, there is much work to be done.

Open Access Publishing Policies (Gold open access)

A common misconception about achieving open access is that it always requires a fee on the part of the author. While this mostly true for traditional commercial publishers attempting to retain their income stream while “acquiescing” to the desires of their authors, it is a falsity broadly, which is proven in our analysis. 56 journals offer open access on an article-by-article basis and require an article processing charge (APC) ranging from $300 to $3,000. 52 of these are published by commercial publishers (Elsevier, Sage, Springer, Wiley, Taylor and Francis, and Emerald). In stark contrast, 35 journals on our list are fully open access and all articles are published without a fee. A significant number, 20 journals, either do not offer a “gold open access” publication option or do not publicize it. A number of the 20 journals that do not offer or publicize a paid open access business model are University Presses (6), and association/society journals (7).

Open Access LIS Journals

As noted above, within these LIS journals there is considerable diversity in policies. We wanted to further explore that depth of difference by looking specifically at the fully open access journals in our sample. This section reiterates some of the analyses from previous sections, but we thought it still important to enumerate the complexities of publishing within this subset of a subset. 38 of the 111 journals that we looked at are open access, and only two (The International Journal of Library Science and IFLA Journal) have a publication charge, $300 and $1500 respectively. While two of the 38 open access journals require a full copyright transfer (International Journal of Library Science and Student Research Journal) a little more than half of them (21) allow the author to keep copyright AND attach a Creative Commons license to the work.15 27 of these fully open access journals allow the author to deposit the final published PDF in a repository, meaning that 11 fully open access journals either place some restrictions on the reuse of open access content or have poorly defined reuse policies.

Even though these are open access journals, the data suggests that what qualifies as “open access” even within our own field is still loosely defined, a point we attempt to illustrate by applying the J.O.I. Factor at the close of this article. Some might make the argument that any restriction of authors’ rights (copyright) and readers’ rights (reuse via licenses) toes the line of not achieving pure open access. Emily Drabinski, a reviewer of this article, made the salient point that the policies we discuss as needing to change are under the purview of journal editorial boards who are often in the complicated position of being between authors (colleagues) and publishers. To that end, we encourage journal Editors as well as authors to lead by taking action. Regardless, as the measures of openness are more effectively discussed within our communities of practice, the LIS field is making slow progress toward public access (readability) and open access (re-usability), a trend we expect to broaden and deepen.

Conclusions

This article illustrates something with which every researcher in the field of Library and Information Studies must contend. A significant percentage of our professional literature is still owned and controlled by commercial publishers whose role in scholarly communication is to maintain “the scholarly record,” yes, but also to generate profits at the expense of library budgets by selling our intellectual property back to us. Conversely, there is much to be proud of, including the many association, society and University-sponsored journals that are well-respected and proving important points about the viability of open access as a business model, a dissemination mechanism, and a principle to which librarians hold — our “free to all” heritage. It is our hope that this article inspires the activism that the earlier articles from our review of the literature pointed out as a disturbing discrepancy in our professional practice. Simply, this is our call for librarians to practice what we preach, regardless of, or even in the face of, tenure and promotion “requirements,” long-held professional norms, and the unnecessary fear, uncertainty and doubt that control academic publishing. We already have models for activism on the collections side of our work; we call our colleagues to echo those impulses on the production side of scholarship, as editors, authors, bloggers, library publishers, and consumers of research.

There are three practical means of seeding this change; 1) exercise the right to self-archive every piece of scholarship published in LIS journals, or better yet never give those rights away in the first place; 2) move the “prestige” to open access, meaning offering the best work to journals that are invested in a more benevolent scholarly communication system; and 3) as editors, work diligently to adapt the policies and procedures for the journals we control to align with our professional principles of access, expansive understanding of copyrights, fair use, and broad reusability.

Returning to “Nixon’s list,” which proposed a possible ranking system for LIS journals, it is interesting to grade her list in terms of the “openness” criteria we’ve employed in this article, and in light of the practical actions we propose. Nixon’s findings present 18 journals that were determined to be the “Tier One” journals, based on the criteria she and her colleagues developed.16 11 of those 18 were also identified as top LIS journals from her literature review. Table 2 shows those 11 “prestige” journals, as graded by our applied J.O.I Factor.

Table 2: JOI of Tier 1 LIS Journals

Table 2: Tier One LIS Journals graded by J.O.I Factor

The results are striking. College and Research Libraries, widely regarded as a top journal for practicing librarians, received a J.O.I Factor of 9+, whereas Information Technology and Libraries (ITAL) measures at 12+, all because of ITAL’s generous Reuse Rights policy (CC-BY). JASIST is tied for last place (J.O.I Factor 2-) with Elsevier and Emerald journals because of copyright transfer requirements, no reuse rights and middling author posting allowances. Library Trends and Library Quarterly (university press journals) sit solidly in the middle, entirely due to their author posting policies which allow posting the Publisher’s PDF.

Based on this, in closing, we submit these final questions to the LIS research community: are these the journals we want on a top tier list, and what measure of openness will we define as acceptable for our prestigious journals? Further, how long will we tolerate measurements like impact factor and h-index guiding our criteria for advancement, while accounting for very little that matters to how we principle ourselves and our work? Finally, has the time come and gone for LIS to lead the shifts in scholarly communication? It is our hope that this article prompts furious and fair debate, but mostly that it produces real, substantive evolution within our profession, how we research, how we assign value to scholarship, and how we share the products of our intellectual work.

Our thanks and gratitude go to Emily Drabinski for her thoughtful, helpful and engaging comments as the external reviewer of this article. Thanks also to Lead Pipe colleagues and editors, Ellie, Erin, and Hugh, for challenging our ideas, correcting our bad grammar and making this lump of coal into a diamond. Most of all, thanks to Brett for proposing the term “Journal Openness Index” to replace our not creative and weird-sounding original concept.

Bibliography

Peterson, E. (2006) Librarian Publishing Preferences and Open-Access Electronic Journals. Electronic Journal of Academic and Special Librarianship, 7(2). Accessible at http://southernlibrarianship.icaap.org/content/v07n02/peterson_e01.htm

Carter, H., Carolyn Snyder, and Andrea Imre. (2007) “Library Faculty Publishing and Self-Archiving: A Survey of Attitudes and Awareness.” portal: Libraries and the Academy, 7(1). Open access version at http://opensiuc.lib.siu.edu/morris_articles/1/

Palmer, K., Emily Dill, and Charlene Christie. (2009) “Where There’s a Will There’s a Way?: Survey of Academic Librarian Attitudes about Open Access.” College and Research Libraries, 70. Accessible at http://crl.acrl.org/content/70/4/315.full.pdf+html

Mercer, H. (2011) Almost Halfway There: An Analysis of the Open Access Behaviors of Academic Librarians. College and Research Libraries, 72. Accessible at http://crl.acrl.org/content/72/5/443.full.pdf+html

Nixon, J. (2014) Core Journals in Library and Information Science: Developing a Methodology for Ranking LIS Journals. College and Research Libraries, 75. Accessible at http://crl.acrl.org/content/75/1/66.full.pdf+html

Smith, K. (2014) Its the content, not the version! Scholarly Communications @ Duke [blog], posted on February 5. Accessible at http://blogs.library.duke.edu/scholcomm/2014/02/05/its-the-content-not-the-version/

Data

Vandegrift, Micah; Bowley, Chealsye (2014): LIS Journals measured for “openness.” http://dx.doi.org/10.6084/m9.figshare.994258

 

Other readings

Malenfant, K. J. (2010) Leading Change in the System of Scholarly Communication: A Case Study of Engaging Liaison Librarians for Outreach to Faculty. College & Research Libraries, 71. Accessible at http://crl.acrl.org/content/71/1/63.full.pdf+html

Sugimoto, C. R., Tsou, A., Naslund, S., Hauser, A., Brandon, M., Winter, D., … Finlay, S. C. (2012) Beyond gatekeepers of knowledge: Scholarly communication practices of academic librarians and archivists at ARL institutions. College & Research Libraries, 75. Accessible at http://crl.acrl.org/content/75/2/145.full.pdf+html

Xia, J. (2012) Positioning Open Access Journals in a LIS Journal Ranking. College & Research Libraries, 73. Accessible at http://crl.acrl.org/content/73/2/134.full.pdf+html

Henry, D. and Tina M. Neville. (2004)  Research, Publication, and Service Patterns of Florida Academic Librarians. The Journal of Academic Librarianship, 30. Open access version at http://hdl.handle.net/10806/200. Published version at http://dx.doi.org/10.1016/j.acalib.2004.07.006

Joswick, K. (1999) Article Publication Patterns of Academic Librarians: An Illinois Case Study. College & Research Libraries, 60. Accessible at http://crl.acrl.org/content/60/4/340.full.pdf+html

Hart, R. (1999) Scholarly Publication by University Librarians: A Study at Penn State. College & Research Libraries, 60. Accessible at http://crl.acrl.org/content/60/5/454.full.pdf+html

Wiberley, Jr., S. Julie M. Hurd, and Ann C. Weller (2006) Publication Patterns of U.S. Academic Librarians from 1998 to 2002. College & Research Libraries, 67. Accessible at http://crl.acrl.org/content/67/3/205.full.pdf+html

Harley, D.; Acord, Sophia Krzys; Earl-Novell, Sarah; Lawrence, Shannon; & King, C. Judson. (2010). Assessing the Future Landscape of Scholarly Communication: An Exploration of Faculty Values and Needs in Seven Disciplines. UC Berkeley: Center for Studies in Higher Education. Accessible at  http://www.escholarship.org/uc/item/15x7385g

Frass, W. Jo Cross, and Victoria Gardener (2013) Taylor and Francis Open Access Survey – Supplement 1-8 Data Breakdown by Subject Area. Accessible at http://www.tandfonline.com/page/openaccess/opensurvey

Priego, E. (2012) Fieldwork: Mentions of Library Science Journals Online. Accessible at http://www.altmetric.com/blog/fieldwork-mentions-library-journals-online/

The Price of Information (Feb. 2012) http://www.economist.com/node/21545974

A (free) roundup of content on the Academic Spring (April 2012) http://www.guardian.co.uk/higher-education-network/blog/2012/apr/12/blogs-on-the-academic-spring

Academic Publishers make Murdoch look like a Socialist (Aug. 2011) http://www.guardian.co.uk/commentisfree/2011/aug/29/academic-publishers-murdoch-socialist

Is the Academic Publishing Industry on the Verge of Disruption? (July 2012) http://www.usnews.com/news/articles/2012/07/23/is-the-academic-publishing-industry-on-the-verge-of-disruption

 

  1. See Open Access Directory “Journals that converted from Toll Access to Open Access.” Accessible at http://oad.simmons.edu/oadwiki/Journals_that_converted_from_TA_to_OA
  2. Bolick, J. (2014). “We need a scale to measure the #scholcomm friendliness of a journal: based on @SPARC_NA and @PLOS #howopenisit: HOII factor?” Tweet available at https://twitter.com/joshbolick/status/453586422004744193
  3. Information pulled from Library Journal’s annual Periodicals Price Survey, accessible at http://lj.libraryjournal.com/2013/04/publishing/the-winds-of-change-periodicals-price-survey-2013/
  4. Peterson, E. (2006) Librarian Publishing Preferences and Open-Access Electronic Journals. Electronic Journal of Academic and Special Librarianship, 7(2). Accessible at http://southernlibrarianship.icaap.org/content/v07n02/peterson_e01.htm
  5. Carter, H., Carolyn Snyder, and Andrea Imre. (2007) “Library Faculty Publishing and Self-Archiving: A Survey of Attitudes and Awareness.” portal: Libraries and the Academy, 7(1). Open access version at http://opensiuc.lib.siu.edu/morris_articles/1/
  6. Additionally, its partner newsletter, College and Research Libraries News, has run a column dedicated to scholarly communication since 2000.
  7. Palmer, K., Emily Dill, and Charlene Christie. (2009) “Where There’s a Will There’s a Way?: Survey of Academic Librarian Attitudes about Open Access.” College and Research Libraries, 70. Accessible at http://crl.acrl.org/content/70/4/315.full.pdf+html
  8. Mercer, H. (2011) Almost Halfway There: An Analysis of the Open Access Behaviors of Academic Librarians. College and Research Libraries, 72. Accessible at http://crl.acrl.org/content/72/5/443.full.pdf+html
  9. For more on the issues associated with impact factor, see this Editorial from Naturehttp://blogs.nature.com/news/2013/05/scientists-join-journal-editors-to-fight-impact-factor-abuse.html A response to these issues is the recent push toward altmetrics (alternative metrics), measuring many other forms of impact beyond simply citations – http://altmetrics.org/manifesto/.
  10. Nixon, J. (2014) Core Journals in Library and Information Science: Developing a Methodology for Ranking LIS Journals. College and Research Libraries, 75. Accessible at http://crl.acrl.org/content/75/1/66.full.pdf+html
  11. Our use of “fully open access” throughout this article means published online, freely accessible to anyone with an internet connection, with broad copyright and reuse options for authors and readers respectively.
  12. Language pulled from a standard Wiley agreement, although only one LIS journal we reviewed is published by Wiley. Elsevier would be a better example agreement, but curiously Elsevier’s copyright transfer agreements are nearly impossible to find on the web anymore.
  13. Further study should be dedicated to determining if these “licenses to publish” are exclusive, non-exclusive or have other clauses that render them less effective than they seem.
  14. Smith, K. (2014) Its the content, not the version! Scholarly Communications @ Duke [blog], posted on February 5. Accessible at http://blogs.library.duke.edu/scholcomm/2014/02/05/its-the-content-not-the-version/
  15. CC-BY is the most common license for these 21 open access journals.
  16. “Top LIS journals can be identified and ranked into tiers by compiling journals that are peer-reviewed and highly rated by the experts, have low acceptance rates and high circulation rates, are journals that local faculty publish in, and have strong citation ratings as indicated by an ISI impact factor and a high h-index using Google Scholar data.” Nixon, J. (2013), http://crl.acrl.org/content/early/2012/07/23/crl12-387.full.pdf

by Micah Vandegrift at April 23, 2014 10:00 AM

April 22, 2014

Denton, William

Better ways of using R on LibStats (2): durations

(In the previous post, Better ways of using R on LibStats (1), I explain the background for this reference desk statistics analysis with R, and I set up the data I use. This follows on, showing another example of how I figured out how to do something more cleanly and quickly.)

In Ref desk 4: Calculating hours of interactions (from almost exactly two years ago) I explained in laborious detail how I calculated the total hours of interaction at the reference desks. I quote myself:

Another fact we record about each reference desk interaction is its duration, which in our libstats data frame is in the time.spent column. As I explained in Ref Desk 1: LibStats these are the options:

We can use this information to estimate the total amount of time we spend working with people at the desk: it’s just a matter of multiplying the number of interactions by their duration. Except we don’t know the exact length of each duration, we only know it with some error bars: if we say an interaction took 5-10 minutes then it could have taken 5, 6, 7, 8, 9, or 10 minutes. 10 is 100% more than 5: relatively that’s a pretty big range. (Of course, mathematically it makes no sense to have a 5-10 minute range and a 10-20 minute range, because if something took exactly 10 minutes it could go in either category.)

Let’s make some generous estimates about a single number we can assign to the duration of reference desk interactions.

Duration Estimate
NA 0 minutes
0-1 minute 1 minute
1-5 minutes 5 minutes
5-10 minutes 10 minutes
10-20 minutes 15 minutes
20-30 minutes 25 minutes
30-60 minutes 40 minutes
60+ minutes 65 minutes

This means that if we have 10 transactions of duration 1-5 minutes we’ll call it 10 * 5 = 50 minutes total. If we have 10 transactions of duration 20-30 minutes we’ll call it a 10 * 25 = 250 minutes total. These estimates are arguable but I think they’re good enough. They’re on the generous side for the shorter durations, which make up most of the interactions.

To do all those calculations I made a function, then a data frame of sums, then I loop through all the library branches, build up new a new data frame for each by applying the function to the sums, then put all those data frames together into a new one. Ugly! And bad!

When I went back to the problem and tackled it with dplyr I realized I’d made a mistake right off the bat back then: I shouldn’t have added up the number of “20-30 minute” durations (e.g. 10) and then multiplied by 25 to get 250 minutes total. It’s much easier to use the time.spent column in the big data frame to generate a new column of estimated durations and then add those up. For example, in each row that has a time.spent of “20-30 minutes” put 25 in the est.duration column, then later add up all those 25s. Doing it this way means only ever having to deal with vectors, and R is great at that.

Here’s the data I’m interested in. I want to have a new est.duration column with numbers in it.

> head(subset(l, select=c("day", "question.type", "time.spent", "library.name")))
         day                  question.type    time.spent library.name
1 2011-02-01              4. Strategy-Based  5-10 minutes        Scott
2 2011-02-01              4. Strategy-Based 10-20 minutes        Scott
3 2011-02-01              4. Strategy-Based  5-10 minutes        Scott
4 2011-02-01  3. Skill-Based: Non-Technical  5-10 minutes        Scott
5 2011-02-01              4. Strategy-Based  5-10 minutes        Scott
6 2011-02-01              4. Strategy-Based  5-10 minutes        Scott

I’ll do it with these two vectors and the match command, which the documentation says “returns a vector of the positions of (first) matches of its first argument in its second.” Here I set them up and show an example of using them to convert the words to an estimated number.

> possible.durations <- c("0-1 minute", "1-5 minutes", "5-10 minutes", "10-20 minutes", "20-30 minutes", "30-60 minutes", "60+ minutes")
> duration.times <- c(1, 4, 8, 15, 25, 40, 65)
> match("20-30 minutes", possible.durations)
[1] 5
> duration.times[5]
[1] 25
> duration.times[match("20-30 minutes", possible.durations)]
[1] 25

That’s how to do it for one line, and thanks to the way R works, if we say we want this to be done on a column, it will do the right thing.

> l$est.duration <- duration.times[match(l$time.spent, possible.durations)]
> head(subset(l, select=c("day", "question.type", "time.spent", "library.name", "est.duration")))
         day                  question.type    time.spent library.name est.duration
1 2011-02-01              4. Strategy-Based  5-10 minutes        Scott            8
2 2011-02-01              4. Strategy-Based 10-20 minutes        Scott           15
3 2011-02-01              4. Strategy-Based  5-10 minutes        Scott            8
4 2011-02-01  3. Skill-Based: Non-Technical  5-10 minutes        Scott            8
5 2011-02-01              4. Strategy-Based  5-10 minutes        Scott            8
6 2011-02-01              4. Strategy-Based  5-10 minutes        Scott            8

Now with dplyr it’s easy to make a new data frame that lists, for each month, how many ref desk interactions happened and an estimate of their total duration. First I’ll take a fresh sample so I can use the est.duration column

> l.sample <- l[sample(nrow(l), 10000),]
> sample.durations.pm <- l.sample %.% group_by(library.name, month) %.% summarise(minutes = sum(est.duration, na.rm =TRUE), count=n())
> sample.durations.pm
Source: local data frame [274 x 4]
Groups: library.name

   library.name      month minutes count
1           ASC 2011-09-01      77     7
2           ASC 2011-10-01      66     2
3           ASC 2011-11-01      13     7
4           ASC 2012-01-01      41     3
5           ASC 2012-02-01      11     5
6           ASC 2012-03-01       1     1
7           ASC 2012-04-01       4     1
8           ASC 2012-05-01      23     3
9           ASC 2012-06-01       8     2
10          ASC 2012-07-01       4     1
..          ...        ...     ...   ...
> ggplot(sample.durations.pm, aes(x=month, y=minutes/60)) + geom_bar(stat="identity") + facet_grid(library.name ~ .) + labs(x="", y="Hours", title="Estimated total interaction time (based on a small sample only)")

The count column is made the same way as last time, and the minutes column uses the sum function to add up all the durations in each grouping of the data. (na.rm = TRUE removes any NA values before adding; without that R would say 5 + NA = NA.)

First chart of durations

So easy compared to all the confusing stuff I was doing before.

Finally, finding the average duration is just a matter of dividing (mutate comes in dplyr):

> sample.durations.pm <- mutate(sample.durations.pm, average.length = minutes/count)
> sample.durations.pm
Source: local data frame [274 x 5]
Groups: library.name

   library.name      month minutes count average.length
1           ASC 2011-09-01      77     7      11.000000
2           ASC 2011-10-01      66     2      33.000000
3           ASC 2011-11-01      13     7       1.857143
4           ASC 2012-01-01      41     3      13.666667
5           ASC 2012-02-01      11     5       2.200000
6           ASC 2012-03-01       1     1       1.000000
7           ASC 2012-04-01       4     1       4.000000
8           ASC 2012-05-01      23     3       7.666667
9           ASC 2012-06-01       8     2       4.000000
10          ASC 2012-07-01       4     1       4.000000
..          ...        ...     ...   ...            ...
> ggplot(sample.durations.pm, aes(x=month, y=average.length)) + geom_bar(stat="identity") + facet_grid(library.name ~ .) + labs(x="", y="Minutes", title="Estimated average interaction time (based on a small sample only)")

Second chart of durations

Don’t take those numbers as reflecting the actual real activity going on at YUL. It’s just a sample, and it conflates all kinds of questions, from directional (“where’s the bathroom”), which take 0-1 minutes, to specialized (generally the deep and time-consuming upper-year, grad and faculty questions, or ones requiring specialized subject knowledge), which can take hours. Include the usual warnings about data gathering, analysis, visualization, interpretation, problem(at)ization, etc.

April 22, 2014 07:36 PM

Better ways of using R on LibStats (1)

A couple of years ago I wrote some R scripts to analyze the reference desk statistics that we keep at York University Libraries with LibStats. I wrote five posts here about what I found; the last one, Ref desk 5: Fifteen minutes for under one per cent, links to the other four.

Those scripts did their job, but they were ugly, and there were some more things I wanted to do. Because of my recent Ubuntu upgrade, I’m running R version 3.0.2 now, which means I can use the new dplyr package by R wizard Hadley Wickham and others. (It doesn’t work on 3.0.1.) The vignette for dplyr has lots of examples, and I’ve been seeing great posts about it, and I was eager to try it. So I’m going back to the old work and refreshing it and figuring out how to do what I wanted to do in 2012—or couldn’t because we only had one year of data; now that we have four, year-to-year comparisons are interesting.

This first post is about how I used to do things in an ugly and slow way, and how to do them faster and better.

I begin with a CSV file containing a slightly munged and cleaned dump of all the information from LibStats.

$ head libstats.csv
timestamp,question.type,question.format,time.spent,library.name,location.name,initials
02/01/2011 09:20:11 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA
02/01/2011 09:43:09 AM,4. Strategy-Based,In-person,10-20 minutes,Scott,Drop-in Desk,AA
02/01/2011 10:00:56 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA
02/01/2011 10:05:05 AM,3. Skill-Based: Non-Technical,Phone,5-10 minutes,Scott,Drop-in Desk,AA
02/01/2011 10:17:20 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA
02/01/2011 10:30:07 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA
02/01/2011 10:54:41 AM,4. Strategy-Based,In-person,5-10 minutes,Scott,Drop-in Desk,AA
02/01/2011 11:08:00 AM,4. Strategy-Based,In-person,10-20 minutes,Scott,Drop-in Desk,AA
02/01/2011 11:32:00 AM,3. Skill-Based: Non-Technical,In-person,10-20 minutes,Scott,Drop-in Desk,AA

I read the CSV file into a data frame, then fix a couple of things. The date is a string and needs to be turned into a Date, and I use a nice function from lubridate to find the floor of the date, which aggregates everything to the month it’s in.

> l <- read.csv("libstats.csv")
> library(lubridate)
> l$day <- as.Date(l$timestamp, format="%m/%d/%Y %r")
> l$month <- floor_date(l$day, "month")
> str(l)
'data.frame': 187944 obs. of  9 variables:
 $ timestamp      : chr  "02/01/2011 09:20:11 AM" "02/01/2011 09:43:09 AM" "02/01/2011 10:00:56 AM" "02/01/2011 10:05:05 AM" ...
 $ question.type  : chr  "4. Strategy-Based" "4. Strategy-Based" "4. Strategy-Based" "3. Skill-Based: Non-Technical" ...
 $ question.format: chr  "In-person" "In-person" "In-person" "Phone" ...
 $ time.spent     : chr  "5-10 minutes" "10-20 minutes" "5-10 minutes" "5-10 minutes" ...
 $ library.name   : chr  "Scott" "Scott" "Scott" "Scott" ...
 $ location.name  : chr  "Drop-in Desk" "Drop-in Desk" "Drop-in Desk" "Drop-in Desk" ...
 $ initials       : chr  "AA" "AA" "AA" "AA" ...
 $ day            : Date, format: "2011-02-01" "2011-02-01" "2011-02-01" "2011-02-01" ...
 $ month          : Date, format: "2011-02-01" "2011-02-01" "2011-02-01" "2011-02-01" ...
> head(l)
               timestamp                 question.type question.format    time.spent library.name location.name initials
1 02/01/2011 09:20:11 AM             4. Strategy-Based       In-person  5-10 minutes        Scott  Drop-in Desk       AA
2 02/01/2011 09:43:09 AM             4. Strategy-Based       In-person 10-20 minutes        Scott  Drop-in Desk       AA
3 02/01/2011 10:00:56 AM             4. Strategy-Based       In-person  5-10 minutes        Scott  Drop-in Desk       AA
4 02/01/2011 10:05:05 AM 3. Skill-Based: Non-Technical           Phone  5-10 minutes        Scott  Drop-in Desk       AA
5 02/01/2011 10:17:20 AM             4. Strategy-Based       In-person  5-10 minutes        Scott  Drop-in Desk       AA
6 02/01/2011 10:30:07 AM             4. Strategy-Based       In-person  5-10 minutes        Scott  Drop-in Desk       AA

The columns are:

Now I have these fields in the data frame that I will use:

> head(subset(l, select=c("day", "month", "question.type", "time.spent", "library.name")))
         day      month                 question.type    time.spent library.name
1 2011-02-01 2011-02-01             4. Strategy-Based  5-10 minutes        Scott
2 2011-02-01 2011-02-01             4. Strategy-Based 10-20 minutes        Scott
3 2011-02-01 2011-02-01             4. Strategy-Based  5-10 minutes        Scott
4 2011-02-01 2011-02-01 3. Skill-Based: Non-Technical  5-10 minutes        Scott
5 2011-02-01 2011-02-01             4. Strategy-Based  5-10 minutes        Scott
6 2011-02-01 2011-02-01             4. Strategy-Based  5-10 minutes        Scott

But I’m going to just take a sample of all of this data, because this is just for illustrative purposes, not real analysis. Let’s grab 10,000 random entries from this data frame and put that into l.sample.

> l.sample <- l[sample(nrow(l), 10000),]

An easy thing to ask first is: How many questions are asked each month in each library?

Here’s how I did it before. I’ll run the command and show the resulting data frame. I used the plyr package, which is (was) great, and its ddply function, which applies a function to a data frame and gives a data frame back. Here I have it collapse the data frame l along the two columns specified (month and library.name) and use nrow to count how many rows result. Then I check how long it would take to perform that operation on the entire data set.

> library(plyr)
> sample.allquestions.pm <- ddply(l.sample, .(month, library.name), nrow)
> head(sample.allquestions.pm)
       month      library.name  V1
1 2011-02-01          Bronfman  63
2 2011-02-01             Scott  60
3 2011-02-01 Scott Information 183
4 2011-02-01              SMIL  57
5 2011-02-01           Steacie  57
6 2011-03-01          Bronfman  46
> system.time(allquestions.pm <- ddply(l, .(month, library.name), nrow))
   user  system elapsed
  2.812   0.518   3.359

The system.time line there show how long the previous command takes to run on the entire data frame: almost 3.5 seconds! That is slow. Do a few of those, chopping and slicing the data in various ways, and it will really add up.

This is a bad way of doing it. It works! But it’s slow and I wasn’t thinking about the problem the right way. Using ddply and nrow was wrong: I should have been using count (also from plyr), which I wrote up a while back, with some examples. That’s a much faster and more sensible way of counting up the number of rows in a data set.

But now that I can use dplyr, I can approach the problem in a whole new way.

First, I’ll clear plyr out of the way, then load dplyr. Doing it this way means no function names collide.

> search()
 [1] ".GlobalEnv"        "package:plyr"      "package:lubridate" "package:ggplot2"   "ESSR"              "package:stats"     "package:graphics"  "package:grDevices"
 [9] "package:utils"     "package:datasets"  "package:methods"   "Autoloads"         "package:base"
> detach("package:plyr")
> library(dplyr)

See how nicely you can construct and chain operations with dplyr:

> l.sample %.% group_by(month, library.name) %.% summarise(count=n())
Source: local data frame [277 x 3]
Groups: month

        month      library.name count
1  2011-02-01          Bronfman    63
2  2011-02-01              SMIL    57
3  2011-02-01             Scott    60
4  2011-02-01 Scott Information   183
5  2011-02-01           Steacie    57
6  2011-03-01          Bronfman    46
7  2011-03-01              SMIL    59
8  2011-03-01             Scott    71
9  2011-03-01 Scott Information   220
10 2011-03-01           Steacie    61
..        ...               ...   ...

The %.% operator lets you chain together different operations, and just for the sake of clarity of reading, I like to arrange things so first I specify the data frame on its own and then walk through the things I do to it. First, group_by breaks down the data frame by columns and does some magic. Then summarise collapses the different chunks of resulting data into one line each, and I use count=n() to make a new column, count, which contains the count of how many rows there were in each chunk, calculated with the n() function. In English I’m saying, “take the l data frame, group it by month and library.name, and count how many rows are in each grouping.” (Also, notice I didn’t need to use the head command to stop it running off the screen, it made it nicely readable on its own.)

It’s easier to think about, it’s easier to read, it’s easier to play with … and it’s much faster. How long would this take to run on the entire data set?

> system.time(l %.% group_by(month, library.name) %.% summarise(count=n()))
   user  system elapsed
  0.032   0.000   0.033

0.03 seconds elapsed time! That is 0.9% of the 3.35 seconds the old way.

Graphing it is easy, using Hadley Wickham’s marvellous ggplot2 package.

> library(ggplot2)
> sample.allquestions.pm <- l.sample %.% group_by(month, library.name) %.% summarise(count=n())
> ggplot(sample.allquestions.pm, aes(x=month, y=count)) + geom_bar(stat="identity") + facet_grid(library.name ~ .) + labs(x="", y="", title="All questions")
> ggsave(filename="20140422-all-questions-1.png", width=8.33, dpi=72, units="in")

First chart of desk activity

You can see the ebb and flow of the academic year: September, October and November are very busy, then things quiet down in December, then January, February and March busy, then it cools off in April and through the summer. (Students don’t ask a lot of questions close to and during exam time—they’re studying, and their assignments are finished.)

What about comparing year to year? Here’s a nice way of doing that.

First, pick out the numbers of the months and years. The format command knows all about how to handle dates and times. See the man page for strptime or your favourite language’s date manipulation commands for all the options possible. Here I use %m to find the month number and %Y to find the four-digit year. Two examples, then the commands:

> format(as.Date("2014-04-22"), "%m")
[1] "04"
> format(as.Date("2014-04-22"), "%Y")
[1] "2014"
> sample.allquestions.pm$mon  <- format(as.Date(sample.allquestions.pm$month), "%m")
> sample.allquestions.pm$year <- format(as.Date(sample.allquestions.pm$month), "%Y")
> head(sample.allquestions.pm)
Source: local data frame [6 x 5]
Groups: month

       month      library.name count mon year
1 2011-02-01          Bronfman    63  02 2011
2 2011-02-01              SMIL    57  02 2011
3 2011-02-01             Scott    60  02 2011
4 2011-02-01 Scott Information   183  02 2011
5 2011-02-01           Steacie    57  02 2011
6 2011-03-01          Bronfman    46  03 2011
> ggplot(sample.allquestions.pm, aes(x=year, y=count)) + geom_bar(stat="identity") + facet_grid(library.name ~ mon) + labs(x="", y="", title="All questions")
> ggsave(filename="20140422-all-questions-2.png", width=8.33, dpi=72, units="in")

This plot changes the x-axis to the year, and facets along two variables, breaking the the chart up vertically by library and horizontally by month. It’s easy now to see how months compare to each other across years.

Second chart of desk activity

With a little more work we can rotate the x-axis labels so they’re readable, and put month names along the top. The month function from lubridate makes this easy.

> sample.allquestions.pm$month.name <- month(sample.allquestions.pm$month, label = TRUE)
> head(sample.allquestions.pm)
Source: local data frame [6 x 6]
Groups: month

       month      library.name count mon year month.name
1 2011-02-01          Bronfman    63  02 2011        Feb
2 2011-02-01              SMIL    57  02 2011        Feb
3 2011-02-01             Scott    60  02 2011        Feb
4 2011-02-01 Scott Information   183  02 2011        Feb
5 2011-02-01           Steacie    57  02 2011        Feb
6 2011-03-01          Bronfman    46  03 2011        Mar
> ggplot(sample.allquestions.pm, aes(x=year, y=count)) + geom_bar(stat="identity") + facet_grid(library.name ~ month.name) + labs(x="", y="", title="All questions") + theme(axis.text.x = element_text(angle = 90))
> ggsave(filename="20140422-all-questions-3.png", width=8.33, dpi=72, units="in")

Third chart of desk activity

April 22, 2014 05:18 PM

Tennant, Roy

The Only Preservation Strategy is Commitment

commitmentBy August I will have published the current awareness newsletter Current Cites every month for twenty-four years — with all but the first of those years (1990-1991) freely available on the Internet. My children, now in college, aren’t even that old. In fact, my only absence from its publication was the period shortly after their birth. Time well spent, I have to say.

Although the publication was born at UC Berkeley, it outgrew its host and has long been hosted elsewhere and no longer has any contributors from Berkeley. From its first day it was written by volunteers — first by employees who volunteered to be a part of the entity that gave it birth, then by people who truly had no compensation for keeping it alive except the love of doing it. When I’m ready to pass it on, or should I die suddenly, I’m sure someone who loves it like I do will step up and keep it going. That’s what commitment is made of.

Meanwhile, I have witnessed almost every other Internet-born publication go down to dust — whether sponsored by an organization or not. The only Internet-based open access publication I can think of that equals or exceeds (I’m not arguing) our longevity is TidBITS, by Adam Engst. And guess what? We share something in common. Commitment. Adam has been just as committed or more to publishing TidBITS as I have been to Current Cites.

So here’s the thing: the only effective strategy for preserving things for the future is commitment. I don’t mean to suggest it must be the commitment of an individual — far from it. There are many examples of institutional commitment. But simply because either an individual or an organization is involved does not by itself signal the sufficient commitment for long-term preservation. I have personally saved web sites from certain neglect or destruction by moving them from an institutional host to either another institutional host or my personal server.

Therefore, I’ve long thought that what we really need for digital preservation is a digital preservation marketplace. For example, let’s say my doctor has said that I have roughly six months to live. After picking myself up off the floor and drying my eyes, eventually I would get around to finding someone to carry the loves of my web life forward. I would need a commitment marketplace. Someplace where I could go to say “I have this. It consists of X. It requires Y to keep it going. You must love it, like you would love a rescue dog.” And individuals or organizations could apply to take it over.

Either it has value or it doesn’t, and the digital preservation marketplace would decide. But without true commitment there is no technology, no metadata standard, no prayer, that will save it. Believe me, I’ve lived — and am living — it. Just call it a commitment. Do not look to technology to save anything. Look only to your heart. It is the only thing that has ever saved anything worth saving or ever will.

Photo by Hector Alejandro, Creative Commons Attribution 2.0 Generic License.

by Roy Tennant at April 22, 2014 04:32 PM

ALA Equitable Access to Electronic Content

Ky. library advocate receives WHCLIST Award

Advocate Mary Lynn Collins. Photo by State Journal.

Advocate Mary Lynn Collins. Photo by State Journal.

Today, the American Library Association (ALA) named Mary Lynn Collins, a library trustee from Frankfort, Ky., the winner of the 2014 White House Conference on Library and Information Services (WHCLIST) Award. The award, which is given to a non-librarian participating in National Library Legislative Day, covers hotel fees in addition to a $300 stipend to reduce the cost of attending the event.

During this year’s National Library Legislative Day, to be held May 5–6, 2014, hundreds of librarians and library supporters from across the country will gather in the nation’s capital to meet with members of Congress to discuss key library issues. As a champion for libraries, Collins incorporates her first-hand knowledge of the Kentucky legislature into her advocacy strategies. Before Collins became a founding member and current president of the Friends of Kentucky Libraries, she served for nearly 30 years on the staff of the Kentucky legislature as a legislative analyst.

Collins has used her legislative experience to gain support for Kentucky libraries that are facing harmful lawsuits in the past few years. In the future, she plans to lead her library group to increase advocacy efforts with congressional representatives.

“As a member of the Friends of Kentucky Libraries, I have seen advocacy at the state and local level become more important each year,” said Collins. “We have in the last three sessions of our state legislature seen legislation that was deemed detrimental to libraries and through the advocacy of library professionals, trustees and friends, and we have been able to defeat those efforts.”

The White House Conference on Library and Information Services—an effective force in library advocacy nationally, statewide and locally—turned its assets over to the ALA Washington Office after the last conference was held in 1991 in order to transmit the spirit of committed, passionate library support to a new generation of advocates. Leading up to National Library Legislative Day each year, the ALA seeks nominations for the award. Representatives of WHCLIST and the ALA Washington office choose the recipient.

Read the full press statement

The post Ky. library advocate receives WHCLIST Award appeared first on District Dispatch.

by Jazzy Wright at April 22, 2014 04:12 PM

Open Knowledge Foundation

Open Knowledge Festival Call for Volunteers Opens Today!

9501197271_76f573b157_z

The OKFestival team is launching our call for volunteers today, and we are excited to bring on board amazing members of our community who will help us to make this festival the huge success we are anticipating. Apply now!

Volunteers are integral to our ability to run OKFestival – without you, we wouldn’t have enough hands to get everything done over the days of the festival!

Join Us!

If you want to come to Berlin this July 15th-17th and help us to create the best Open festival there has ever been, please apply today at the link above, and then spread the word to ensure others know about the festival too!

There is no hard deadline on applying, but the sooner you apply the better your chance of being selected to come and make Open history with us at this year’s OKFestival. We can’t wait to see you there!

by Beatrice Martini at April 22, 2014 02:44 PM

April 21, 2014

Morgan, Eric Lease

LiAM Guidebook: Executive summary

spanish steps Linked data is a process for embedding the descriptive information of archives into the very fabric of the Web. By transforming archival description into linked data, an archivist will enable other people as well as computers to read and use their archival description, even if the others are not a part of the archival community. The process goes both ways. Linked data also empowers archivists to use and incorporate the information of other linked data providers into their local description. This enables archivists to make their descriptions more thorough, more complete, and more value-added. For example, archival collections could be automatically supplemented with geographic coordinates in order to make maps, images of people or additional biographic descriptions to make collections come alive, or bibliographies for further reading.

Publishing and using linked data does not represent a change in the definition of archival description, but it does represent an evolution of how archival description is accomplished. For example, linked data is not about generating a document such as EAD file. Instead it is about asserting sets of statements about an archival thing, and then allowing those statements to be brought together in any number of ways for any number of purposes. A finding aid is one such purpose. Indexing is another purpose. For use by a digital humanist is anther purpose. While EAD files are encoded as XML documents and therefore very computer readable, the reader must know the structure of EAD in order to make the most out of the data. EAD is archives-centric. The way data is manifested in linked data is domain-agnostic.

The objectives of archives include collection, organization, preservation, description, and often times access to unique materials. Linked data is about description and access. By taking advantages of linked data principles, archives will be able to improve their descriptions and increase access. This will require a shift in the way things get done but not what gets done. The goal remains the same.

Many tools are ready exist for transforming data in existing formats into linked data. This data can reside in Excel spreadsheets, database applications, MARC records, or EAD files. There are tiers of linked data publishing so one does not have to do everything all at once. But to transform existing information or to maintain information over the long haul requires the skills of many people: archivists & content specialists, administrators & managers, metadata specialists & catalogers, computer programers & systems administrators.

Moving forward with linked data is a lot like touristing to Rome. There are many ways to get there, and there are many things to do once you arrive, but the result will undoubtably improve your ability to participate in the discussion of the human condition on a world wide scale.

by LiAM: Linked Archival Metadata at April 21, 2014 08:59 PM

ALA Equitable Access to Electronic Content

Libraries: Our story, our future

Author Anthony Chow enjoying Washington, D.C. during the 2013 National Library Legislative Day.

Author Anthony Chow enjoying Washington, D.C. with his daughter during the 2013 National Library Legislative Day.

The article below was written by library advocate Anthony Chow, Ph.D., who is the assistant professor of the Department of Library and Information Studies at the University of North Carolina at Greensboro, and the co-chair of the North Carolina Library Association’s Legislative and Advocacy Committee

Knowledge is power. I have always believed that. As a professional educator and father of three, the gift of literacy is a gift for the future. My wife and I read to all three of our kids every day for years until one day our youngest, Emma, said she did not want to be read to anymore. She wanted and could do it now on her own. Emma and her brother and sister were empowered with the gift of reading—a door to endless possibilities, a pathway towards knowledge about whatever they wanted and needed. This is a wonderful feeling for any parent or educator. This is freedom and independence personified.

Do libraries make you happy? I sincerely hope you will join us.

Both school and public libraries have played a pivotal role in helping build the joy and love of reading in our children. For this, my wife and I will be forever grateful. I am a library advocate and wish the same feeling of joy and empowerment for all Americans. I want to give back what they have given to me.

I am also a professor at The University of North Carolina at Greensboro’s Library and Information Studies Department. My job is to prepare future librarians and a significant part of my teaching philosophy is to lead by example and be extremely active in service as part of my own pathway of life-long learning. This is how I became involved in North Carolina’s library advocacy efforts five years ago.

My passion for libraries and library advocacy derives from my personal and professional conviction that they are indeed an essential part of the American story—past, present, and future. As a member of the North Carolina delegation attending National Library Legislative Day (NLLD) for the past four years, I had the honor and privilege of meeting with our state’s legislators to tell them my story and let them know unequivocally that libraries are a fundamental part of our life and the lives of many North Carolinians and Americans across the country.

The author's 11 year-old daughter Emma with their Morkie Ellie. Emma is already an avid reader.

The author’s 11 year-old daughter Emma with their Morkie Ellie. Emma is already an avid reader.

As a grizzled veteran learning under accomplished mentor Carol Walters, retired director of the Sandhills Regional Library System, and Brandy Hamilton, Regional Library Manager of the East Regional Library in Wake County, I was asked to help lead our 2014 delegation.

As we planned for this year’s NLLD we had two primary goals: 1) Allow our youth a voice to speak directly to legislators about how important libraries are to them personally, and 2) Find unique ways to make a splash and have people pay attention to us and our message of strong libraries for everyone.

The North Carolina Library Association (NCLA) created the NCLA student ambassador program and this year we are bringing 20 K-12 students to personally meet with their legislators and tell them first-hand how important libraries are to them. The creativity, energy, and diversity of their winning entries were refreshing and breathtaking in their depth and breadth. The youth are our future and their support of libraries could not be more authentically stated.

Coinciding with our focus on youth was the emergence of Pharrell William’s academy award nominated song “Happy” and the emergence of “Happy Dances” across the world on YouTube. We decided that doing a “Happy Dance” as part of our advocacy efforts made perfect sense and dancing was the perfect, positive, and fun way of expressing our support for libraries.

Like any large event an idea must be supported by passionate, talented, and brave people willing to dedicate their time, expertise, and pride on the line by doing something different. One of our new faculty members, Dr. Rebecca Morris, was a majorette at Pennsylvania State University and it was her brilliant idea to choose the song “Happy” as our flash mob dance and her willingness to take the lead on the choreography and instructional video that allowed the idea to become a reality. Our idea was quickly supported by North Carolina’s State Librarian Cal Shepard and our movement was born and off and running.

Our initial NCLA video also prompted several schools in North Carolina, West Wilkes Middle School and Smithfield-Selma High School, to film their own videos as well the Charlotte-Mecklenburg Public Library System.

In reserving the location for the flash mob Happy Dance, I was told by the U.S. Capitol Police that just dancing was not a clear enough expression of our First Amendment rights. So, in collaboration with the American Library Association (ALA), our dance turned into a full blown rally, which will take place from 2:30-4:00 p.m. right in front of the U.S. Capitol on Site 10 (across from the Library of Congress at the intersection of Independence and First Street). The flash mob will start promptly at 3:00 p.m. led by Dr. Morris, myself, and the majority of the North Carolina delegation including our State Librarian and many of our Student Ambassadors.

This is not really my story but our story and our future. Library advocacy is one clear-cut way for me to give back in some small way what they have given to me, my family, North Carolina, and our nation. Knowledge is power. Literacy is a gift that keeps on giving. Libraries also do so much more for other people—youth programming, access to technology, work force development, a place for the community to meet, and books—lots of books—in all different formats.

When I asked our State Librarian what was the overarching message she wanted to convey this year, she told me in no uncertain terms that this year is a celebration as North Carolina libraries are booming and we need the continuing support of our legislators to help us keep growing and providing vital services to our communities. Libraries make me so happy that I will dance for libraries. Do libraries make you happy? I sincerely hope you will join us.

The post Libraries: Our story, our future appeared first on District Dispatch.

by Jazzy Wright at April 21, 2014 07:43 PM

Rosenthal, David

Skeptical about emulation?

If you're skeptical about two trends I've been pointing to, the rapid rise of emulation technology, and the evolution of the Web's language from HTML to Javascript, you need to watch Gary Bernhardt's video that fell through a time-warp from 2035.

Also, at the recent EverCloud workshop Mahadev Satyanarayanan, my colleague from the long-gone days of the Andrew Project, gave an impressive live demo of C-MU's Olive emulation technology. The most impressive part was that the emulations started almost instantly, despite that fact that they were demand-paging over the hotel's not super-fast Internet.

by David. (noreply@blogger.com) at April 21, 2014 09:00 AM

April 20, 2014

Dempsey, Lorcan

The decentered library network presence

Think of two trends in the development of the library's network presence. These have emerged successively and continue to operate together.

  1. A centripetal trend producing a library network presence centered on the institutional website, as the library wants to offer an integrated service.
  2. A centrifugal trend, unbundling functionality and placing it in a variety of decentered network presences, as the library wants to be in the flow of its users (think of how communication has been unbundled to social networking sites for example, or of how metadata may be shared with various aggregation sites, or of how a resolver may be configured for use with a third party site).


The decentered library network presence is an important component of library service although it still appears to be an emergent interest in strategic or organizational terms.

In the centripetal trend, the focus was on integration around a singular, 'centered' network presence: the library website. The library website was the principal de facto network manifestation of the library, and the integration of library services in the website was a goal. Early examples of this were discussions around 'portals', one-stop-shops and metasearch. Later, this trend continued with more consolidated approaches which overcame some of the cost and inefficiencies of integration. Included here were the use of unified search systems often deploying integrated discovery layer products, the use of resource guides to manage resources with a consistent approach, the adoption of content management systems. Service consolidation, a stronger focus on providing a coherent user experience, a move to cloudsourced applications (discovery and resource guides for example), and an emerging emphasis on full library discovery help create a more unified experience at the library website.

In the centrifugal trend, the network library presence is decentered, unbundled or decoupled to an evolving ecosystem of services, each with a particular focus or scope. Think for example of how aspects of user engagement have been unbundled to various social networking sites (Facebook, Twitter, Pinterest, Flickr, ...), or of how parts of the discovery experience has been unbundled to Google Scholar or PubMed or to a cloud-based discovery layer, or of how some library services are atomized and delivered as mobile apps, toolbar applications, or 'widgets' in learning management systems and other web environments external to the library's. The library website is now a part, albeit an important part, of this evolving network presence. In this way, the library network presence has been decentered, subject to a centrifugal trend to multiple network locations potentially closer to user workflows. There are two important drivers here. One is the desire to reach into user workflows, acknowledging that potential library users may not always come to the library website. A second is the desire to make institutional resources (digitized special collections, research and learning materials, for example) available to external audiences in more effective ways. This is an aspect of the 'inside-out library'.

Here are some strands of the 'decentered' library network presence.

Now, despite the fact that there is quite a bit of activity supporting what I am calling here the decentered network presence, it has not crystallized as a service or organizational category for the library. It is an area of emergent interest. There seem to be at least three factors at play.

Clearly, there are different dynamics at play in the components of the decentered network presence of the library. However, we can expect a more holistic view to emerge in coming years.

Related entries:


by dempseyl@oclc.org (Lorcan Dempsey) at April 20, 2014 10:59 AM

April 19, 2014

Denton, William

Ubuntu 14.04 and grub

Another note to my future self about upgrading Ubuntu.

Ubuntu 14.04 was released yesterday. I have two laptops that run it. I did the unimportant one first, and everything went fine. Then I did the important one, the one where I do all my work, and after restarting it came up with a boot error:

error: symbol 'grub_term_highlight_color' not found

I had two reactions. First, boot errors are solvable. The boot stuff is on one part of my hard drive, and my real stuff is on another part, and it’s fine where it is, I just need to fix the boot stuff. Besides, I have backups. So with a bit of fiddling, I’ll be able to fix it. Second, cripes, what the hell? I’ve been using this laptop for six months or a year or more since a major upgrade, and now it’s telling me there’s some problem with how it boots up? That is a load of shite.

Searching turned up evidence other people had the same problem, and they were being blamed for having an improper boot sector or some such business. For a few minutes I felt like non-geeks feel when presented with problems like this: despair … annoyance … frustration … the first pangs of hate.

But such is life. When upgrading a system we must be prepared for possible problems. We cannot expect it to always go smoothly. Even in the face of such technical problems we must try to remain tranquil.

It’s solvable, I remembered. So I downloaded a Boot-Repair Disk image—this is a very useful tool, and it works even though it’s a year old—and put it on a USB drive with startup disk creator, then booted up, ran sudo boot-repair, used all the default answers, let it do its work, and everything was all right. Phew.

Aside from that, everything about the upgrade went perfectly fine. This time I did it at the command line with sudo do-release-upgrade. It took a while to download all the upgraded packages, but the actual update went quickly and smoothly. My thanks to everyone involved with Debian, Ubuntu, GNU/Linux, and everything else.

(However, I’m glad I had another machine available where I could do the download and set up the boot disk. Without it, I would have been in trouble. I don’t know if a similar problem might have arisen when Windows or MacOS users do an upgrade.)

April 19, 2014 04:14 AM

April 18, 2014

Leggott, Mark

Return to Blogging

It seems to me I have done this once (or twice) before, but I feel like it is time to continue blogging on Loomware. My Loomware blog started in February of 2004, so I guess I can call this the 10th anniversary and just get on with it!

One of the drivers for me was the incredible interest in the Islandora digital asset management system, which had its genesis in 2007 just after I joined UPEI. In the last 7 years Islandora has seen adoption in countries all over the world, and for a wide range of functions. I will start the posts next week with a series on the coming version of Islandora - 7.x-1.3, which is our way of saying the 3rd release of Islandora for Drupal 7 and Fedora 3. This new series will describe all the awesome goodness in the upcoming release, solution pack by solution pack, module by module and innclude some shoutouts to friends and colleagues who are giving their time and extpertise to build a great open source ecosystem!

by mleggott at April 18, 2014 10:08 PM

ALA Equitable Access to Electronic Content

Library broadband takes center stage at IMLS hearing

 

Larra Clark of ALA’s Office for Information Technology Policy speaks on panel.

Larra Clark of ALA’s Office for Information Technology Policy speaks on panel.

The ongoing digital revolution continues to create new opportunities for education, entrepreneurship, job skills training and more. Those of us with home broadband, smartphones or both can easily take advantage of these opportunities. However, for millions of Americans currently living without personal access to high-capacity internet or who lack digital literacy skills, libraries serve as the on-ramp to the digital world. With a growing number of people turning to libraries to avail themselves of broadband-enabled technologies, library networks are being strained more than ever before. Yesterday, the Institute for Library and Museum Services (IMLS) held a public hearing to discuss the importance of high-speed connectivity in libraries and outline strategies for helping libraries expand bandwidth to accommodate growing network use.

Federal Communications Commission (FCC) Chairman Thomas Wheeler’s opening remarks set the tone for the day: “Andrew Carnegie built 2,500 libraries in a public-private partnership, defining information access for millions of people for more than a century,” he said. “We stand on the precipice of being able to have the same kind of seminal impact on the flow of information and ideas in the 21st century…That’s why reform of the E-rate program is so essential. The library has always been the on-ramp to the world of information and ideas, and now that on-ramp is at gigabit speeds.”

The hearing convened three expert panels, each of which discussed a different dimension of library connectivity. The first panel propounded strategies for helping libraries procure the resources they need to build network capacity. Chris Jowaisas of the Gates Foundation urged libraries to underscore the ways in which their activities advance the goals of top giving foundations. “[Libraries should]…package their services to meet foundation needs,” Jowaisas said. “With a robust and reliable broadband connection, libraries and communities can move into more areas of exploration and innovation. The foundation hopes the network of supporters of this vision grows because we have seen and learned first-hand from investments in public libraries that they are key organizations for growing opportunity.”


Created with Admarket’s flickrSLiDR.

Following his remarks, Clarence Anthony of the National League of Cities stressed the need for the library community to ramp up its efforts to make government leaders aware of the extent to which urban communities rely on libraries for broadband access.

The second panel analyzed current library connectivity data and identified areas where the data falls short in assessing broadband capacity. Larra Clark of ALA’s Office for Information Technology Policy drew on 20 years of research to illustrate that the progress libraries have made in expanding bandwidth—while meaningful—has generally not proven sufficient to accommodate the growing needs of users. About 9 percent of public libraries reported speeds of 100 Mbps or greater in the 2012 Public Library Funding & Technology Access Study, and the forthcoming Digital Inclusion Survey shows this number has only climbed to 12 percent. More than 66 percent of public libraries report they would like to increase their broadband connectivity speeds. “Libraries aren’t standing still, but too many are falling behind,” Clark said.

Researcher John Horrigan also gave the audience a preview of forthcoming research looking at Americans’ levels of digital readiness, which finds significant variations in digital skills even among people who are highly connected to digital tools. Of the 80 percent of Americans with home broadband or a smartphone, nearly one-fifth (or 34 million adults) has a low level of digital skills. “(Libraries) are the vanguard in the forces we bring to bear to bolster digital readiness,” Horrigan noted. “Libraries will have more demands placed upon them, which makes the case for ensuring they have the resources to meet these demands compelling.”

The final panel built on the capacity-building strategies offered by Jowaisas and Anthony by providing real-world examples of successful efforts to expand library bandwidth. Gary Wasdin of the Omaha Public Library System discussed ways in which his libraries are leveraging federal dollars to engage private funders in efforts to build broadband capacity, and Eric Frederick of Connect Michigan described how public-private synergies are improving library connectivity in his state. The final panelist was Linda Lord, Maine state librarian and chair of ALA’s E-rate Task Force. Lord discussed ALA’s efforts to inform the FCC’s ongoing E-rate modernization proceeding. “ALA envisions that all libraries will be at a gig (1Gbps) by 2018”, Lord said. The E-rate program provides schools and libraries with telecommunications services at discounted rates. Linda went on to clearly articulate ALA’s commitment to updating the program to help libraries address 21st century challenges.

Libraries and library users can add to the IMLS hearing record on the urgency and impact of library broadband by submitting comments by April 24. View C-SPAN coverage of the hearing.

The post Library broadband takes center stage at IMLS hearing appeared first on District Dispatch.

by Charles Wapner at April 18, 2014 10:03 PM

code4lib

Code4Lib 2014 Trip Report - Zahra Ashktorab

Zahra Ashktorab
March 28, 2014

Code4Lib Report

I was recently selected by the Code4Lib community to receive a diversity scholarship to attend the Code4Lib conference in Raleigh, North Carolina. The Code4Lib conference was the perfect place to make new connections with people who aim to make information more accessible through technology. As someone who is in close proximity in technology and usability, I was interested in the new strides taking place in this area. At this conference, I made new contacts for future collaboration and attend talks ranging from Linked Open Data and Google Analytics.

read more

by bohyunkim at April 18, 2014 06:25 PM

Code4Lib 2014 Trip Report- Nabil Kashyap

Diversity Scholarship Trip Report
Code4Lib, 2014
Raleigh, NC
Nabil Kashyap
4/1/14

Coming to my first Code4Lib was significant because when I first began connecting with the group and its resources, I was a freshly-minted graduate in the middle of a career change. By the time I landed in Raleigh, three months into a new job, I was an information professional--more or less.

After graduating last May from library school, I admit to using the Code4Lib website obsessively during my quest for employment; I quickly found the site, wiki, listserv and journal invaluable. There was a level of energy and involvement by users that made it stand out from other, more conventional professional organizations. Plus, the job postings often described exactly the kinds of emerging, interdisciplinary positions I was most interested in. Code4Lib was a network I wanted to be a part of. Miraculously, my search worked out: I was offered a position, though I had not yet started when I finally applied for the diversity scholarship.

read more

by bohyunkim at April 18, 2014 06:24 PM

Code4Lib 2014 Trip Report - Junior Tidal

As a recipient of Diversity Scholarship for the 9th annual Code4Lib conference in Raleigh, North Carolina, I had an enlightening and incredible experience. I learned a great deal of information that revolved around library system usability, emerging coding frameworks, and applying social justice to user-centered design. Throughout the conference, I asked myself, how I could use these concepts and coding techniques for my daily work at my institution? As a “one-man shop” I have limited support for implementing many of these technologies. However, as I have networked with the diverse members of the code4lib community I know that it will be a bit easier trying to experiment with these techniques.

My time at the conference revealed that many libraries are passionately striving to make end-user systems usable, accessible, and transparent. There were numerous presentations that revolved around these ideas, such as using APIs to create data visualizations for displaying library statistics, real-time interactive discovery systems and interfaces, moving away from “list” type listings of holdings to network-node maps, web accessibility for differently abled patrons, and much more. The numerous lightning talks also provided a great wealth of information (all within 5 minutes!)

read more

by bohyunkim at April 18, 2014 06:23 PM

Code4Lib 2014 Trip Report - Jennifer Maiko Kishi

Jennifer Maiko Kishi
Code4Lib 2014 Conference Report
1 April 2014

As a new professional in the field, lone digital archivist, and a first timer to the Code4Lib Conference, my experience was incredibly inspiring and enriching. I value Code4Lib’s collective mission of teaching and learning through community, collaboration, and a free exchange of ideas. The conference was unique and unlike any other library or archives conference I have attended. I appreciate the thoughtfulness of planning events to specifically welcome new attendees. The newcomer dinner was not only a great way to meet fellow newbies (and oldtimers) on the evening before the conference, but also provided familiar faces to say hello to the following day. Moreover, Code4Lib resolved my session selecting anxieties, where I always feel like I’ve missed out on yet another important session. The conference is set up so that all attendees will have equal opportunities to view the sessions together in a continuous fashion, in addition to live streams made available to those unable to attend. The conference was jam packed with back to back presentations, lightning talks, and breakout sessions. There was a good balance of interesting topics by insightful speakers, mixed in with scheduled breaks with copious coffee and tea to stay alert and focused throughout the day.

read more

by bohyunkim at April 18, 2014 06:22 PM

Code4Lib 2014 Trip Report - J. (Jenny) Gubernick

Code4Lib 2014: Conference Review
J. (Jenny) Gubernick

Intro

I was fortunate to receive a diversity scholarship to help defray the costs of attending Code4Lib 2014 in Raleigh, NC. Although I am still processing the somewhat overwhelming amount of information I absorbed, I suspect that I will look back at this past week as a transformative experience. I pivoted from thinking of myself as “not a real programmer,” “lucky to have any job,” and that “maybe someday I can do something cool,” to thinking of myself as being in a position of great empowerment to learn and do, and being ready to apply my skills to a more complex work. I look forward to continuing to be part of this community in months and years to come.
Takeaways

read more

by bohyunkim at April 18, 2014 06:12 PM

Code4Lib 2014 Trip Report - Emily Reynolds

Emily Reynolds
Code4Lib trip report
31 March 2014

As a diversity scholarship recipient, I was afforded the opportunity to attend the 2014 Code4Lib conference in Raleigh, NC. The conference consisted of two and a half days of presentations and one day of preconference workshops. Looking back on the experience, I am impressed by the content of the presentations, the openness of the community, and the overall sense of curiosity and exploration. I learned a great deal and am looking forward to applying the inspiration and motivation that I took away from the conference in my daily work.

Pre-conference

Prior to the start of the conference itself, I attended the “Archival Discovery and Use” pre-conference session. True to its name, Code4Lib has historically been more library-focused, but this session covered topics like the modern relevance of archival finding aids, archival crowdsourcing, and presentation methods for digitized materials. Because librarians and archivists have so many intertwined concerns, I was glad to see the archival community represented.

read more

by bohyunkim at April 18, 2014 06:11 PM

Code4Lib 2014 Trip Report - Coral Sheldon Hess

Coral Sheldon Hess
From: http://www.sheldon-hess.org/coral/2014/04/code4lib-2014-write-up/

I had an enjoyable and educational time at Code4Lib 2014. It was my first time attending any Code4Lib event, and I am grateful to have had the opportunity to be there, thanks to the Diversity Scholarship sponsored by the Council on Library and Information Resources/Digital Library Federation, EBSCO, ProQuest, and Sumana Harihareswara. Thank you to the sponsors, the scholarship and organizing committees, and everyone else involved with the conference for this amazing learning experience!

Things that went well

read more

by bohyunkim at April 18, 2014 06:10 PM

Code4Lib 2014 Trip Report - Christina Harlow

A Newbie, Troublesome Cataloger at Code4Lib

Christina Harlow

In March 2014, I attended my first (and definitely not only) Code4Lib National Conference. I had been following the Code4Lib group via their website, journal, wiki and local NYC chapter for some time; but being a metadata/cataloging person, I was hesitant to jump into a meeting of programmers, coders, systems librarians, and others. I am immensely glad that I did not let this hesitation hold me back this year, as the 2014 Code4Lib Conference was the best and most inviting conference that I have ever attended.

read more

by bohyunkim at April 18, 2014 06:08 PM

Ng, Cynthia

LibTechConf 2014: A Reflection

After all the conferences and the craziness at work, LibTechConf seems like ages ago and though it’s been a little while, I wanted to write the usual reflection that I do. I wish I had done it sooner now, but I’m finally getting to it. Great Keynotes I normally prefer getting keynote speakers from outside […]

by Cynthia at April 18, 2014 02:54 AM

Morgan, Eric Lease

Rome in three days, an archivists introduction to linked data publishing

If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice.

trevi fountain Linked data in archival practice is not new. Others have been here previously. You can benefit from their experience and begin publishing linked data right now using tools with which you are probably already familiar. For example, you probably have EAD files, sets of MARC records, or metadata saved in database applications. Using existing tools, you can transform this content into RDF and put the result on the Web, thus publishing your information as linked data.

EAD

If you have used EAD to describe your collections, then you can easily make your descriptions available as valid linked data, but the result will be less than optimal. This is true not for a lack of technology but rather from the inherent purpose and structure of EAD files.

A few years ago an organisation in the United Kingdom called the Archive’s Hub was funded by a granting agency called JISC to explore the publishing of archival descriptions as linked data. The project was called LOCAH. One of the outcomes of this effort was the creation of an XSL stylesheet (ead2rdf) transforming EAD into RDF/XML. The terms used in the stylesheet originate from quite a number of standardized, widely accepted ontologies, and with only the tiniest bit configuration / customization the stylesheet can transform a generic EAD file into valid RDF/XML for use by anybody. The resulting XML files can then be made available on a Web server or incorporated into a triple store. This goes a long way to publishing archival descriptions as linked data. The only additional things needed are a transformation of EAD into HTML and the configuration of a Web server to do content negotiation between the XML and HTML.

For the smaller archive with only a few hundred EAD files whose content does not change very quickly, this is a simple, feasible, and practical solution to publishing archival descriptions as linked data. With the exception of doing some content negotiation, this solution does not require any computer technology that is not already being used in archives, and it only requires a few small tweaks to a given workflow:

  1. implement a content negotiation solution

  2. create and maintain EAD file
s
  3. transform EAD into RDF/XML

  4. transform EAD into HTML

  5. save the resulting XML and HTML files on a Web server

  6. go to step #2

EAD is a combination of narrative description and a hierarchal inventory list, and this data structure does not lend itself very well to the triples of linked data. For example, EAD headers are full of controlled vocabularies terms but there is no way to link these terms with specific inventory items. This is because the vocabulary terms are expected to describe the collection as a whole, not individual things. This problem could be overcome if each individual component of the EAD were associated with controlled vocabulary terms, but this would significantly increase the amount of work needed to create the EAD files in the first place.

The common practice of using literals to denote the names of people, places, and things in EAD files would also need to be changed in order to fully realize the vision of linked data. Specifically, it would be necessary for archivists to supplement their EAD files with commonly used URIs denoting subject headings and named authorities. These URIs could be inserted into id attributes throughout an EAD file, and the resulting RDF would be more linkable, but the labor to do so would increase, especially since many of the named items will not exist in standardized authority lists.

Despite these short comings, transforming EAD files into some sort of serialized RDF goes a long way towards publishing archival descriptions as linked data. This particular process is a good beginning and outputs valid information, just information that is not as linkable as possible. This process lends itself to iterative improvements, and outputting something is better than outputting nothing. But this particular proces is not for everybody. The archive whose content changes quickly, the archive with copious numbers of collections, or the archive wishing to publish the most complete linked data possible will probably not want to use EAD files as the root of their publishing system. Instead some sort of database application is probably the best solution.

MARC

In some ways MARC lends it self very well to being published via linked data, but in the long run it is not really a feasible data structure.

Converting MARC into serialized RDF through XSLT is at least a two step process. The first step is to convert MARC into MARCXML and then MARCXML into MODS. This can be done with any number of scripting languages and toolboxes. The second step is to use a stylesheet such as the one created by Stefano Mazzocchi to transform the MODS into RDF/XML — mods2rdf.xsl From there a person could save the resulting XML files on a Web server, enhance access via content negotiation, and called it linked data.

Unfortunately, this particular approach has a number of drawbacks. First and foremost, the MARC format had no place to denote URIs; MARC records are made up almost entirely of literals. Sure, URIs can be constructed from various control numbers, but things like authors, titles, subject headings, and added entries will most certainly be literals (“Mark Twain”, “Adventures of Huckleberry Finn”, “Bildungsroman”, or “Samuel Clemans”), not URIs. This issue can be overcome if the MARCXML were first converted into MODS and URIs were inserted into id or xlink attributes of bibliographic elements, but this is extra work. If an archive were to take this approach, then it would also behoove them to use MODS as their data structure of choice, not MARC. Continually converting from MARC to MARCXML to MODS would be expensive in terms of time. Moreover, with each new conversion the URIs from previous iterations would need to be re-created.

EAC-CPF

Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) goes a long way to implementing a named authority database that could be linked from archival descriptions. These XML files could easily be transformed into serialized RDF and therefore linked data. The resulting URIs could then be incorporated into archival descriptions making the descriptions richer and more complete. For example the FindAndConnect site in Australia uses EAC-CPF under the hood to disseminate information about people in its collection. Similarly, “SNAC aims to not only make the [EAC-CPF] records more easily discovered and accessed but also, and at the same time, build an unprecedented resource that provides access to the socio-historical contexts (which includes people, families, and corporate bodies) in which the records were created” More than a thousand EAC-CPF records are available from the RAMP project.

METS, MODS, OAI-PMH service providers, and perhaps more

If you have archival descriptions in either of the METS or MODS formats, then transforming them into RDF is as far away as your XSLT processor and a content negotiation implementation. As of this writing there do not seem to be any METS to RDF stylesheets, but there are a couple stylesheets for MODS. The biggest issue with these sorts of implementations are the URIs. It will be necessary for archivists to include URIs into as many MODS id or xlink attributes as possible. The same thing holds true for METS files except the id attribute is not designed to hold pointers to external sites.

Some archives and libraries use a content management system called ContentDM. Whether they know it or not, ContentDM comes complete with an OAI-PMH (Open Archives Initiative – Protocol for Metadata Harvesting) interface. This means you can send a REST-ful URL to ContentDM, and you will get back an XML stream of metadata describing digital objects. Some of the digital objects in ContentDM (or any other OAI-PMH service provider) may be something worth exposing as linked data, and this can easily be done with a system called oai2lod. It is a particular implementation of D2RQ, described below, and works quite well. Download application. Feed oai2lod the “home page” of the OAI-PMH service provider, and oai2load will publish the OAI-PMH metadata as linked open data. This is another quick & dirty way to get started with linked data.

Databases

Publishing linked data through XML transformation is functional but not optimal. Publishing linked data from a database comes closer to the ideal but requires a greater amount of technical computer infrastructure and expertise.

Databases — specifically, relational databases — are the current best practice for organizing data. As you may or may not know, relational databases are made up of many tables of data joined together with keys. For example, a book may be assigned a unique identifier. The book has many characteristics such as a title, number of pages, size, descriptive note, etc. Some of the characteristics are shared by other books, like authors and subjects. In a relational database these shared characteristics would be saved in additional tables, and they would be joined to a specific book through the use of unique identifiers (keys). Given this sort of data structure, reports can be created from the database describing its content. Similarly, queries can be applied against the database to uncover relationships that may not be apparent at first glance or buried in reports. The power of relational databases lies in the use of keys to make relationships between rows in one table and rows in other tables. The downside of relational databases as a data model is infinite variety of fields/table combinations making them difficult to share across the Web.

Not coincidently, relational database technology is very much the way linked data is expected to be implemented. In the linked data world, the subjects of triples are URIs (think database keys). Each URI is associated with one or more predicates (think the characteristics in the book example). Each triple then has an object, and these objects take the form of literals or other URIs. In the book example, the object could be “Adventures Of Huckleberry Finn” or a URI pointing to Mark Twain. The reports of relational databases are analogous to RDF serializations, and SQL (the relational database query language) is analogous to SPARQL, the query language of RDF triple stores. Because of the close similarity between well-designed relational databases and linked data principles, the publishing of linked data directly from relational databases makes whole lot of sense, but the process requires the combined time and skills of a number of different people: content specialists, database designers, and computer programmers. Consequently, the process of publishing linked data from relational databases may be optimal, but it is more expensive.

Thankfully, many archivists probably use some sort of behind the scenes database to manage their collections and create their finding aids. Moreover, archivists probably use one of three or four tools for this purpose: Archivist’s Toolkit, Archon, ArchivesSpace, or PastPerfect. Each of these systems have a relational database at their heart. Reports could be written against the underlying databases to generate serialized RDF and thus begin the process of publishing linked data. Doing this from scratch would be difficult, as well as inefficient because many people would be starting out with the same database structure but creating a multitude of varying outputs. Consequently, there are two alternatives. The first is to use a generic database application to RDF publishing platform called D2RQ. The second is for the community to join together and create a holistic RDF publishing system based on the database(s) used in archives.

D2RQ is a very powerful software system. It is supported, well-documented, executable on just about any computing platform, open source, focused, functional, and at the same time does not try to be all things to all people. Using D2RQ it is more than possible to quickly and easily publish a well-designed relational database as RDF. The process is relatively simple:

The downside of D2RQ is its generic nature. It will create an RDF ontology whose terms correspond to the names of database fields. These field names do not map to widely accepted ontologies & vocabularies and therefore will not interact well with communities outside the ones using a specific database structure. Still, the use of D2RQ is quick, easy, and accurate.

If you are going to be in Rome for only a few days, you will want to see the major sites, and you will want to adventure out & about a bit, but at the same time is will be a wise idea to follow the lead of somebody who has been there previously. Take the advise of these people. It is an efficient way to see some of the sights.

by LiAM: Linked Archival Metadata at April 18, 2014 12:43 AM

April 17, 2014

Rosenthal, David

Henry Newman on HD vs. SSD Economics

Henry Newman has an excellent post entitled SSD vs. HDD Pricing: Seven Myths That Need Correcting. His seven myths are:

Henry is looking at the market for performance storage, not for long-term storage, but given that limitation I agree with nearly everything he writes. However, I think there is a simpler argument that ends up at the same place that Henry did:
QED.

by David. (noreply@blogger.com) at April 17, 2014 09:00 PM

Rochkind, Jonathan

Large collections in JS in the browser

Developers from the New York Times have released some open source software meant for displaying and managing large digital content collections, and doing so client-side, in the browser with JS.

Developed for journalism, this has some obvious potential relevance to the business of libraries too, right?  Large collections (increasingly digital), that’s what we’re all about, ain’t it?

Pourover and Tamper

Today we’re open-sourcing two internal projects from The Times:

Collections are important to developers, especially news developers. We are handed hundreds of user submitted snapshots, thousands of archive items, or millions of medical records. Filtering, faceting, paging, and sorting through these sets are the shortest paths to interactivity, direct routes to experiences which would have been time-consuming, dull or impossible with paper, shelves, indices, and appendices….

…The genesis of PourOver is found in the 2012 London Olympics. Editors wanted a fast, online way to manage the half a million photos we would be collecting from staff photographers, freelancers, and wire services. Editing just hundreds of photos can be difficult with the mostly-unimproved, offline solutions standard in most newsrooms. Editing hundreds of thousands of photos in real-time is almost impossible.

Yep, those sorts of tasks sound like things libraries are involved in, or would like to be involved in, right?

The actual JS does some neat things with figuring out how to incrementally and just-in-time send delta’s of data, etc., and some good UI tools. Look at the page for more.

I am increasingly interested in what ‘digital journalism’ is up to these days. They are an enterprise with some similarities to libraries, in that they are an information-focused business which is having to deal with a lot of internet-era ‘disruption’.    Journalistic enterprises are generally for-profit (unlike most of the libraries we work in), but still with a certain public service ethos.  And some of the technical problems they deal with overlap heavily with our area of focus.

It may be that the grass is always greener, but I think the journalism industry is rising to the challenges somewhat better than ours is, or at any rate is putting more resources into technical innovation. When was the last time something that probably took as many developer-hours as this stuff, and is of potential interest outside the specific industry, came out of libraries?


Filed under: General

by jrochkind at April 17, 2014 06:08 PM

PeerLibrary

Presentation of PeerLibrary at iAnnotate 2014 conference in San...



Presentation of PeerLibrary at iAnnotate 2014 conference in San Francisco. Including the demo of current version in development, 0.2.

by mitarm at April 17, 2014 05:42 PM

Rochkind, Jonathan

“You build it, you run it”

I have seen several different approaches to division of labor in developing, deploying, and maintaining web apps.

The one that seems to work best to me is when the same team responsible for developing an app is the team responsible for deploying it and keeping it up, as well as for maintaining it. The same team — and ideally the same individual people (at least at first; job roles and employment changes over time, of course).

If the people responsible for writing the app in the first place are also responsible for deploying it with good uptime stats, then they have incentive to create software that can be easily deployed and can stay up reliably. If it isn’t at first, then the people who receive the pain of this are the same people best placed to improve the software to deploy better, because they are most familiar with it’s structure and how it might be altered.

Software is always a living organism, it’s never simply “done”, it’s going to need modifications in response to what you learn from how it’s users use it, as well as changing contexts and environments.  Software is always under development, the first time it becomes public is just one marker in it’s development lifecycle, and not a clear boundary between “development” and “deployment”.

Compare this to other divisions of labor, where maybe one team does “R&D” on a nice prototype, then hands their code over to another team to turn it into a production service, or to figure out how to get it deployed and keep it deployed reliably and respond to trouble tickets.  Sometimes these teams may be in entirely different parts of the organization.  If it doesn’t deploy as easily or reliably as the ‘operations’ people would like, do they need to convince the ‘development’ people that this is legit and something should be done? And when it needs additional enhancements or functional changes, maybe it’s the crack team of R&Ders who do it, even though they’re on to newer and shinier things; or maybe it’s the operations people expected to it, even though they’re not familiar with the code since they didn’t write it; or maybe there’s nobody to do it at all, because the organization is operating on the mistaken assumption that developing software is like constructing a building, when it’s done it’s done.[1]

I just don’t find that it works well to create robust, reliable software which can evolve to meet changing requirements.

 

Recently I ran into a quote from an interview with Werner Vogels, Chief Technology Officer at Amazon, expressing these benefits of “You build it, you run it.”:

There is another lesson here: Giving developers operational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.

I was originally directed to that quote by this blog post on the need for shared dev and ops responsibility, which I reccommend too.

In this world of silos, development threw releases at the ops or release team to run in production.

The ops team makes sure everything works, everything’s monitored, everything’s continuing to run smoothly.

When something breaks at night, the ops engineer can hope that enough documentation is in place for them to figure out the dial and knobs in the application to isolate and fix the problem. If it isn’t, tough luck.

Putting developers in charge of not just building an app, but also running it in production, benefits everyone in the company, and it benefits the developer too.

It fosters thinking about the environment your code runs in and how you can make sure that when something breaks, the right dials and knobs, metrics and logs, are in place so that you yourself can investigate an issue late at night.

As Werner Vogels put it on how Amazon works: “You build it, you run it.”

The responsibility to maintaining your own code in production should encourage any developer to make sure that it breaks as little as possible, and that when it breaks you know what to do and where to look.

That’s a good thing.

None of this means you can’t have people who focus on ops other people who focus on dev; but I think it means they should be situated organizationally close to each other, on the same teams, and that the dev people have to have share some ops responsibilities, so they feel some pain from products that are hard to deploy, or hard to keep running reliably, or hard to maintain or change.

[1] Note some people think even constructing a building shouldn’t be “when it’s done it’s done”, but that buildings too should be constructed in such a way that allows continual modification by those who inhabit them, in response to changing needs or understandings of needs.


Filed under: General

by jrochkind at April 17, 2014 04:45 PM

Open Knowledge Foundation

Building an archaeological project repository II: Where are the research data repositories?

This is a guest post by Anthony Beck, Honorary fellow, and Dave Harrison, Research fellow, at the University of Leeds School of Computing

DART_UML_DART_2011_2013_RAW

Data repository as research tool

In a previous post, we examined why Open Science is necessary to take advantage of the huge corpus of data generated by modern science. In our project Detection of Archaeological residues using Remote sensing Techniques, or DART, we adopted Open Science principles and made all the project’s extensive data available through a purpose-built data repository built on the open-source CKAN platform. But with so many academic repositories, why did we need to roll our own? A final post will look at how the portal was implemented.

DART: data-driven archaeology

DART’s overall aim is to develop analytical methods to differentiate archaeological sediments from non-archaeological strata, on the basis of remotely detected phenomena (e.g. resistivity, apparent dielectric permittivity, crop growth, thermal properties etc). DART is a data rich project: over a 14 month period, in-situ soil moisture, soil temperature and weather data were collected at least once an hour; ground based geophysical surveys and spectro-radiometry transects were conducted at least monthly; aerial surveys collecting hyperspectral, LiDAR and traditional oblique and vertical photographs were taken throughout the year, and laboratory analyses and tests were conducted on both soil and plant samples. The data archive itself is in the order of terabytes.

Analysis of this archive is ongoing; meanwhile, this data and other resources are made available through open access mechanisms under liberal licences and are thus accessible to a wide audience. To achieve this we used the open-source CKAN platform to build a data repository, DARTPortal, which includes a publicly queryable spatio-temporal database (on the same host), and can support access to individual data as well as mining or analysis of integrated data.

This means we can share the data analysis and transformation processes and demonstrate how we transform data into information and synthesise this information into knowledge (see, for example, this Ipython notebook which dynamically exploits the database connection). This is the essence of Open Science: exposing the data and processes that allow others to replicate and more effectively build on our science.

Lack of existing infrastructure

Pleased though we are with our data repository, it would have been nice not to have to build it! Individual research projects should not bear the burden of implementing their own data repository framework. This is much better suited to local or national institutions where the economies of scale come into their own. Yet in 2010 the provision of research data infrastructure that supported what DART did was either non-existent or poorly advertised. Where individual universities provided institutional repositories, these were focused on publications (the currency of prestige and career advancement) and not on data. Irrespective of other environments, none of the DART collaborating partners provided such a data infrastructure.

Data sharing sites like Figshare did not exist – and when it did exist the size of our hyperspectral data, in particular, was quite rightly a worry. This situation is slowly changing, but it is still far from ideal. The positions taken by Research Councils UK and the Engineering and Physical Science Research Council (EPSRC) on improving access to data are key catalysts for change. The EPSRC statement is particularly succinct:

Two of the principles are of particular importance: firstly, that publicly funded research data should generally be made as widely and freely available as possible in a timely and responsible manner; and, secondly, that the research process should not be damaged by the inappropriate release of such data.

This has produced a simple economic issue – if research institutions can not demonstrate that they can manage research data in the manner required by the funding councils then they will become ineligible to receive grant funding from that council. The impact is that the majority of universities are now developing their own, or collaborating on communal, data repositories.

But what about formal data deposition environments?

DART was generously funded through the Science and Heritage Programme supported by the UK Arts and Humanities Research Council (AHRC) and the EPSRC. This means that these research councils will pay for data archiving in the appropriate domain repository, in this case the Archaeology Data Service (ADS). So why produce our own repository?

Deposition to the ADS would only have occurred after the project had finished. With DART, the emphasis has been on re-use and collaboration rather than primarily on archiving. These goals are not mutually exclusive: the methods adopted by DART mean that we produced data that is directly suitable for archiving (well documented ASCII formats, rich supporting description and discovery metadata, etc) whilst also allowing more rapid exposure and access to the ‘full’ archive. This resulted in DART generating much richer resource discovery and description metadata than would have been the case if the data was simply deposited into the ADS.

The point of the DART repository was to produce an environment which would facilitate good data management practice and collaboration during the lifetime of the project. This is representative of a crucial shift in thinking, where projects and data collectors consider re-use, discovery, licences and metadata at a much earlier stage in the project life cycle: in effect, to create dynamic and accessible repositories that have impact across the broad stakeholder community rather than focussing solely on the academic community. The same underpinning philosophy of encouraging re-use is seen at both FigShare and DataHub. Whilst formal archiving of data is to be encouraged, if it is not re-useable, or more importantly easily re-useable, within orchestrated scientific workflow frameworks then what is the point.

In addition, it is unlikely that the ADS will take the full DART archive. It has been said that archaeological archives can produce lots of extraneous or redundant ‘stuff’. This can be exacerbated by the unfettered use of digital technologies – how many digital images are really required for the same trench? Whilst we have sympathy with this argument, there is a difference between ‘data’ and ‘pretty pictures’: as data analysts, we consider that a digital photograph is normally a data resource and rarely a pretty picture. Hence, every image has value.

This is compounded when advances in technology mean that new data can be extracted from ‘redundant’ resources. For example, Structure from Motion (SfM) is a Computer Vision technique that extracts 3D information from 2D objects. From a series of overlapping photographs, SfM techniques can be used to extract 3D point clouds and generate orthophotographs from which accurate measurements can be taken. In the case of SfM there is no such thing as redundancy, as each image becomes part of a ‘bundle’ and the statistical characteristics of the bundle determine the accuracy of the resultant model. However, one does need to be pragmatic, and it is currently impractical for organisations like the ADS to accept unconstrained archives. That said, it is an area that needs review: if a research object is important enough to have detailed metadata created about it, then it should be important enough to be archived.

For DART, this means that the ADS is hosting a subset of the archive in long-term re-use formats, which will be available in perpetuity (which formally equates to a maximum of 25 years), while the DART repository will hold the full archive in long term re-use formats until we run out of server money. We are are in discussion with Leeds University to migrate all the data objects over to the new institutional repository with sparkling new DOIs and we can transfer the metadata held in CKAN over to Open Knowledge’s public repository, the dataHub. In theory nothing should be lost.

How long is forever?

The point on perpetuity is interesting. Collins Dictionary defines perpetuity as ‘eternity’. However, the ADS defines ‘digital’ perpetuity as 25 years. This raises the question: is it more effective in the long term to deposit in ‘formal’ environments (with an intrinsic focus on preservation format over re-use), or in ‘informal’ environments (with a focus on re-use and engagement over preservation (Flickr, Wikimedia Commons, DART repository based on CKAN, etc)? Both Flickr and Wikimedia Commons have been around for over a decade. Distributed peer to peer sharing, as used in Git, produces more robust and resilient environments which are equally suited to longer term preservation. Whilst the authors appreciate that the situation is much more nuanced, particularly with the introduction of platforms that facilitate collaborative workflow development, this does have an impact on long-term deployment.

Choosing our licences

Licences are fundamental to the successful re-use of content. Licences describe who can use a resource, what they can do with this resource and how they should reference any resource (if at all).

Two lead organisations have developed legal frameworks for content licensing, Creative Commons (CC) and Open Data Commons (ODC). Until the release of CC version 4, published in November 2013, the CC licence did not cover data. Between them, CC and ODC licences can cover all forms of digital work.

At the top level the licences are permissive public domain licences (CC0 and PDDL respectively) that impose no restrictions on the licensees use of the resource. ‘Anything goes’ in a public domain licence: the licensee can take the resource and adapt it, translate it, transform it, improve upon it (or not!), package it, market it, sell it, etc. Constraints can be added to the top level licence by employing the following clauses:

Each of these clauses decreases the ‘open-ness’ of the resource. In fact, the NC and ND clause are not intrinsically open (they restrict both who can use and what you can do with the resource). These restrictive clauses have the potential to produce license incompatibilities which may introduce profound problems in the medium to long term. This is particularly relevant to the SA clause. Share-alike means that any derived output must be licensed under the same conditions as the source content. If content is combined (or mashed up) – which is essential when one is building up a corpus of heritage resources – then content created under a SA clause can not be combined with content that includes a restrictive clause (BY, NC or ND) that is not in the source licence. This licence incompatibility has a significant impact on the nature of the data commons. It has the potential to fragment the data landscape creating pockets of knowledge which are rarely used in mainstream analysis, research or policy making. This will be further exacerbated when automated data aggregation and analysis systems become the norm. A permissive licence without clauses like Non-commercial, Share-alike or No-derivatives removes such licence and downstream re-user fragmentation issues.

For completeness, specific licences have been created for Open Government Data. The UK Government Data Licence for public sector information is essentially an open licence with a BY attribution clause.

At DART we have followed the guidelines of The Open Data Institute and separated out creative content (illustrations, text, etc.) from data content. Hence, the DART content is either CC-BY or ODC-BY respectively. In the future we believe it would be useful to drop the BY (attribution) clause. This would stop attribute stacking (if the resource you are using is a derivative of a derivative of a derivative of a ….. (you get the picture), at what stage do you stop attribution) and anything which requires bureaucracy, such as attributing an image in a powerpoint presentation, inhibits re-use (one should always assume that people are intrinsically lazy). There is a post advocating ccZero+ by Dan Cohen. However, impact tracking may mean that the BY clause becomes a default for academic deposition.

The ADS uses a more restrictive bespoke default licence which does not map to national or international licence schemes (they also don’t recognise non CC licences). Resources under this licence can only be used for teaching, learning, and research purposes. Of particular concern is their use of the NC clause and possible use of the ND clause (depending on how you interpret the licence). Interestingly, policy changes mean that the use of data under the bespoke ADS licence becomes problematic if university teaching activities are determined to be commercial. It is arguable that the payment of tuition fees represents a commercial activity. If this is true then resources released under the ADS licence can not be used within university teaching which is part of a commercial activity. Hence, the policy change in student tuition and university funding has an impact on the commercial nature of university teaching which has a subsequent impact on what data or resources universities are licensed to use. Whilst it may never have been the intention of the ADS to produce a licence with this potential paradox, it is a problem when bespoke licences are developed, even if they were originally perceived to be relatively permissive licences. To remove this ambiguity it is recommended that submissions to the ADS are provided under a CC licence which renders the bespoke ADS licence void.

In the case of DART, these licence variations with the ADS should not be a problem. Our licences are permissive (by attribution is the only clause we have included). This means the ADS can do anything they want with our resources as long as they cite the source. In our case this would be the individual resource objects or collections on the DART portal. This is a good thing, as the metadata on the DART portal is much richer than the metadata held by the ADS.

Concerns about opening up data, and responses which have proved effective

Christopher Gutteridge (University of Southampton) and Alexander Dutton (University of Oxford) have collated a Google doc entitled ‘Concerns about opening up data, and responses which have proved effective‘. This document describes a number of concerns commonly raised by academic colleagues about increasing access to data. For DART two issues became problematic that were not covered by this document:

The former point is interesting – does the process of undertaking open science, or at least providing open data, undermine the novelty of the resultant scientific process? With open science it could be difficult to directly attribute the contribution, or novelty, of a single PhD student to an openly collaborative research process. However, that said, if online versioning tools like Git are used, then it is clear who has contributed what to a piece of code or a workflow (the benefits of the BY clause). This argument is less solid when we are talking solely about open data. Whilst it is true that other researchers (or anybody else for that matter) have access to the data, it is highly unlikely that multiple researchers will use the same data to answer exactly the same question. If they do ask the same question (and making the optimistic assumption that they reach the same conclusion), it is still highly unlikely that they will have done so by the same methods; and even if they do, their implementations will be different. If multiple methods using the same source data reach the same conclusion then there is an increased likelihood that the conclusion is correct and that the science is even more certain. The underlying point here is that 21st-century scientific practice will substantially benefit from people showing their working. Exposure of the actual process of scientific enquiry (the algorithms, code, etc.) will make the steps between data collection and publication more transparent, reproduceable and peer-reviewable – or, quite simply, more scientific. Hence, we would argue that open data and research novelty is only a problem if plagiarism is a problem.

The journal publication point is equally interesting. Publications are the primary metric for academic career progression and kudos. In this instance it was the policy of the ‘leading journal in this field’ that they would not publish a paper from a dataset that was already published. No credible reasons were provided for this clause – which seems draconian in the extreme. It does indicate that no one size fits all approach will work in the academic landscape. It will also be interesting to see how this journal, which publishes work which is mainly funded by EPSRC, responds to the EPSRC guidelines on open data.

This is also a clear demonstration that the academic community needs to develop new metrics that are more suited to 21st century research and scholarship by directly link academic career progression to other source of impact that go beyond publications. Furthermore, academia needs some high-profile exemplars that demonstrate clearly how to deal with such change. The policy shift and ongoing debate concerning ‘Open access’ publications in the UK is changing the relationship between funders, universities, researchers, journals and the public – a similar debate needs to occur about open data and open science.

The altmetrics community is developing new metrics for “analyzing, and informing scholarship” and have described their ethos in their manifesto. The Research Councils and Governments have taken a much greater interest in the impact of publically funded research. Importantly public, social and industry impact are as important as academic impact. It is incumbent on universities to respond to this by directly linking academic career progression through to impact and by encouraging improved access to the underlying data and procesing outputs of the research process through data repositories and workflow environments.

by Guest at April 17, 2014 02:59 PM

Hellman, Eric

Is the Kindle Direct Program MFN Legal?

If you sell an ebook through Amazon's Kindle Direct program, Amazon doesn't want you to offer it for less somewhere else. It's easy to understand why; if you're a consumer, you hate to pay $10 for an ebook on Amazon and then find that you can get it direct from the author for $5. But is it legal for Amazon to enjoin a publisher from offering better prices in other channels? In other words, is Amazon allowed to insist on a "Most Favored Nation" (MFN) provision?


Here's the provision in Amazon's Kindle Direct Program that constitutes an MFN:
4. Setting Your List Price
You must set your Digital Book's List Price (and change it from time-to-time if necessary) so that it is no higher than the list price in any sales channel for any digital or physical edition of the Digital Book.
But if you choose the 70% Royalty Option, you must further set and adjust your List Price so that it is at least 20% below the list price in any sales channel for any physical edition of the Digital Book. 
I really don't know the answer, but I do know that Apple's MFN provision was a focus of the Department of Justice's successful prosecution of Apple and 5 colluding publishers for violations of the Sherman Antitrust Act. If Apple couldn't have a MFN, then how can Amazon insist on it, given their dominant market position in ebooks?

The Winston & Strawn law firm has a nice discussion of MFN clauses in the light of Judge Cote's decision in the US vs. Apple case. Here's the highlight:
Although the judge found that the MFN clause in this instance was critical to Apple’s ability to orchestrate the unlawful conspiracy, Judge Cote explicitly held that MFN clauses are not, in and of themselves, “inherently illegal.” Judge Cote explained that “entirely lawful contracts may contain an MFN …. The issue is not whether an entity … used an MFN, but whether it conspired to raise prices.” This determination, she stated, must be based on consideration of the “totality of the evidence,” rather than on the language of the agency agreement or MFN alone. Examining the facts in this particular case, Judge Cote found that Apple’s use of the MFN clause to facilitate the e-book conspiracy with the publishers constituted a “per se” violation of the antitrust laws.
Martin Coleman writes in mondaq:
depending on the economic and commercial circumstances, MFN clauses have on occasion caused concern to competition authorities. In particular:
  • They can act as a disincentive to price cutting. If a supplier knows that, by offering a discount to any third-party customer, the supplier must also offer the customer benefiting from the MFN clause a discount to ensure that the latter enjoys the most favourable price, that is a "double cost" to price cutting, and therefore could have the effect of deterring price cuts and keeping prices higher than they might otherwise be.
In the European Union, Amazon has run into problems with a similar "Price Parity" provision for the Amazon Marketplace.  After inquiries by European Union regulatory agencies Amazon agreed NOT to enforce Price Parity, a policy that has been in effect since August 31, 2013. The Bookseller reported on the effect of this agreement in the (print) book market.

In the U.S., there's further confusion about distribution channel pricing because of the Robinson-Patman Act, which prevents them from pricing print books to favor one distributor over another. But according to the Federal Trade Commission, "The Act applies to commodities, but not to services, and to purchases, but not to leases." Since ebooks are licensed, not sold, it seems to this non-lawyer that Robinson-Patman shouldn't apply to ebooks.

The particular situation that has drawn my attention is the case of authors and publishers that make their ebooks available under Creative Commons licenses. Many of these authors also make their ebooks available via the Kindle Direct Publishing Program. There's nothing at all wrong with that - many readers prefer to get these ebooks onto their Kindles via Amazon, and are happy to know that some money ends up with the creators of the ebook. Amazon offers convenience, reliable customer service and wireless delivery.  

At Unglue.it, we're starting to offer Creative Commons creators the ability to ask people who download their ebooks for support (the program officially launches on April 30). The top concern these authors have expressed to us about this program is the "setting your list price" clause for their Kindle Direct channel. If they participate in our "Thanks for Ungluing" program they worry that Amazon will kick them out of the KDP program and the corresponding revenue stream.

We've done a few things to address this concern. Creators can set a "list price" in unglue.it- it's the suggested contribution for the pay-what-you-want download. And that's the price we report in our schema.org metadata.

But what if Amazon sees Unglue.it offering free downloads of books they're offering for $3.99 on the Kindle? Would they delist the book from the Kindle platform and kill that revenue stream? Or maybe delist the publisher entirely?

It seems to me that if Amazon did this, it could be running afoul of Judge Cote's guidelines for MFN provisions. Enforcing the MFN would amount to a retaliation against creators who offer lower prices (including zero) in other channels. Amazon doesn't even let you set your price to zero

What do you think?
Enhanced by Zemanta

by Eric (noreply@blogger.com) at April 17, 2014 03:02 AM

Morgan, Eric Lease

Rome in a day, the archivist on a linked data pilgrimage way

If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra.

sistine chapelLinked data is not a fad. It is not a trend. It makes a lot of computing sense, and it is a modern way of fulfilling some the goals of archival practice. Just like Rome, it is not going away. An understanding of what linked data has to offer is akin to experiencing Rome first hand. Both will ultimately broaden your perspective. Consequently it is a good idea to make a concerted effort to learn about linked data, as well as visit Rome at least once. Once you have returned from your trip, discuss what you learned with your friends, neighbors, and colleagues. The result will be enlightening everybody.

The previous sections of this book described what linked data is and why it is important. The balance of book describes more of the how’s of linked data. For example, there is a glossary to help reenforce your knowledge of the jargon. You can learn about HTTP “content negotiation” to understand how actionable URIs can return HTML or RDF depending on the way you instruct remote HTTP servers. RDF stands for “Resource Description Framework”, and the “resources” are represented by URIs. A later section of the book describes ways to design the URIs of your resources. Learn how you can transform existing metadata records like MARC or EAD into RDF/XML, and then learn how to put the RDF/XML on the Web. Learn how to exploit your existing databases (such as the one’s under Archon, Archivist’s Toolkit, or ArchiveSpace) to generate RDF. If you are the Do It Yourself type, then play with and explore the guidebook’s tool section. Get the gentlest of introductions to searching RDF using a query language called SPARQL. Learn how to read and evaluate ontologies & vocabularies. They are manifested as XML files, and they are easily readable and visualizable using a number of programs. Read about and explore applications using RDF as the underlying data model. There are a growing number of them. The book includes a complete publishing system written in Perl, and if you approach the code of the publishing system as if it were a theatrical play, then the “scripts” read liked scenes. (Think of the scripts as if they were a type of poetry, and they will come to life. Most of the “scenes” are less than a page long. The poetry even includes a number of refrains. Think of the publishing system as if it were a one act play.) If you want to read more, and you desire a vetted list of books and articles, then a later section lists a set of further reading.

After you have spent some time learning a bit more about linked data, discuss what you have learned with your colleagues. There are many different aspects of linked data publishing, such as but not limited to:

In archival practice, each of these things would be done by different sets of people: archivists & content specialists, administrators & managers, computer programers & systems administrators, metadata experts & catalogers. Each of these sets of people have a piece of the publishing puzzle and something significant to contribute to the work. Read about linked data. Learn about linked data. Bring these sets of people together discuss what you have learned. At the very least you will have a better collective understanding of the possibilities. If you don’t plan to “go to Rome” right away, you might decide to reconsider the “vacation” at another time.

Even Michelangelo, when he painted the Sistine Chapel, worked with a team of people each possessing a complementary set of skills. Each had something different to offer, and the discussion between themselves was key to their success.

by LiAM: Linked Archival Metadata at April 17, 2014 01:33 AM

Journal, Code4Lib

Editorial Introduction: Seeking a Diversity of Voices

Making the Journal the best that it can be.

by Ron Peterson at April 17, 2014 12:46 AM

EgoSystem: Where are our Alumni?

Comprehensive social search on the Internet remains an unsolved problem. Social networking sites tend to be isolated from each other, and the information they contain is often not fully searchable outside the confines of the site. EgoSystem, developed at Los Alamos National Laboratories (LANL), explores the problems associated with automated discovery of public online identities for people, and the aggregation of the social, institution, conceptual, and artifact data connected to these identities. EgoSystem starts with basic demographic information about former employees and uses that information to locate person identities in various popular online systems. Once identified, their respective social networks, institutional affiliations, artifacts, and associated concepts are retrieved and linked into a graph containing other found identities. This graph is stored in a Titan graph database and can be explored using the Gremlin graph query/traversal language and with the EgoSystem Web interface.

by James Powell, Harihar Shankar, Marko Rodriguez, Herbert Van de Sompel at April 17, 2014 12:46 AM

Enhancing Descriptive Metadata Records with Freely-Available APIs

This article describes how the University of North Texas Libraries' Digital Projects Unit used simple, freely-available APIs to add place names to metadata records for over 8,000 maps in two digital collections. These textual place names enable users to easily find maps by place name and to find other maps that feature the same place, thus increasing the accessibility and usage of the collections. This project demonstrates how targeted large-scale, automated metadata enhancement can have a significant impact with a relatively small commitment of time and staff resources.

by Mark Phillips and Hannah Tarver at April 17, 2014 12:46 AM

Using Open Source Tools to Create a Mobile Optimized, Crowdsourced Translation Tool

In late 2012, OSU Libraries and Press partnered with Maria’s Libraries, an NGO in Rural Kenya, to provide users the ability to crowdsource translations of folk tales and existing children's books into a variety of African languages, sub-languages, and dialects. Together, these two organizations have been creating a mobile optimized platform using open source libraries such as Wink Toolkit (a library which provides mobile-friendly interaction from a website) and Globalize3 to allow for multiple translations of database entries in a Ruby on Rails application. Research regarding successes of similar tools has been utilized in providing a consistent user interface. The OSU Libraries & Press team delivered a proof-of-concept tool that has the opportunity to promote technology exploration, improve early childhood literacy, change the way we approach foreign language learning, and to provide opportunities for cost-effective, multi-language publishing.

by Evviva Weinraub Lajoie, Trey Terrell, Susan McEvoy, Eva Kaplan, Ariel Schwartz, and Esther Ajambo at April 17, 2014 12:46 AM

EPUB as Publication Format in Open Access Journals: Tools and Workflow

In this article, we present a case study of how the main publishing format of an Open Access journal was changed from PDF to EPUB by designing a new workflow using JATS as the basic XML source format. We state the reasons and discuss advantages for doing this, how we did it, and the costs of changing an established Microsoft Word workflow. As an example, we use one typical sociology article with tables, illustrations and references. We then follow the article from JATS markup through different transformations resulting in XHTML, EPUB and MOBI versions. In the end, we put everything together in an automated XProc pipeline. The process has been developed on free and open source tools, and we describe and evaluate these tools in the article. The workflow is suitable for non-professional publishers, and all code is attached and free for reuse by others.

by Trude Eikebrokk, Tor Arne Dahl, and Siri Kessel at April 17, 2014 12:46 AM

Customizing Android Tablets for a Shared Environment

The Valley Library at Oregon State University Libraries & Press supports access to technology by lending laptops and e-readers. As a newcomer to tablet lending, The Valley Library chose to implement its service using Google Nexus tablets and an open source custom firmware solution, CyanogenMod, a free, community-built Android distribution. They created a custom build of CyanogenMod featuring wireless updates, website shortcuts, and the ability to quickly and easily wipe devices between patron uses. This article shares code that simplifies Android tablet maintenance and addresses Android application licensing issues for shared devices.

by Jane Nichols, Uta Hussong-Christian and Ryan Ordway at April 17, 2014 12:46 AM

An Introduction to Optical Media Preservation

As the archival horizon moves forward, optical media will become increasingly significant and prevalent in collections. This paper sets out to provide a broad overview of optical media in the context of archival migration. We begin by introducing the logical structure of compact discs, providing the context and language necessary to discuss the medium. The article then explores the most common data formats for optical media: Compact Disc Digital Audio, ISO 9660, the Joliet and HFS extensions, and the Universal Data Format (with an eye towards DVD-Video). Each format is viewed in the context of preservation needs and what archivists need to be aware of when handling said formats. Following this, we discuss preservation workflows and concerns for successfully migrating data away from optical media, as well as directions for future research.

by Alexander Duryee at April 17, 2014 12:46 AM

Review of DigitalSignage.com

Digital signage has been used in the commercial sector for decades. As display and networking technologies become more advanced and less expensive, it is surprisingly easy to implement a digital signage program at a minimal cost. In the fall of 2011, the University of Florida (UF), Health Sciences Center Library (HSCL) initiated the use of digital signage inside and outside its Gainesville, Florida facility. This article details UF HSCL’s use and evaluation of DigitalSignage.com signage software to organize and display its digital content.

by Clifford Richmond and Matthew Daley at April 17, 2014 12:46 AM

April 16, 2014

LITA

Jobs in Information Technology: April 16

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.


New This Week

Senior Network Administrator,  Douglas County Libraries, Castle Rock, CO

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

by vedmonds at April 16, 2014 05:11 PM

OCLC Dev Network

Announcing Release of OCLC Python Authentication Library

We are happy to announce the v1.0 release of the OCLC Python 2.7 Authentication Library via Github. This code library is the fourth implementation that the OCLC Developer Network is releasing to assist developers working with our web services protected by our API key system.

by George Campbell at April 16, 2014 12:00 PM

April 15, 2014

Engard, Nicole

Bookmarks for April 15, 2014

Today I found the following resources and bookmarked them on <a href=

Digest powered by RSS Digest

The post Bookmarks for April 15, 2014 appeared first on What I Learned Today....

by Nicole C. Engard at April 15, 2014 08:30 PM

Miedema, John

“Dr. Lanyon sat alone over his wine.” Tokenizing content using OpenNLP.

Before Whatson accepts a search query it will first ingest, analyze and index documents so that searches don’t take forever. I have shown how Whatson will use Apache Tika to extract metadata and convert different content types into plain text. After that, the plain text will be split up into words, called tokens, so that queries can later be matched up to documents. Here is a simple example:

The tokenizer analyzes whitespace and punctuation to produce a list of tokens. Partial example, with pipes inserted by me:

Dr. | Lanyon | sat | alone | over | his | wine | . | This | was | a | hearty | , | healthy …

The tokenizer was smart enough to keep the period with “Dr.” but separate it out when it was used to split up sentences. This is why you don’t want to build from scratch.

References:

The Apache Software Foundation. Apache OpenNLP Developer Documentation.
dpdearing. Getting started with OpenNLP 1.5.0 – Sentence Detection and Tokenizing.

by johnmiedema at April 15, 2014 06:01 PM

Denton, William

The heart of the university

The library is the heart of the University. From it, the lifeblood of scholarship flows to all parts of the University; to it the resources of scholarship flow to enrich the academic body. With a mediocre library, even a distinguished faculty cannot attain its highest potential; with a distinguished library, even a less than brilliant faculty may fulfill its mission. For the scientist, the library provides an indispensable backstop to his laboratory and field work. For the humanist, the library is not only his reference centre; it is indeed his laboratory and the field of his explorations. What he studies, researches and writes is the product of his reading in the library. For these reasons, the University library must be one of the primary concerns for those responsible for the development and welfare of the institution. At the same time, the enormous cost of acquisitions, the growing scarcity of older books, the problem of storage and cataloguing make the library one of the most painful headaches of the University administrator.

From the Report to the Committee on University Affairs and the Committee of Presidents of Provincially-Assisted Universities, by the Commission to Study the Development of Graduate Programmes in Ontario Universities, chaired by Gustave O. Arlt, F. Kenneth Hare, and J.W.T. Spinks, published in 1966. I think I found this in Evolution of the Heart: A History of the University of Toronto Library Up to 1981 by Robert Blackburn, which is a fine book, and very interesting. Blackburn was the chief librarian there for about 25 years.

April 15, 2014 12:55 PM

ALA Equitable Access to Electronic Content

Indexing the internet and searching for “free”

The Botnet

The Botnet

“If we can put a man on the moon and we can transplant a heart, we surely can say when something shows up ‘free’ and do something about that.” Rep. Tom Marino (R-PA).

In March, the U.S. House Judiciary Subcommittee on Courts, Intellectual Property and the Internet held a hearing on Section 512, the provision that provides protection for internet service providers from liability for the infringing actions of network users. The Library Copyright Alliance (LCA) submitted comments (pdf) in support of no changes to the existing law, holding that this provision helps libraries provide online services in good faith without liability for the potentially illegal actions of a third party.

Though libraries were not specifically represented in the hearing, one line of questioning directed at both Google and Automattic Inc.—owner of WordPress—stands out as relevant to both present and future methods of delivering content and services to library patrons: “free” as the opposite of “legal” or “legitimate.”

Several representatives focused on witnesses Katherine Oyama, senior copyright policy counsel for Google, and Paul Sieminski, general counsel for Automattic Inc., expressing significant confusion about how Google creates and modifies indexing and search algorithms, as well as the nuances of copyright protection on a blogging platform. “Free” was the watchword, and many subcommittee members expressed the same basic concerns.

copyright pirate imageRep. Judy Chu (D-CA) asked about autocomplete results in Google that include “free” and “watch online,” saying that such results “induce infringement” on the part of searchers. Rep. Cedric Richmond (D-LA) further echoed worries that unsophisticated Internet users like his grandmother would be “induced to infringe” by seeing an autocomplete result for “watch 12 Years a Slave free online.”

But the most colorful exchange began with Rep. Tom Marino (R-PA) expressing disbelief that Google could not simply ban or remove terms such as “watch X movie online for free” from the engine.

Oyama rightly pointed out that “we are not going to ban the word ‘free’ from search…there are many legitimate sources for music and films that are available for free.” She also promoted YouTube’s ContentID software as an effective answer to alleged infringement, though there are certainly reasons to remain wary of the “software savior” in addressing takedown notices (more on ContentID coming soon).

As libraries begin exploring ways to deliver legally obtained and responsibly monitored content to patrons, we will have to offer a counterpoint to the concept of “free” as the automatic enemy of rights holders. While we know that it is anything but free to provide these services (no-fee or no-charge is perhaps a better description), the public often perceives it as such, and simply banning phrases like “read for free” or “watch for free” from the world’s largest Internet index will not reduce infringement. Instead, it removes a responsible and reliable source from top page results, which is the exact opposite of what the lawmakers above support.

The post Indexing the internet and searching for “free” appeared first on District Dispatch.

by Abby Lull at April 15, 2014 06:15 AM

Morgan, Eric Lease

Four “itineraries” for putting linked data into practice for the archivist

If you to go to Rome for a day, then walk to the Colosseum and Vatican City. Everything you see along the way will be extra. If you to go to Rome for a few days, do everything you would do in a single day, eat and drink in a few cafes, see a few fountains, and go to a museum of your choice. For a week, do everything you would do in a few days, and make one or two day-trips outside Rome in order to get a flavor of the wider community. If you can afford two weeks, then do everything you would do in a week, and in addition befriend somebody in the hopes of establishing a life-long relationship.

map of vatican cithyWhen you read a guidebook on Rome — or any travel guidebook — there are simply too many listed things to see & do. Nobody can see all the sites, visit all the museums, walk all the tours, nor eat at all the restaurants. It is literally impossible to experience everything a place like Rome has to offer. So it is with linked data. Despite this fact, if you were to do everything linked data had to offer, then you would do all of things on the following list starting at the first item, going all the way down to evaluation, and repeating the process over and over:

Given that it is quite possible you do not plan to immediately dive head-first into linked data, you might begin by getting your feet wet or dabbling in a bit of experimentation. That being the case, here are a number of different “itineraries” for linked data implementation. Think of them as strategies. They are ordered from least costly and most modest to greatest expense and completest execution:

  1. Rome in a day – Maybe you can’t afford to do anything right now, but if you have gotten this far in the guidebook, then you know something about linked data. Discuss (evaluate) linked data with with your colleagues, and consider revisiting the topic a year.
  2. Rome in three days – If you want something relatively quick and easy, but with the understanding that your implementation will not be complete, begin migrating your existing data to RDF. Use XSLT to transform your MARC or EAD files into RDF serializations, and publish them on the Web. Use something like OAI2RDF to make your OAI repositories (if you have them) available as linked data. Use something like D2RQ to make your archival description stored in databases accessible as linked data. Create a triple store and implement a SPARQL endpoint. As before, discuss linked data with your colleagues.
  3. Rome in week – Begin publishing RDF, but at the same time think hard about and document the structure of your future RDF’s URIs as well as the ontologies & vocabularies you are going to use. Discuss it with your colleagues. Migrate and re-publish your existing data as RDF using the documentation as a guide. Re-implement your SPARQL endpoint. Discuss linked data not only with your colleagues but with people outside archival practice.
  4. Rome in two weeks – First, do everything you would do in one week. Second, supplement your triple store with the RDF of others’. Third, write an application against the triple store that goes beyond search. In short, tell stories and you will be discussing linked data with the world, literally.

by LiAM: Linked Archival Metadata at April 15, 2014 02:36 AM

April 14, 2014

ALA Equitable Access to Electronic Content

Appeals court decision undermines free speech, misinterprets copyright law

Last week, the American Library Association (ALA) joined an amicus brief calling for reconsideration of a 9th circuit court decision in Garcia v. Google, case where actress Cindy Sue Garcia sued Google for not removing a YouTube video in which she appears. Garcia appears for five seconds in “Innocence of Muslims,” the radical anti-Islamic video that fueled the attack on the American embassy in Benghazi. The video was uploaded on YouTube, exposing Garcia to threats and hate mail. Garcia did not know that her five second performance would be used in a controversial video.

Garcia turned to the copyright law for redress, arguing that her five second performance was protected by copyright, and therefore, as a rights holder she could ask that the video be removed from YouTube. While we empathize with Garcia’s situation, the copyright law does not protect performances in film—instead these performances are works-for-hire. This ruling, if taken to its extreme, would hold that anyone who worked on a film—from the editor to the gaffer—could claim rights, creating a copyright permissions nightmare.

On appeal, the judge agreed that the copyright argument was weak, but nonetheless ruled for Garcia. The video currently is not available for public review. This decision needs to be reheard en banc—the copyright ruling is mistaken, and perhaps more importantly, the copyright law cannot be used to restrain speech. While the facts of this case are not at all appealing, we agree that rules of law need to be upheld. Fundamental values of librarianship—including intellectual freedom, fair use, and preservation of the cultural record—are in serious conflict with the existing court ruling.

Read more on the case.

The post Appeals court decision undermines free speech, misinterprets copyright law appeared first on District Dispatch.

by Carrie Russell at April 14, 2014 09:43 PM

Prom, Chris

Thematic Theme Structure, Hooks, and Filters

One of the most useful resources I found when developing a child theme in the wordpress thematic theme framework was the theme structure document formerly found on the bluemandala.com website.

With permission from Deryk, I am reproducing it here:  http://e-records.chrisprom.com/manualuploads/thematic-structure.html.

And, here is another great resource for development using Thematic:  http://visualizing.thematic4you.com/

by Chris Prom at April 14, 2014 08:24 PM

Denton, William

Stuff, Standards and Sites

On 26 March 2014 I gave a short talk at the March 2014 AR Standards Community Meeting in Arlington, Virginia. The talk was called “Stuff, Standards and Sites: Libraries and Archives in AR.” My slides and the text of what I said are online:

I struggled with how best to talk to non-library people, all experts in different aspects of augmented reality, about how our work can fit with theirs. The stuff/standards/sites components gave me something to hang the talk on, but it didn’t all come together as well as I’d hoped and in the heat of actually speaking I forgot to mention a couple of important things. Ah well.

I made the slides a new way. They are done with reveal.js, but I wrote them in Emacs with Org and then used org-reveal to export them. It worked beautifully! The diagrams in the slides are done in text in Org with ditaa and turned into images on export.

What I write in Org looks like this (here I turned image display off, but one keystroke makes them show):

Screenshot of editing two slides of content

When turned into slides, that looks like this:

Slide 08

Slide 09

Working this way was a delight. No more nonsense about dragging boxes and stuff around like in Power Point. I get to work with pure text, in my favourite editor, and generate great-looking slides, all with free software.

To turn all the slides into little screenshots, I used this little script I found in a GitHub gist: PhantomJS script to capture/render screenshots of the slides of a Reveal.js powered slideshow. I had to install phantom.js first, but on my Ubuntu system that was just a simple sudo apt-get install phantomjs.

April 14, 2014 07:55 PM

LITA

A LITA Guide to responsive web design

A LITA guide to responsive web design
Responsive Web Design for Libraries: A LITA Guide

Related

For Immediate Release
Fri, 04/11/2014

Contact:

Valerie Edmonds-Merritt
Program Coordinator
LITA Division
312-280-4269

vedmonds@ala.org

CHICAGO — Tablets, desktops, smartphones, laptops, minis: we live in a world of screens, all of different sizes. Library websites need to work on all of them, but maintaining separate sites or content management systems is resource intensive and still unlikely to address all the variations. By using responsive Web design, libraries can build one site for all devices—now and in the future. In “Responsive Web Design for Libraries: A LITA Guide,” published by ALA TechSource, experienced responsive Web developer Matthew Reidsma, named “a web librarian to watch” by ACRL’s TechConnect blog, shares proven methods for delivering the same content to all users using HTML and CSS. His practical guidance will enable Web developers to save valuable time and resources by working with a library’s existing design to add responsive Web design features. With both clarity and thoroughness, and firmly addressing the expectations of library website users, this book:

  • shows why responsive Web design is so important, and how its flexibility can meet the needs of both today’s users and tomorrow’s technology;
  • provides in-depth coverage of implementing responsive Web design on an existing site, steps for taking traditional desktop CSS and adding breakpoints for site responsiveness and ways to use grids to achieve a visual layout that’s adaptable to different devices;
  • includes valuable tips and techniques from Web developers and designers, such as how to do more with fewer resources and improving performance by designing a site that sends fewer bytes over fewer connections;
  • offers advice for making vendor sites responsive;
  • features an abundance of screen captures, associated code samples and links to additional resources.

Reidsma is Web services librarian at Grand Valley State University, in Allendale, Mich. He is the cofounder and editor in chief of Weave: Journal of Library User Experience, a peer-reviewed, open-access journal for library user experience professionals. He speaks frequently about library websites, user experience and responsive design around the world. Library Journal named him a “Mover & Shaker” in 2013. He writes about libraries and technology at Matthew Reidsma.

The Library and Information Technology Association (LITA), a division of ALA, educates, serves and reaches out to its members, other ALA members and divisions, and the entire library and information community through its publications, programs and other activities designed to promote, develop, and aid in the implementation of library and information technology.

ALA Store purchases fund advocacy, awareness and accreditation programs for library professionals worldwide. Contact us at (800) 545-2433 ext. 5418 or editionsmarketing@ala.org.

Filed Under:

PR Category:

by vedmonds at April 14, 2014 03:53 PM

Morgan, Eric Lease

Italian Lectures on Semantic Web and Linked Data

rome   croce   koha

Koha Gruppo Italiano has organized the following free event that may be of interest to linked data affectionatos in cultural heritage institutions:

Italian Lectures on Semantic Web and Linked Data: Practical Examples for Libraries, Wednesday May 7, 2014 at The American University of Rome – Auriana Auditorium (Via Pietro Roselli, 16 – Rome, Italy)

Please RSVP to f.wallner at aur.edu by May 5.

This event is generously sponsored by regesta.exe, Bucap Document Imaging SpA, and SOS Archivi e Biblioteche.

regesta   bucap   sos

by LiAM: Linked Archival Metadata at April 14, 2014 02:04 PM

Miedema, John

“Put down the marker, step away from the whiteboard.” Architecture diagram for Whatson.

“Put down the marker, step away from the whiteboard.” I joked that once in a design session. A picture can represent a rich array of information in a single frame — that is its strength and weakness. “A picture paints a thousand words. Stop all the talking!” It can take a while to assimilate all the information in a diagram. Here is my first cut at an architecture diagram for Whatson, my home basement attempt at building Watson using public knowledge and open source technology. I will detail the components in future posts as the build proceeds.

whatson architecture

1. Data Source to Index. In order for Whatson to be able to answer questions in a timely fashion, data sources must be pre-processed. Data sources must be crawled and indexed. The index is the structured target for searches.

1.1. Data Sources. I will download data sources including public domain literature and Wikipedia. Other sources may be added. The more data sources, the smarter Whatson will be.

1.2. Crawl. In a previous post, I showed how I can use Apache Tika to convert different content types (e.g., html, pdf) into plain text, and extract metdata. This is the crawl stage. The common plain text format makes further processing much easier.

1.3. Index. Using OpenNLP I will process text along a UIMA pipeline. UIMA is an open, industry standard architecture for Natural Language Processing (NLP) stages. A UIMA pipeline is a series of text-processing steps including: parsing document content into individual words or tokens; analyzing the tokens into parts of speech like nouns and verbs; and identification of entities like people, locations and organizations.

2. Question to Answer. Once the data sources have been crawled and indexed, a question may be submitted to Whatson. The output must be Whatson’s single best answer.

2.1. Question. A user interface will accept a question in natural language.

2.2. Cognitive Analysis. Whatson will analyze the question text. The analysis first submits the text to the UIMA pipeline built for step 1. The pipeline outputs are used here to make the question easier to analyze for the next step, deciding the question type. Is the question seeking a person or a place? Is the context literal or figurative? Current or historical? Based on the question type, modules will be enlisted to answer the question. The modular approaches simulates the human brain, with different modules dedicated to different kinds of knowledge and cognitive processing. The modules use domain-specific logic to search for answers in the index prepared in step 1. For example, a literature module will have domain-specific rules for analyzing literature. This approach prevents Whatson from wild goose chases and speeds up processing. The output of the cognitive analysis is a candidate answer and confidence level from each enlisted module.

2.3 Dialog. Whatson needs to decide which answer from the cognitive analysis is best. If the stakes are low, it will simply select the answer with the highest confidence level. The questioner can respond whether the answer is satisfactory. A dialog may continue with additional questions. If Whatston is used in a context that has penalties, like playing Jeopardy, it might not risk giving its best answer if the confidence level is low. If the context permits, Whatson could ask for hints or prompt for a restatement of the question.

References:

Baker. Final Jeopardy: Man vs. Machine and the Quest to Know Everything.
Ingersoll, Morton & Farris. Taming Text.

by johnmiedema at April 14, 2014 12:20 PM

Open Knowledge Foundation

The Tragic Consequences of Secret Contracts

The following post is by Seember Nyager, CEO of the Public and Private Development Centre in Nigeria, one of our campaign partners in the Stop Secret Contracts campaign

procurement montior

Every day, through secret contracts being carried out within public institutions, there is confirmation that the interest of the public is not served. A few days ago, young Nigerians in Abuja were arrested for protesting against the reckless conduct of the recruitment exercise at the Nigerian Immigration Service (NIS) that led to the death of 19 applicants.

Although the protesters were later released, the irony still stings that whilst no one has been held for the resulting deaths from the reckless recruitment conduct, the young voices protesting against this grave misconduct are being silenced by security forces. Most heart-breaking is the reality that the deadly outcomes of the recruitment exercise could have been avoided with more conscientious planning, through an adherence to due process and diligence in the selection of consultants to carry out the exercise.

A report released by Premium times indicates that the recruitment exercise was conducted exclusively by the Minister of Interior who hand-picked the consultant that carried out the recruitment exercise at the NIS. The non-responsiveness of the Ministry in providing civic organizations including BudgIT and PPDC with requested details of the process through which the consultant was selected gives credence to the reports of due process being flouted.

The non-competitive process through which the consultant was selected is in sharp breach of the Public procurement law and its results have undermined the concept of value for money in the award of contracts for public services. Although a recruitment website was built and deployed by the hired consultant, the information gathered by the website does not seem to have informed the plan for the conduct of the recruitment exercise across the country which left Nigerians dead in its wake. Whilst the legality of the revenue generated from over 710,000 applicants is questioned, it is appalling that these resources were not used to ensure a better organized recruitment exercise.

This is not the first time that public institutions in Nigeria have displayed reckless conduct in the supposed administration of public services to the detriment of Nigerians. The recklessness with which the Ministry of Aviation took a loan to buy highly inflated vehicles, the difficulty faced by BudgIT and PPDC in tracking the exact amount of SURE-P funds spent, the 20 billion Dollars unaccounted for by the NNPC are a few of the cases where Nation building and development is undermined by public institutions.

In the instance of the NIS recruitment conducted three weeks ago, some of the consequences have been immediate and fatal, yet there is foot dragging in apportioning liability and correcting the injustice that has been dealt to Nigerians. On the same issue, public resources have been speedily deployed to silence protesters.

procurement monitor2

It is time that our laws which require due process and diligence are fully enforced. Peaceful protests should no longer be clamped down because Nigerians are justified for being outraged by any form of institutional recklessness. The Nigerian Immigration Service recruitment exercise painfully illustrates that the outcomes of secret contracts could be deadly and such behaviour cannot be allowed to continue. We must stop institutional recklessness, we must stop secret contracts.

Ms. Seember Nyager coordinates procurement monitoring in Nigeria. Follow Nigerian Procurement Monitors at @Nig_procmonitor.

by Theodora Middleton at April 14, 2014 10:13 AM

ALA Equitable Access to Electronic Content

Takedown notice ≠ infringement

Google ScreenAmidst a flurry of congressional hearings and treaty negotiations, it is important to remember that statistics often tell half of the story. As I catch up on recent U.S. House subcommittee hearings, I continue to marvel at how often both committee members and witnesses conflate a total number of takedown notices with actual cases of infringement. This is not a new problem; the “Chilling Effect” is a well-documented (pdf) result of widespread abuse of Section 512 takedown notices. In 2009, Google reported that over a third of DMCA takedown notices were invalid:

Google notes that more than half (57%) of the takedown notices it has received under the US Digital Millennium Copyright Act 1998, were sent by business targeting competitors and over one third (37%) of notices were not valid copyright claims.

And that doesn’t even include YouTube or Blogger takedown statistics! The numbers aren’t much better today. Google’s latest Transparency Report shows over 27 million removal requests over the past three years, with nearly a million of those requests denied (requests cited as “improper” or “abusive”) in 2011 alone. Many rights holders will continue to point to takedown notice numbers as evidence of widespread infringement, but this simply bolsters a landscape in which everybody is guilty until proven innocent of violating copyright.

The post Takedown notice ≠ infringement appeared first on District Dispatch.

by Abby Lull at April 14, 2014 06:33 AM

April 12, 2014

Morgan, Eric Lease

Linked Archival Metadata: A Guidebook

A new but still “pre-published” version of the Linked Archival Metadata: A Guidebook is available. From the introduction:

The purpose of this guidebook is to describe in detail what linked data is, why it is important, how you can publish it to the Web, how you can take advantage of your linked data, and how you can exploit the linked data of others. For the archivist, linked data is about universally making accessible and repurposing sets of facts about you and your collections. As you publish these fact you will be able to maintain a more flexible Web presence as well as a Web presence that is richer, more complete, and better integrated with complementary collections.

And from the table of contents:

There are a number of versions:

Feedback desired and hoped for.

by LiAM: Linked Archival Metadata at April 12, 2014 12:41 PM

April 11, 2014

Hess, M Ryan

Google NSA skin using Stylish Browser Plugin

Google NSA skin using Stylish Browser PluginA few weeks back, I dropped Google search in favor of DuckDuckGo, an alternative search engine that does not log your searches. Today, I’m here to report on that experience and suggest two even better secure search tools: StartPage and Ixquick.

The probelm with DuckDuckGo

As I outlined in my initial blog post, DuckDuckGo falls down probably as a consequence of its emphasis on privacy. Whereas Google results are based on an array of personal variables that tie specific result sets to your social graph…a complex web of data points collected on you through your Chrome Browser, Android apps, browser cookies, location data, possibly even the contents of your documents and emails stored on Google’s servers (that’s a guess, but totally within the scope of reason). This is a considerable handicap for DuckDuckGo.

But moreover, Google’s algorithm remains superior to everything else out there.

The benefits of using DuckDuckGo, of course, are that you are far more anonymous, especially if you are searching in private browser mode, accessing the Internet through a VPN or Tor, etc.

Again, given the explosive revelations about aggressive NSA data collection and even of government programs that hack such social graphs, and the potential leaking of that data to even worse parties, many people may decide that, on balance, they are better off dealing with poor search precision rather than setting themselves up for a cataclysmic breach of their data.

I’m one such person, but to be quite honest, I was constantly turning back to Google because DuckDuckGo just wouldn’t get me what I knew was out there.

Fortunately, I found something better: StartPage and Ixquick.

Google but without all the evil

StartPage is a US version of the Dutch-based search engine Ixquick.

There are two important things to understand about StartPage and Ixquick:

  1. Like DuckDuckGo, StartPage and Ixquick are totally private. They don’t collect any data on you, don’t share any data with third parties and don’t use cookies. They also use HTTPS (and no Heartbleed vulnerabilities) for all transactions.
  2. Both StartPage and Ixquick use proxy services to query other search engines. In the case of Ixquick, they query multiple search engines and then return the results with the highest average rank. StartPage only queries Google, but via the proxy service, making your search private and free of the data mining intrigue that plagues the major search engines.

Still some shortcomings remain

But, like DuckDuckGo, neither Ixquick or StartPage are able to source your social graph, so they will never get results as closely tailored to you as Google. By design, they are not looking at your cookies or building their own database of you, so they won’t be able to guess your location or political views, and therefore, will never skew results around those variables. Then again, your results will be more broadly relevant and serendipitous, saving you from the personal echo-chamber that you may have found in Google.

Happily private

It’s been over a month since I switched from DuckDuckGo to StartPage and so far it’s been quite good. StartPage even has a passable image and video search. I almost never go to Google anymore. In fact, I’ve used a browser plugin called Stylish to re-skin Google’s search interface with the NSA logo just as a humorous reminder that every search is being collected by multiple parties.

For that matter, I’ve used the same plugin to re-skin StartPage since where they get high marks for privacy and search results, they’re interface design needs major work…but I’m just picky that way.

So, with my current setup, I’ve got StartPage as my default browser, set in my omnibar in Firefox. Works like a charm!


by mryanhess at April 11, 2014 11:43 PM

Ng, Cynthia

Guerrilla Usability: Choosing the Front Page with Mockups

As part of the redesign for the new site, the main thing that I really wanted to change in terms of the look was the front page.Based on my experience and discussions with staff about what our users look for when they arrive at the site, I had an idea of what information should be […]

by Cynthia at April 11, 2014 09:25 PM

Summers, Ed

Flickr Commons LAMs

After the last post Seb got me wondering if there were any differences between libraries, archives and museums when looking at upload and comment activity in Flickr Commons in Aaron’s snapshot of the Flickr Commons metadata.

First I had to get a list of Flickr Commons organizations and classify them as either a library, museum or archive. It wasn’t always easy to pick, but you can see the result here. I lumped galleries in with museums. I also lumped historical societies in with archives. Then I wrote a script that walked around in the Redis database I already had from loading Aaron’s data.

In doing this I noticed there were some Flickr Commons organizations that were missing from Aaron’s snapshot:

Update: Aaron quickly fixed this.

I didn’t do any research to see if these organizations had significant activity. Also, since there were close to a million files, I didn’t load the British Library activity yet. If there’s interest in adding them into the mix I’ll splurge for the larger ec2 instance.

Anyhow, below are the results. You can find the spreadsheet for these graphs up in Google Docs

This was all done rather quickly, so if you notice anything odd or that looks amiss please let me know. Initially it seemed a bit strange to me that libraries, archives and museums trended so similarly in each graph, even if the volume was different.

by ed at April 11, 2014 08:58 PM

Denton, William

Dying Every Day

I was in New York a couple of weeks ago, and I went to the Strand Bookstore, that multistory heaven of used and new books. I wandered around a while and got some things I’d been wanting. I wanted to read something set in New York so I looked first at Lawrence Block’s books and got The Burglar in the Closet, which opens with Bernie Rhodenbarr sitting in Gramercy Park, which I’d just passed by on the walk down, and then at Donald E. Westlake and got Get Real, the last of the Dortmunder series, and mostly set in the Lower East Side. Welcome to New York.

While I was standing near a table in the main aisle on the ground floor an older woman carrying some bags passed behind me and accidentally knocked some books to the floor. “Oh, I’m sorry, did I do that?” she said in a thick local accent. A young woman and I both leaned over to pick up the books. I was confused for a moment, because it looked like the cover had ripped, but it hadn’t, the rip was printed.

Cover of Dying Every Day

Then I saw what the book was: Dying Every Day: Seneca at the Court of Nero, by James Romm. A new book about Seneca, the Roman senator and Stoic philosopher! Fate had actually put this book in my hand. “It is destined to be,” I thought, and immediately bought it.

It’s a fine book, a gripping history and biography, covering in full something I only knew a tiny bit about. Seneca wrote a good amount of philosophy, including the Epistles, a series of letters full of Stoic advice to a younger friend, but the editions of his philosophy (or his tragedies) don’t go much into the details of Seneca’s life. They might mention he was a senator and advisor to Nero, and rich (as rich as a billionaire today), but then they get on to analyzing the subtleties of his thoughts on nature or equanimity.

Seneca led an incredible life: he was a senator in Rome, he was banished by the emperor Claudius on trumped-up charges of an affair with Caligula’s sister, but was later called back to Rome at the behest of Agrippina, Nero’s mother, to act as an advisor and tutor to the young man. Five years later, Agrippina poisoned Claudius, and Nero became emperor.

Seneca was very close to Nero and stayed as his advisor for years. It worked fairly well at first, but Nero was Nero. This is the main matter of the book: how Seneca, the wise Stoic, stayed close to Nero, who gradually went out of control: wild behaviour, crimes, killings, and eventually the murder of his mother Agrippina. An attempt to kill her on a boat failed, and then:

None of Seneca’s meditations on morality, Virtue, Reason, and the good life could have prepared him for this. Before him, as he entered Nero’s room, stood a frightened and enraged youth of twenty-three, his student and protégé for the past ten years. For the past five, he had allied with the princeps against his dangerous mother. Now the path he had first opened for Nero, by supporting his dalliance with Acte, had led to a botched murder and a political debacle of the first magnitude. It was too late for Seneca to detach himself. The path had to be followed to its end.

Every word Seneca wrote, every treatise he published, must be read against his presence in this room at this moment. He stood in silence for a long time, as though contemplating the choices before him. There were no good ones. When he finally spoke, it was to pass the buck to Burrus. Seneca asked whether Burrus could dispatch his Praetorians to take Agrippina’s life.

Seneca supported Nero’s matricide.

It’s impossible to match that, and other things Seneca did, with his Stoic writings, but it was all the same man. It’s a remarkable and paradoxical life.

Romm’s done a great job of writing this history. It’s full of detail (especially drawing on Tacitus), with lots of people and events to follow, but it’s all presented clearly and with a strong narrative. If you liked I, Claudius you’ll like this, and I see similar comments about House of Cards and Game of Thrones.

I especially recommend this to anyone interested in Stoicism. Thrasea Pateus is a minor figure in the book, another senator and also a Stoic, but one who acted like a Stoic should have, by opposing Nero. He was new to me. Seneca’s Stoic nephew Lucan, who wrote the epic poem The Civil War, also appears. He was friends with Nero but later took part in a conspiracy to kill the emperor. It failed, and Lucan had to commit suicide, as did Seneca, who wasn’t part of the plot.

There’s a nice chain of philosophers at the end of the book. After Nero’s death, Thrasea’s Stoic son-in-law Helvidius Priscus returns to Rome, as does the great Stoic Musonius Rufus and Demetrius the Cynic. The emperor Vespasian later banished philosophers from Rome (an action that seems very puzzling these days; I’m not sure what the modern equivalent would be), but for some reason let Musonius Rufus stay. One of his students was Epictetus, who had been a slave belonging to Epaphroditus, who in turn had been Nero’s assistant and had been with him when Nero, on the run, committed suicide—in fact, Epaphroditus helped his master by opening up the cut in his throat.

Later the Stoics were banished from Rome again, and Epictetus went to Greece and taught there. He never wrote anything himself, but one of his students, Arrian, wrote down what he said, which is why we now have the very powerful Discourses. And years later this was read by Marcus Aurelius, the Stoic emperor, a real philosopher king.

For a good introduction to the book, listen to this interview with James Romm on WNYC in late March. It’s just twenty minutes.

April 11, 2014 02:58 PM

Open Knowledge Foundation

Why secret contracts matter in aid transparency

The following guest post is by Nicole Valentinuzzi, from our Stop Secret Contracts campaign partner Publish What You Fund.

A new campaign to Stop Secret Contracts, supported by the Open Knowledge Foundation, Sunlight Foundation and many other international NGOs, aims to make sure that all public contracts are made available in order to stop corruption before it starts.

As transparency campaigners ourselves, Publish What You Fund is pleased to be a supporter of this new campaign. We felt it was important to lend our voice to the call for transparency as an approach that underpins all government activity.

We campaign for more and better information about aid, because we believe that by opening development flows, we can increase the effectiveness and accountability of aid. We also believe that governments have a duty to act transparently, as they are ultimately responsible to their citizens.

This includes publishing all public contracts that governments put out for tender, from school books to sanitation systems. These publicly tendered contracts are estimated to top nearly US$ 9.5 trillion each year globally, yet many are agreed behind closed doors.

These secret contracts often lead to corruption, fraud and unaccountable outsourcing. If the basic facts about a contract aren’t made publicly available – for how much and to whom to deliver what – then it is not possible to make sure that corruption and abuses don’t happen.

But what do secret contracts have to do with aid transparency, which is what we campaign for at Publish What You Fund? Well, consider the recent finding by the campaign that each year Africa loses nearly a quarter of its GDP to corruption…then consider what that money could have been spent on instead – things like schools, hospitals and roads.

This is money that in many cases is intended to be spent on development. It should be published – through the International Aid Transparency Initiative (IATI), for example – so that citizens can follow the money and hold governments accountable for how it is spent.

But corruption isn’t just a problem in Africa – the Stop Secret Contracts campaign estimates Europe loses an estimated €120 billion to corruption every year.

At Publish What You Fund, we tell the world’s biggest providers of development cooperation that they must publish their aid information to IATI because it is the only internationally-agreed, open data standard. Information published to IATI is available to a wide range of stakeholders for their own needs – whether people want to know about procurement, contracts, tenders or budgets. More than that, this is information that partner countries have asked for.

Governments use tax-payer money to award contracts to private companies in every sector, including development. We believe that any companies that receive public money must be subject to the same transparency requirements as governments when it comes to the goods and services they deliver.

Greater transparency and clearer understanding of the funds that are being disbursed by governments or corporates to deliver public services can only be helpful in building trust and supporting accountability to citizens. Whether it is open aid or open contracts, we need to get the information out of the hands of governments and into the hands of citizens.

Ultimately for us, the question remains how transparency will improve aid – and open contracts are another piece of the aid effectiveness puzzle. Giving citizens full and open access to public contracts is a crucial first step in increasing global transparency. Sign the petition now to call on world leaders to make this happen.

stopsecretcontracts logo

by Nicole Valentinuzzi at April 11, 2014 09:53 AM

April 10, 2014

ALA Equitable Access to Electronic Content

ALA to participate in IMLS hearing on libraries and broadband

Tom Wheeler

FCC Chairman Tom Wheeler will speak at the IMLS hearing.

On Thursday, April 17, 2014, from 9:30–11:30 a.m., leaders from the American Library Association (ALA) will participate in “Libraries and Broadband: Urgency and Impact,” a public hearing hosted by the Institute for Museum and Library Services (IMLS) that will explore the need for high-speed broadband in American libraries. Larra Clark, director of the ALA Program on Networks, and Linda Lord, ALA E-rate Task Force Chair and Maine State Librarian, will present on two panels.

The hearing, which takes place during National Library Week (April 13–19, 2014), will explore innovative library practices, partnerships and strategies for serving our communities; share available research on library broadband connections and services; and discuss solutions for improving library connectivity to drive education, community and economic development. During her discussion, Clark will share findings from relevant library research managed by the ALA Office for Research & Statistics, including the IMLS-funded Digital Inclusion Survey and the Public Libraries Funding Technology Access Study, funded by the Bill & Melinda Gates Foundation. Lord will discuss ALA e-rate policy recommendations for boosting libraries toward gigabit broadband speeds.

Federal Communications Commission Chairman Thomas Wheeler will make opening remarks at the hearing, and expert panelists from across the library, technology, and public policy spectrum will explore the issue of high-speed broadband in America’s libraries. IMLS Director Susan H. Hildreth will chair the hearing along with members of the National Museum Services Board including, Christie Pearson Brandau of Iowa, Charles Benton of Illinois, Winston Tabb of Maryland, and Carla Hayden also of Maryland.

Interested participants may register to attend the event in-person at D.C.’s Martin Luther King Jr. Memorial Library. Alternatively, participants can also tune into event virtually, as IMLS will stream the hearing live on YouTube or Google+. Library staff may also participate by submitting written comments sharing their successes, challenges or other input related to library broadband access and use into the hearing record on or before April 24, 2014. Each comment must include the author’s name and organizational affiliation, if any, and sent to comments@imls.gov. Guidance for submitting testimony is available here (pdf).

 

The post ALA to participate in IMLS hearing on libraries and broadband appeared first on District Dispatch.

by Jazzy Wright at April 10, 2014 02:32 PM

Hochstenbach, Patrick

History of book scanning at UGent library

Cartoons made for the Fiesole conference in Cambridge UK   Filed under: Doodles Tagged: cambridge, fiesole, ghent, library, scanning

by hochstenbach at April 10, 2014 01:50 PM

April 09, 2014

Denton, William

Don't give up your library card number

There was good news Monday from the Toronto Public Library (TPL): Toronto Public Library Introduces Online Music and Video. It seemed good at the start, anyway.

Toronto Public Library has introduced a new service that allows customers to download or stream a wide variety of music and video content. With a library card, customers can access music albums from a wide variety of genres, movies, educational television and documentaries. More information is available at tpl.ca/hoopla.

“We’re happy to now offer customers a great selection of music and videos that they can easily stream or download. E-content is our fastest area of growth, with customers borrowing more than 2 million ebooks, eaudio-books and emagazines in 2013. We expect we’ll see even more growth this year with the introduction of online music and video,” said Vickery Bowles, Director of Collections Management at Toronto Public Library.

With just a library card, customers can listen to a wide selection of music albums and watch a variety of video content. Content may be borrowed via a browser, smartphone or tablet and instantly streamed or downloaded with no waiting lists or late fees. Customers may borrow up to five items per month.

Here’s a CBC news report about: Hoopla comes to Toronto: Toronto’s libraries are introducing a new Netflix-like service.

Seems like a very nice service. I’m happy to see my local library system working to get more streaming media to people in Toronto. I’m unhappy with the privacy implications of this, however. (As is Kate Johnson, a professor at the library school at the University of Western Ontario, who’s interviewed in that video clip: she raises the privacy question, but the reporter completely drops the issue). Here are my speculations based on a brief examination of what I see.

The TPL’s page about the new service explains how it works. It says you need an “account at hoopladigital.com (library card and email address required to create)” and “because Hoopla requires a separate account to be created, you may wish to review their privacy policy.” The privacy policy is, oddly, a PDF hosted at an unmemorable Cloudfront URL, and not the official privacy policy on Hoopla’s web site. They are different. For example, the web site version says, “As you use the hoopla service, we record how you use our application, including the materials you borrow. This information is reported to your library, content providers, and licensing agencies. When it is reported, it is always reported in aggregate with other patrons. It is never reported in a manner that associates your account with specific content or activities.” (Update at 19:25: that privacy policy link has been corrected to go to Hoopla’s site.)

None of that bothered me particularly, so I went to sign up for an account to try it out. This is the third step in the process:

Hoopla asks for my library card number

“Enter your libary card number,” it says. “If your library gave you a PIN to use with your library card, please enter it.” I have a PIN, but I stopped here. (I don’t know what happens to people without a password; I’d guess they’re asked to set one up.)

So Hoopla wants my library card number. I posted a comment on Twitter about that and got a number of responses, including three from Michelle Leung (@mishiechau), who said, we review 3rd prty privcy polcies 2 portect cust + we suggest cust. do the same… and we haven’t given hoopla a dump of card #s in advance their systms chk w/ours@ acnt creation time 2 c if user valid..

Certainly Hoopla needs to be sure that anyone claiming to be a Toronto Public Library user actually is. But it looks like they’re doing it by asking the user for their library card number and password and then asking TPL if that is a valid account.

This is not right. There’s no need for any third party to know my library card number. OAuth would be a better way to do it: as it says, it’s “an open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications.” This is what they say to anyone offering services online: “If you’re storing protected data on your users' behalf, they shouldn’t be spreading their passwords around the web to get access to it. Use OAuth to give your users access to their data while protecting their account credentials.”

Who’s behind Hoopla, anyway? It’s a sevice run by Midwest Tape, who on their Twitter account say “Midwest Tape is a full service DVD, Blu-ray, music CD, audiobook, and Playaway distributor, conducting business exclusively with public libraries since 1989.” They’re run out of Holland, Ohio, in the United States.

I suspect this means the Toronto Public Library is offering a service that requires users to give their library card number and password to an American company that will store it on American servers, which means the data is available to the US government through the PATRIOT Act. (Of course, we also need to assume that all library data can be access by our spy agencies, but we need to do what we can.)

I may be wrong. I’ll ask Hoopla and TPL and update this with what I find.

April 09, 2014 11:25 PM

Sefton, Peter

Notes on ownCloud robustness

I’m on my way to a meeting at Intersect about the next phase of the Cr8it data packaging and publishing project. Cr8it is an ownCloud plugin, and ownCloud is widely regarded as THE open source dropbox-like service, but it is not without its problems.

Dropbox has been a huge hit, a killer app with what I call powers to "Share, Sync & See". Syncing between devices, including mobile (where it’s not really syncing) is what made Dropbox so pervasive, giving us a distributed file-system with almost frictionless sharing via emailed requests, with easy signup for new users. The see part refers to the fact that you can look at your stuff via the web too. And there is a growing ecosystem of apps that can use Dropbox as an underlying distributed filesystem.

ownCloud is (amongst other things) an open source alternative to Dropbox.com’s file-sync service. A number of institutions and service providers in the academic world are now looking at it because it promises some of the killer-app qualities of dropbox in an open source form, meaning that, if all goes well it can be used to manage research data, on local or cloud infrastructure, at scale, with the ease of use and virality of dropbox. If all goes well.

There are a few reasons dropbox and other commercial services are not great for a university:

But ownCloud has some problems. The ownCloud forum is full of people saying, "tried this out for my company/workgroup/school. Showed promise but there’s too many bugs. Bye." At UWS eResearch we have been using it more or less successfully for several months, and have experienced some fairly major issues to do with case-sensitivty and other incompatibilities between various file systems on Windows, OS X and Linux.

From my point of view as an eResearch manager, I’d like to see the emphasis at ownCloud be on getting the core share-sync-see stuff working, and then on getting a framework in place to support plugins in a robust way.

What I don’t want to see is more of this:

Last week, the first version of OwnCloud Documents was released as a part of OwnCloud 6. This incorporates a subset of editing features from the upstream WebODF project that is considered stable and well-tested enough for collaborative editing.

We tried this editor at eResearch UWS as a shared scratchpad in a strategy session and it was a complete disaster, our browsers kept losing contact with the document, and when we tried to copy-paste the text to safety it turned out that copying text is not supported. In the end we had to rescue our content by copying HTML out of the browser and stripping out the tags.

In my opinion, ownCloud is not going to reach its potential when the focus remains on getting shiny new stuff out all the time, far from making ownCloud shine, every broken app like this editor tarnishes its reputation substantially. By all means release these things for people to play with but the ownCloud team needs to have a very hard think about what they mean by "stable and well tested".

Along with others I’ve talked to in eResearch, I’d like to see work at owncloud.com focus on:

Creative Commons License
Notes on ownCloud robustness by Peter Sefton is licensed under a Creative Commons Attribution 4.0 International License

by ptsefton at April 09, 2014 10:58 PM

Hickey, Thom

FRBRizing WorldCat

Several of us here at OCLC have spent considerable time over the last decade trying to pull bibliographic records into work clusters.  Lately we've been making considerable progress along these lines and thought it would be worth sharing some of the results.

Probably our biggest accomplishment is that work we have done to refine the worksets is now visible in WorldCat.org (as well as in an experimental view of the works as RDF). This is a big step for us, involving a number of people in research, development and production.  In addition to making the new work clusters visible in WorldCat, this gives us in Research the opportunity to use the same work IDs in other services such as Classify.  We also expect to move the production work IDs into services such as WorldCat Identities.

One of the numbers we keep track of is the ratio of records to works. When we first started, the record to work ratio was something like 1.2:1, that is, every work cluster averaged 1.2 records.   The ratio is now close to 1.6:1, and for the first time the majority of records in WorldCat are now in work clusters with other records, primarily because of better matching.

Of records that have at least one match, we find the average workset size is 3.9 records.  In terms of holdings we have 10.6 holdings/workset and over 43 holdings/non-singleton workset (worksets with more than one record).  Another way to look at this is that 84% of WorldCat's holdings are in non-singleton worksets and over 1.5 billion of WorldCats 2.1 billion holdings are in worksets of 3 or more records, so collecting them together has a big impact on many displays.

As the worksets become larger and more reliable we are finding many uses for them, not the least in improving the work-level clustering itself.  We find the clustering helps find variations in names, which in turn helps find title variations. We are also learning how to connect our manifestation and expression level clustering with our work-level algorithms, improving both.  The Multilingual WorldCat work reported here is also an exciting development growing out of this.

There is still more to do of course.  One of our latest approaches is to build on the Multilingual WorldCat work by creating new authority records in the background that can be used to guide the automated creation of authority records from WorldCat, that in turn help generate better clusters.  We are applying this technique at first on problem works such as Twain's Adventures of Huckleberry Finn and his Adventures of Tom Sawyer which are published together so often and cataloged in so many ways that it is difficult to separate the two.  These generated title authority records are starting to show up in VIAF as 'xR' records.

So, we've been working on this off and on for a decade, but WorldCat and our computational capabilities have changed dramatically and it still seems like a fresh problem to us as we pull in VIAF to help and use matching techniques that just would not have been feasible a decade ago.

While many of us, both in and out of OCLC Research, have worked on this over the years, no one has done more than Jenny Toves who both designs and implements the matching code.

--Th

by Thom at April 09, 2014 06:46 PM