Please follow and support us on our recently-opened Facebook page: https://www.facebook.com/PeerLibrary.
Semifinalists for the Knight News Challenge will be chosen tomorrow and the refinement period will begin. This is your last chance to show your support for our submission before the next stage of the competition. The Knight Foundation is asking “How might we leverage libraries as a platform to build more knowledgeable communities?” We believe that PeerLibrary closely parallels the theme of the challenge and provides an answer to the foundation’s question. By facilitating a community of independent learners and promoting collaborative reading and discussion of academic resources, PeerLibrary is modernizing the concept of a library in order to educate and enrich the global community. Please help us improve our proposal, give us feedback, and wish PeerLibrary good luck in the next stage of the Knight News Challenge.
LYRASIS has published three open source software case studies on FOSS4LIB.org as part of its continuation of support and services for libraries and other cultural heritage organizations interested in learning about, evaluating, adopting, and using open source software systems.
With support from a grant from The Andrew W. Mellon Foundation, LYRASIS asked academic and public libraries to share their experiences with open source systems, such as content repositories, integrated library systems, and websites. Of the submitted proposals, LYRASIS selected three concepts for development into case studies from Crawford County Federated Library System (Koha), Fenway Libraries Online (Coral), and the University of Chicago Library (Kuali OLE). The three selected organizations then prepared narrative descriptions of their experience and learning, to provide models, advice, and ideas for others.
Each case study details how the organization handled the evaluation, selection, adoption, conversion, and implementation of the open source system. They also include the rationale for going with an open source solution. The case studies all provide valuable information and insights, including:
- Actual experiences, both good and bad
- Steps, decision points, and processes used in evaluation, selection, and implementation
- Factors that led to selection of an open source system
- Organization-wide involvement of and impact to staffs and patrons
- Useful tools created or applied to enhance the open source system and/or expand its functionality, usefulness, or benefit
- Plans for ongoing support and future enhancement
- Key takeaways from the process, including what worked well, what didn’t work as planned, and what the organization might do differently in the future
The goal of freely offering these case studies to the public is to help cultural heritage organizations use firsthand experience with open source to inform their evaluation and decision-making process, the same objective of FOSS4LIB.org. While open source software is typically available at no cost, these case studies provide tangible examples of the associated costs, time, energy, commitment and resources required to effectively leverage open source software and participate in the community.
“These three organizations expertly outline the in-depth process of selecting and implementing open source software with insight, humor, candor and clarity. LYRASIS is honored to work with these organizations to share this invaluable information with the larger community,” stated Kate Nevins, Executive Director of LYRASIS. “The case studies exemplify the importance of understanding the options and experiences necessary to fully utilize open source software solutions.”Link to this post!
Ahhhh! It’s done!
This project took over 7 years and went through a few big iterations. I was just finishing library school when it started and learned a lot from the other advisory board members. I appreciate how the much more experienced folks on the advisory board helped bring me up to speed on issues I was less familiar with, and how they treated me, even though I was just a student.
It was published this spring but my copy just arrived in the mail. Here’s the page about the book on the Library Juice Press site, and here’s where you can order a copy on Amazon.
At the Gender and Sexuality in Information Studies Colloquium the program session I was the most excited about was Porn in the library. There were 3 presentations in this panel exploring this theme.
First, Joan Beaudoin and Elaine Ménard presented The P Project: Scope Notes and Literary Warrant Required! Their study looked at 22 websites that are aggregators of free porn clips. Most of these sites were in English, but a few were in French. Ménard acknowledged that it is risky and sometimes uncomfortable to study porn in the academy. They looked at the terminology used to describe porn videos, specifically the categories available to access porn videos. They described their coding manual which outlined various metadata facets (activity, age, cinematography, company/producers, age, ethnicity, gender, genre, illustration/cartoon, individual/stars, instruction, number of individuals, objects, physical characteristics, role, setting, sexual orientation). I learned that xhamster has scope notes for their various categories (mouseover the lightbulb icon to see).
While I appreciate that Beaudoin and Ménard are taking a risk to look at porn, I think they made the mistake of using very clinical language to legitimize and sanitize their work. I’m curious why they are so interested in porn, but realize that it might be too risky for them to situate themselves in their research.
It didn’t seem like they understood the difference between production company websites and free aggregator sites. Production company sites have very robust and high quality metadata and excellent information architecture. Free aggregator sites that have variable quality metadata and likely have a business model that is based on ads or referring users to the main production company websites. Porn is, after all, a content business, and most porn companies are invested in making their content findable, and making it easy for the user to find more content with the same performers, same genre, or by the same director.
Beaudoin and Ménard expressed disappointment that porn companies didn’t want to participate in their study. As these two researchers don’t seem to understand the porn industry or have relationships with individuals I don’t think it’s surprising at all. For them to successfully build on this line of inquiry I think they need to have some skin in the game and clearly articulate what they offer their research subjects in exchange for building their own academic capital.
It was awesome to have a quick Twitter conversation with Jiz Lee and Chris Lowrance, the web manager for feminist porn company Pink and White productions, about how sometimes the terms a consumer might be looking for is prioritized over the performers’ own gender identity.
Jiz Lee is genderqueer porn performer and uses the pronouns they/them and is sometimes misgendreed by mainstream porn and by feminist porn. I am a huge fan of their work.
I think this is the same issue that Amber Billy, Emily Drabinski and K.R. Roberto raise in their paper What’s gender got to do with it? A critique of RDA rule 9.7. They argue that it is regressive for a cataloguer to assign a binary gender value to an author. In both these cases someone (porn company or consumer, or cataloguer) is assigning gender to someone else (porn performer or content creator). This process can be disrespectful, offensive, inaccurate and highlights a power dynamic where the consumer’s (porn viewer or researcher/student/librarian) desires/politics/needs/worldview is put above someone’s own identity.
Next, Lisa Sloniowski and Bobby Noble. presented Fisting the Library: Feminist Porn and Academic Libraries (which is the best paper title ever). I’ve been really excited their SSHRC funded porn archive research. This research project has become more of a conceptional project, rather than building a brick and mortar porn archive. Bobby talked about the challenging process of getting his porn studies class going at York University. Lisa talked they initially hoped to start a porn collection as part of York University Library’s main collection, not as a reading room or a marginal collection. Lisa spoke about the challenges of drafting a collection development policy and some of the labour issues, presumably about staff who were uncomfortable with porn having to order, catalogue, process and circulate porn. They also talked about the Feminist Porn Awards and second feminist porn conference that took place before the Feminist Porn Awards last year.
Finally, Emily Lawrence and Richard Fry presented Pornography, Bomb Building and Good Intentions: What would it take for an internet filter to work? They presented a philosophical argument against internet filters. They argued that for a filter to not overblock and underblock it would need to be mind reading and fortune telling. A filter would need to be able to read an individual’s mind and note factors like the person viewing, their values, their mood, etc and be fortune telling by knowing exactly what information that the user was seeking before they looked at it. I’ve been thinking about internet filtering a lot lately, because of Vancouver Public Library’s recent policy change that forbids “sexually explicit images”. I was hoping to get a new or deeper understanding on filtering but was disappointed.
This colloquium was really exciting for me. The conversations that people on the porn in the library panel were having are discussions I haven’t heard elsewhere in librarianship. I look forward to talking about porn in the library more.
From Jon Ippolito, Professor of New Media,Director, Digital Curation graduate program, The University of Maine
Orono, ME Digital conservator Dragan Espenschied and the crew at Rhizome, one of the leading platforms for new media art, have created a tool for archiving social media such as Instagram and Facebook.
This week is Open Access Week all around the world, and from Open Knowledge’s side we are following up on last year’s tradition by putting together a blog post series to highlight great Open Access projects and activities in communities around the world. Every day this week will feature a new writer and activity.
Open Access Week, a global event now entering its eighth year, is an opportunity for the academic and research community to continue to learn about the potential benefits of Open Access, to share what they’ve learned, and to help inspire wider participation in helping to make Open Access a new norm in scholarship and research.
This past year has seen lots in great progress and with the Open Knowledge blog we want to help amplify this amazing work done in communities around the world:
- Tuesday, Jonathan Gray from Open Knowledge: “Open Knowledge work on Open Access in humanities and social sciences”
- Wednesday, David Carroll from Open Access Button: “Launching the New Open Access Button”
- Thursday, Alma Swan from SPARC Europe: “Open Access and the humanities: on our travels round the UK”
- Friday, Jenny Molloy from Open Science working group: “OK activities in open access to science”
- Saturday, Kshitiz Khanal from Open Knowledge Nepal: “Combining Open Science, Open Access, and Collaborative Research”
- Sunday, Denis Parfenov from Open Knowledge Ireland: “Open Access: Case of Ireland”
We’re hoping that this series can inspire even more work around Open Access in the year to come and that our community will use this week to get involved both locally and globally. A good first step is to sign up at http://www.openaccessweek.org for access to a plethora of support resources, and to connect with the worldwide Open Access Week community. Another way to connect is to join the Open Access working group.
Open Access Week is an invaluable chance to connect the global momentum toward open sharing with the advancement of policy changes on the local level. Universities, colleges, research institutes, funding agencies, libraries, and think tanks use Open Access Week as a platform to host faculty votes on campus open-access policies, to issue reports on the societal and economic benefits of Open Access, to commit new funds in support of open-access publication, and more. Let’s add to their brilliant work this week!
On Friday, the U.S. Court of Appeals for the 11th Circuit handed down an important decision in Cambridge University Press et al. v. Carl V. Patton et al. concerning the permissible “fair use” of copyrighted works in electronic reserves for academic courses. Although publisher’s sought to bar the uncompensated excerpting of copyrighted material for “e-reserves,” the court rejected all such arguments and provided new guidance in the Eleventh Circuit for how “fair use” determinations by educators and librarians should best be made. Remanding to the lower court for further proceedings, the court ruled that fair use decisions should be based on a flexible, case-by-case analysis of the four factors of fair use rather than rigid “checklists” or “percentage-based” formulae.
Courtney Young, president of the American Library Association (ALA), responded to the ruling by issuing a statement.
The appellate court’s decision emphasizes what ALA and other library associations have always supported—thoughtful analysis of fair use and a rejection of highly restrictive fair use guidelines promoted by many publishers. Critically, this decision confirms the importance of flexible limitations on publisher’s rights, such as fair use. Additionally, the appeals court’s decision offers important guidance for reevaluating the lower courts’ ruling. The court agreed that the non-profit educational nature of the e-reserves service is inherently fair, and that that teachers’ and students’ needs should be the real measure of any limits on fair use, not any rigid mathematical model. Importantly, the court also acknowledged that educators’ use of copyrighted material would be unlikely to harm publishers financially when schools aren’t offered the chance to license excerpts of copyrighted work.
Moving forward, educational institutions can continue to operate their e-reserve services because the appeals court rejected the publishers’ efforts to undermine those e-reserve services. Nonetheless, institutions inside and outside the appeals court’s jurisdiction—which includes Georgia, Florida and Alabama—may wish to evaluate and ultimately fine tune their services to align with the appeals court’s guidance. In addition, institutions that employ checklists should ensure that the checklists are not applied mechanically.
In 2008, publishers Cambridge, Oxford University Press, and SAGE Publishers sued Georgia State University for copyright infringement. The publishers argued that the university’s use of copyright-protected materials in course e-reserves without a license was a violation of the copyright law. Previously, in May 2012, Judge Orinda Evans of the U.S. District Court ruled in favor of the university in a lengthy 350-page decision that reviewed the 99 alleged infringements, finding all but five infringements to be fair uses.
The post ALA encouraged by “fair use” decision in Georgia State case appeared first on District Dispatch.
This is the third of three posts about the workshop.
Following the presentations, attendees divided into breakout groups. There were a variety of suggested topics, but the discussions took on lives of their own. The breakout discussions surfaced many themes that may merit further attention:
Support for researchers
It may be the institution’s responsibility to provide infrastructure to support compliance with mandates, but it is certainly the library’s role to assist researchers in depositing their content somewhere and to ensure that deposits are discoverable. We should establish trust by offering our expertise and familiarity with reliable external repositories, deposit, compliance with mandates, selection, description … and support the needs of researchers during and after their projects. Access to research outcomes involves both helping researchers find and access information they need as inputs to their work and helping them to ensure that their outputs are discovered and accessible by others. We should also find ways to ensure portability of research outputs throughout a researcher’s career. We need to partner with faculty and help them take the long view. We cannot do this by making things harder for the researcher, but by making it seamless, building on the ways they prefer to work.
Adapting to the challenge
We need to retool and reskill to add new expertise: ensuring that processes are retained along with data, promoting licenses that allow reusability, thinking about what repositories can deliver back to research, and adding developers to our teams. When we extend beyond existing library standards, we need to look elsewhere to see what we can adopt rather than create. We need to leverage and retain the trust in libraries, but need resources to do the work. While business models don’t exist yet, we need to find ways to rebalance resources and contain costs. One of the ways we might do that is to build library expertise and funding into the grant proposal process, becoming an integral part of the process from inception to dissemination of results.
Academic libraries should first collect, preserve, and provide access to materials created by those at their institution. How do libraries put a value on assets (to the institution, to researchers, to the wider public)? Not just outputs but also the evidence-base and surrounding commentary. What should proactively be captured from active research projects? How many versions should be retained? What role should user-driven demand play? What is needed to ensure we have evidence for verification and retain results of failed experiments? What need not be saved (locally or at all)? When is sampling called for? What about deselection? While we can involve researchers in identifying resources for preservation, in some cases we may have to be proactive and hunt them down and harvest them ourselves.
Competitiveness (regarding tenure, reputation, IP, and scooping) can inhibit sharing. Timing of data sharing can be important, sometimes requiring an embargo. Privacy issues regarding research subjects must be considered. Researchers may be sensitive about sharing “personal” scientific notes – or sharing data before their research is published. Different disciplines have different traditions about sharing.
Collaboration with others in the university
Policy and financial drivers (mandates, ROI expectations, reputation and assessment) will motivate a variety of institutional stakeholders in various ways. How can expertise be optimized and duplication be minimized? Libraries can’t change faculty behaviors, so need to join together with those with more influence. When Deans see that libraries can address parts of the challenge, they will welcome involvement. When multiple units are employing different systems and services, IT departments and libraries may become key players. There are limits to institutional capacity, so cooperating with other institutions is also necessary.
Collaboration with other stakeholders in a distributed archive across publishers, subjects, nations
We need to understand various solutions for fixity, versioning, and citation. We need to accommodate persistent object identifiers and multiple researcher name identifiers. We need to explore ways to link the various research materials related to the same project. We need to coordinate metadata in objects (e.g., an instrument’s self-generated metadata) with metadata about the objects and metadata about the context). Embedded links need to be maintained. Campus systems may need to interoperate with external systems (such as SHARE). We should help find efficient metrics for assessing researcher impact and enhancing institutional reputation. We should consider collaborating on processes to capture content from social media. In doing these things we should be contributing to developing standards, best practices, and tools.
Attendees of the workshop feel that stewardship efforts will evolve from informal to more formal. Mandates, cost-savings, and scale will motivate this evolution. It is a common good to have demonstrable historical record to document what is known, to protect against fraud, and for future research to build upon. Failure to act is a risk for libraries, for research, and for the scholarly record.
Future Evolving Scholarly Record workshops will expand the discussion and contribute to identifying topics for further investigation. The next scheduled workshops will be in Washington DC on December 10, 2014 and in San Francisco on June 4, 2015. Watch for more details and for announcements of other workshops on the OCLC Research events page.About Ricky Erway
Ricky Erway, Senior Program Officer at OCLC Research, works with staff from the OCLC Research Library Partnership on projects ranging from managing born digital archives to research data curation.Mail | Web | Twitter | LinkedIn | More Posts (36)
Acharya et al:
attempt to answer two questions. First, what fraction of the top-cited articles are published in non-elite journals and how has this changed over time. Second, what fraction of the total citations are to non-elite journals and how has this changed over time. For the first question they observe that:
The number of top-1000 papers published in non-elite journals for the representative subject category went from 149 in 1995 to 245 in 2013, a growth of 64%. Looking at broad research areas, 4 out of 9 areas saw at least one-third of the top-cited articles published in non-elite journals in 2013. For 6 out of 9 areas, the fraction of top-cited papers published in non-elite journals for the representative subject category grew by 45% or more. and for the second that:
Considering citations to all articles, the percentage of citations to articles in non-elite journals went from 27% in 1995 to 47% in 2013. Six out of nine broad areas had at least 50% of citations going to articles published in non-elite journals in 2013.They summarize their method as:
We studied citations to articles published in 1995-2013. We computed the 10 most-cited journals and the 1000 most-cited articles each year for all 261 subject categories in Scholar Metrics. We marked the 10 most-cited journals in a category as the elite journals for the category and the rest as non-elite. In a post to liblicense, Ann Okerson asks:
- Any thoughts about the validity of the findings? Google has access to high-quality data, so it is unlikely that they are significantly mis-characterizing journals or papers.They examine the questions separately in each of their 261 subject categories, and re-evaluate the top-ranked papers and journals each year.
- Do they take into account the overall growth of article publishing in the time frame examined? Their method excludes all but the most-cited 1000 papers in each year, so they consider a decreasing fraction of the total output each year:
- The first question asks what fraction of the top-ranked papers appear in top-ranked journals, so the total volume of papers is irrelevant.
- The second question asks what fraction of all citations (from all journals, not just the top 1000) are to top-ranked journals. Increasing the number of articles published doesn't affect the proportion of them in a given year that cite top-ranked journals.
- What's really going on here? Across all fields, the top-ranked 10 journals in their respective fields contain a gradually but significantly decreasing fraction of the papers subsequently cited. Across all fields, a gradually but significantly decreasing fraction of citations are to the top-ranked 10 journals in their respective fields. This means that authors of cite-worthy papers are decreasingly likely to publish in, read from, and cite papers in their field's top-ranked journals. In other words, whatever value that top-ranked journals add to the papers they publish is decreasingly significant to authors.
- Life Sciences & Earth Sciences (general): Nature, Science, PNAS
- Health & Medical Sciences (general): NEJM, Lancet, PNAS
- Cell Biology: Cell
- Molecular Biology: Cell
- Oncology: Journal of Clinical Oncology
- Chemical & Material Sciences (general): Chemical Reviews, Journal of the American Chemical Society
- Physics & Mathematics (general): Physical Review Letters
Lets look at this another way. No matter how well their work is regarded by others in their field, researchers in the vast majority of fields have no prospect of ever publishing in a global top-10 journal because those journals effectively don't publish papers in those fields. And if they ever did, the paper is likely to be junk, as illustrated by my favorite example, because the global top-10 journal's stable of reviewers don't work in that field. The global top-10 journals are important to librarians, because they look at scholarly communication from the top down, to publishers, because they are important to librarians so they anchor the "big deals", and to researchers in a small number of important fields. To every one else, they may be interesting but they are not important.
Acharya et al conclude:
First, the fraction of top-cited articles published in non-elite journals increased steadily over 1995-2013. While the elite journals still publish a substantial fraction of high-impact articles, many more authors of well-regarded papers in diverse research fields are choosing other venues.
Second, now that finding and reading relevant articles in non-elite journals is about as easy as finding and reading articles in elite journals, researchers are increasingly building on and citing work published everywhere. Both seem right to me, which reinforces the message that, even on a per-field basis, highly rated journals are not adding as much value as they did in the past (which was much less than commonly thought). Authors of other papers are the ultimate judge of the value of a paper (they are increasingly awarding citations to papers published elsewhere), and of the value of a journal (they are increasingly publishing work that other authors value elsewhere).
On October 27, 2014, the American Library Association (ALA) will host “$2.2 Billion Reasons to Pay Attention to WIOA,” an interactive webinar that will explore ways that public and community college libraries can receive funding for employment skills training and job search assistance from the recently-passed Workforce Innovation and Opportunity Act. The no-cost webinar, which includes speakers from the U.S. Departments of Education and Labor, takes place Oct 27, 2014, from 2:00–3:00 p.m. EDT.
The Workforce Innovation and Opportunity Act allows public and community college libraries to be considered additional One-Stop partners and authorizes adult education and literacy activities provided by public and community college libraries as an allowable statewide employment and training activity. Additionally, the law defines digital literacy skills as a workforce preparation activity.
- Moderator: Sari Feldman, president-elect, American Library Association and executive director, Cuyahoga County Public Library
- Susan Hildreth, director, Institute of Museum and Library Services
- Heidi Silver-Pacuilla, team leader, Applied Innovation and Improvement, Office of Career, Technical, and Adult Education, U.S. Department of Education
- Kimberly Vitelli, chief of Division of National Programs, Employment and Training Administration, U.S. Department of Labor
The post ALA, Depts. of Ed. and Labor to Host Webinar on Workforce Funding appeared first on District Dispatch.
It was a hot, dusty day in Moab, Utah. I drove into town from my beautiful campsite overlooking the La Sal Mountains, where I’d been cycling and exploring the beautiful country. I was taking a few days off from work, and even though I was relaxing, I had a phone call I didn’t want to reschedule. So back to town I went, straight to—naturally—the public library. I had fond memories of the library from a previous visit a few years back: a beautiful building with reliable Wi-Fi. Aside from not being allowed to bring coffee inside, it would be a great place to check email and take a call on the bench outside.
As I entered the library, I decided that transitioning from adventure mode to work mode required, at least, washing some of Moab’s ample sand and dust off of my hands. I washed my hands and what happened next I did automatically, without consideration or contemplation: I cupped my hands and splashed some water on my face. Refreshing! I then wet a paper towel to wipe the sunscreen off of the back of my neck.
It was at about this point that I realized just what was going on; I was the guy bathing in the library restroom!
Half shocked, half amused by my actions, I quickly made sure I didn’t drip anywhere and sully the otherwise very clean and pleasant basin.Contextually appropriate
I can’t say I’m proud of my mindless act, but it did get me thinking about the very sensitive issue of appropriate behavior in libraries.
I’m not going on a campaign encouraging libraries to offer showers to their patrons, but not because I think the idea is ridiculous. I actually think it is a legitimate potential service offering. That such a service would likely be useful for only a very small segment of library users is one reason why it isn’t worth pursuing.
But as a theoretical concept, I find nothing inherently wrong or illogical with the idea of a library offering showers. It is simply an idea that hasn’t found many appropriate contexts.
Even so, with the smallest amount of imagination I can think of contexts in which this could work. What about a multiuse facility that houses a restaurant, a gym, a coworking space, and a library? Seems like an amazing place. And don’t forget that the new central library in Helsinki, Finland—to be completed in 2017—will feature sauna facilities. These will be contextually and culturally appropriate.Challenging assumptions
This is about more than showers and saunas. It is about our long-held assumptions and how we react to new ideas. When we’re closed off to concepts without examining them fully, or without exploring the frameworks in which they exist, we’re unlikely truly to innovate or create any radically meaningful experiences. When evaluating new initiatives, we should consider the library less and our communities more. Without this sort of thinking, we’d have never realized libraries with popular materials, web access, and instructional classes, let alone cafés, gaming nights, and public health nurses.
Learning about our contexts—our communities—takes more than facilitating surveys and leading focus groups. After all, those techniques put less emphasis on people and more on their opinions. Even though extra work is required, the techniques aren’t mysterious. There are well-established methods we can use to learn about the individuals in our areas and then design contextually appropriate programs and services.
To the Grand County Public Library in Moab, my apologies for the slight transgression. I did leave the restroom in the same shape as I found it. To everyone else, if you’re in Moab, visit the library. But if you need a place to clean up in that city, try the aquatic center. It has nice pools and clean showers.
As archives increasingly process born-digital collections one thing is clear; processing digital collections often involves working with tons of email. There is already some great work exploring how to deal with email, but given that it is such a significant problem area it is great to see work focused on developing tools to make sense of this material. Of particular concern is how email is simultaneously so ubiquitous and so messy. I’ve heard cases of repositories needing to deal with hundreds of millions of email objects in a single collection. Beyond that, in actual practice people use email for just about everything, so email records are often a messy mixture of public, private, personal and professional material.
To this end, the ePADD project at Stanford, with the help of an NHPRC grant, is working to produce an open-source tool that will allow repositories and individuals to interact with email archives before and after they have been transferred to a repository. I was lucky enough to sit in on a presentation from the projects technical advisor, Dr. Sudheendra Hangal, on the status of the project and am thrilled to have this opportunity to discuss work on it with him and his colleagues, Glynn Edwards and Peter Chan as part of our Insights Interview series. Glynn is the Head of Technical Services in the Manuscripts division in Stanford Libraries and the Manager of the Born-Digital Program and Peter is a digital archivist at Stanford Libraries.
Trevor: Could you briefly describe the scope and objective of the ePADD project? Specifically, what problem are you working to solve and how are you going about solving it?
Glynn: The ePADD project grew out of earlier experimentation during the Mellon-funded AIMS grant. One of the collections contained 50,000 unique email messages. Peter, who was our new digital archivist, experimented with Gephi (exporting header information to create social network graphs) and Forensic Toolkit. Neither option managed to provide a suitable tool for processing or to facilitate discovery. FTK did not allow us to flag individual messages that contained personal identifying information for restriction and neither provided a view of the entities within the corpus nor expose them to remote researchers.
Most of the previous email projects and tools we researched were focused specifically on acquisition and preservation. They did not address other core functions of stewardship – appraisal, processing and access (discovery & delivery).
During this experimental period, Peter discovered MUSE (Memories USing Email), a research project in the Mobisocial group of Stanford’s Computer Science Dept. Using NLP and a built-in lexicon, it allowed us to extract entities, view by correspondent or a graphical visualization of sentiments based on the lexical terms. This was a step in the right direction and we began a multiyear collaboration with Sudheendra Hangal, MUSE’s creator.
The objectives are to create an open-source Java-based software program built on MUSE that supports different activities aligned with core functions of the digital curation lifecycle: appraisal, accessioning, processing, discovery and delivery. In effect it would allow a user, anyone from the creator, donor, curator, archivist or researcher, to use the collection both before and after transfer to a repository.
Stanford, along with our collaborating partners (NYU, Smithsonian, Columbia, and Bodleian @ Oxford), created and prioritized a set of specifications for the initial development cycle, funded by NHPRC. We also developed and published a beta site to demonstrate our concept for exporting entities and correspondents to facilitate discovery. We have been steadily receiving more email collections. Our most recent acquisition contains over 600,000 unique messages. The grant states that the program will handle at least 250,000 messages – so this latest archive will be more than adequate as a stress test!
Trevor: Could you tell us a bit about the design of the workflow in the tool? How are you envisioning donors and processing archivists working with it?
Peter: The workflow is designed as follows:
Creators of email archives use the appraisal module to scan their archives and identify messages they don’t want to transfer. They can also flag messages as “restricted” and enter annotations to specify the terms of the restriction. The files exported from ePADD will NOT contain the messages flagged as “Do Not Transfer.”
After receiving the files from donors, processing archivists will then identify messages to be restricted according to the policy of their institutions and communication with the donor. Depending on the resources available, processing archivists may want to confirm the email addresses of correspondents suggested by ePADD. Archivists may also want to reconcile the correspondents/person entities extracted with authority records suggested by ePADD. After they finish processing, archivists will output two versions of the archive from ePADD. Neither set contains any restricted messages.
The first set is designed for the discovery module, with all messages redacted barring identified entities (people, places, organizations) and email address with a masked domain name. This version will be stored in the web server to be used for the discovery module. Public researchers with internet access can browse and search the archives using the discovery module. They will only see a redacted version of the original messages containing extracted entities, but this is still useful to them to get a sense of the entities present in the archive, without being able to see what is said about them.
The second version, designed for the delivery module, will be stored in the reading room computer designated for email delivery. Researchers using the designated computer in the reading room will be able to browse and search the archives. The messages, when displayed, will be the full messages without redaction. Researchers can define their own lexicon to analyze the collection. They may request copies by flagging the messages they need. Public service archivists/librarians can then give the researchers the files according to the policy of their institutions.
Glynn: I would only add that the appraisal module is meant to make it possible for a creator/donor to review their email archives, to create their own lexicon if desired, and prepare the files for export and final transfer to a repository. During this process they may take actions on specific messages (individually) or sets of messages (bulk by topic or correspondent) as restricted or elected not to transfer. We felt this functionality was important to offer a donor for two reasons. First, in the hope that they weed out irrelevant messages or spam! Second, there may be individuals they correspond with that do not want their messages archived – this is a case for one of our collections.
Trevor: How do you imagine archivists using this tool? Further, how do you see it fitting in with the ecosystem of other open-source tools and platforms that act as digital repository platforms and other tools for processing and working born digital archival materials like BitCurator and Archivematica?
Peter: I consider “processing” of born-digital materials to include both identifying restricted materials AND arranging/describing the intellectual content of the materials. My understanding of Bitcurator and Archivematica is that neither offers tools to arrange/describe the intellectual content of the materials. ePADD offers four tools to arrange/describe the intellectual content of email archives. First, it uses a natural language processing library to extract personal names, organizational names and locations in email archives to give researchers a sense of people, organizations and locations in the archives. Second, it gathers all image files in one place for researchers to browse and if necessary go to the messages containing the images.
Third, it offers user-definable lexicons which contain categories of words the system will use to search against the emails so that researchers/archivists can browse emails according to the lexicons they defined. Finally, ePADD reconciles the correspondents and personal names mentioned in messages with the FAST (Faceted Application of Subject Terminology) dataset which is derived from the Library of Congress Subject Headings. Archivists can then give their confirmation to the suggested matches by ePADD. If none of the suggestions are correct, they can enter their own links to the authority records.
I can see people using ePADD to appraise, process, discover and deliver emails and sending the files generated for delivery and discovery to systems using Archivematica for long term preservation.
Trevor: In Sudheendra’s presentation I saw some really interesting things happening with approaches to identifying different distinct email addresses that are associated with the same individual over time in a collection, and some interesting approaches to associating the names of individuals with canonical data for names of people. I think he also illustrated ways that the content of the messages could be identified and associated with subjects. Could you tell us a bit about how this works and how you are thinking about the possibilities and impact on things like archival description that these approaches could have?
Glynn: With email archives – or any born-digital materials – archivists need automated methods to get through large amounts of data. ePADD incorporates several methods of automation to assist with processing of email. Here are three:
1. Correspondents & name resolution
During ingest, ePADD gathers all correspondents and recipients from email headers and performs basic name resolution tasks. When your cursor rolls over a name, different versions that were aggregated appear in a pop-up window. The archivist can go into the back end and override or edit the addresses that are associated with a specific name.
I would direct you to the wonderful documentation on processing and using email archives on the MUSE website. Regarding the resolution of correspondents, Sudheendra states (PDF) in the “MUSE: Reviving Memories Using Email Archives” report that “MUSE performs entity resolution by unifying names and email addresses in email headers when either the name or email address (as specified in the RFC-822 email header) is equivalent. This is essential since email addresses and even name spellings for a person are likely to change in a long-term archive.”
ePADD performs this process during ingest and allows the donor (appraisal module) or archivist (processing module) to correct or edit the email aliases that are automatically bundled together by ePADD at ingestion.
2. Entity extraction & disambiguation
ePADD extracts entities from the email corpus using Apache’s openNLP library and checks them against OCLC’s FAST database to identify authorities. In the case of multiple hits on a name, it shows all the matching records and can read data from DBpedia to automatically rank the likelihood of each record being the correct one. The archivist finally confirms which authority record is correct.
Algorithms are also used to help the archivist or researcher understand context while reading a message. For example, suppose a conversation mentions Bob, which could refer to any number of Bobs present in the archive. ePADD analyzes the occurrences of Bob throughout the archive with respect to the text and headers of this message, and thinks: “Hmm…when the name Bob is used with the people copied on this email, and when these other names appear in the message, its more likely to be Bob Creeley than other Bobs in the archive like Dylan or Woodward.” It displays a popup with the ranked list of possibilities (see image).
The colored bar underneath each full name indicates the likelihood of that association. This feature can be used by an archivist during processing or by researchers in the delivery module to understand the archive’s contents better. If you think about it, we humans do this kind of context-based disambiguation all the time; ePADD is helping us along by trying to automate some of it.
3. Lexical searches & review
The archivist can use the built-in lexicons or create one in order to tease out the subjects or topics in the archive. MUSE came with a “sentiment” lexicon and ePADD will include another default lexicon based on searching for Personally Identifiable Information and sensitive material. This will include the ability to identify regular expressions – such as credit card or social security numbers as well as material that may be governed by FERPA or HIPAA. These lexicons are editable or one could start from scratch and create a specialized one. The beauty of this is that once the terms are indexed by ePADD the user can view the messages individually or in a visualization graph.
Trevor: As a follow-up to that question, how is the project conceptualizing the role of the archivist engaging with some of these automated processes for description? Sudheendra showed how an archivist could intervene and accept/reject or tweak the resulting bundling of email addresses and associate them with named entities. With that said, I imagine it would be a huge undertaking, and one that seems inconsistent with an MPLP approach, to have an archivist review all of this metadata. To that end, are there ways the project can enable some level of review of particularly important figures and still communicate which part is automated and which part has been reviewed? Or are there other ways the team is thinking about this kind of issue?
Peter: In view of the large number of correspondents and personal names mentioned in an email archive, reviewing ALL name entities is usually not feasible. Depending on resources we have for each archive, we can review, say the top 1000 most mentioned names in an archive.
Glynn: Agreed. This is similar to processing the analog or paper correspondence in a collection. The archivist usually selects correspondents that are either well known, that have substantive letters – either in form or extent. Not all correspondents in a collection make it into the finding aid as added entries, into folder-level description, or even into a detailed index. With ePADD the top 50 or 100 correspondents (in extent) are easily and automatically identified.
However, because researchers may be interested in entities/correspondents that we do not “process,” we are considering allowing them much of the same functionality in the full text access module in the reading room. One example, would be allowing the researcher to create a new lexicon and search by their terms.
To identify what’s been processed is a work in process. We still need to build in some administrative features – such as scope and content notes – to let the researcher know the types/depth of actions performed.
Trevor: How are you thinking about authenticity of records in the context of this project? That is, what constitutes the original and authentic format of these records and how does the project work to ensure the integrity of those records over time. Similarly, how are you thinking about documenting decisions and actions taken in the appraisal process on the records?
Peter: According to “ISO 15489-1, Information and documentation–Records Management,” an authentic (electronic) record is one that can be proven:
a) to be what it purports to be,
b) to have been created or sent by the person purported to have created or sent it, and
c) to have been created or sent at the time purported.
Format is not part of the requirements for an authentic electronic record. One of the reasons is that electronically produced documents actually are not objects at all but rather, by their nature, products that have to be processed each time they are used. There is no transfer, no reading without a re-creation of the information. Furthermore, electronic records are at risk because of technical obsolescence as newer formats replace older ones.
ePADD does not address the issue of authenticity in this round of funding. This issue is definitely important and complicated and I would like to address it in the future.
Trevor: What lessons has the team learned so far about working with email archives? Are there any assumptions or thoughts you had about working with email as records that have evolved or changed while working on the project?
Peter: Conversion of old archived emails can be tricky. Even though normalization is not within the scope of ePADD, still people need to convert emails to MBox format before ePADD can work on them. One of our partners found missing headers from emails when looking at them using ePADD. The emails came from old Groupwise emails that were migrated into Outlook and then converted to Mbox. Is this a conversion error when converting Groupwise emails to Outlook? Or when converting the Outlook emails to mbox? Or when ePADD parses the emails?
Attachment files come in diverse file formats. The ability to view files in attachments is an important feature for a system like ePADD. Apache OpenOffice.org can read files in ~50 file formats. On the other hand, QuickView Plus can view files in ~500 file formats. Should we integrate a commercial software in ePADD in order to view files in the 450 file formats which Apache OpenOffice is not capable of? If yes, ePADD will not be an open-source project anymore. If no, ePADD users have to face the fact that there are files they are not able to view.
Glynn: The sheer volume of data to review can be very daunting. The more specific the terms in the lexicon to perform automated indexing to messages the better. You want to discover messages that should be restricted but not have too many false positives to wade through during review.
The ability to process in bulk cannot be stressed too much. When performing actions on a set of messages – either from a lexical result, correspondent or a user-defined search. ePADD allows to you apply any action to that entire subset. You can also apply actions to original folders. For example, if messages are organized into a folder marked “human resources,” the archivist or donor may choose to flag all the messages in that folder as “restricted until 2050.”
Trevor: What are the next steps for the project? What sorts of things are you exploring for the future?
Peter: I would like to look at the topics/concepts exchanged in emails (and match them against the Library of Congress Subject Heading – Topical). It would be interesting to know what books and movies were mentioned in emails. Publishing extracted entities as linked open data is definitely one thing I would like to do as well. However, it all depends on funding.
Glynn: This is the fun part – envisioning what else is needed or desired in future iterations. It is, however, reliant on funding and collaboration. Input is needed across different types of institutions – museums, government, academic, corporate to name a few. While many of the use cases would be similar, there are unique aspects or goals for different institutions.
Over the past few weeks, we’ve taken part in the NDSA-sponsored meeting (see Chris Prom’s blog post) and held ePADD’s first Advisory Group meeting. These sparked some wonderful discussions and ideas about next steps for greater discovery, delivery and collaboration.
There is a definite need in the profession to begin defining and documenting use cases, to analyze and document life cycles of email archives and existing tools in order to evaluate gaps and future needs, to further discovery through exporting correspondents and extracted entities from ePADD and publishing them with a dynamic search interface across archives. Other avenues we would like to explore are the ability to process and deliver other document types (beyond email), including social media.
The final delivery or access module is intended for reading room access, and we hope to provide more robust tools to allow user interaction with the archives. Additionally, we would like to offer data dumps for text mining/analysis or extractions of header information for social network analysis. Currently these are managed by correspondence through Special Collections.
One suggestion from our Advisory Group was to broaden use of ePADD before final release in the summer (2015). By allowing other repositories to use ePADD for processing we would expose more email collections for researcher use and hopefully get more feedback for the development and specification teams. This will be a better demonstration and test of the program. To this end we plan to release ePADD beyond our grant collaborators to other institutions that have already expressed strong interest.
Sudheendra: Glynn and Peter have answered your questions wonderfully, so I’ll just jump in with a little bit of speculation. In the last couple of decades, we’re seeing that a lot of our lives are reflected in our online activities, be it email, blogs, Facebook, Twitter or any other medium. A small example: a cousin of mine spent almost a year organizing a major dance performance for her daughter. She was reflecting on the effort, and exclaimed to me: “That was so much work. You know, I should save all those emails!” I think that is very telling. All of us have wonderful stories in our lives. There are moments of joy and exasperation, love and sorrow, accomplishment and failure, and they are often captured in our electronic communications. We should be able to preserve them, reflect on them, and hand them over to future generations. We already do this with photographs, which are wonderful. However, text-based communications are complementary to images because they capture thoughts, feelings and intentions in a way that images do not.
Unfortunately, the misuse of personal data for commercial or surveillance reasons is causing many people to be wary of preserving their own records, and even to go out of their way to delete them. This is a pity, because there is so much value buried in archives, if only users could keep their data under their own control, and have good tools with which to make sense of it. So in the next decade, I predict that individuals and families will routinely use tools like ePADD to preserve history important to them. We’re all archivists in that sense.
Online education has extended its presence to public libraries. Online learning and career training, by services such as Ed2Go and Lynda, are usually offered complimentary to college and university students. Similar services such as Gale Courses, Universal Class and Treehouse are geared toward public library use.
Gale Courses is a subscription service of Cengage Learning. It is a hybrid of Ed2Go, offering courses that range from GED preparation to PC Security. Courses are six weeks in length and are instructor led.
Universal Class offers hundreds of courses on a variety of topics, including dog obedience training, to patrons of diverse interests. Courses are self-paced and users can begin a course at anytime.
Treehouse is uniquely geared toward web design, development and programming for personal computers and mobile device applications. Users can select self-paced educational Tracks that are focused on a specific development area.
An alternative to MOOCs
A considerable population of the general public cannot afford to pursue a formal education. Extending the services of the library into web-based learning, online courses provide access to continuing education for the general public. The mention of free online education is not complete without a nod to massive open online courses (MOOC). MOOCs can be non-profit or commercial. They offer free or affordable online education, of varying course structure, to students around the world. Though MOOCs and open courseware are comparable alternatives, library-hosted continuing education offers additional incentives from those of most freely available online courses.
Education as a service
One advantage to using a service provided by the public library is that patrons can use the computers available on site. For patrons lacking home computer access, they can incorporate another library service into their education. Continuing education courses are free to library card holders at participating libraries. If your regional library does not offer the service, you can always purchase a library card from a participating library. Considering that each course can range from $50 to the mid $100s, the benefit of access to hundreds of courses outweighs the cost of purchasing a library card. Patrons will receive a certificate of completion for each completed course and in the case of Universal Class they will receive continuing education units that are approved by the International Association for Continuing Education and Training (IACET). Treehouse opts for using a point-based system and Badges, digital awards, which signify a user’s progress. Online education also helps to highlight the public library as an evolving source of public information.
All three continuing education providers, offer free trials and demo courses for anyone interested in their services.
My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.
I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4. The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.
However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.Background: The ActiveRecord Concurrency Model
Is pretty much described in the header docs for ConnectionPool, and the fundamental architecture and contract hasn’t changed since Rails 2.2.
Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.
You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use. So far so good.
But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.
And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)
And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before. A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.
And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.The danger of leaked connections
So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).
But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do. If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.
The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.
And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it. If you leak one connection like this, that’s one less connection available in the ConnectionPool. If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.
In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed. You could call this method from time to time yourself to try and clean up after yourself.
And in fact, if you tried to check out a connection, and no connections were available — Rails 3.2 would call clear_stale_cached_connections! itself to see if there were any leaked connections that could be reclaimed, before raising a ConnectionTimeout. So if you were leaking connections all over the place, you still might not notice, the ConnectionPool would clean em up for you.
But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually. As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.
So this makes it pretty important to avoid leaking connections.
(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup. That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )Monkey-patch AR to avoid leaked connections
I understand where Rails is coming from with the ‘implicit checkout’ thing. For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).
So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.
The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.
That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `Thread.new` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).
So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”. Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is:Thread.new do ActiveRecord::Base.forbid_implicit_checkout_for_thread! # stuff end
Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.
If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.
This way you can enforce your code to only use `with_connection` like it should.
Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.DO fear the Reaper
In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections. In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”
The problem is, as far as I can tell by reading the code, it simply does not do this.
A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long. A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).
The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.
Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)
So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.Oh, maybe fear ruby 1.9.3 too
When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:ThreadError: deadlock; recursive locking:
In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0.
If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.Can you use an already loaded AR model without a connection?
Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?
Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.
Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.
But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).
Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.
Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.Concurrency Patterns to Avoid in ActiveRecord?
Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).
So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.
If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.
I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.
That’s not what my app does. I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).
My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)
So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`. But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.
I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point, I’d recommend avoiding using ActiveRecord concurrency this way.What to do?
What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.
At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.” DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).
Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).
I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel). I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.
So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads. Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.
That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.
Filed under: General
Jason Ronallo: A Plugin For Mediaelement.js For Preview Thumbnails on Hover Over the Time Rail Using WebVTT
The time rail or progress bar on video players gives the viewer some indication of how much of the video they’ve watched, what portion of the video remains to be viewed, and how much of the video is buffered. The time rail can also be clicked on to jump to a particular time within the video. But figuring out where in the video you want to go can feel kind of random. You can usually hover over the time rail and move from side to side and see the time that you’d jump to if you clicked, but who knows what you might see when you get there.
Some video players have begun to use the time rail to show video thumbnails on hover in a tooltip. For most videos these thumbnails give a much better idea of what you’ll see when you click to jump to that time. I’ll show you how you can create your own thumbnail previews using HTML5 video.
TL;DR Use the time rail thumbnails plugin for Mediaelement.js.Archival Use Case
We usually follow agile practices in our archival processing. This style of processing became popularized by the article More Product, Less Process: Revamping Traditional Archival Processing by Mark A. Greene and Dennis Meissner. For instance, we don’t read every page of every folder in every box of every collection in order to describe it well enough for us to make the collection accessible to researchers. Over time we may decide to make the materials for a particular collection or parts of a collection more discoverable by doing the work to look closer and add more metadata to our description of the contents. But we try not to allow the perfect from being the enemy of the good enough. Our goal is to make the materials accessible to researchers and not hidden in some box no one knows about.
Some of our collections of videos are highly curated like for video oral histories. We’ve created transcripts for the whole video. We extract out the most interesting or on topic clips. For each of these video clips we create a WebVTT caption file and an interface to navigate within the video from the transcript.
At NCSU Libraries we have begun digitizing more archival videos. And for these videos we’re much more likely to treat them like other archival materials. We’re never going to watch every minute of every video about cucumbers or agricultural machinery in order to fully describe the contents. Digitization gives us some opportunities to automate the summarization that would be manually done with physical materials. Many of these videos don’t even have dialogue, so even when automated video transcription is more accurate and cheaper we’ll still be left with only the images. In any case, the visual component is a good place to start.Video Thumbnail Previews
When you hover over the time rail on some video viewers, you see a thumbnail image from the video at that time. YouTube does this for many of its videos. I first saw that this would be possible with HTML5 video when I saw the JW Player page on Adding Preview Thumbnails. From there I took the idea to use an image sprite and a WebVTT file to structure which media fragments from the sprite to use in the thumbnail preview. I’ve implemented this as a plugin for Mediaelement.js. You can see detailed instructions there on how to use the plugin, but I’ll give the summary here.1. Create an Image Sprite from the Video
This uses ffmpeg to take a snapshot every 5 seconds in the video and then uses montage (from ImageMagick) to stitch them together into a sprite. This means that only one file needs to be downloaded before you can show the preview thumbnail.ffmpeg -i "video-name.mp4" -f image2 -vf fps=fps=1/5 video-name-%05d.jpg montage video-name*jpg -tile 5x -geometry 150x video-name-sprite.jpg 2. Create a WebVTT metadata file
This is just a standard WebVTT file except the cue text is metadata instead of captions. The URL is to an image and uses a spatial Media Fragment for what part of the sprite to display in the tooltip.WEBVTT 00:00:00.000 --> 00:00:05.000 http://example.com/video-name-sprite.jpg#xywh=0,0,150,100 00:00:05.000 --> 00:00:10.000 http://example.com/video-name-sprite.jpg#xywh=150,0,150,100 00:00:10.000 --> 00:00:15.000 http://example.com/video-name-sprite.jpg#xywh=300,0,150,100 00:00:15.000 --> 00:00:20.000 http://example.com/video-name-sprite.jpg#xywh=450,0,150,100 00:00:20.000 --> 00:00:25.000 http://example.com/video-name-sprite.jpg#xywh=600,0,150,100 00:00:25.000 --> 00:00:30.000 http://example.com/video-name-sprite.jpg#xywh=0,100,150,100 3. Add the Video Thumbnail Preview Track
Put the following within the <video> element.<track kind="metadata" class="time-rail-thumbnails" src="http://example.com/video-name-sprite.vtt"></track> 4. Initialize the Plugin
See Bug Sprays and Pets with sound.Installation
One of the DOM API features I hadn’t used before is MutationObserver. One thing the thumbnail preview plugin needs to do is know what time is being hovered over on the time rail. I could have calculated this myself, but I wanted to rely on MediaElement.js to provide the information. Maybe there’s a callback in MediaElement.js for when this is updated, but I couldn’t find it. Instead I use a MutationObserver to watch for when MediaElement.js changes the DOM for the default display of a timestamp on hover. Looking at the time code there then allows the plugin to pick the correct cue text to use for the media fragment. MutationObserver is more performant than the now deprecated MutationEvents. I’ve experienced very little latency using a MutationObserver which allows it to trigger lots of events quickly.
The plugin currently only works in the browsers that support MutationObserver, which is most current browsers. In browsers that do not support MutationObserver the plugin will do nothing at all and just show the default timestamp on hover. I’d be interested in other ideas on how to solve this kind of problem, though it is nice to know that plugins that rely on another library have tools like MutationObserver around.Other Caveats
This plugin is brand new and works for me, but there are some caveats. All the images in the sprite must have the same dimensions. The durations for each thumbnail must be consistent. The timestamps currently aren’t really used to determine which thumbnail to display, but is instead faked relying on the consistent durations. The plugin just does some simple addition and plucks out the correct thumbnail from the array of cues. Hopefully in future versions I can address some of these issues.Discoveries
Having this feature be available for our digitized video, we’ve already found things in our collection that we wouldn’t have seen before. You can see how a “Profession with a Future” evidently involves shortening your life by smoking (at about 9:05). I found a spinning spherical display of Soy-O and synthetic meat (at about 2:12). Some videos switch between black & white and color which you wouldn’t know just from the poster image. And there are some videos, like talking heads, that appear from the thumbnails to have no surprises at all. But maybe you like watching boiling water for almost 13 minutes.
OK, this isn’t really a discovery in itself, but it is fun to watch a head banging JFK as you go back and forth over the time rail. He really likes milk. And Eisenhower had a different speaking style.
You can see this in action for all of our videos on the NCSU Libraries’ Rare & Unique Digital Collections site and make your own discoveries. Let me know if you find anything interesting.Preview Thumbnail Sprite Reuse
Since we already had the sprite images for the time rail hover preview, I created another interface to allow a user to jump through a video. Under the video player is a control button that shows a modal with the thumbnail sprite. The sprite alone provides a nice overview of the video that allows you to see very quickly what might be of interest. I used an image map so that the rather large sprite images would only have to be in memory once. (Yes, image maps are still valid in HTML5 and have their legitimate uses.) jQuery RWD Image Maps allows the map area coordinates to scale up and down across devices. Hovering over a single thumb will show the timestamp for that frame. Clicking a thumbnail will set the current time for the video to be the start time of that section of the video. One advantage of this feature is that it doesn’t require the kind of fine motor skill necessary to hover over the video player time rail and move back and forth to show each of the thumbnails.
This feature has just been added this week and deployed to production this week, so I’m looking for feedback on whether folks find this useful, how to improve it, and any bugs that are encountered.Summarization Services
I expect that automated summarization services will become increasingly important for researchers as archives do more large-scale digitization of physical collections and collect more born digital resources in bulk. We’re already seeing projects like fondz which autogenerates archival description by extracting the contents of born digital resources. At NCSU Libraries we’re working on other ways to summarize the metadata we create as we ingest born digital collections. As we learn more what summarization services and interfaces are useful for researchers, I hope to see more work done in this area. And this is just the beginning of what we can do with summarizing archival video.
Most of the conferences I go to are technology ones that are focused on practical applications and knowledge sharing on how we have solved specific technical problems or figured out new, more efficient ways to do old things. It’s been a long time since I’ve been to a conference that’s about broader ideas and a much longer time since I’ve been to an academic conference. This was outside my comfort zone and it was an extremely worthwhile experience.
I was unbelievably excited to see the program for the first Gender and Sexuality in Information Studies colloquium. Also, as Emily Drabinski and Lisa Sloniowski were involved, so I knew it was going to be great.
There were 100 attendees. I’d estimate that library and information studies professors and PhD students made up 50%, library school grad students made up 25%, and the other 25% of us were practioners, who work almost exclusively in academic settings. The conference participants had the best selection of glasses, and I was inspired to document some of them.
The program was great and I had a very hard time picking which of the 3 streams I wanted to attend. A few people scampered between rooms to catch papers in different streams. Program highlights for me was the panel on porn in the library and the panel on gender and content. My thoughts on the porn in the library panel became a bit long, so I’ll post those tomorrow.
In my opinion it was a shame that most of the presenters defaulted to a traditional academic style of conference presentation, that is, they stood at the front of the room and read their papers to the audience without making much eye contact. For me the language was sometimes unnecessarily dense and that many of the theoretical concepts discussed would’ve been more successful if expressed in plain English.
I was also disappointed that there wasn’t a plan to post the papers online. Lisa explained to me that for those librarians and scholars in a university environment publications are important to tenure and promotion. Conference presentations count, but not as much as peer reviewed publications, which don’t count as much as book publications. I know there’s a plan in the works for a edition of Library Trends that will be published in 2 years. Also, I know from the interest on Twitter that there are many people who weren’t able to travel to Toronto and attend in person who are very hungry to read these papers. For the technology conferences I go to it is standard to share as much as possible: to livestream the conference, to archive the Twitter stream, and to post presentations online and made code public too. I hope that most of the presenters will figure out a way to share their work openly without it costing them in academic prestige. There’s got to be a way to do this.
There was a really magical feeling at this first colloquium on gender and sexuality in LIS. Everyone brought their smarts, ideas and generous spirits. I think a lot of us have been starved for this kind of environment, engagement and community.
My brain, heart and sinuses are full. I’m exhausted and heading home to Vancouver. This one day of connections and ideas will keep me going for another year. Kudos to the organizers Emily Drabinski, Patrick Keilty and Litwin Books for organizing this. I’m hungry for more.