Hi there, future text miners. Before we head down the coal shoot together, I’ll begin by saying this, and I hope it will reassure you- no matter your level of expertise, your experience in writing code or conducting data analysis, you can find an online tool to help you text mine.
The internet is a wild and beautiful place sometimes.
But before we go there, you may be wondering- what’s this Brave New Workplace business all about? Brave New Workplace is my monthly discussion of tech tools and skill sets which can help you adapt and know a new workplace. In our previous two installments I’ve discussed my own techniques and approaches to learning about your coworkers’ needs and common goals. Today I’m going to talk about text mining the results of your survey, but also text mining generally.
Now three months into my new position, I have found that text mining my survey results was only the first step to developing additional awareness of where I could best apply my expertise to library needs and goals. I went so far as to text mine three years of eresource Help Desk tickets and five years of meeting notes. All of it was fun, helpful, and revealing.
Text mining can assist you in information gathering in a variety of ways, but I tend to think it’s helpful to keep in mind the big three.
1. Seeing the big picture (clustering)
2. Finding answers to very specific questions (question answering)
3. Hypothesis generation (concept linkages)
For the purpose of this post, I will focus on tools for clustering your data set. As with any data project, I encourage you to categorize your inputs and vigorously review and pre-process your data. Exclude documents or texts that do not pertain to the subject of your inquiry. You want your data set to be big and deep, not big and shallow.
I will divide my tool suggestions into two categories: beginner and intermediate. For my beginners just getting started, you will not need to use any programming language, but for intermediate, you will.
Start yourself off easy and use WordClouds.com. This simple site will make you a pretty word cloud, and also provide you with a comprehensive word frequencies list. Those frequencies are concept clusters, and you can begin to see trends and needs in your new coworkers and your workplace goals. This is a pretty cool, and VERY user friendly way to get started text mining.
WordClouds eliminates frequently used words, like articles, and gets you to the meat of your texts. You can copy paste text or upload text files. You can also scan a site URL for text, which is what I’ve elected to do as an example here, examining my library’s home page. The best output of WordClouds is not the word cloud. It’s the easily exportable list of frequently occurring words.WordCloud Frequency List
To be honest, I often use this WordClouds’ function in advance of getting into other data tools. It can be a way to better figure out categories of needs, a great first data mining step which requires almost zero effort. With your frequencies list in hand you can do some immediate (and perhaps more useful) data visualization in a simple tool of your choice, for instance Excel.
Excel Graphs for Visualization
Depending on your preferred programming language, many options are available to you. While I have traditionally worked in SPSS for data analysis, I have recently been working in R. The good news about R versus SPSS- R is free and there’s a ton of community collaboration. If you have a question (I often do) it’s easy to find an answer.
Getting started in R with text mining is simple. You’ll need to install the packages necessary if you are text mining for the first time.
Then save your text files in a folder titled: “texts,” and load those in R. Once in, you’ll need to pre-process your text to remove common words and punctuation. This guide is excellent in taking you through the steps to process your data and analyze it.
Just like our WordClouds, you can use R to discover term frequencies and visualize them. Beyond this, working in R or SPSS or Python can allow you to cluster terms further. You can find relationships between words and examine those relationships within a dendrogram or by k-means. These will allow you to see the relationships between clusters of terms.
Ultimately, the more you text mine, the more familiar you will become with the tools and analysis valuable in approaching a specific text dataset. Get out there and text mine, kids. It’s a great way to acculturate to a new workplace or just learn more about what’s happening in your library.
Now that we’ve text mined the results of our survey, it’s time to move onto building a Customer Relationship Management system (CRM) for keeping our collaborators and projects straight. Come back for Brave New Workplace: Your Homegrown CRM on December 21st.
A quick pointer to Automated Scanning of Firefox Extensions is Security Theater (And Here’s Code to Prove It) by Dan Stillman, lead developer on Zotero, about how extension signing (meant to make Firefox more secure) could cause serious problems for Zotero.
For the last few months, we’ve been asking on the Mozilla add-ons mailing list that Zotero be whitelisted for extension signing. If you haven’t been following that discussion, 1) lucky you, and 2) you can read my initial post about it, which gives some context. The upshot is that, if changes aren’t made to the signing process, we’ll have no choice but to discontinue Zotero for Firefox when Firefox 43 comes out, because, due to Zotero’s size and complexity, we’ll be stuck in manual review forever and unable to release timely updates to our users, who rely on Zotero for time-sensitive work and trust us to fix issues quickly. (Zotero users could continue to use our standalone app and one of our lightweight browser extensions, but many people prefer the tighter browser integration that the original Firefox version provides.)
Mozilla should give Zotero the special treatment it deserves. It’s a very important tool, and a crucial part of ongoing research all over the world. Mozilla needs to support it.
This morning I received an email asking me to peer review a book proposal for Chandos Publishing, the Library and Information Studies imprint of Elsevier. Initially I thought it was spam because of some sloppy punctuation and the “Dr. Robertson” salutation.
When other people pointed out that this likely wasn’t spam my ego was flattered for a few minutes and I considered it. I was momentarily confused–would participating in Elsevier’s book publishing process be evil? Isn’t it different from their predatory pricing models with libraries and roadblocks to sharing research more broadly? I have a lot to learn about scholarly publishing, but decided that I’m not going to contribute my labour to a company that are jerks to librarians, researchers and libraries.
Here’s some links I found useful:
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
New This Week:
Visit the LITA Job Site for more available jobs and for information on submitting a job posting.
Winchester, MA For a quick round-up of current news and information about events and achievements happening around the digital preservation and access ecosystem visit DuraSpace Today: http://duraspace.org/duraspace-today. Follow DuraSpace on Twitter by clicking the link at the top of the page.
From Surina Khan, Open Repository
From Michele Mennielli, Cineca
Bologna, Italy In the last two months Cineca attended two very important IT events focused on support for Higher Education. At both events the Italian Consortium presented its research ecosystem related activities, focusing on DSpace and DSpace-CRIS.
From Tiago Ferreira
Petrópolis, Rio de Janeiro, Brazil Provider IT Neki Technologies, the Brazilian Duraspace Registered Service Provider, has undergone a major change during the past few months and is now Neki IT. Located in Petrópolis, Rio de Janeiro, Neki IT has left the Provider Group and is again running its own structure.
But Emigrant City is a bit different from the other projects we’ve released in one very important way: this one is built on top of a totally new framework called Scribe, built in collaboration with Zooniverse and funded by a grant from the NEH Office of Digital Humanities along with funds from the University of Minnesota. Scribe is the codebase working behind the scenes to support this project.
What is Scribe?
Scribe is a highly configurable, open-source framework for setting up community transcription projects around handwritten or OCR-resistant texts. Scribe provides the foundation of code for a developer to configure and launch a project far more easily than if starting from scratch.
NYPL Labs R&D has built a few community transcription apps over the years. In general, these applications are custom built to suit the material. But Scribe prototypes a way to describe the essential work happening at the center of those projects. With Scribe, we propose a rough grammar for describing materials, workflows, tasks, and consensus. It’s not our last word on the topic, but we think it’s a fine first pass proposal for supporting the fundamental work shared by many community transcription projects.So, what’s happening in all of these projects?
Our previous community transcription projects run the gamut from requesting very simple, nearly binary input like “Is this a valid polygon?” (as in the case of Building Inspector) to more complex prompts like “Identify every production staff member in this multi-page playbill” (as in the case of Ensemble). Common tasks include:
- Identify a point/region in a digitized document or image
- Answer a question about all or part of an image
- Flag an image as invalid (meaning it’s blank or does not include any pertinent information)
- Flag other’s contributions as valid/invalid
- Flag a page or group of pages as “done”
There are many more project-specific concerns, but we think the features above form the core work. How does Scribe approach the problem?
Scribe reduces the problem space to “subjects” and “classifications.” In Scribe, everything is either a subject or a classification: Subjects are the things to be acted upon, classifications are created when you act. Creating a classification has the potential to generate a new subject, which in turn can be classified, which in turn may generate a subject, and so on.
This simplification allows us to reduce complex document transcription to a series of smaller decisions that can be tackled individually. We think reducing the atomicity of tasks makes projects less daunting for volunteers to begin and easier to continue. This simplification doesn’t come at the expense of quality, however, as projects can be configured to require multiple rounds of review.
The final subjects produced by this chain of workflows represent the work of several people carrying an initial identification all the way through to final vetted data. The journey comprises a chain of subjects linked by classifications connected by project-specific rules governing exposure and consensus. Every region annotated is eventually either deleted by consensus or further annotated with data entered by several hands and, potentially, approved by several reviewers. The final subjects that emerge represent singular assertions about the data contained in a document validated by between three and 25 people.
In the case of Emigrant City specifically, individual bond records are represented as subjects. When participants mark those records up, they produce “mark” subjects, which appear in Transcribe. In the Transcribe workflow, other contributors transcribe the text they see, which are combined with others’ transcriptions as “transcribe” subjects. If there’s any disagreement among the transcriptions, those transcribe subjects appear in Verify where additional classifications are added by other contributors as votes for one or another transcription. But this is just the configuration that made sense for Emigrant City. Scribe lays the groundwork to support other configurations.Is it working?
I sure hope so! In any case, the classifications are mounting for Emigrant City. At writing we’ve gathered 227,638 classifications comprising marks, transcriptions, and verifications from nearly 3,000 contributors. That’s about 76 classifications each, on average, which is certainly encouraging as we assess the stickiness of the interface.
We’ve had to adjust a few things here and there. Bugs have surfaced that weren’t apparent before testing at scale. Most issues have been patched and data seems to be flowing in the right directions from one workflow to the next. We’ve already collected complete, verified data for several documents.
Reviewing each of these documents, I’ve been heartened by the willingness of a dozen strangers spread between the US, Europe, and Australia to meditate on some scribbles in a 120 year old mortgage record. I see them plugging away when I’m up at 2 a.m. looking for a safe time to deploy fixes.What’s next?
As touched on above, Scribe is primarily a prototype of a grammar for describing community transcription projects in general. The concepts underlying Scribe formed over a several-month collaboration between remote teams. We built the things we needed as we needed them. The codebase is thus a little confusing in areas, reflecting several mental right turns when we found the way forward required an additional configuration item or chain of communication. So one thing I’d like to tackle is reigning in some of the areas that have drifted from the initial elegance of the model. The notion that subjects and workflows could be rearranged and chained in any configuration has been a driving idea, but in practice the system obliges only a few arrangements.
An increasingly more pressing desire, however, is developing an interface to explore and vet the data assembled by the system. We spent a lot of time developing the parts that gather data, but perhaps not enough on interfaces to analyze it. Because we’ve reduced document transcription into several disconnected tasks, the process to reassemble the resultant data into a single cohesive whole is complicated. That complexity requires a sophisticated interface to understand how we arrived at a document’s final set of assertions from the the chain of contributions that produced it. Luckily we now have a lot of contributions around which to build that interface.
Most importantly, the code is now out in the wild, along with live projects that rely on it. We’re already grateful for the tens of thousands of contributions people have made on the transcription and verification front, and we’d likewise be immensely grateful for any thoughts or comments on the framework itself—let us know in the comments, or directly via Github, and thanks for helping us get this far.
Also, check out the other community transcription efforts built on Scribe. Measuring the Anzacs collects first-hand accounts from New Zealanders in WWI. Coming soon, “Old Weather: Whaling” gathers Arctic ships’ logs from the late 19th and early 20th centuries.
In his opening remarks at the November 17 Re:Create conference, Public Knowledge President & CEO Gene Kimmelman shared his thoughts about fair use as a platform for today’s creative revolution, and about it being a key to the importance of how knowledge is shared in today’s society. That set the tone for a dynamic discussion of copyright policy and law that followed, the cohesive focus behind the Re:Create coalition.
“Yes, it’s important for creators to have a level of protection for their work,” Eli Lehrer, president of the R Street Institute, said, “but that doesn’t mean government should have free rein. The founding fathers wanted copyright to be limited but they also wanted it to support the growth of science and the arts.” He went on to decry how copyright has been “taken over by special interests and crony capitalism. We need a vibrant public domain to support true creation,” he said, “and our outdated copyright law is stifling the advancement of knowledge and new creators in the digital economy.”
Three panels of experts brought together by the Re:Create Coalition then proceeded to critique pretty much every angle of copyright law and the role of the copyright office. They also discussed the potential for modernization of the U.S. Copyright Office, whether the office should stay within the Library of Congress or move, and the prospects for reform of the copyright law. The November 17 morning program was graciously hosted by Washington, D.C.’s Martin Luther King, Jr. Memorial Library.
Cory Doctorow, author and advisor to the Electronic Frontier Foundation, believes audiences should have the opportunity to interact with artists/creators. He pointed to Star Wars as an example. Because fans and audiences have interacted and carried the theme and impact forward, Star Wars continues to be a big cultural phenomenon, despite long pauses between new parts in the series. As Michael Weinberg, general counsel and intellectual property (IP) expert at Shapeways, noted, there are certain financial benefits in “losing control,” i.e. the value of the brand is being augmented by audience interaction, thus adding value to the product. Doctorow added that we’ve allowed copyright law to become entertainment copyright law, thus “fans get marginalized by the heavyweight producers.”
On the future of the copyright office panel, moderator Michael Petricone, senior vice president, government affairs, Consumer Technology Association (CTA), said we need a quick and efficient copyright system, and “instead of fighting over how to slice up more pieces of the pie, let’s focus on how to make the pie bigger.”
Jonathan Band, counsel to the Library Copyright Alliance (LCA), said the Copyright Office used to just manage the registration process, but then, 1) volume multiplied 2) some people registered and others didn’t who are not necessarily using the system, and 3) the office didn’t have the resources to keep up on the huge volume of things being created. This “perfect storm,” he said, is not going to improve without important changes, such as modernizing its outdated and cumbersome record-keeping, but the Office also needs additional resources to address the “enormous black hole of rights.” Laura Moy, senior policy counsel at the Open Technology Institute (OTI) agreed that this is a big problem, because many new creators don’t have the resources or the legal counsel to help them pursue copyright searches and registration.
All the panelists were in agreement that it makes no sense to move the Copyright Office out of the Library of Congress, as has been proposed by a few. Matt Schruers, vice president for law & policy, Computer & Communications Industry Association (CCIA) agreed, urging for more robust record-keeping, incentives to get people to register and taking steps to mitigate costs. He said “we need to look at what the problems are, and fix them where they are. A lot of modernization can be done in the Office where it is, instead of all the cost of moving it and setting it up elsewhere.” Band strongly agreed. “Moving it elsewhere wouldn’t solve the issue/cost of taking everything digital. Moving the Office just doesn’t make sense.” Moy suggested there are also some new skills and expertise that are needed, such as someone with knowledge in IP and its impacts on social justice.
Later in the program, panelists further batted around the topic of fair use. For Casey Rae, CEO of the Future of Music Coalition, fair use is often a grey area of copyright law because it depends on how fair use is interpreted by the courts. In the case of remixes, for example, the court, after a lengthy battle, ruled in favor of 2 Live Crew’s remix of Pretty Woman, establishing that commercial parodies can qualify for fair use status. Lateef Mtima, professor of law, Howard University, and founder and director of the Institute for Intellectual Property and Social Justice, cited the Lenz v. Universal case that not only ruled in favor of the mom who had posted video on YouTube of her baby dancing to Prince’s Let’s Get Crazy, but established that fair use is a right, warning those who consider issuing a takedown notice to “ignore it at your own peril.”
When determining fair use, Greta Peisch, international trade counsel, Senate Finance Committee, said “Who do you trust more to best interpret what is in the best interests of society, the courts or Congress?” The audience response clearly placed greater confidence in the courts on that question. And Engine Executive Director Julie Samuels concluded that “fair use is the most important piece of copyright law—absolutely crucial.”
In discussing the future for copyright reform, Rae said there’s actually very little data on how revising the laws will impact the creative community and their revenue streams. He said legislation can easily be created based on assumptions without the data to back it up, so he urged for more research. But he also implied that the music industry (sound recording and music studios) need to do a better job of explaining their narrative…i.e. go to policymakers with data in hand and real life stories to share.
Mtima is optimistic that society is making progress in better understanding how the digital age has opened up the world for sharing knowledge and expanding literacy (what he called society’s Gutenberg moment). At first, he says, there was resistance to change. But as content users have made more and more demands for access to content, big content providers are recognizing the need to move away from the old model of controlling and “monetizing copies.” New models are developing and there’s recognition that opening access is often expanding the value of the brand.
Re:Create’s ability to focus on such an important area of public policy as copyright is the reason the coalition has attracted a broad and varied membership. It remains an important forum for discourse among creators, advocates, thinkers and consumers.