You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 7 hours 39 min ago

DPLA: New Uses for Old Advertising

Wed, 2014-11-19 18:00

3 Feeds One Cent, International Stock Food Company, Minneapolis, Minnesota, ca.1905. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.

Digitization efforts in the US have, to date, been overwhelmingly dominated by academic libraries, but public libraries are increasingly finding a niche by looking to their local collections as sources for original content. The Hennepin County Library has partnered with the Minnesota Digital Library (MDL)—and now the Digital Public Library of America—to bring thousands of items to the digital realm from its extensive holdings in the James K. Hosmer Special Collections Department. These items include maps, atlases, programs, annual reports, photographs, diaries, advertisements, and trade catalogs.

Our partnership with MDL has not only provided far greater access to these hidden parts of our collections, it has also made patrons much more aware of the significance of our collections and the large number of materials that we could be digitizing. The link to DPLA has further increased our awareness of the potential reach of our collections: DPLA is already the second largest source of referrals to our digital content on MDL. All this has motivated us to increase our digitization activities and place greater emphasis on the role of digital content in our services.

Recently, we have been contributing hundreds of items related to local businesses in the form of large advertising posters, trade catalogs, and over 300 business trade cards from Minneapolis companies. These vividly illustrated materials provide a fascinating view of advertising techniques, local businesses, consumer and industrial goods, social mores and popular culture from the late 19th and early 20th centuries.

Hennepin County Library is committed to serving as Hennepin County’s partner in lifelong learning with programs for babies to seniors, new immigrants, small business owners and students of all ages. It comprises 41 libraries, and has holdings of more than five million books, CDs, and DVDs in 40 world languages. It manages around 1,750 public computers, has 11 library board members, and is one great system serving 1.1 million residents of Hennepin County.

Featured image credit: Detail of 1893 Minneapolis Industrial Exposition Catalog, Minneapolis, Minnesota. Courtesy of Hennepin County Library’s James K. Hosmer Special Collections Library via the Minnesota Digital Library.

All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.

Eric Lease Morgan: My second Python script, dispersion.py

Wed, 2014-11-19 17:54

This is my second Python script, dispersion.py, and it illustrates where common words appear in a text.

#!/usr/bin/env python2 # dispersion.py - illustrate where common words appear in a text # # usage: ./dispersion.py <file> # Eric Lease Morgan <emorgan@nd.edu> # November 19, 2014 - my second real python script; "Thanks for the idioms, Don!" # configure MAXIMUM = 25 POS = 'NN' # require import nltk import operator import sys # sanity check if len( sys.argv ) != 2 : print "Usage:", sys.argv[ 0 ], "<file>" quit() # get input file = sys.argv[ 1 ] # initialize with open( file, 'r' ) as handle : text = handle.read() sentences = nltk.sent_tokenize( text ) pos = {} # process each sentence for sentence in sentences : # POS the sentence and then process each of the resulting words for word in nltk.pos_tag( nltk.word_tokenize( sentence ) ) : # check for configured POS, and increment the dictionary accordingly if word[ 1 ] == POS : pos[ word[ 0 ] ] = pos.get( word[ 0 ], 0 ) + 1 # sort the dictionary pos = sorted( pos.items(), key = operator.itemgetter( 1 ), reverse = True ) # do the work; create a dispersion chart of the MAXIMUM most frequent pos words text = nltk.Text( nltk.word_tokenize( text ) ) text.dispersion_plot( [ p[ 0 ] for p in pos[ : MAXIMUM ] ] ) # done quit()

I used the program to analyze two works: 1) Thoreau’s Walden, and 2) Emerson’s Representative Men. From the dispersion plots displayed below, we can conclude a few things:

  • The words “man”, “life”, “day”, and “world” are common between both works.
  • Thoreau discusses water, ponds, shores, and surfaces together.
  • While Emerson seemingly discussed man and nature in the same breath, but none of his core concepts are discussed as densely as Thoreau’s.

Thoreau’s Walden

Emerson’s Representative Men

Python’s Natural Langauge Toolkit (NLTK) is a good library to get start with for digital humanists. I have to learn more though. My jury is still out regarding which is better, Perl or Python. So far, they have more things in common than differences.

OCLC Dev Network: Now Playing: Finding a Common Language

Wed, 2014-11-19 17:00

If you missed our recent webinar on Finding a Common Language you can now view the full recording.

Harvard Library Innovation Lab: Link roundup November 19, 2014

Wed, 2014-11-19 16:34

Yipes cripes we’ve got our winter coats on today. Sit down with a hot beverage and enjoy these internet finds.

The FES Watch Is an E-Ink Chameleon – Design Milk

An E-Ink watch. Why isn’t E-Ink used in more places?

The Ingenuity and Beauty of Creative Parchment Repair in Medieval Books | Colossal

Acknowledge the imperfect object. Could be some creative ways to repair damaged children’s books.

A Brief History of Failure

I love the idea that failed tech can loop back around. Who knows we’ve tossed in the trash bin.

Lost At The Museum? This Ingenious 3-D Map Makes Navigation A Cinch

This would be a killer maps of the stacks.

Letterpress Printers Are Running Out Of @ Symbols And Hashtags

The boom of the @ sign.

John Miedema: Genre, gender and agency analysis using Parts of Speech in Watson Content Analytics. A simple demonstraton.

Wed, 2014-11-19 16:03

Genre is often applied as a static classification: fiction, non-fiction, mystery, romance, biography, and so on. But the edges of genre are “blurry” (Underwood). The classification of genre can change over time and situation. Ideally, genre and all classifications could be modeled dynamically during content analysis. How can IBM’s Watson Content Analytics (WCA) help analyze genre? Here is a simple demonstration.

In WCA I created a collection of 1368 public domain novels from Open Library. For this demonstration, I obtained author metadata and expressed it as a WCA facet. I did not obtain existing genre metadata. I will demonstrate that I can use author gender to dynamically classify genre for a specific analytical question. In particular, I follow the research of Matthew Jockers and the Nebraska Literary Lab. Can genre be distinguished by the gender of the author? How is action and agency treated differently in male and female genres? This simple demonstration does not answer these questions, but shows how WCA can be used to give insight into literature.

In Figure 1, the WCA Author facet is used to filter the collection to ten male authors: Walter Scott, Robert Louis Stevenson, and others. The idea is to dynamically generate a male genre by the selection of male authors. (Simple, but note that a complex array of facets could be used to quickly define a male genre.)


In Figure 2, the WCA Parts-of-Speech analysis lists frequently used verbs in the collection susbset, the male genre: tempt, condemn, struggle. Some values might be considered action verbs, but further analysis is required.

 

In Figure 3, the verb “struggle” is seen in the context of its source, the Waverly novels: “the Bohemian struggled to detain Quentin”, “to struggle with the sea”. This view can be used to determine the gender of characters, the actions they are performing, and interpret agency.

 

In Figure 4, a new search is performed, this time filtering for female authors: Jane Austen, Maria Edgeworth, Susan Ferrier, and others. In this case, the idea is to dynamically generate a female genre by selecting female authors.

 

In Figure 5, the WCA Parts-of-Speech analysis lists frequently used verbs in the female genre: mix, soothe, furnish. At a glance, there is an obvious difference in quality from the verbs in the male genre.


Finally in Figure 6, the verb “furnish” is seen in the context of its source in Jane Austen’s Letters, “Catherine and Lydia … a walk to Meryton was necessary to amuse their morning hours and furnish conversation.” In this case, furnish does not refer to the literal furnishing of a house, but to the facilitation of dialog. As before, detailed content inspection is needed to analyze and interpret agency.

HangingTogether: Libraries &amp; Research: Supporting change in the university

Wed, 2014-11-19 14:43

[This is the third in a short series on our 2014 OCLC Research Library Partnership meeting, Libraries and Research: Supporting Change/Changing Support. You can read the first and second posts and also refer to the event webpage that contains links to slides, videos, photos, and a Storify summary.]

[Driek Heesakkers, Paolo Manghi, Micah Altman, Paul Wouters, and John Scally]

As if changes in research are not enough, changes are also coming at the university level and at the national level. The new imperatives of higher education around Open Access, Open Data and Research Assessment are impacting the roles of libraries in managing and providing access to e-research outputs, in helping define the university’s data management policies, and demonstrating value in terms of research impact. This session explored these issues and more!

John MacColl (University Librarian at University of St Andrews) [link to video] opened the session, speaking briefly about the UK context to illustrate how libraries are taking up new roles within academia. John presented this terse analysis of the landscape (and I thank him for providing notes!):

  • Professionally, we live increasingly in an inside-out environment. But our academic colleagues still require certification and fixity, and their reputation is based on a necessarily conservative world view (tied up with traditional modes of publishing and tenure)
  • Business models are in transition. The first phase of transition was from publisher print to publisher digital. We are now in a phase which he terms as deconstructive, based on a reassessment of the values of scholarly publishing, driven by the high cost of journals.
  • There are several reasons for this: among the main ones are the high costs of publisher content, and our responsibility as librarians for the sustainability of the scholarly record; another is the emergence of public accountability arguments – the public has paid for this scholarship, they have the right to access outputs.
  • What these three new areas of research library activity have in common is the intervention of research funders into the administration of research within universities, although the specifics vary considerably in different nations.

John Scally (Director of Library and University Collections, University of Edinburgh) [link to video] added to the conversation, speaking about the role of the research library in research data management (RDM) at the University of Edinburgh. From John’s perspective, the library is a natural place for RDM work to happen because the library has been in the business of managing and curating stuff for a long time and services are at the core of the library. Naturally, making content available in different ways is a core responsibility of the library. Starting research data conversations around policy and regulatory compliance is difficult — it’s easier to frame as a problem around storage, discovery and reuse of data. At Edinburgh they tried to frame discussions around how can we help, how can you be more competitive, do better research? If a researcher comes to the web page about data management plans (say at midnight, the night before a grant proposal is due) that webpage should do something useful at the time of need, not direct researchers to come to the library during the day. Key takeaways: Blend RDM into core services, not a side business. Make sure everyone knows who is leading. Make sure the money is there, and you know who is responsible. Institutional policy is a baby step along the way, implementation is most important. RDM and open access are ways of testing (and stressing) your systems and procedures – don’t ignore fissures and gaps. An interesting correlation between RDM and the open access repository – since RDM has been implemented at Edinburgh, deposits of papers have increased.

Driek Heesakkers (Project Manager at the University of Amsterdam Library) [link to video] told us about RDM at the University of Amsterdam and in the Netherlands. Netherlands differs from other landscapes, characterized as “bland” – not a lot of differences between institutions in terms of research outputs. A rather complicated array of institutions for humanities, social science, health science, etc, all trying to define their roles in RDM. For organizations who are mandated to capture data, it’s vital that they not just show up at the end of the process to scoop up data, but that they be embedding in the environment where the work is happening, where tools are being used.  Policy and infrastructure need to be rolled out together. Don’t reinvent the wheel – if there are commercial partners or cloud services that do the work well, that’s all for the good. What’s the role of the library? We are not in the lead with policy but we help to interpret and implement — similarly with technology. The big opportunity is in the support – if you have faculty liaisons, you should be using them for data support. Storage is boring but necessary. The market for commercial solutions is developing which is good news – he’d prefer to buy, not built, when appropriate. This is a time for action — we can’t be wary or cautious.

Switching gears away from RDM, Paul Wouters (Director of the Centre for Science and Technology Studies at the University of Leiden) [link to video] spoke about the role of libraries in research assessment. His organization combines fundamental research and services for institutions and individual researchers. With research becoming increasingly international and interdisciplinary, it’s vital that we develop methods of monitoring novel indicators. Some researchers have become, ironically and paradoxically, fond of assessment (may be tied up with the move towards the quantified self?). However, self assessment can be nerve wracking and may not return useful information. Managers may are also interested in individual assessment because it may help them give feedback.  Altmetrics do not correlate closely to citation metrics, and and can vary considerably across disciplines. It’s important to think about the meaning of various ways of measuring impact. As an example of other ways of measuring, Paul presented the ACUMEN (Academic Careers Understood through Measurement and Norms) project, which allows researchers to take the lead and tell a story given evidence from his or her portfolio. An ACUMEN profile includes a career narrative supported by expertise, outputs, and influence. Giving a stronger voice to researchers is more positive than researchers not being involved in or misunderstanding (and resenting) indicators.

Micah Altman (Director of Research, Massachusetts Institute of Technology Libraries) [link to video] discussed the importance of researcher identification and the need to uniquely identify researchers in order to manage the scholarly record and to support assessment. Micah spoke in part as a member of a group that OCLC Research colleague Karen Smith-Yoshimura led, the Registering Researchers Task Group working group (their report, Registering Researchers in Authority Files is now available). It explored motivations, state of the practice, observations and recommendations. The problem is that there is more stuff, more digital content, and more people (the average number of authors on journal articles have gone up, in some cases way up). To put it mildly, disambiguating names is not a small problem. A researcher may have one or more identifiers, which may not link to one another and may come from different sources. The task group looked at the problem not only from the perspective of the library, but also from the perspective of various stakeholders (publishers, universities, researchers, etc.). Approaches to managing name identifiers result in some very complicated (and not terribly efficient) workflows. Normalizing and regularizing this data has big potential payoffs in terms of reducing errors in analytics, and creating a broad range of new (and more accurate) measures. Fortunately, with a recognition of the benefits, interoperability between identifier systems is increasing, as is the practice of assigning identifiers to researcher. One of the missing pieces is not only identifying researchers but also their roles in a given piece of work (this is a project that Micah is working on with other collaborators). What are steps that libraries can take? Prepare to engage! Work across stakeholder communities; demand more than PDFs from publishers. And prepare for more (and different) types of measurement.

Paolo Manghi (Researcher at Institute of Information Science and Technologies “A. Faedo” (ISTI), Italian National Research Council) [link to video] talked about the data infrastructures that support access to the evolving scholarly record and the requirements needed for different data sources (repositories, CRIS systems, data archives, software archives, etc.) to interoperate. Paolo spoke as a researcher, but also as the technical manager of the EU funded OpenAIRE project. This project started in 2009 out of a strong open access push from the European Commission. The project initially collected metadata and information about access to research outputs. The scope was expanded to include not only articles but also other research outputs. The work is done by human input and also technical infrastructure. They rely on input from repositories, also use software developed elsewhere. Information is funneled via 32 national open access desks. They have developed numerous guidelines (for metadata, for data repositories, and for CRIS managers to export data to be compatible with OpenAIRE). The project fills three roles — a help desk for national agencies, a portal (linking publications to research data and information about researchers) and a repository for data and articles that are otherwise homeless (Zenodo). Collecting all this information into one place allows for some advanced processes like deduplication, identifying relationships, demonstrating productivity, compliance, and geographic distribution. OpenAIRE interacts with other repository networks, such as SHARE (US), and ANDS (Australia). The forthcoming Horizon 2020 framework will cause some significant challenges for researchers and service providers because it puts a larger emphasis on access for non-published outputs.

The session was followed by a panel discussion.

I’ll conclude tomorrow with a final posting, wrapping up this series.

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (274)

LITA: Cataloging Board Games

Wed, 2014-11-19 13:00

Since September, I have been immersed in the world of games and learning.  I co-wrote a successful grant application to create a library-based Center for Games and Learning.

The project is being  funded through a Sparks Ignition! Grant from the Institute of Museum and Library Services.

One of our first challenges has been to decide how to catalog the games.  I located this presentation on SlideShare.  We have decided to catalog the games as Three Dimensional Objects (Artifact) and use the following MARC fields:

  • MARC 245  Title Statement
  • MARC 260  Publication, Distribution, Etc.
  • MARC 300  Physical Description
  • MARC 500  General Note
  • MARC 508  Creation/Production Credits
  • MARC 520  Summary, Etc.
  • MARC 521  Target Audience
  • MARC 650  Topical Term
  • MARC 655  Index Term—Genre/Form

There are many other fields that we could use, but we decided to keep it as simple as possible.  We decided not to interfile the games and instead, create a separate collection for the Center for Games and Learning.  Due to this, we will not be assigning a Library of Congress Classification to them, but will instead by shelving the games in alphabetical order.  We also created a material type of “board games.”

For the Center for Games and Learning we are also working on a website that will be live in the next few months.  The project is still in its infancy and I will be sharing more about this project in upcoming blog posts.

Do any LITA blog readers have board games in your libraries? If, so what MARC fields do you use to catalog the games?

 

 

 

 

 

 

State Library of Denmark: SB IT Preservation at ApacheCon Europe 2014 in Budapest

Wed, 2014-11-19 12:56

Ok, actually only two of us are here. It would be great to have the whole department at the conference, then we could cover more tracks and start discussing, what we will be using next week ;-)

The first keynote was mostly introduction to The Apache Software Foundation along with some key numbers. The second keynote (in direct extension of the first) was an interview with best selling author Hugh Howey, who self-published ‘Wool’, in 2011. A very inspiring interview! Maybe I could be an author too – with a little help from you? One of the things he talked about was how he thinks

“… the future looks more and more like the past”

in the sense that storytelling in the past was collaborative storytelling around the camp fire. Today open source software projects are collaborative, and maybe authors should try it too? Hugh Howey’s book has grown with help from fans and fan fiction.

The coffee breaks and lunches have been great! And the cake has been plentiful!

Så skal Apache software foundations 15 års fødselsdag da fejres!

Var der nogen som sagde at Ungarn var kendt for kager?

And yes, there has also been lots and lots of interesting presentations of lots and lots of interesting Apache tools. Where to start? There is one that I want to start using on Monday: Apache Tez. The presentation was by Hitesh Shah from Hortonworks and the slides are available online.

There are quite a few, that I want to look into a bit more and experiment with, such as Spark and Cascading, and I think my colleague can add a few more. There are some that we will tell our colleagues at home about, and hope that they have time to experiment… And now I’ll go and hear about Quadrupling your Elephants!

Note: most of the slides are online. Just look at http://events.linuxfoundation.org/events/apachecon-europe/program/slides.


Open Knowledge Foundation: The Public Domain Review brings out its first book

Wed, 2014-11-19 12:44

Open Knowledge project The Public Domain Review is very proud to announce the launch of its very first book! Released through the newly born spin-off project the PDR Press, the book is a selection of weird and wonderful essays from the project’s first three years, and shall be (we hope) the first of an annual series showcasing in print form essays from the year gone by. Given that there’s three years to catch up on, the inaugural incarnation is a special bumper edition, coming in at a healthy 346 pages, and jam-packed with 146 illustrations, more than half of which are newly sourced especially for the book.

Spread across six themed chapters – Animals, Bodies, Words, Worlds, Encounters and Networks – there is a total of thirty-four essays from a stellar line up of contributors, including Jack Zipes, Frank Delaney, Colin Dickey, George Prochnik, Noga Arikha, and Julian Barnes.

What’s inside? Volcanoes, coffee, talking trees, pigs on trial, painted smiles, lost Edens, the social life of geometry, a cat called Jeoffry, lepidopterous spying, monkey-eating poets, imaginary museums, a woman pregnant with rabbits, an invented language drowning in umlauts, a disgruntled Proust, frustrated Flaubert… and much much more.

Order by 26th November to benefit from a special reduced price and delivery in time for Christmas.

If you are wanting to get the book in time for Christmas (and we do think it is a fine addition to any Christmas list!), then please make sure to order before midnight (PST) on 26th November. Orders place before this date will also benefit from a special reduced price!

Please visit the dedicated page on The Public Domain Review site to learn more and also buy the book!

FOSS4Lib Recent Releases: PERICLES Extraction Tool - 1.0

Wed, 2014-11-19 01:02

Last updated November 18, 2014. Created by Peter Murray on November 18, 2014.
Log in to edit this page.

Package: PERICLES Extraction ToolRelease Date: Thursday, October 30, 2014

FOSS4Lib Updated Packages: PERICLES Extraction Tool

Wed, 2014-11-19 01:01

Last updated November 18, 2014. Created by Peter Murray on November 18, 2014.
Log in to edit this page.

The PERICLES Extraction Tool (PET) is an open source (Apache 2 licensed) Java software for the extraction of significant information from the environment where digital objects are created and modified. This information supports object use and reuse, e.g. for a better long-term preservation of data. The Tool was developed entirely for the PERICLES EU project http://www.pericles-project.eu/ by Fabio Corubolo, University of Liverpool, and Anna Eggers, Göttingen State and University Library.

Package Type: Data Preservation and ManagementLicense: Apache 2.0 Package Links Development Status: In DevelopmentOperating System: LinuxMacWindows Releases for PERICLES Extraction Tool Programming Language: JavaPerlOpen Hub Link: https://www.openhub.net/p/pericles-petOpen Hub Stats Widget: 

Library Tech Talk (U of Michigan): Quick Links and Search Frequency

Wed, 2014-11-19 00:00
Does adding links to popular databases change user searching behavior? An October 2013 change to the University of Michigan Library’s front page gave us the opportunity to conduct an empirical study and shows that user behavior has changed since the new front page design was launched.

DuraSpace News: DSpace User Group Meeting Held in Berlin

Wed, 2014-11-19 00:00

Winchester, MA  In October 26 participants from 16 institutions attended a German DSpace User Group Meeting hosted by the University Library of the Technische Universität Berlin.

DuraSpace News: REGISTER: ADVANCED DSPACE TRAINING RESCHEDULED

Wed, 2014-11-19 00:00
Winchester, MA  We are happy to announce the re-scheduled dates for the in-person, 3-day Advanced DSpace Course in Austin March 17-19, 2015. The total cost of the course is being underwritten with generous support from the Texas Digital Library and DuraSpace. As a result, the registration fee for the course for DuraSpace Members is only $250 and $500 for Non-Members (meals and lodging not included). Seating will be limited to 20 participants.  

DuraSpace News: Hot Topics Webinar Recording Available

Wed, 2014-11-19 00:00

Winchester, MA   Hot Topics: The  DuraSpace Community Webinar Series present big picture strategic issues by matching community experts with current topics of interest.  Each webinar is recorded and made available at http://duraspace.org/hot-topics

DuraSpace News: Yaffle: Memorial University’s VIVO-Based Solution to Support Knowledge Mobilization in Newfoundland and Labrador

Wed, 2014-11-19 00:00

One particular VIVO project that demonstrates the spirit of open access principles is Yaffle. Many VIVO implementations provide value to their host institutions, ranging from front-end access to authoritative organizational information to highlights of works created in the social sciences and arts and humanities. Yaffle extends beyond its host institution and provides a cohesive link between Memorial University and citizens from Newfoundland and Labrador. The prospects for launching Yaffle in other parts of Canada will be realized in the near future. 

Ed Summers: On Forgetting

Tue, 2014-11-18 21:17

After writing about the Ferguson Twitter archive a few months ago three people have emailed me out of the blue asking for access to the data. One was a principal at a small, scaryish defense contracting company, and the other two were from a prestigious university. I’ve also had a handful of people interested where I work at the University of Maryland.

I ignored the defense contractor. Maybe that was mean, but I don’t want to be part of that. I’m sure they can go buy the data if they really need it. My response to the external academic researchers wasn’t much more helpful since I mostly pointed them to Twitter’s Terms of Service which says:

If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and/or User IDs.

You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweets and/or User Objects per user of your Service, per day.

Any Content provided to third parties via non-automated file download remains subject to this Policy.

It’s my understanding that I can share the data with others at the University of Maryland, but I am not able to give it to the external parties. What I can do is give them the Tweet IDs. But there are 13,480,000 of them.

So that’s what I’m doing today: publishing the tweet ids. You can download them from the Internet Archive:

https://archive.org/details/ferguson-tweet-ids

I’m making it available using the CC-BY license.

Hydration

On the one hand, it seems unfair that this portion of the public record is unshareable in its most information rich form. The barrier to entry to using the data seems set artificially high in order to protect Twitter’s business interests. These messages were posted to the public Web, where I was able to collect them. Why are we prevented from re-publishing them since they are already on the Web? Why can’t we have lots of copies to keep stuff safe? More on this in a moment.

Twitter limits users to 180 requests every 15 minutes. A user is effectively a unique access token. Each request can hydrate up to 100 Tweet IDs using the statuses/lookup REST API call.

180 requests * 100 tweets = 18,000 tweets/15 min = 72,000 tweets/hour

So to hydrate all of the 13,480,000 tweets will take about 7.8 days. This is a bit of a pain, but realistically it’s not so bad. I’m sure people doing research have plenty of work to do before running any kind of analysis on the full data set. And they can use a portion of it for testing as it is downloading. But how do you download it?

Gnip, who were recently acquired by Twitter, offer a rehydration API. Their API is limited to tweets from the last 30 days, and similar to Twitter’s API you can fetch up to 100 tweets at a time. Unlike the Twitter API you can issue a request every second. So this means you could download the results in about 1.5 days. But these Ferguson tweets are more than 30 days old. And a Gnip account costs some indeterminate amount of money, starting at $500…

I suspect there are other hydration services out there. But I adapted twarc the tool I used to collect the data, which already handled rate-limiting, to also do hydration. Once you have the tweet IDs in a file you just need to install twarc, and run it. Here’s how you would do that on an Ubuntu instance:

sudo apt-get install python-pip sudo pip install twarc twarc.py --hydrate ids.txt > tweets.json

After a week or so, you’ll have the full JSON for each of the tweets.

Archive Fever

Well, not really. You will have most of them. But you won’t have the ones that have been deleted. If a user decided to remove a Tweet they made, or decided to remove their account entirely you won’t be able to get their Tweets back from Twitter using their API. I think it’s interesting to consider Twitter’s Terms of Service as what Katie Shilton would call a value lever.

The metadata rich JSON data (which often includes geolocation and other behavioral data) wasn’t exactly posted to the Web in the typical way. It was made available through a Web API designed to be used directly by automated agents, not people. Sure, a tweet appears on the Web but it’s in with the other half a trillion Tweets out on the Web, all the way back to the first one. Requiring researchers to go back to the Twitter API to get this data and not allowing it circulate freely in bulk means that users have an opportunity to remove their content. Sure it has already been collected by other people, and it’s pretty unlikely that the NSA are deleting their tweets. But in a way Twitter is taking an ethical position for their publishers to be able to remove their data. To exercise their right to be forgotten. Removing a teensy bit of informational toxic waste.

As any archivist will tell you, forgetting is an essential and unavoidable part of the archive. Forgetting is the why of an archive. Negotiating what is to be remembered and by whom is the principal concern of the archive. Ironically it seems it’s the people who deserve it the least, those in positions of power, who are often most able to exercise their right to be forgotten. Maybe putting a value lever back in the hands of the people isn’t such a bad thing. If I were Twitter I’d highlight this in the API documentation. I think we are still learning how the contours of the Web fit into the archive. I know I am.

If you are interested in learning more about value levers you can download a pre-print of Shilton’s Value Levers: Building Ethics into Design.

Dan Scott: Social networking for researchers: ResearchGate and their ilk

Tue, 2014-11-18 20:48

The Centre for Research in Occupational Safety and Health asked me to give a lunch'n'learn presentation on ResearchGate today, which was a challenge I was happy to take on... but I took the liberty of stretching the scope of the discussion to focus on social networking in the context of research and academics in general, recognizing four high-level goals:

  1. Promotion (increasing citations, finding work positions)
  2. Finding potential collaborators
  3. Getting advice from experts in your field
  4. Accessing other's work

I'm a librarian, so naturally my take veered quickly into the waters of copyright concerns and the burden (to the point of indemnification) that ResearchGate, Academia.edu, Mendeley, and other such services put on their users to ensure that they are in compliance with copyright and the researchers' agreements with publishers... all while heartily encouraging their users to upload their work with a single click. I also dove into the darker waters of r/scholar, LibGen, and SciHub, pointing out the direct consequences that our university has suffered due to the abuse of institutional accounts at the library proxy.

Happily, the audience opened up the subject of publishing in open access journals--not just from a "covering our own butts" perspective, but also from the position of the ethical responsibility to share knowledge as broadly as possible. We briefly discussed the open access mandates that some granting agencies have put in place, particularly in the States, as well as similar Canadian initiatives that have occurred or are still emerging with respect to public funds (SSHRC and the Tri-Council). And I was overjoyed to hear a suggestion that, perhaps, research funded by the Laurentian University Research Fund should be required to publish in an open access venue.

I'm hoping to take this message back to our library and, building on Kurt de Belder's vision of the library as a Partner in Knowledge help drive our library's mission towards assisting researchers in not only accessing knowledge, but most effectively sharing and promoting the knowledge they create.

That leaves lots of work to do, based on one little presentation

Karen Coyle: Classes in RDF

Tue, 2014-11-18 18:39
RDF allows one to define class relationships for things and concepts. The RDFS1.1 primer describes classes succinctly as:
Resources may be divided into groups called classes. The members of a class are known as instances of the class. Classes are themselves resources. They are often identified by IRIs and may be described using RDF properties. The rdf:type property may be used to state that a resource is an instance of a class.This seems simple, but it is in fact one of the primary areas of confusion about RDF.

If you are not a programmer, you probably think of classes in terms of taxonomies -- genus, species, sub-species, etc. If you are a librarian you might think of classes in terms of classification, like Library of Congress or the Dewey Decimal System. In these, the class defines certain characteristics of the members of the class. Thus, with two classes, Pets and Veterinary science, you can have:
Pets
- dogs
- cats

Veterinary science
- dogs
- catsIn each of those, dogs and cats have different meaning because the class provides a context: either as pets, or information about them as treated in veterinary science.

For those familiar with XML, it has similar functionality because it makes use of nesting of data elements. In XML you can create something like this:
<drink>
    <lemonade>
        <price>$2.50</price>
        <amount>20</amount>
    </lemonade>
    <pop>
        <price>$1.50</price>
        <amount>10</amount>
    </pop>
</drink>and it is clear which price goes with which type of drink, and that the bits directly under the <drink> level are all drinks, because that's what <drink> tells you.

Now you have to forget all of this in order to understand RDF, because RDF classes do not work like this at all. In RDF, the "classness" is not expressed hierarchically, with a class defining the elements that are subordinate to it. Instead it works in the opposite way: the descriptive elements in RDF (called "properties") are the ones that define the class of the thing being described. Properties carry the class information through a characteristic called the "domain" of the property. The domain of the property is a class, and when you use that property to describe something, you are saying that the "something" is an instance of that class. It's like building the taxonomy from the bottom up.

This only makes sense through examples. Here are a few:
1. "has child" is of domain "Parent".

If I say "X - has child - 'Fred'" then I have also said that X is a Parent because every thing that has a child is a Parent.

2. "has Worktitle" is of domain "Work"

If I say "Y - has Worktitle - 'Der Zauberberg'" then I have also said that Y is a Work because every thing that has a Worktitle is a Work.
In essence, X or Y is an identifier for something that is of unknown characteristics until it is described. What you say about X or Y is what defines it, and the classes put it in context. This may seem odd, but if you think of it in terms of descriptive metadata, your metadata describes the "thing in hand"; the "thing in hand" doesn't describe your metadata. 

Like in real life, any "thing" can have more than one context and therefore more than one class. X, the Parent, can also be an Employee (in the context of her work), a Driver (to the Department of Motor Vehicles), a Patient (to her doctor's office). The same identified entity can be an instance of any number of classes.
"has child" has domain "Parent"
"has licence" has domain "Driver"
"has doctor" has domain "Patient"

X - has child - "Fred"  = X is a Parent 
X - has license - "234566"  = X is a Driver
X - has doctor - URI:765876 = X is a PatientClasses are defined in your RDF vocabulary, as as the domains of properties. The above statements require an application to look at the definition of the property in the vocabulary to determine whether it has a domain, and then to treat the subject, X, as an instance of the class described as the domain of the property. There is another way to provide the class as context in RDF - you can declare it explicitly in your instance data, rather than, or in addition to, having the class characteristics inherent in your descriptive properties when you create your metadata. The term used for this, based on the RDF standard, is "type," in that you are assigning a type to the "thing." For example, you could say:
X - is type - Parent
X - has child - "Fred"This can be the same class as you would discern from the properties, or it could be an additional class. It is often used to simplify the programming needs of those working in RDF because it means the program does not have to query the vocabulary to determine the class of X. You see this, for example, in BIBFRAME data. The second line in this example gives two classes for this entity:
<http://bibframe.org/resources/FkP1398705387/8929207instance22>
a bf:Instance, bf:Monograph .
One thing that classes do not do, however, is to prevent your "thing" from being assigned the "wrong class." You can, however, define your vocabulary to make "wrong classes" apparent. To do this you define certain classes as disjoint, for example a class of "dead" would logically be disjoint from a class of "alive." Disjoint means that the same thing cannot be of both classes, either through the direct declaration of "type" or through the assignment of properties. Let's do an example:
"residence" has domain "Alive"
"cemetery plot location" has domain "Dead"
"Alive" is disjoint "Dead" (you can't be both alive and dead)

X - is type - "Alive"                                         (X is of class "Alive")
X - cemetery plot location - URI:9494747      (X is of class "Dead") Nothing stops you from creating this contradiction, but some applications that try to use the data will be stumped because you've created something that, in RDF-speak, is logically inconsistent. What happens next is determined by how your application has been programmed to deal with such things. In some cases, the inconsistency will mean that you cannot fulfill the task the application was attempting. If you reach a decision point where "if Alive do A, if Dead do B" then your application may be stumped and unable to go on.

All of this is to be kept in mind for the next blog post, which talks about the effect of class definitions on bibliographic data in RDF.

Pages