Those who have been paying attention to the cutting edge of digital libraries no doubt know about the Hydra project headed up by Stanford. Hydra is a digital repository system that is built using Ruby and is designed to accept the full range of digital object types that a large research library must manage. Built on top of Fedora and Solr, with Blacklight as the default front-end, one doesn’t normally associate ease of installation with a stack like that. Heck, you could spend a week just getting all of the dependencies installed, configured, and up and running.
So color me surprised when it was announced that the Digital Public Library of America, Stanford University, and the Duraspace organization announced that IMLS had awarded them a $2 million National Leadership Grant to develop “Hydra-in-a-Box”. Just as it sounds, the goal is to “build, bundle, and promote a feature-complete, robust digital repository that is easy to install, configure, and maintain—in short, a next-generation digital repository that will work for institutions large and small, and is capable of running as a hosted service.”
That is no small goal, and a laudable one at that. But…gosh. What a distance there is to travel to get there. The project has it pegged at 30 months, so nearly three years. That sounds about right, and so far Tom Cramer has built one of the most broad-based coalitions I’ve seen in academic libraries around Hydra, so you won’t find me betting against him. Especially since he just landed $2 million to help him build out his pet project. So as much as it pains this Cal Bear to say it, Go Stanford!
Summer is right around the corner and a long held tradition in the public library community is summer reading programs. Synonymous with youth and young adult services, summer reading is worth the revisit by adults.
Science fiction is a gateway
I believe there is a positive correlation between reading science fiction novels and genuine interest in emerging technology. When I was younger, I loved science fiction and fantasy. My interests range from A Princess of Mars to The Hitchhikers Guide to the Galaxy. The Twilight Zone was a mark of my childhood. What I read and watched informed my psyche and furthered my interests in futuristic technology that modern humans could only dream of. The bottom line is that these books sparked an interest. Almost all tech heads I know love science fiction and fantasy. Not everyone is into books, but most science fiction films are based on alternate worlds created by authors like Isaac Asimov and Philip K. Dick. Authors of science fiction and fantasy push the envelope on physics, technology, psychology and history. These novels take place in the “future”, a fictional past or serve as social commentary. They can are cautionary tales or impetus for the reader to become proactive in current affairs. I’m sure no one wants to live in a world similar to Pat Frank’s Alas, Babylon.
A few suggestions for your reading list
In 2011 NPR published a fan-selected list of the top 100 science-fiction and fantasy books for summer reading. While selecting the best science fiction/fantasy book of all time may be a point of contention amongst staunch fans, the point in doing so is impractical.
I went ahead and selected my favorites from NPR’s list as suggestions for summer reading. There are a few that are on my personal reading wish list and many are on my re-read wish list. Which eager reader doesn’t have a wish list?
If you went to high school in the United States, you were probably forced to read these. You probably had to analyze the themes, tone, characters, etc. As a result the mere mention of them is trite, but they more than deserve their place on this list.
1984 by George Orwell
Fahrenheit 451 by Ray Bradbury
Brave New World by Aldous Huxley
Slaughterhouse-Five by Kurt Vonnegut
Frankenstein by Mary Shelley
Some of the best science-fiction/fantasy books are based in an infinite universe so that they require reader commitment and the ability to lift a ten pound book. Though your eyes may be weary, you won’t be at a loss for the possibilities that are illuminated through the text.
The Lord of the Rings by J.R.R. Tolkien
Dune by Frank Herbert
Foundation by Isaac Asimov
A Game of Thrones by George R.R. Martin
The Giver by Lois Lowry (not on NPR’s list)
A Princess of Mars by Edgar Rice Burroughs (not on NPR’s list)
Do Androids Dream of Electric Sheep? by Philip K. Dick
The Andromeda Strain by Michael Chrichton
The Gunslinger (The Dark Tower Series) by Stephen King
Outlander by Diana Gabaldon
1632 by Eric Flint
The Body Snatchers by Jack Finney
Now that I’ve performed my reader’s advisory, what’s on your summer reading list? If you have any recommendations, reply to this post to share with others.
Here's a contribution from Jeff Young, who manages the RDF aspects of VIAF:
Since Wikidata’s introduction to the Linked Data Web in 2014 and subsequent integration of Freebase, it has become a premier example of how to publish and manage Linked Data. Like VIAF, Wikidata uses Schema.org as its core RDF vocabulary and both datasets publish using Linked Data best practices. This consistency should allow applications to treat both datasets as complementary. The main difference will be in the coverage of entities/information, based on their respective sources.
The VIAF RDF changes outlined on the Developer Network blog are intended to further enrich and align the common purpose. Some of the VIAF changes provide additional information to help disambiguate entities, such as schema:location and schema:description. Where possible, schema:names are now language tagged, which should make it easier for applications to select a language-appropriate label for display.
The biggest change, though, is in the “shape of the data” that gets returned via Linked Data requests. Previously, this was a record-oriented view rather than a concise description of the entity. Like Wikidata, the new response will focus on the entity itself and depend on the related entities to describe themselves.
Alignment with Wikidata is a major step in the evolution of VIAF, which started with RDF/XML representations of name authority clusters in 2009 and transitioned to “primary entities” in 2011. The introduction of VIAF as Schema.org in 2014 extends the audience and integration with Wikidata further strengthens industry standard practices. These steps should help ensure that VIAF remains an authoritative source of entity identifiers and information in the linked web of data.
Note: We expect these RDF changes to be visible on viaf.org April 16, 2015. The bulk distribution will follow shortly after that.
Boston, MA – The Digital Public Library of America (DPLA), Stanford University, and the DuraSpace organization are pleased to announce that their joint initiative has been awarded a $2M National Leadership Grant from the Institute of Museum and Library Services (IMLS). Nicknamed Hydra-in-a-Box, the project aims foster a new, national, library network through a community-based repository system, enabling discovery, interoperability and reuse of digital resources by people from this country and around the world.
This transformative network is based on advanced repositories that not only empower local institutions with new asset management capabilities, but also interconnect their data and collections through a shared platform.
“At the core of the Digital Public Library of America is our national network of hubs, and they need the systems envisioned by this project,” said Dan Cohen, DPLA’s executive director. “By combining contemporary technologies for aggregating, storing, enhancing, and serving cultural heritage content, we expect this new stack will be a huge boon to DPLA and to the broader digital library community. In addition, I’m thrilled that the project brings together the expertise of DuraSpace, Stanford, and DPLA.”
Each of the partners will fulfill specific roles in the joint initiative. Stanford will use its existing leadership in the Hydra Project to develop core components, in concert with the broader Hydra community. DPLA will focus on the connective tissue between hubs, mapping, and crosswalks to DPLA’s metadata application profile, and infrastructure to support metadata enhancement and remediation. DuraSpace will use its expertise in building and serving repositories, and doing so at scale, to construct the back-end systems for Hydra hosting.
“DuraSpace is excited to provide the infrastructure for this project,” said Debra Hanken Kurtz, DuraSpace CEO. “It aligns perfectly with our mission to steward the scholarly and cultural heritage records and make them accessible for current and future generations. We look forward to working with DPLA and Stanford to support their work and that of the community to ensure a robust and sustainable future for Hydra-in-a-Box.’”
Over the project’s 30-month time frame, the partners will engage with libraries, archives, and museums nationwide, especially current and prospective DPLA hubs and the Hydra community, to systematically capture the needs for a next-generation, open source, digital repository. They will collaboratively extend the existing Hydra project codebase to build, bundle, and promote a feature-complete, robust digital repository that is easy to install, configure, and maintain—in short, a next-generation digital repository that will work for institutions large and small, and is capable of running as a hosted service. Finally, starting with DPLA’s own metadata aggregation services, the partners will work to ensure that these repositories have the necessary affordances to support networked aggregation, discovery, management and access to these resources, producing a shared, sustainable, nationwide platform.
“The Hydra Project has already demonstrated enormous traction and value as a best-in-class digital repository for institutions like Stanford,” said Tom Cramer, Chief Technology Strategist at the Stanford University Libraries. “And yet there is so much more to do. This grant will provide the means to rapidly accelerate Hydra’s rate of development and adoption–expanding its community, features and value all at once.”
To find out more about the Hydra-in-a-Box initiative contact Dan Cohen (email@example.com), Tom Cramer (firstname.lastname@example.org) or Debra Hanken Kurtz (email@example.com). An information page is available here: https://wiki.duraspace.org/display/hydra/Hydra+in+a+Box.
The Digital Public Library of America (http://dp.la) strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated over 8.5 million items from over 1,700 institutions. The DPLA is a registered 501(c)(3) non-profit.
DuraSpace (http://duraspace.org), an independent 501(c)(3) not-for-profit organization providing leadership and innovation for open technologies that promote durable, persistent access to digital data. We collaborate with academic, scientific, cultural, and technology communities by supporting projects (DSpace, Fedora, VIVO) and creating services (DuraCloud, DSpaceDirect, ArchivesDirect) to help ensure that current and future generations have access to our collective digital heritage. Our values are expressed in our organizational byline, “Committed to our digital future.”
About Stanford University Libraries
The Stanford University Libraries (http://library.stanford.edu) is internationally recognized as a leader among research libraries, and in leveraging digital technology to support scholarship in the age of information. It is a founder of both the Hydra Project and the Fedora 4 repository effort, and a leading institution in the International Image Interoperability Framework (IIIF) (http://iiif.io).
About the Hydra Project
The Hydra Project (http://projecthydra.org) is both an open source community and a suite of software that provides a flexible and robustframework for managing, preserving, and providing access to digital assets. The project motto, “One body, many heads,” speaks to the flexibility provided by Hydra’s modern, modular architecture, and the power of combining a robust repository backend (the “body”) with flexible, tailored, user interfaces (“heads”). Co-designed and developed in concert with Fedora 4, the extensible, durable, and widely used repository software, the Hydra/Fedora stack is centerpiece of a thriving and rapidly expanding open source community poised to easy-to-implement solution.
Code4Lib Journal: Finding and Supporting New Voices: Code4Lib Journal’s Issue 28 on Diversity in Library Technology
Code4Lib Journal: Transforming Knowledge Creation: An Action Framework for Library Technology Diversity
Code4Lib Journal: “What If I Break It?”: Project Management for Intergenerational Library Teams Creating Non-MARC Metadata
In the years leading up to WWII, the French built the Maginot Line as an impregnable barrier against a German invasion:
While the fortification system did prevent a direct attack, it was strategically ineffective, as the Germans invaded through Belgium, going around the Maginot Line.Copyright maximalists such as the major academic publishers, are in a similar position. The more effective and thus intrusive the mechanisms they implement to prevent unauthorized access, the more they incentivize "guerilla open access".
Some copyright owners are coming to terms with this phenomenon. Today, Hugh Pickens reports that the first 4 of the 10 episodes of Game of Thrones new season have leaked:
The episodes have already been downloaded almost 800,000 times, and that figure was expected to blow past a million downloads by the season 5 premiere. Game of Thrones has consistently set records for piracy, which has almost been a point of pride for HBO. "Our experience is [piracy] leads to more penetration, more paying subs, more health for HBO, less reliance on having to do paid advertising. If you go around the world, I think you're right, Game of Thrones is the most pirated show in the world. Well, you know, that's better than an Emmy." LG shows the massive scale on which "guerilla open access" is happening in the field of academic journals. As of the study, Library Genesis hosted nearly 23M articles identified by DOI, 15TB of data. The distribution was heavily skewed to the major publishers, representing 77% of Elsevier's DOIs, 73% of Wiley's and 53% of Springer's, although only 36% of all DOIs. To give some idea of the scale, this is about 60% of Ontario's Scholar's Portal, which has 38M.
Although some open access DOIs are included, the motivation to upload them is much less. A recent estimate by Khabasa and Lee Giles is that 24% of all articles are openly accessible on the Web, their methodology excluded most content from Library Genesis. Not all DOIs from major publishers are paywalled, they publish some open access journals and allow Gold open access (author pays) in some cases. Despite these elements of double counting, it appears likely that at least a majority of all articles, and significantly more than a majority of major publisher articles, can be accessed without passing though a paywall.
Although the bulk of the Library Genesis content arrived via a small number of large uploads, the median upload rate is 2720 new articles/day. Among the sources for them are:
- The Scholar subreddit, which LG estimates sees about 45 requests/day for articles to be shared via Library Genesis.
- Sci-Hub, a service using proxies running on networks with subscriptions to paywalled publishers that allows users to enter a DOI. It it is not available from Library Genesis, the service tries proxies at random until one is found that can access the paper, which is both served to the user and added to Library Genesis.
LG doesn't have an estimate of the Sci-Hub traffic, but unless it is very large there must be other mechanisms filling the large gap between the Scholar subreddit and #icanhazpdf rates and the Library Genesis median upload rate.
Admittedly, it takes time for newly published articles to appear outside their paywalls. Some publishers operate "moving walls", so their articles become open access after an embargo period. It takes time for the various mechanisms driving Library Genesis to locate and upload articles. LG shows that their most recent year (2013) has only about half as many articles as the previous year, so the average delay is similar to the moving wall.
Paying to pass through paywalls thus delivers some value, not just access to a minority of the content but also more timely access to some of the majority. Nevertheless, the multi-billion dollar profits of the major publishers, let alone the other multiple billions that represent their costs in supplying their services, are hard to justify. We have already seen that their peer review process fails in its assigned role of ensuring the quality of the papers they publish. Now we see that the majority of the content for which they charge these enormous sums is available without payment.
My previous posts on scholarly communication.
I hate one-dimensional characters in movies and TV. I love complex characters who have good qualities and bad. I like that “The Good Wife” actually isn’t really such a paragon of moral virtue at all. That she has made questionable decisions and struggles with things, just like we all do. I like how many of the “villains” on that show do monstrous things, but still have likable qualities and people they love and who love them in turn. I’m glad we’re seeing more and more shows like that, where characters are as flawed and three-dimensional as we all are.
Yet there seems to be something in us that likes to simplify things when it comes to judging real people. Someone is either good or bad. On the side of right or on the side of evil. And there’s a tendency to either vilify people or put them on a pedestal. But the world is not so black-and-white.
I think few things have made that tendency to simplify as clear to me as the whole Joe Murphy vs. #Teamharpy lawsuit and social media debacle. It seemed like the dominant narrative either had to be that Lisa Rabey and nina de jesus were heroes and saints and Joe Murphy was a monster, or that Joe Murphy was a saint and poor innocent victim and Lisa Rabey and nina de jesus were monsters. I personally don’t believe either is true. Joe Murphy is not a saint, but he has had his reputation damaged (maybe fatally in our profession) for something there may be no evidence of him having done. Calling someone a sexual predator without first-hand knowledge or evidence that they are one (and I’m not saying that victims need to have evidence) seems like a shitty thing to do. But, given the number of negative things I’d heard about Joe from other librarians prior to all this, I’m assuming (and hoping) that Lisa thought she was doing something good in warning people about him.
I’m writing this knowing that I will probably be trolled by someone for it, but c’est la vie. I’m disturbed by the fact that, after all of the petitions, and Facebook drama, and blog posts, and tweets about this no one seems to be talking about this (other than right-wing feminist-hating nut-jobs) since the lawsuit was settled and Lisa and nina published retractions. We shouldn’t let right-wing feminist-hating nut-jobs control the narrative. And we also should be willing to admit when we were wrong and/or stand up for our beliefs if we feel we are right.
When I first wrote a post about all this, social media had been relatively quiet about it. I think there had been a couple of blog posts and the Team Harpy WordPress site was up, but nothing with a lot of vitriol had come out. Most of the rhetoric seemed focused generally on how common sexual harassment is — even in our female-dominated profession — and how important it is that there are whistleblowers who speak out about that behavior. There were posts about the importance of believing victims and supporting whistleblowers. I’d say that people were generally supportive of Lisa and nina, but were not necessarily assuming that Joe was what they said he was.
Soon after, the discussion took a turn for the bizarre, at least to me. The conversation around Joe on Facebook and Twitter became intensely vitriolic, with plenty of people arguing his guilt as if they had inside information. Respected library administrators who have never met Joe were calling him a “douchebag” on Twitter. There was a change.org petition asking him to drop his lawsuit, apologize to nina and Lisa, and compensate them. It was signed by over 1,000 people, including many people I like and respect. I did not sign it. I found it really odd that no one was considering the fact that he might be the victim in this. Instead, Lisa and nina were treated like victims, which, if they did harm his career without any evidence of a crime, they were very much in the wrong. I find it difficult to believe that over 1,000 other people knew for a fact that he actually was a sexual predator.
It seemed more like people thought he was wrong to have sued them. If someone publicly accused me of a terrible crime with no evidence and damaged my career, wouldn’t I be the injured party and shouldn’t I be able to seek damages in a court of law? The idea that he was squashing their free speech rights was ridiculous. If it’s not true that Joe is a sexual predator, it is slander. It’s one thing to say Joe Murphy is a jerk. That is opinion. But stating that someone is factually something that they don’t know is true is not protected speech. Destroying someone’s reputation is a tremendous and personal violation of another human being. But maybe he deserved it because he was a player and a flirt? How is that any different than “slut-shaming?” I found it disturbing that none of the people I like and respect seemed to be acknowledging this. But maybe everyone but me knew for a fact that it was true?
I don’t like Joe Murphy. I still feel about him exactly the way I did when I wrote my first post. But, as I mentioned then, I think the fact that he was disliked by so many people made it easy for folks to believe him to have done it (and he might consider why so many people were saying awful things about him behind his back, because it’s not just “haters gonna hate”). We’ve all seen the delight people feel when someone powerful (or someone who is perceived of as being privileged) is taken down. I’ve been reading a lot about Jon Ronson’s new book So You’ve Been Publicly Shamed and am looking forward to reading it and learning more about this strange and all-too-common social phenomenon.
In addition to the fact that plenty of people wanted to see him taken down a peg, this was happening at a time when things like gamergate and the recent conversations, articles, and presentations about sexual harassment in librarianship were shining a pretty bright light on this issue. I think people wanted to show their support for women who have been the victims of sexual harassment and this lawsuit gave our community an opportunity to come together to do that.
But let’s remember something here: nina and Lisa were not sexually harassed by Joe Murphy. That was never what anyone was claiming. But many people behaved like Joe was suing the victims of harassment. No. He was suing people who were reporting something they said they’d heard. This wasn’t about believing the victims of sexual harassment. They may have believed they were doing the right thing, but they weren’t harassed by Joe prior to posting what they did.
Now the tide has shifted and the trolls are attacking nina, Lisa, and their supporters (including me, though I wasn’t actually a supporter). I can’t even blame Joe much for engaging in a bit of schadenfreude now (I’ve seen him favoriting some of the trolling tweets his lawyers have been shooting out to me and others) I can’t fathom the suffering he must have endured through all this. I can’t imagine how demoralizing it must have been to have more than 1,000 people in our profession signing a change.org petition against him. But sadly, because he’s put on the mantle of the innocent victim and good-guy, I doubt very much that he is going to examine the behavior that got him here (and I don’t mean the lawsuit).
And that’s the rub. How do we call people like Joe on their shit in a way that might actually create change? Calling them a sexual predator on Twitter without evidence is clearly not it. I believe in the power of social media for good, but I haven’t seen a lot of good come out of it when it comes to calling out powerful men for bad behavior, because many then just position themselves as victims. Has public shaming really ever worked to meaningfully change people’s behavior (again, gotta read Ronson’s book)? But the “whisper network” doesn’t work either. People were saying lots of things about Joe, but the information wasn’t getting to people in power or maybe even Joe himself. Maybe he didn’t know how a lot of people felt about him. I have no idea.
Still the greatest tragedy here, in my opinion is that so many women suffer sexual harassment and most of the time the perpetrators get away with it. And this whole sordid affair did little to help the cause of encouraging women to come forward. I’ve been sexually harassed and stalked and never reported any of it. But it was when a faculty member at a former job who used to stand too close to me and would put his arm around my waist sometimes later escalated to grabbing a colleagues breasts that I realized my silence was hurting other women. Because men who do things like this don’t just do it once. If they get away with something that you consider too minor to report, they may escalate to doing something much worse to someone else. We have to find more ways to help women feel safe reporting harassment. I’m happy that more conferences now have codes of conduct and discernible methods of reporting inappropriate behavior, and that will help, but it’s not enough.
I don’t have anything positive to end with here, so I’ll close with an excerpt from an interview with Jon Ronson where he talks about a situation where guy at a conference was social media shamed after a woman tweeted about an off-color joke he made and then she was horribly trolled after he said he lost his job because of it. See any parallels?
The strange thing is the impulse to shame often comes from a good place. Like the desire to confront sexism, say. A good example is the tech conference incident: Hank whispers a naff joke about ‘big dongles’ to his friend, Adria hears it and takes offence, posts something on Twitter and the whole thing snowballs.
Ronson: Yeah, everybody involved in that shaming is doing it for social justice reasons. So Adria feels that in calling out Hank she’s a calling out a greater truth: that privileged white men don’t know the effect they have on other people. The trolls think they’re doing the right thing because they feel Adria robbed Hank of his employment – so they wanted to get back at her. Everybody involved in that story feels the urge to be a good person – and it’s carnage all round. Everyone is broken by the experience; especially Adria, she has it worse than anybody. I mean, I’m on Hank’s side. Nobody wants to live in a world where you can’t make a dongle joke! But by the end of the story, Hank’s okay, he’s got a new job, but Adria’s unemployed and subjected to death threats. So Adria’s view of the world feels vindicated.
Archives are full of silences. Archivists try to surface these silences by making appraisal decisions about what to collect and what not to collect. Even after they are accessioned, records can be silenced by culling, weeding and purging. We do our best to document these activities, to leave a trail of these decisions, but they are inevitably deeply contingent. The context for the records and our decisions about them unravels endlessly.
At some point we must accept that the archival record is not perfect, and that it’s a bit of a miracle that it exists at all. But in all these cases it is the archivist who has agency: the deliberate or subliminal decisions that determine what comprises the archival record are enacted by an archivist. In addition the record creator has agency, in their decision to give their records to an archive.
Perhaps I’m over-simplifying a bit, but I think there is a curious new dynamic at play in social media archives, specifically archives of Twitter data. I wrote in a previous post about how Twitter’s Terms of Service prevent distribution of Twitter data retrieved from their API, but do allow for the distribution of Tweet IDs and relatively small amounts of derivative data (spreadsheets, etc).
Tweet IDs can then be hydrated, or turned back into raw original data, by going back to the Twitter API. If a tweet has been deleted you cannot get it back from the API. The net effect this has is of cleaning, or purging, the archival record as it is made available on the Web. But the decision of what to purge is made by the record creator (the creator of the tweet) or by Twitter themselves in cases where tweets or users are deleted.
For example lets look at the collection of Twitter data that Nick Ruest has assembled in the wake of the attack on the offices of Charlie Hebdo earlier this year. Nick collected 13 million tweets mentioning four hashtags related to the attacks, for the period of January 9th to January 28th, 2015. He has made the tweet IDs available as a dataset for researchers to use (a separate file for each hashtag). I was interested in replicating the dataset for potential researchers at the University of Maryland, but also in seeing how many of the tweets had been deleted.
So on February 20th (42 days after Nick started his collecting) I began hydrating the IDs. It took 4 days for twarc to finish. When it did I counted up the number of tweets that I was able to retrieve. The results are somewhat interesting:
hashtag archived tweets hydrated deletes percent deleted #JeSuisJuif 96,518 89,584 6,934 7.18% #JeSuisAhmed 264,097 237,674 26,423 10.01% #JeSuisCharlie 6,503,425 5,955,278 548,147 8.43% #CharlieHebdo 7,104,253 6,554,231 550,022 7.74% Total 13,968,293 12,836,767 1,131,526 8.10%
It looks like 1.1 million tweets out of the 13.9 million tweet dataset have been deleted. That’s about 8.1%. I suspect now even more have been deleted. While the datasets themselves are significantly smaller the number of deletes for #JeSuiAhmed and #JeSuisJuif seem quite a bit higher than #JeSuisCharlie and #CharlieHebdo. Could this be that users were concerned about how their tweets would be interpreted by parties analyzing the data?
Of course, it’s very hard for me to say since I don’t have the deleted tweets. I don’t even know who sent them. A researcher interested in these questions would presumably need to travel to York University to work with the dataset. In a way this seems to be how archives usually work. But if you add the Web as a global, public access layer into the mix it complicates things a bit.