Code4Lib Journal: “What If I Break It?”: Project Management for Intergenerational Library Teams Creating Non-MARC Metadata
In the years leading up to WWII, the French built the Maginot Line as an impregnable barrier against a German invasion:
While the fortification system did prevent a direct attack, it was strategically ineffective, as the Germans invaded through Belgium, going around the Maginot Line.Copyright maximalists such as the major academic publishers, are in a similar position. The more effective and thus intrusive the mechanisms they implement to prevent unauthorized access, the more they incentivize "guerilla open access".
Some copyright owners are coming to terms with this phenomenon. Today, Hugh Pickens reports that the first 4 of the 10 episodes of Game of Thrones new season have leaked:
The episodes have already been downloaded almost 800,000 times, and that figure was expected to blow past a million downloads by the season 5 premiere. Game of Thrones has consistently set records for piracy, which has almost been a point of pride for HBO. "Our experience is [piracy] leads to more penetration, more paying subs, more health for HBO, less reliance on having to do paid advertising. If you go around the world, I think you're right, Game of Thrones is the most pirated show in the world. Well, you know, that's better than an Emmy." LG shows the massive scale on which "guerilla open access" is happening in the field of academic journals. As of the study, Library Genesis hosted nearly 23M articles identified by DOI, 15TB of data. The distribution was heavily skewed to the major publishers, representing 77% of Elsevier's DOIs, 73% of Wiley's and 53% of Springer's, although only 36% of all DOIs. To give some idea of the scale, this is about 60% of Ontario's Scholar's Portal, which has 38M.
Although some open access DOIs are included, the motivation to upload them is much less. A recent estimate by Khabasa and Lee Giles is that 24% of all articles are openly accessible on the Web, their methodology excluded most content from Library Genesis. Not all DOIs from major publishers are paywalled, they publish some open access journals and allow Gold open access (author pays) in some cases. Despite these elements of double counting, it appears likely that at least a majority of all articles, and significantly more than a majority of major publisher articles, can be accessed without passing though a paywall.
Although the bulk of the Library Genesis content arrived via a small number of large uploads, the median upload rate is 2720 new articles/day. Among the sources for them are:
- The Scholar subreddit, which LG estimates sees about 45 requests/day for articles to be shared via Library Genesis.
- Sci-Hub, a service using proxies running on networks with subscriptions to paywalled publishers that allows users to enter a DOI. It it is not available from Library Genesis, the service tries proxies at random until one is found that can access the paper, which is both served to the user and added to Library Genesis.
LG doesn't have an estimate of the Sci-Hub traffic, but unless it is very large there must be other mechanisms filling the large gap between the Scholar subreddit and #icanhazpdf rates and the Library Genesis median upload rate.
Admittedly, it takes time for newly published articles to appear outside their paywalls. Some publishers operate "moving walls", so their articles become open access after an embargo period. It takes time for the various mechanisms driving Library Genesis to locate and upload articles. LG shows that their most recent year (2013) has only about half as many articles as the previous year, so the average delay is similar to the moving wall.
Paying to pass through paywalls thus delivers some value, not just access to a minority of the content but also more timely access to some of the majority. Nevertheless, the multi-billion dollar profits of the major publishers, let alone the other multiple billions that represent their costs in supplying their services, are hard to justify. We have already seen that their peer review process fails in its assigned role of ensuring the quality of the papers they publish. Now we see that the majority of the content for which they charge these enormous sums is available without payment.
My previous posts on scholarly communication.
I hate one-dimensional characters in movies and TV. I love complex characters who have good qualities and bad. I like that “The Good Wife” actually isn’t really such a paragon of moral virtue at all. That she has made questionable decisions and struggles with things, just like we all do. I like how many of the “villains” on that show do monstrous things, but still have likable qualities and people they love and who love them in turn. I’m glad we’re seeing more and more shows like that, where characters are as flawed and three-dimensional as we all are.
Yet there seems to be something in us that likes to simplify things when it comes to judging real people. Someone is either good or bad. On the side of right or on the side of evil. And there’s a tendency to either vilify people or put them on a pedestal. But the world is not so black-and-white.
I think few things have made that tendency to simplify as clear to me as the whole Joe Murphy vs. #Teamharpy lawsuit and social media debacle. It seemed like the dominant narrative either had to be that Lisa Rabey and nina de jesus were heroes and saints and Joe Murphy was a monster, or that Joe Murphy was a saint and poor innocent victim and Lisa Rabey and nina de jesus were monsters. I personally don’t believe either is true. Joe Murphy is not a saint, but he has had his reputation damaged (maybe fatally in our profession) for something there may be no evidence of him having done. Calling someone a sexual predator without first-hand knowledge or evidence that they are one (and I’m not saying that victims need to have evidence) seems like a shitty thing to do. But, given the number of negative things I’d heard about Joe from other librarians prior to all this, I’m assuming (and hoping) that Lisa thought she was doing something good in warning people about him.
I’m writing this knowing that I will probably be trolled by someone for it, but c’est la vie. I’m disturbed by the fact that, after all of the petitions, and Facebook drama, and blog posts, and tweets about this no one seems to be talking about this (other than right-wing feminist-hating nut-jobs) since the lawsuit was settled and Lisa and nina published retractions. We shouldn’t let right-wing feminist-hating nut-jobs control the narrative. And we also should be willing to admit when we were wrong and/or stand up for our beliefs if we feel we are right.
When I first wrote a post about all this, social media had been relatively quiet about it. I think there had been a couple of blog posts and the Team Harpy WordPress site was up, but nothing with a lot of vitriol had come out. Most of the rhetoric seemed focused generally on how common sexual harassment is — even in our female-dominated profession — and how important it is that there are whistleblowers who speak out about that behavior. There were posts about the importance of believing victims and supporting whistleblowers. I’d say that people were generally supportive of Lisa and nina, but were not necessarily assuming that Joe was what they said he was.
Soon after, the discussion took a turn for the bizarre, at least to me. The conversation around Joe on Facebook and Twitter became intensely vitriolic, with plenty of people arguing his guilt as if they had inside information. Respected library administrators who have never met Joe were calling him a “douchebag” on Twitter. There was a change.org petition asking him to drop his lawsuit, apologize to nina and Lisa, and compensate them. It was signed by over 1,000 people, including many people I like and respect. I did not sign it. I found it really odd that no one was considering the fact that he might be the victim in this. Instead, Lisa and nina were treated like victims, which, if they did harm his career without any evidence of a crime, they were very much in the wrong. I find it difficult to believe that over 1,000 other people knew for a fact that he actually was a sexual predator.
It seemed more like people thought he was wrong to have sued them. If someone publicly accused me of a terrible crime with no evidence and damaged my career, wouldn’t I be the injured party and shouldn’t I be able to seek damages in a court of law? The idea that he was squashing their free speech rights was ridiculous. If it’s not true that Joe is a sexual predator, it is slander. It’s one thing to say Joe Murphy is a jerk. That is opinion. But stating that someone is factually something that they don’t know is true is not protected speech. Destroying someone’s reputation is a tremendous and personal violation of another human being. But maybe he deserved it because he was a player and a flirt? How is that any different than “slut-shaming?” I found it disturbing that none of the people I like and respect seemed to be acknowledging this. But maybe everyone but me knew for a fact that it was true?
I don’t like Joe Murphy. I still feel about him exactly the way I did when I wrote my first post. But, as I mentioned then, I think the fact that he was disliked by so many people made it easy for folks to believe him to have done it (and he might consider why so many people were saying awful things about him behind his back, because it’s not just “haters gonna hate”). We’ve all seen the delight people feel when someone powerful (or someone who is perceived of as being privileged) is taken down. I’ve been reading a lot about Jon Ronson’s new book So You’ve Been Publicly Shamed and am looking forward to reading it and learning more about this strange and all-too-common social phenomenon.
In addition to the fact that plenty of people wanted to see him taken down a peg, this was happening at a time when things like gamergate and the recent conversations, articles, and presentations about sexual harassment in librarianship were shining a pretty bright light on this issue. I think people wanted to show their support for women who have been the victims of sexual harassment and this lawsuit gave our community an opportunity to come together to do that.
But let’s remember something here: nina and Lisa were not sexually harassed by Joe Murphy. That was never what anyone was claiming. But many people behaved like Joe was suing the victims of harassment. No. He was suing people who were reporting something they said they’d heard. This wasn’t about believing the victims of sexual harassment. They may have believed they were doing the right thing, but they weren’t harassed by Joe prior to posting what they did.
Now the tide has shifted and the trolls are attacking nina, Lisa, and their supporters (including me, though I wasn’t actually a supporter). I can’t even blame Joe much for engaging in a bit of schadenfreude now (I’ve seen him favoriting some of the trolling tweets his lawyers have been shooting out to me and others) I can’t fathom the suffering he must have endured through all this. I can’t imagine how demoralizing it must have been to have more than 1,000 people in our profession signing a change.org petition against him. But sadly, because he’s put on the mantle of the innocent victim and good-guy, I doubt very much that he is going to examine the behavior that got him here (and I don’t mean the lawsuit).
And that’s the rub. How do we call people like Joe on their shit in a way that might actually create change? Calling them a sexual predator on Twitter without evidence is clearly not it. I believe in the power of social media for good, but I haven’t seen a lot of good come out of it when it comes to calling out powerful men for bad behavior, because many then just position themselves as victims. Has public shaming really ever worked to meaningfully change people’s behavior (again, gotta read Ronson’s book)? But the “whisper network” doesn’t work either. People were saying lots of things about Joe, but the information wasn’t getting to people in power or maybe even Joe himself. Maybe he didn’t know how a lot of people felt about him. I have no idea.
Still the greatest tragedy here, in my opinion is that so many women suffer sexual harassment and most of the time the perpetrators get away with it. And this whole sordid affair did little to help the cause of encouraging women to come forward. I’ve been sexually harassed and stalked and never reported any of it. But it was when a faculty member at a former job who used to stand too close to me and would put his arm around my waist sometimes later escalated to grabbing a colleagues breasts that I realized my silence was hurting other women. Because men who do things like this don’t just do it once. If they get away with something that you consider too minor to report, they may escalate to doing something much worse to someone else. We have to find more ways to help women feel safe reporting harassment. I’m happy that more conferences now have codes of conduct and discernible methods of reporting inappropriate behavior, and that will help, but it’s not enough.
I don’t have anything positive to end with here, so I’ll close with an excerpt from an interview with Jon Ronson where he talks about a situation where guy at a conference was social media shamed after a woman tweeted about an off-color joke he made and then she was horribly trolled after he said he lost his job because of it. See any parallels?
The strange thing is the impulse to shame often comes from a good place. Like the desire to confront sexism, say. A good example is the tech conference incident: Hank whispers a naff joke about ‘big dongles’ to his friend, Adria hears it and takes offence, posts something on Twitter and the whole thing snowballs.
Ronson: Yeah, everybody involved in that shaming is doing it for social justice reasons. So Adria feels that in calling out Hank she’s a calling out a greater truth: that privileged white men don’t know the effect they have on other people. The trolls think they’re doing the right thing because they feel Adria robbed Hank of his employment – so they wanted to get back at her. Everybody involved in that story feels the urge to be a good person – and it’s carnage all round. Everyone is broken by the experience; especially Adria, she has it worse than anybody. I mean, I’m on Hank’s side. Nobody wants to live in a world where you can’t make a dongle joke! But by the end of the story, Hank’s okay, he’s got a new job, but Adria’s unemployed and subjected to death threats. So Adria’s view of the world feels vindicated.
Archives are full of silences. Archivists try to surface these silences by making appraisal decisions about what to collect and what not to collect. Even after they are accessioned, records can be silenced by culling, weeding and purging. We do our best to document these activities, to leave a trail of these decisions, but they are inevitably deeply contingent. The context for the records and our decisions about them unravels endlessly.
At some point we must accept that the archival record is not perfect, and that it’s a bit of a miracle that it exists at all. But in all these cases it is the archivist who has agency: the deliberate or subliminal decisions that determine what comprises the archival record are enacted by an archivist. In addition the record creator has agency, in their decision to give their records to an archive.
Perhaps I’m over-simplifying a bit, but I think there is a curious new dynamic at play in social media archives, specifically archives of Twitter data. I wrote in a previous post about how Twitter’s Terms of Service prevent distribution of Twitter data retrieved from their API, but do allow for the distribution of Tweet IDs and relatively small amounts of derivative data (spreadsheets, etc).
Tweet IDs can then be hydrated, or turned back into raw original data, by going back to the Twitter API. If a tweet has been deleted you cannot get it back from the API. The net effect this has is of cleaning, or purging, the archival record as it is made available on the Web. But the decision of what to purge is made by the record creator (the creator of the tweet) or by Twitter themselves in cases where tweets or users are deleted.
For example lets look at the collection of Twitter data that Nick Ruest has assembled in the wake of the attack on the offices of Charlie Hebdo earlier this year. Nick collected 13 million tweets mentioning four hashtags related to the attacks, for the period of January 9th to January 28th, 2015. He has made the tweet IDs available as a dataset for researchers to use (a separate file for each hashtag). I was interested in replicating the dataset for potential researchers at the University of Maryland, but also in seeing how many of the tweets had been deleted.
So on February 20th (42 days after Nick started his collecting) I began hydrating the IDs. It took 4 days for twarc to finish. When it did I counted up the number of tweets that I was able to retrieve. The results are somewhat interesting:
hashtag archived tweets hydrated deletes percent deleted #JeSuisJuif 96,518 89,584 6,934 7.18% #JeSuisAhmed 264,097 237,674 26,423 10.01% #JeSuisCharlie 6,503,425 5,955,278 548,147 8.43% #CharlieHebdo 7,104,253 6,554,231 550,022 7.74% Total 13,968,293 12,836,767 1,131,526 8.10%
It looks like 1.1 million tweets out of the 13.9 million tweet dataset have been deleted. That’s about 8.1%. I suspect now even more have been deleted. While the datasets themselves are significantly smaller the number of deletes for #JeSuiAhmed and #JeSuisJuif seem quite a bit higher than #JeSuisCharlie and #CharlieHebdo. Could this be that users were concerned about how their tweets would be interpreted by parties analyzing the data?
Of course, it’s very hard for me to say since I don’t have the deleted tweets. I don’t even know who sent them. A researcher interested in these questions would presumably need to travel to York University to work with the dataset. In a way this seems to be how archives usually work. But if you add the Web as a global, public access layer into the mix it complicates things a bit.
Last updated April 14, 2015. Created by wooble on April 14, 2015.
Log in to edit this page.
pycounter makes working with COUNTER usage statistics in Python easy, including fetching statistics with NISO SUSHI.
Developed by the Health Sciences Library System of the University of Pittsburgh to support importing usage data into our in-house Electronic Resources Management (ERM) system.
Licensed under the MIT license. See the file LICENSE for details.
pycounter is tested on Python 2.6, 2.7, 3.3, 3.4, 3.5, and pypy2Package Type: Electronic Resource ManagementLicense: MIT License Package Links
- pycounter - 0.5a2 6-Apr-2015
Drupal for the library website is under development.
This is the second in a series of posts about the Extended Date Time Format (EDTF) and its use in the Digital Public Library of America. For more background on this topic take a look at the first post in this series.EDTF Use by Hub
In the previous post I looked at the overall usage of the EDTF in the DPLA and found that it didn’t matter that much if a Hub was classified as a Content Hub or a Service Hub when it came to looking at the availability of dates in item records in the system. Content Hubs had 83% of their records with some date value and Service Hubs had 81% of their records with date values.
Looking overall at the dates that were present, there were 51% that were valid EDTF strings and 49% that were not valid EDTF strings.
One of the things that should be noted is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF. For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.
I wanted to look at how the EDTF was distributed across the Hubs in the dataset and was able to create the following table.Hub Name Items With Date % of total items with date present Valid EDTD Valid EDTF % Not Valid EDTF Not Valid EDTF % ARTstor 49,908 88.6% 26,757 53.6% 23,151 46.4% Biodiversity Heritage Library 29,000 21.0% 22,734 78.4% 6,266 21.6% David Rumsey 48,132 100.0% 48,132 100.0% 0 0.0% Digital Commonwealth 118,672 95.1% 14,731 12.4% 103,941 87.6% Digital Library of Georgia 236,961 91.3% 188,263 79.4% 48,687 20.5% Harvard Library 6,957 65.8% 6,910 99.3% 47 0.7% HathiTrust 1,881,588 98.2% 1,295,986 68.9% 585,598 31.1% Internet Archive 194,454 93.1% 185,328 95.3% 9,126 4.7% J. Paul Getty Trust 92,494 99.8% 6,319 6.8% 86,175 93.2% Kentucky Digital Library 87,061 68.1% 87,061 100.0% 0 0.0% Minnesota Digital Library 39,708 98.0% 33,201 83.6% 6,507 16.4% Missouri Hub 34,742 83.6% 32,192 92.7% 2,550 7.3% Mountain West Digital Library 634,571 73.1% 545,663 86.0% 88,908 14.0% National Archives and Records Administration 553,348 78.9% 10,218 1.8% 543,130 98.2% North Carolina Digital Heritage Center 214,134 82.1% 163,030 76.1% 51,104 23.9% Smithsonian Institution 675,648 75.3% 44,860 6.6% 630,788 93.4% South Carolina Digital Library 52,328 68.9% 42,128 80.5% 10,200 19.5% The New York Public Library 791,912 67.7% 47,257 6.0% 744,655 94.0% The Portal to Texas History 424,342 88.8% 416,835 98.2% 7,505 1.8% United States Government Printing Office (GPO) 148,548 99.9% 17,894 12.0% 130,654 88.0% University of Illinois at Urbana-Champaign 14,273 78.8% 11,304 79.2% 2,969 20.8% University of Southern California. Libraries 269,880 89.6% 114,293 42.3% 155,573 57.6% University of Virginia Library 26,072 86.4% 21,798 83.6% 4,274 16.4%
Turning this into a graph helps things show up a bit better.
There are a number of things that can be teased out of here, first is that there are a few Hubs that have 100% or nearly 100% of their dates conforming to EDTF already, notably David Rumsey’s Hub and the Kentucky Digital Library both at 100%. Harvard at 99% and the Portal to Texas History at 98% are also notable. On the other end of the spectrum we have the National Archives and Records Administration with 98% of their dates being Not Valid, New York Public Library with 94%, and the J Paul Getty Trust at 93%.Use of EDTF Level Features
The EDTF has the notion of feature levels which include Level 0, Level 1, and Level 2. Level 0 are the basic date features such as date, date and time, and intervals. Level 1 adds features like
Uncertain/Approximate dates, Unspecified, Extended Intervals, years exceeding four digits and seasons to the mix. Level 2 adds to the feature set with partial uncertain/approximate dates, partial unspecified, sets, multiple dates, masked precision and extensions of the extended interval and years exceeding four digits. Finally Level 2 lets you qualify seasons. For a full list of the features please take a look at the draft specification at the Library of Congress.
When I was preparing the dataset I also tested the dates to see which feature level they matched to. After starting the analysis I noticed a few bugs in my testing code and added them as issues to the GitHub site for the ExtendedDateTimeFormat Python module available here. Even with the bugs which falsely identified one feature as a Level0 and Level1 feature, and another feature as both Level1 and Level2, I was able to come up with usable data for further analysis. Because of these bugs there are a few Hubs in the list below that differ slightly in the number of valid EDTF items than the list presented in the first part of this post.Hub Name valid EDTF items valid-level0 % Level0 valid-level1 % Level1 valid-level2 % Level2 ARTstor 26,757 26,726 99.9% 31 0.1% 0 0.0% Biodiversity Heritage Library 22,734 22,702 99.9% 32 0.1% 0 0.0% David Rumsey 48,132 48,132 100.0% 0 0.0% 0 0.0% Digital Commonwealth 14,731 14,731 100.0% 0 0.0% 0 0.0% Digital Library of Georgia 188,274 188,274 100.0% 0 0.0% 0 0.0% Harvard Library 6,910 6,822 98.7% 83 1.2% 5 0.1% HathiTrust 1,295,990 1,292,079 99.7% 3,662 0.3% 249 0.0% Internet Archive 185,328 185,115 99.9% 212 0.1% 1 0.0% J. Paul Getty Trust 6,319 6,308 99.8% 11 0.2% 0 0.0% Kentucky Digital Library 87,061 87,061 100.0% 0 0.0% 0 0.0% Minnesota Digital Library 33,201 26,055 78.5% 7,146 21.5% 0 0.0% Missouri Hub 32,192 32,190 100.0% 2 0.0% 0 0.0% Mountain West Digital Library 545,663 542,388 99.4% 3,274 0.6% 1 0.0% National Archives and Records Administration 10,218 10,003 97.9% 215 2.1% 0 0.0% North Carolina Digital Heritage Center 163,030 162,958 100.0% 72 0.0% 0 0.0% Smithsonian Institution 44,860 44,642 99.5% 218 0.5% 0 0.0% South Carolina Digital Library 42,128 42,079 99.9% 49 0.1% 0 0.0% The New York Public Library 47,257 47,251 100.0% 6 0.0% 0 0.0% The Portal to Texas History 416,838 402,845 96.6% 6,302 1.5% 7,691 1.8% United States Government Printing Office (GPO) 17,894 16,165 90.3% 875 4.9% 854 4.8% University of Illinois at Urbana-Champaign 11,304 11,275 99.7% 29 0.3% 0 0.0% University of Southern California. Libraries 114,307 114,307 100.0% 0 0.0% 0 0.0% University of Virginia Library 21,798 21,558 98.9% 236 1.1% 4 0.0%
Looking at the top 25% of the data, you get the following.
Obviously the majority of dates in the DPLA that are valid EDTF comply with Level0 which includes standard dates like years, (1900), year and month (1900-03), year month and day (1900-03-03), full date and time (2014-03-03T13:23:50 and intervals with any of the dates (yyyy, yyyy-mm, yyyy-mm-dd) in the format of 2004-02/2014-03-23.
There are a number of Hubs that are making use of Level 1 and Level 2 features with the most notable being the Minnesota Digital Library that makes use of Level 1 features in 21.5 % of their item records. The Portal to Texas History and the Government Printing Office both make use of Level2 features as well with the Portal having them present in 7,691 item records (1.8% of their total) and GPO in 854 of their item records (4.8%).
I have one more post in this series that will take a closer look at which of the EDTF features are being used in the DPLA as a whole and then for each of the Hubs.
Feel free to contact me via Twitter if you have questions or comments.
The ORCID site says:
ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized.
I will post this so I can present – and come back later to expand, and clean up typos, so this post will evolve a bit.
This event is launching a document:
The ‘Joint Statement of Principle: ORCID – Connecting Researchers and Research‘ [PDF 297KB] proposes that Australia’s research sector broadly embrace the use of ORCID (Open Researcher and Contributor ID) as a common researcher identifier. The statement was drafted by a small working group coordinated by the Australian National Data Service (ANDS) comprised of representatives from Universities Australia (UA), the Council of Australian University Librarians (CAUL) and the Australasian Research Management Society (ARMS). Representatives of the Australian Research Council and the National Health and Medical Research Council also provided input through the working group.
In this presentation I talk about some of the details of how to implement the ORCID. Just how do you use an ORCID ID in a institutional repository?
This is not that simple, as most of our systems are set up to expect string-values for names, not IDs.This talk is not all about ORCIDs…
… it’s about implemeting linked data principles
This talk is about why ORCIDs are important, as part of the linked-data web. I will give examples of some of the work that’s going on at UTS and other institutions on linked-data approaches to research data management and research data publishing and conclude with some comments about the kinds of services I think ORCID needs to offer.Modern metadata should be linked-data
Thou Shalt Have No Data Without Metadata
RDF is best practice for Metadata
Use Metadata Standards where they exist
Use URIs rather than Scalars (eg Strings) as names
Name all data and metadata ASAP
And while it’s easy enought to say “RDF is best practice for Metadata” entering RDF metadata is non-trivial for humans. So I wanted to show you some of the work we’ve been doing to make it possible to build research data systems that are compliant with the above principles.1. ReDBOX
Screenshot from the UTS Stash data catalogue showing a party lookup, to get a URI that identifies a person.The ANDS-funded ReDBOX project embraced linked metadata principles from the very beginning, it has a name-authority service, Mint which is a clearing house for sources of truth about people, organisational units, subject codes, grant codes etc.
But, the ReBOX/Mint partnership is a very close one, there’s no general way to lookup other name authorities, without loading them into Mint.
In 2014 I asked, what if there were a general way to do this, so that we could use URIs from a wide range of sources, and a team of developers from NZ and the UK responded as part of the Developer Challenge Event at that year’s Open Repositories conference in Helsinki, supported by Rob Peters from ORCID, who is at this meeting in Canberra.
- Adam Field : iSolutions, University of Southampton
- Claire Knowles: Library Digital Development Manager, University of Edinburgh
- Kim Shepherd: Digital Development, University of Auckland Library
- Jared Watts: Digital Development, University of Auckland Library
- Jiadi Yao: EPrints Services, University of Southampton
See their git repo.
This modest github repository might not look like much, but as far as I know, it’s the first example of an attempt to create an easy-to use protocol for web developers to make lookup services.
Fill my list enabled auto-complete lookup to multiple sources of truth including ORCID, so a user can find the particular Lee or Smith they want to assert is a creator of a work, specify which kind of Python they mean for the subject of a work and get a URI. The FML team did prototype implementations for ePrints and Dspace software.
Looking up the Schools Online Thesaurus (ScOT) for the URI for “Billabongs”.
The above screenshot shows a prototype lookup service which shows auto-complete hints as you type.
Note that typing “Oxb” find the same URI – Billabongs are also know as ‘oxbow lakes’.
Note that in the screenshot you can see one of the important changes we made to the Omeka repository software to support linked data, as part of the Ozmeka project.
Instead of just a string field for the subject there is a URI as well. So, even though some records might say “Billabongs” and some might say “Oxbow lakes” both would have the same URI.
Note that to make this work we had to hack the Omeka software we’re using because like most repository software it didn’t have good support for using URIs as metadata values.So, why am I telling you all this?
The raw, machine-readable Fill My List protocol in action, looking up an ORCID index.When we refer to researchers on the web, we should use their ORCID ID, in the form of a URI. But to be able to do this we often have to update repository software (as my tean at UTS are doing with Omeka). In conclusion
The ORCID API (machine to machine interface) provides pretty good but not perfect open lookup services so…
Repository developers can make their repositories linked-data compliant
But it’s a lot of work and it will involve a community effort to update our repository systems, many of which are open source.We have made a lot of progress on improving the quality of metadata in the Australian research sector – and a lot of that has been community driven, for example the ReBOX project’s insistence on using URIs led to the first URIs for Australian grant being minted by developers from Griffith and USQ, because it was the right thing to do.
Now, a few years later, the government is making its own URIs and the Australian National Data services is providing vocab services.
ORCID does have a public API to allow us to build Fill My List type lookup services – allowing to query on name-parts, it would be better if it included bibliographic information, wich might help someone entering metadata choose between two people with the same name.
ACRL was ridiculously amazing this year. I feel energized, affirmed, and hopeful (and completely exhausted and sick since it ended). The programming was so high-quality and relevant that, in most cases, I had at least four options in every time slot on my planner that I wanted to attend. Luckily, ACRL records all the sessions and will be putting them online in the next few weeks; there are so many I want to listen to! It’s really nice to go to a conference when you feel like you’re actually in a position to implement some of the things you’ve heard about.
I have such warm feelings for my amazing colleagues at PCC, but I also have such warm feelings for the community college library community. Everyone has been so welcoming and positive about my move. It feels like marrying into a family, but only in the good ways. I feel so very lucky. I think community college librarianship is the best kept secret in our profession and I’m just happy I got the chance to figure it out.
Secret seems to be the problem though. Two years ago, I heard a good deal of complaining on Twitter that there wasn’t much programming for or by community college librarians at ACRL 2013. This was definitely not the case this year, where there were more sessions by community college librarians than any one person could attend. What was interesting this year was that, in many of those sessions, only community college librarians attended. I went to a great session presenting projects from the Assessment in Action program that were done by community colleges, and the session was attended almost entirely by community college librarians. And yet there was so much any academic librarian interested in assessment would have gotten out of their very realistic and honest (warts and all) descriptions of their assessment projects. We had about 80 people at our talk on what it takes to build a culture of assessment (where we compared our results from community colleges to those from the first study), but only a handful of attendees were not community college librarians. I totally understand that these might have seemed to be only relevant to community college librarians, but I also think there is a tendency to believe that community college libraries have more to learn from university libraries than vice versa. Maybe that’s true when it comes to data management, scholarly communications, and home-grown technologies, but the singular focus on student success makes community colleges a fantastic source of learning about instruction, outreach, and assessment.
Portland Community College looks to Portland State University a great deal (especially in the library) for ideas and to create a consistent experience for students moving from one school to the other. But, in my time at Portland State, most of my colleagues were not interested in the community colleges that provided around 2/3 of their student body. I’ll admit I was guilty of it as well until I was contacted by a wonderful librarian at PCC (who is now my colleague) who saw a presentation Amy Hofer and I gave at the 2013 Oregon Library Association/Washington Library Association Conference and wanted to meet. Amy and I started to meet with fantastic librarians from Portland and Mount Hood Community Colleges to talk about our learning outcomes and instructional practice. During the worst summer of my professional life, they were a ray of sunshine and hope. When I first saw the brilliant work my now colleague, Pam Kessinger, did around curriculum mapping, I became ashamed of the fact that I originally thought we at PSU had more to teach than we had to learn from the community colleges. I was so wrong. Interestingly, Amy and I are now both working for community colleges.
I was so happy to see the presentation at ACRL about how Appalachian State and their local community colleges met to collaborate and discuss learning outcomes development. Similar work was done years ago in Oregon and that work blossomed into a group, ILAGO (Information Literacy Articulation Group of Oregon), that connects community college, university, and K-12 librarians around our shared information literacy and advocacy goals.
There is so much Portland State could learn from Portland Community College about how to build a culture of assessment right. Having served on the Assessment Council at PSU, I can say that there was only lip service paid at the administrative level to the importance of assessment and no real support for assessment was provided in the years I was at the University. You can’t even find learning outcomes for courses (if they exist at all), and while the departments were required to draft program-level outcomes a few years ago, they were not published anywhere on the websites of most departments (I had to ask for a copy from Institutional Research). A couple of departments were doing really great assessment work, but they were the exception rather than the norm.
The College is still on the road to building an assessment culture, but they’re doing it in such a smart way. Every department is required to do assessment, but the faculty in each department are empowered to decide what they want to assess and how to approach it. And they are given support, in the form of faculty mentors associated with the Learning Assessment Council. And the people to whom we have to report our assessment progress each year and who give us feedback on it are our peers on the Learning Assessment Council. The faculty are driving the bus. The departments I’ve seen doing assessment seem to be really focused on doing it to improve their programs for their students. It’s very inspiring. And so nice, as a new librarian learning about her liaison areas, to be able to see the course-level outcomes for every course listed prominently on the College website. I’m not saying every community college is doing amazing assessment work, but, according to our research, they seem to be doing more than BA, MA, and PhD-granting institutions.
Take a look at the results Lisa Hinchliffe and I shared at ACRL from our study of the factors that facilitate and hinder librarians in building an assessment culture, and you’ll see that community college librarians are ahead of the game in terms of assessment practices. [Sorry the formatting got a little messed up in slideshare, but the content is all there.]
Community college libraries have longer been scrutinized by outside entities and so have longer had to play the accountability game. Their more singular focus on student success and learning encourages a focus on assessment for and about learning. And I’d argue that their long history of being resource-constrained (by-and-large) has led in many cases to real creativity (I think this is also sometimes helped by having leaner organizations with less bureaucracy). There’s a lot we could learn from the approaches community colleges have taken to engaging in assessment practice.
This is starting to feel like a guilt trip for university librarians, but I think community college librarians share the blame if they’re not sharing the great work they do. When you look at the literature, a lot less publishing is happening in community college libraries. Seeing how much busier I am in my current job with reference, instruction, and library-wide projects than I ever was in previous positions, I totally understand why, but I don’t think we can expect people to be interested in our work if we are not out there sharing it. Lisa and I exhorted our audience to share their assessment work. Whether you publish it in a journal, at a conference, on a listserv, or a blog, what matters most is that you’re sharing it (though I’d love to see more people publishing in places that provides open access to their work). Librarians of every type could benefit so much from knowing the great work that goes on in community college libraries.
I also think it would be helpful to not use the term “community college library” in the title of articles and presentations, unless something is really only relevant to community college libraries. I think it may make people from other types of academic libraries think it isn’t for them. The work my incredible colleague at PCC, Sara Seely, presented in our ACRL presentation on teaching sources and source evaluation for lifelong information literacy was from a community college context, but was totally relevant to any librarian teaching information literacy. I understand the desire to have community college-specific programming, but I think having the speakers be from a community college is good enough and would expose more people to our great work. So much of what we do and struggle with is not unique to community colleges.
Let’s share the great work we do and break down the barriers between community college librarians and academic librarians in other contexts. There is so much we can learn from each other!
Image credit: 99u
I’ve got a new series of posts that I’ve been wanting to do for a while now to try and get a better understanding of the utilization of the Extended Date Time Format (EDTF) in cultural heritage organizations and more specifically if those date formats are making their way into the Digital Public Library of America. Before I get started with the analysis of the over eight million records in the DPLA, I wanted to give a little background of the EDTF format itself and some of the work that has happened in this area in the past.A Bitter Harvest
One of the things that I remember about my first year as a professional librarian was starting to read some of the work that Roy Tennant and others were doing at the California Digital Library that was specific to metadata harvesting. One specific text that I remember specifically was his “Bitter Harvest: Problems & Suggested Solutions for OAI-PMH Data & Service Providers” which talked about many of the issues that they ran into in trying to deal with dates from a variety of service providers. This same challenge was also echoed by projects like the National Science Digital Library (NSDL), OAIster, and other aggregations of metadata over the years.
One thing that came out of many of these aggregation projects, and something that many of us are dealing with today is the fact that “dates are hard”.Extended Date Time Format
A few years ago an initiative was started to address some of the challenges that we have in representing dates for the types of things that we work with in cultural heritage organizations. This initiative was named the Extended Date Time Format (EDTF) which has a site at the Library of Congress and which resulted in a draft specification for a profile or extension to ISO 8601.
An example of what this documented was how to represent some of the following date concepts in a machine readable way.
Commonly Used DatesDate Feature Example Item Format Example Date Year Book with publication year YYYY 1902 Month Monthly journal issue YYYY-MM 1893-05 Day Letter YYYY-MM-DD 1924-03-03 Time Born-digital photo YYYY-MM-DDTHH:MM:SS 2003-12-27T11:09:08 Interval Compiled court documents YYYY/YYYY 1887/1889 Season Seasonal magazine issue YYYY-SS 1957-23 Decade WWII poster YYYu 194u Approximate Map “circa 1886” YYYY~ 1886~
Some Complex DatesExample Item Kind of Date Format Example Date Photo taken at some point during an event August 6-9, 1992 One of a Set [YYYY..YYYY] [1992-08-06..1992-08-09] Hand-carved object, “circa 1870s” Extended Interval (L1) YYYY~/YYYY~ 1870~/1879~ Envelope with a partially-legible postmark Unspecified “u” in place of digit(s) 18uu-08-1u Map possibly created in 1607 or 1630 One of a Set, Uncertain [YYYY, YYYY] [1607?, 1630?]
The UNT Libraries made an effort to adopt the EDTF for its digital collections in 2012 and started the long process of identifying and adjusting dates that did not conform with the EDTF specification (sadly we still aren’t finished).
Hannah Tarver and I authored a paper titled “Lessons Learned in Implementing the Extended Date/Time Format in a Large Digital Library” for the 2013 Dublin Core Metadata Initiative (DCMI) Conference that discussed our findings after analyzing the 460,000 metadata records in the UNT Libraries’ Digital Collections at the time. As a followup to that presentation Hannah created a wonderful cheatsheet to help people with the EDTF for many of the standard dates we encounter in describing items in our digital libraries.EDTF use in the DPLA
When the DPLA introduced its Metadata Application Profile (MAP) I noticed that there was mention of the EDTF as one of the ways that dates could be expressed. In the 3.1 profile it was mentioned in both the dpla:SourceResource.date property syntax schema as well as the edm:TimeSpan class for all of its properties. In the 4.0 profile it changed up a bit with the removal from the dpla:SourceResource.date property as a syntax schema, and from the edm:TimeSpan “Original Source Date” but kept in the edm:TimeSpan “Being” and “End” property.
Because of this mention, and the knowledge that the Portal to Texas History which is a service-Hub is contributing records in the EDTF, I had the following questions in mind when I started the analysis presented in this post and a few that will follow.
- How many date values in the DPLA are valid EDTF values?
- How are these valid EDTF values distributed across the Hubs?
- What feature levels (Level 0, Level 1, and Level 2) are used by various Hubs?
- What are the most common date format patterns used in the DPLA?
With these questions in mind I started the analysisPreparing the Dataset
I used the same dataset that I had created for some previous work that consisted of a data download from the DPLA of their Bulk Metadata Download in February 2015. This download contained 8,xxx,xxx records that I used for the analysis.
I used the UNT Libraries’ ExtendedDateTimeFormat validation module (available on Github) to classify each date present in each record as either valid EDTF or not valid. Additionally I tested which level of EDTF each value conformed to. Finally I identified the pattern of each of the date fields by converting all digits in the string to 0, all alpha characters to x and leaving all non alpha-numeric characters.
This resulted in the following fields being indexed for each dateField Value date 2014-04-04 date_valid_edtf true date_level0_feature true date_level1_feature false date_level2_feature false date_pattern 0000-00-00
For those interested in trying some EDTF dates you can check out the EDTF Validation Service offered by the UNT Libraries.
After several hours of indexing these values into Solr, I was able to start answering some of the questions mentioned above.Date usage in the DPLA
The first thing that I looked at was how many of the records in the DPLA dataset had dates vs the records that were missing dates. Of the 8,012,390 items in my copy of the DPLA dataset, 6,624,767 (83%) had a date value with 1,387,623 (17%) missing any date information.
I was impressed with the number of records that have dates in the DPLA dataset as a whole and was then curious about how that mapped to the various Hubs.Hub Name Items Items With Date Items With Date % Items Missing Date Items Missing Date % ARTstor 56,342 49,908 88.6% 6,434 11.4% Biodiversity Heritage Library 138,288 29,000 21.0% 109,288 79.0% David Rumsey 48,132 48,132 100.0% 0 0.0% Digital Commonwealth 124,804 118,672 95.1% 6,132 4.9% Digital Library of Georgia 259,640 236,961 91.3% 22,679 8.7% Harvard Library 10,568 6,957 65.8% 3,611 34.2% HathiTrust 1,915,159 1,881,588 98.2% 33,571 1.8% Internet Archive 208,953 194,454 93.1% 14,499 6.9% J. Paul Getty Trust 92,681 92,494 99.8% 187 0.2% Kentucky Digital Library 127,755 87,061 68.1% 40,694 31.9% Minnesota Digital Library 40,533 39,708 98.0% 825 2.0% Missouri Hub 41,557 34,742 83.6% 6,815 16.4% Mountain West Digital Library 867,538 634,571 73.1% 232,967 26.9% National Archives and Records Administration 700,952 553,348 78.9% 147,604 21.1% North Carolina Digital Heritage Center 260,709 214,134 82.1% 46,575 17.9% Smithsonian Institution 897,196 675,648 75.3% 221,548 24.7% South Carolina Digital Library 76,001 52,328 68.9% 23,673 31.1% The New York Public Library 1,169,576 791,912 67.7% 377,664 32.3% The Portal to Texas History 477,639 424,342 88.8% 53,297 11.2% United States Government Printing Office (GPO) 148,715 148,548 99.9% 167 0.1% University of Illinois at Urbana-Champaign 18,103 14,273 78.8% 3,830 21.2% University of Southern California. Libraries 301,325 269,880 89.6% 31,445 10.4% University of Virginia Library 30,188 26,072 86.4% 4,116 13.6%
I was surprised by the high percentage of dates in records for many of the Hubs in the DPLA, the only Hub that had more records without dates than with dates was the Biodiversity Heritage Library. There were some Hubs, notably David Rumsey, HathiTrust, J. Paul Getty Trust, and the Government Printing Office that have dates for more then 98% of their items in the DPLA. This is most likely because of the kinds of data they are providing or the fact that dates are required to identify which items can be shared (HathiTrust)
When you look at Content-Hubs vs Service-Hubs you see the following.Hub Type Items Items With Date Items With Date % Items Missing Date Items Missing Date % Content-Hub 5,736,178 4,782,214 83.4% 953,964 16.6% Service-Hub 2,276,176 1,842,519 80.9% 433,657 19.1%
It looks like things are pretty evenly matched between the two types of Hubs when it comes to the presence of dates in their records.Valid EDTF Dates
I took a look at the 6,624,767 items that had date values present in order to see if their dates were valid based on the EDTF specification. It turns out the 3,382,825 (51%) of these values are valid and 3,241,811 (49%) are not valid EDTF date strings.
So the split is pretty close.
One of the things that should be mentioned is that there are many common date formats that we use throughout our work that are valid EDTF date strings that may not have been created with the idea of supporting EDTF. For example 1999, and 2000-04-03 are both valid (and highly used date_patterns) that are normal to run across in our collections. Many of the “valid EDTF” dates in the DPLA fall into this category.
In the next posts I want to take a look how EDTF dates are distributed across the different hubs and also to take a look at some of the EDTF features used by Hubs in the DPLA.
As always feel free to contact me via Twitter if you have questions or comments.
The recent publication of Monica Berger and Jill Cirasella’s piece in College and Research Libraries News “Beyond Beall’s List: Better understanding predatory publishers” is a reminder that the issue of “predatory publishers” continues to require focus for those working in scholarly communication. Berger and Cisarella have done a exemplary job of laying out some of the issues with Beall’s list, and called on librarians to be able “to describe the beast, its implications, and its limitations—neither understating nor overstating its size and danger.”
At my institution academic deans have identified “predatory” journals as an area of concern, and I am sure similar conversations are happening at other institutions. Here’s how I’ve “described the beast” at my institution, and models for services we all can provide, whether subject librarian or scholarly communication librarian.What is a Predatory Publisher? And Why Does the Dean Care?
The concept of predatory publishers became much more widely known in 2013 with a publication of an open access sting by John Bohannon in Science, which I covered in this post. As a recap, Bohannon created a fake but initially believable poor quality scientific article, and submitted it to open access journals. He found that the majority of journals accepted the poor quality paper, 45% of which were included in the Directory of Open Access Journals. At the time of publication in October 2013 the response to this article was explosive in the scholarly communications world. It seems that more than a year later the reaction continues to spread. Late in the fall semester of 2014, library administration asked me to prepare a guide about predatory publishers, due to concern among the deans that unscrupulous publishers might be taking advantage of faculty. This was a topic I’d been educating faculty about on an ad hoc basis for years, but I never realized we needed to address it more systematically. That all has changed, with senior library administration now doing regular presentations about predatory publishers to faculty.
If we are to be advocates of open access, we need to focus on the positive impact that open access has rather than dwell for too long on the bad sides of it. We also need faculty to be clear on their own goals for making their work open access so that they may make more informed choices. Librarians have limited faculty bandwidth on the topic, and so focusing on education about self-archiving articles (otherwise known as green open access) or choosing no-fee (also known as gold) open access journals is a better way to achieve advocacy goals than suggesting faculty choose only a certain set of gold open access journals. Unless we are offering money for paying article fees, we also don’t have much say about where faculty choose to publish. Education about how to choose a journal and a license responsibly is what we should focus on, even if it diverges from certain ideals (see Meredith Farkas on choosing creative commons licenses.)Understanding the Needs and Preparing the Material
As I mentioned, my library administration asked for a guide that that they could use in presentations and share with faculty. In preparing this guide, I worked with our library’s Scholarly Communications committee (of which I am co-chair) to determine the format and content.
We decided that adding this material to our existing Open Access research guide would be the best move, since it was already up and we shared the URL widely already. We have a robust series of Open Access Week events (which I wrote about last fall) and this seemed to ideal place to continue engaging people. That said, we determined that the guide needed an overhaul to make it more clear that open access was an on-going area of concern, not a once a year event. Since faculty are not always immediately thinking of making work open access but of the mechanics of publishing, I preferred to start with the title “Publishing Your Own Work”.
To describe its features a bit more, I wanted to start from the mindset of self-archiving work to make it open access with a description of our repository and Peter Suber’s useful guide to making one’s own work open access. I then continued with an explanation of article publication fees, since I often get questions along those lines. They are not unique to open access journals, and don’t imply any fee to accept for publication, which was a fear that I heard more than once during Open Access Week last year. I only then discussed the concept of predatory journals, with the hope that a basic understanding of the process would allay fears. I then present a list of steps to research a journal. I thought these steps were more common sense than anything, but after conversations with faculty and administration, I realized that my intuition about what type of journal I am dealing with is obvious because I have daily practice and experience. For people new to the topic I tried to break down research into easy steps that help them to figure out where a journal is on the continuum from outright scams to legitimate but new or unusual journals. It was also important to me to emphasize self-archiving as a strategy no matter the journal publication model.
Lastly, while most academic libraries have a model of liaison librarians engaging in scholarly communications activities, the person who spends every day working on these issues is likely to be more versed in emerging trends. So it is important to work with liaisons to help them research journals and to identify quality open access journals in their disciplines. We plan to add this information to the guide in a future version.Taking it on the Road
We felt that in-person instruction on these matters with faculty was a crucial next step, particularly for people who publish in traditional journals but want to make their work available. Traditional journals’ copyright transfer agreements can be predatory, even if we don’t think about it in those terms. Taking inspiration from the ACRL Scholarly Communications Roadshow I attended a few years ago, I decided to take the curriculum from that program and offer it to faculty and graduate students. We read through three publication agreements as a group, and then discussed how open the publishers were to reuse of material, or whether they mentioned it at all. We then included a section on addenda to contracts for negotiation about additional rights.
The first workshop received modest attendance, but included some thoughtful conversations, and we have promised to run it again. Some people may never have read their agreements closely, and never realized they were doing something illegal or not specifically allowed by, for instance, sharing an article they wrote with their students. That concrete realization is more likely to spur action than more abstract arguments about the benefits of open access.Escaping the Predator Metaphor
If I could go back, I would get rid of the concept of “predator” attached to open access journals. Let’s call it instead unscrupulous entrants into an emerging business model. That’s not as catchy, but it explains why this has happened. I would argue, personally, that the hybrid gold journals by large publishers are just as predatory, as they capitalize on funding requirements to make articles open access with high fees. They too are trying new business models, and those may not be tenable either. As I said above, choosing a journal with eyes wide open and understanding all the ramifications of different publication models is the only way forward. To suggest that faculty are innocently waiting to be pounced on by predators is to deny their agency and their ability to make choices about their own work. There may be days where that metaphor seems apt, but I think overall this is a damaging mentality to librarians interested in promoting new models of scholarly communication. I hope we can provide better resources and programming to escape this, as well as to help administration to understand how to choose to fund open access initiatives.
In the comments I’d like to hear more suggestions about how to escape the “predator” metaphor, as well as your own techniques for educating faculty on your campus.
The Fedora 4 upgration is coming into its third month with a big focus on migration. Notes from the last project meeting are available here. Some highlights:
Jared Whiklo, Web Application Developer at University of Manitoba, has joined the project team and is working with Danny and Nick on code tasks.
Recent work has focused on a couple of areas. The first was collaborating with Mike Durbin (University of Virginia) on fcrepo4-labs/migration-utils, which will support an upgration in this order:
- traversing the objectStore file system
- archive export format
- migrate export format
In order to provide a large set of test fixtures for use with this tool, Nick updated YUDL’s Fedora to 3.8.1-SNAPSHOT.
The second focus on was data modelling. Specifically, mapping fcrepo3 object properties to fcrepo4 container properties, fcrepo3 datastream properties to fcrepo4 file properties, mapping RELS-EXT properties, identifying standard audit trail events, and working towards bringing Islandora into compliance with the Portland Common Data Model. This work was shared with the community via a Large Image Solution Pack example object modelled in Fedora 4.
Related to the migration, work has also been done around Audit Service design in Fedora 4. Nick participated in all of the Audit Service design meetings, and led a discussion around PROV and PREMIS ontology usage in the service. A code sprint led by Esme Coles and devoted to the Audit Service began on March 30th. That work is outlined here.
The migration work will most likely continue through April. If you want to attend future meetings and keep an eye on development, please join the Islandora Fedora 4 Interest Group. Your ideas and use cases are also very welcome as issues is Islandora Labs. For anyone planning to attend the Open Repositories conference in Indianapolis this June, the upgration team will be giving a presentation on the upgration project called Islandora and Fedora 4: The Atonement.
Today I found the following resources and bookmarked them on Delicious.
- Messenger for Mac Facebook Messenger for the desktop
Digest powered by RSS Digest
The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at http://dh.crc.nd.edu/sandbox/thatcamp-2015/ as well as http://nd2015.thatcamp.org/2015/04/07/close-and-distant/.)
To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?Comparison Feature The Prince Representative Men Author Niccolò di Bernardo dei Machiavelli (1469 – 1527) Ralph Waldo Emerson (1803 – 1882) Title The Prince Representative Men Date 1532 1850 Fulltext plain text | HTML | PDF | TEI/XML plain text | HTML | PDF | TEI/XML Length 31,179 words 59,600 words Fog score 23.1 14.6 Flesch score 33.5 52.9 Kincaid score 19.7 11.5 Frequencies unigrams, bigrams, trigrams, quadgrams, quintgrams unigrams, bigrams, trigrams, quadgrams, quintgrams Parts-of-speech nouns, pronouns, adjectives, verbs, adverbs nouns, pronouns, adjectives, verbs, adverbs Search
I observe this project to be a qualified success.
First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.
Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.
How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:
“man” in The Prince
“man” in Representative men
I think I’m going to have to read the books to find the answer. Really.Code
Bunches o’ code was written to produce the reports:
- concordance.cgi – the simple search engine
- fathom.pl – used to compute the readability scores
- file2pos.py – create a parts-of-speech file for later use
- network.cgi – used to display words used “in the same breath” a given word
- ngrams.pl – compute ngrams
- pos.py – count and tabulate parts-of-speech from a previously created file
You can download this entire project — code and all — from http://dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz or http://infomotions.com/blog/wp-content/uploads/2015/04/thatcamp-2015.tar.gz.
This all started with a conversation over twitter (https://twitter.com/_whitni/status/583603374320410626) about a week ago. A discussion about why the current version of MarcEdit is so fragile when being run on a Mac. The short answer has been that MarcEdit utilizes a cross platform toolset when building the UI which works well on Linux and Windows systems, but tends to be less refined on Mac systems. I’ve known this for a while, but to really do it right, I’d need to develop a version of MarcEdit that uses native Mac APIs, which would mean building a new version of MarcEdit for the Mac (at least, the UI components). And I’ve considered it – mapped out a road map – but what’s constantly stopped me has been a lack of interest from the MarcEdit community and a lack of a Mac system. On the community-side, I can count on two hands the number of times I’ve had someone request a version of MarcEdit specifically for a Mac. And since I’ve been making a Mac App version of MarcEdit available – it’s use has been fairly low (though this could be due to the struggles noted above). With an active community of over 20,000, I try to put my time where it will make the most impact, and up until a week ago, better support for Mac systems didn’t seem to be high on the list. The second reason is I don’t own a Mac. My technology stack is made up of about a dozen Windows and Linux systems embedded around my house because they play surprisingly well together, where as, Apple’s walled garden just doesn’t thrive within my ecosystem. So, I’ve been waiting and hoping that the cross-platform toolset would get better and that in time, this problem would eventually go away.
I’m giving that background because apparently I’ve been misreading the MarcEdit community. As I said, this all started with this conversation on twitter (https://twitter.com/_whitni/status/583603374320410626) – and out of that, two co-conspirators, Whitni Watkins and Francis Kayiwa set out to see just how much interest there actually was in having dedicated version of MarcEdit for the Mac. The two set out to see if they could raise funds to acquire a Mac to do this development and indirectly, demonstrate that there was actually a much larger slice of the community interested in seeing this work done. And, so, off they went – and I set back and watched. I made a conscious decision that if this was going to happen, it was going to be because the community wanted it and in that, my voice wasn’t necessary. And after 8 days, it’s done. In all, 40 individuals contributed to the campaign, but more importantly to me, I heard directly from around 200+ individuals that were hopeful that this project would proceed.
Now the hard work starts. MarcEdit is a program that has been under constant development since 1999 – so even just rewriting the UI components of the application will be a significant undertaking. So, I’m breaking up this work in chunks. I figure it would take approximately 8-12 months to completely port the UI, which is a long-time. Too long…so I’m breaking the development into 3 month “sprints”. the first sprint will target the 80%, the functionality that would make MarcEdit productive when doing MARC editing. This means porting the functionality for all the resources found in the MARC Tools and much of the functionality found in the MarcEditor components. My guess is these two components are the most important functional areas for catalogers – so finishing those would allow the tool to be immediately useful for doing production cataloging and editing. After that – I’ll be able to evaluate the remainder of the program and begin working on functional parity between all versions of the application.
But I’ll admit, at this point, the road map is somewhat even cloudy to me. See, I’ve written up the following document (http://1drv.ms/1ake4gO) and shared it with Whitni and asked her to work with other Mac users to refine the list and let me know what falls into that 80%. So, I’ll be interested to see where their list differs from my own. In the mean time, I’ll be starting work on the port – creating wireframes and spending time over the next week hitting the books and familiarizing myself with Apple’s API docs and the UI best practices (though, I will be trying to keep the program looking very familiar to the current application – best practices be damned). Coding on the new UI will start in earnest around May 1 – and by August 1, 2015, I hope to have the first version built specifically for a Mac available. For those interested in following the development process – I’ll be creating a build page on the MarcEdit website (http://marcedit.reeset.net) and will be posting regular builds as new areas of the application are ported so that folks can try them, and give feedback.
So, that’s where this stands and this point. For those interested in providing feedback, feel free to contact me directly at email@example.com. And for those of you that reached out or participated in the campaign to make this happen, my sincere thanks.
Part bu of Amazon crawl..
This item belongs to: data/ol_data.
This item has files of the following types: Data, Data, Metadata, Text
John Miedema: The cognitive computing features of Lila. Candidate technologies, limitations, and future plans.
Cognitive computing extends the range of knowledge tasks that can be performed by computers and humans. In the previous post I summarized the characteristics of a cognitive system. This post maps the characteristics to Lila features, along with candidate technology to deliver them. Limitations and future plans are also listed.Cognitive Characteristic Lila Features Candidate Technology Limitations and Future Plans 1. Life-world data Lila operates on unstructured data from multiple sources. Unstructured data includes author notes, digital articles and books. Data is collected from many sources, including smart phone notes, email, web pages, documents, PDFs.
Lila operates on rapidly changing data, as is expected when writing a work. Lila’s functions can be re-calculated on demand.
Data volume is expected to be the size of an average non-fiction work (about 100,000 words), up to 1000 full length articles, and about 100 full length books.There are existing tools for gathering content from different sources. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote. English only.
Digital text only.
Text must be text analyzable, i.e., no locked formats.
Table content can be analyzed, but no table look-up operations.
Image analysis is limited to associated text labels.2. Natural questions Lila analyzes author notes, treating them as questions to be asked of other notes and unread articles and books. The following features combine to build meaningful queries on the content.
- The finite size of the note itself helps capture the author’s meaning.
- Lila use author suggested categories, tags and markup to understand what the author considers important.
- Lila develops a model of the author’s work, used to better understand the author’s intent.
Structured queries will be performed using existing technology, Apache Solr.Questions are constructed implicitly from author notes, not from a voice or text question box.
No direct dialog interface is provided, but see 6&7.3. Reading and understanding Lila uses natural language processing (NLP) to read author notes and unread content.
Language dictionaries provide an understanding of synonyms and parts-of-speech. This knowledge of language is an advance over simple keyword matching.
Entity identification is performed automatically using machine learning. Identification includes person names, organizations and locations. Lila can be extended to include custom entity identification models.
Lila uses additional input from the author to build a model of the author’s work. This model is used to better understand the the author’s meaning when questioning content. See 6&7.Existing NLP technologies, e.g., OpenNLP.
New Lila technology for the model.English only.
Lila does not perform deep parsing of syntax.4. Analytics Lila calculates a correlation between author notes, and between author notes and unread content. Lila also calculates a suggested order for notes. The open source R tool can be used for statistical calculations.
Language resources such as the MRC psycholinguistic database will be used to create new Lila technology for ordering notes.The calculations for suggesting order are experimental. It is likely that this function will need development over time. 5. Answers are structured and ordered Lila provides two visualizations:
- A connections view to visualize correlations between notes and unread content.
- A suggested order for notes, a visual hierarchy or a table of contents.
- Classify content with categories and tags.
- Inline markup of entities, concepts and relations.
These inputs create the model used to question content and create correlations. The author can manually edit the model with improvements.
The connections view will allow the author to “pin” correct relationships and delete incorrect relationships.There are existing technologies for classifying content. Evernote, for example, is a candidate technology for a first version of Lila. Lila’s cognitive functions can operate on data exported from Evernote.
New Lila technology for the model.The Evernote interface for collecting and editing notes has limitations. In the future, Lila will need its own interface to allow for advanced functions, e.g., inline markup, sorting of notes without numbered labels.
In the future, Lila may use author ordering of notes as a suggestion toward its calculated order.
I’ve been involved with open source software projects since at least the 1990s. I even saved a Unix application from certain death that I still use today. But that doesn’t mean I’m all rosy-eyed about all open source software projects. They are not all created equal.
To be clear, there are “open source” projects that are neither all that open nor all that successful.
But let me parse my terms before you get all hot and bothered. “Open” can be as little as dropping the code out on a repository somewhere, which is the level at which many projects currently sit. Or, it could mean that the code is actively managed under an open governance model. Most fall somewhere in between, and a number die a quiet death from neglect. I’ve also seen projects that claim the open source label long before releasing any code. And, as we’ve seen with Kuali, there is no guarantee that open source software will remain that way.
Meanwhile, someone like Terry Reese, who has programmed and maintained the amazing MARCEdit application for many years, is criticized for not open sourcing his software. If he refused to also make it better and add capabilities then maybe there would be reason for concern. But it has been tirelessly maintained and improved. Managing an open source community is not easy. I can certainly understand why Terry may want to simplify his job by vastly reducing the number of variables involved.
All things being equal, open source is better than closed source. But things are rarely equal. And it doesn’t follow that software must be open source to be useful and valued. It also doesn’t mean that someone such as Terry may choose to open source the software when he no longer wishes to maintain it. So let’s stop beating up people and projects that wish to control their own code. There should be many options for software development, not just one.
Now go ahead and give me hell, people, because I know you want to.
Picture by J. Albert Bowden II, https://www.flickr.com/photos/jalbertbowdenii/, Creative Commons License CC BY 2.0.