You are here

Feed aggregator

NYPL Labs: Together We Listen: Make Hundreds of NYC Stories Accessible—One Word at a Time

planet code4lib - Tue, 2016-04-05 16:05

The following blog post is co-authored by Willa Armstrong (NYPL Labs) and Alex Kelly (Adult Programming and Outreach Services).

Are you familiar with the NYPL Community Oral History Project? Take a few moments to listen to some highlights or just dive right into our full collection of stories.

The NYPL Community Oral History Project is truly the people’s project. It’s powered by the public, as hundreds of engaged community members come together to gather oral histories from each other in order to preserve the rich and constantly changing history of New York City. Beginning in 2013 at Jefferson Market Library in Greenwich Village and building momentum, oral histories have been collected in six additional neighborhoods. Visible Lives, an oral history project on the disability experience is another large scale collection effort, based out of Andrew Heiskell Braille and Talking Book Library. Read more about our growing collection at

To date, the Community Oral History collection contains over 1,000 stories, with more on the way! This is very exciting, but here’s the issue: We’re faced with the challenge of making this large corpus of audio accessible and searchable to the public. It’s a challenge faced by many organizations and institutions with audio-based collections and archives.

When working to make audio accessible, transcripts are important because they make audio content searchable online, and accessible to people with hearing disabilities. Recent advances in speech-to-text technologies have made great progress in opening audio to the web, but the transcripts they produce are still error-prone and can only be considered first drafts. Though they’re a good start, careful human editing is required to provide polish these computer-generated drafts and ensure accurate, high quality transcripts.

So, how are we going to transcribe these hundreds of audio hours quickly and cost effectively?  People and computers need to collaborate.

And this brings us to our big announcement: NYPL Labs has built a brand new Open Transcript Editor to engage the public in helping to make our oral history collection accessible—one word at a time. The Open Transcript Editor is an interactive transcript editor allowing multiple people to perform the final layer of polish and proofreading on computer-generated transcripts. It’s a big undertaking and we’re inviting the public to pitch in and help correct computer-generated transcripts from our NYPL Community Oral History Project.

Visit to get started and help make this public treasure trove of NYC stories accessible.

Since we’re not the only ones tackling this challenge, the NYPL has teamed up with The Moth, a live storytelling organization with its own growing audio archive, for Together We Listen, a community program that invites our respective audiences to correct transcripts online using this new tool. To get our initial, computer-generated transcripts, both partners have sent their stories through Pop Up Archive, a speech-to-text service that works extensively with the public media and cultural heritage sectors. This project was made possible with generous support provided by the Knight Foundation Protoype Fund, an initiative of the John S. and James L. Knight Foundation. The Open Transcript Editor codebase itself is open source and is available to be used and further developed around other audio archives.

Anyone can contribute to this effort online and, for the first time ever, NYPL is also organizing in-person events at various library locations to encourage people to get together and help transcribe the oral histories for their own communities. We hope you can join us at an event in your neighborhood!

By editing transcripts, you're helping to create truly accurate transcripts in order to share hundreds of stories from the ongoing Community Oral History Project. Once they have been edited with agreement from enough contributors, completed transcripts will be available to read and download at, along with the audio recordings.

Pitch in now: Tune in and transcribe!   Join the Together We Listen project!

Join our initiative to make New York City history accessible one story at a time! We've partnered with The Moth to create a transcription tool that will allow you to help us improve upon computer-generated transcripts for over 1,000 stories from our Community Oral History Project. Be a part of history today:

Posted by NYPL The New York Public Library on Tuesday, April 5, 2016

Want to receive updates about NYPL digital initiatives? Sign up for our e-mail newsletter.

David Rosenthal: The Curious Case of the Outsourced CA

planet code4lib - Tue, 2016-04-05 15:00
I took part in the Digital Preservation of Federal Information Summit, a pre-meeting of the CNI Spring Membership Meeting. Preservation of government information is a topic that the LOCKSS Program has been concerned with for a long time; my first post on the topic was nine years ago. In the second part of the discussion I had to retract a proposal I made in the first part that had seemed obvious. The reasons why the obvious was in fact wrong are interesting. The explanation is below the fold.

One major difficulty in preserving the Federal Web presence is finding it. The idea that all Federal websites live under .gov or .mil, or even that somewhere in the Federal government is a complete, accurate and up-to-date list of them is just wrong. How does a Web crawler know that some random site in .com, .org or .us is actually part of the Federal Web presence?

Connecting to GeoTrustIn a praiseworthy attempt to protect Federal websites from the all-powerful Chinese hackers and the dreaded terrorists, a decree as gone forth that, come December 31st this year, all of them must be HTTPS-only. An HTTPS website must have a certificate carrying a chain of signatures terminating in one from a root Certificate Authority (CA) that browsers trust. Here, for example, is the website of GeoTrust, a commercial CA that browsers trust. The certificate chain is:
  • is certified by:
  • GeoTrust Extended Validation SHA256 SSL CA, which is certified by:
  • GeoTrust Primary Certification Authority G3, which is a CA browsers trust.
The list of CAs that my Firefox trusts is here, all 188 of them. Note that because GeoTrust is in the list, it can certify its own website. I wrote about the issues around CAs back in 2013, notably that:
  • The browser trusts all of them equally.
  • The browser trusts CAs that the CAs on the list delegate trust to. Back in 2010, the EFF found more than 650 organizations that Internet Explorer and Firefox trusted.
  • Commercial CAs on the list, and CAs they delegate to, have regularly been found to be issuing false or insecure certificates.
Among the CAs on the list are agencies of many governments, such as the Dutch, Chinese, Hong Kong, and Japanese governments.

I assumed that the US government would be on the list too. My obvious idea was that government websites outside .gov and .mil could be found by crawling other domains looking for HTTPS sites whose certificate's signature chain ended at the US government root CA. This would solve a big problem for collecting and preserving Federal government information. Alas, I under-estimated the mania for outsourcing government functions to for-profit companies.

Connecting to the LoC websiteAs an example, visit Your browser will be redirected to, the home page of the Library of Congress website. It will display a green padlock icon,showing that the connection is secure and the browser has verified the certificate upon which the connection's security depends. So far so good. Now click on the green padlock and reveal the details of this verification. The certificate chain looks like:
  • * is certified by:
  • Entrust Certification Authority - L1K, which is certified by:
  • Entrust Root Certification Authority G2, which is on the browser's trusted CA list.
What this means is that the Library of Congress is paying a commercial CA to reassure citizens that their website is what it claims to be, and is secure.

Connecting to the DHS websiteIf you visit you will be redirected to but unlike the Library of Congress you won't get the reassuring green padlock. There are two reasons:
  • The images in the page are delivered via HTTP, so:
    Your connection to this site is private, but someone on the network might be able to change the look of the page.
  • The browser doesn't like the HTTPS connection because:
    Your connection to is encrypted using an obsolete cipher suite.
But the browser has verified the signature chain. It looks like:
  • is certified by:
  • GeoTrust SSL CA - G3, which is certified by:
  • GeoTrust Global CA, which is on the browser's trusted CA list.
So the Department of Homeland Security is paying a different commercial CA to reassure citizens that their website is what it claims to be, and that it is not very secure.

Why is this? Is it because the Library of Congress believes that Entrust is more trustworthy than the US Government? I hope not, Entrust is one of the CAs whose delegated CAs have been caught issuing bogus certificates. It is because, as far as I can tell, the list of 188 CAs that browsers trust contains no US Government controlled CA.

So, your browser trusts the government of the People's Republic of China but not the government of the United States of America!

It isn't that the Federal government doesn't trust itself to run a secure root CA. There is a Federal root CA, the Common Policy Root CA, which is clearly regarded as secure since it is used to control access to Federal systems. But it isn't in the browser's list of trusted CAs, so it isn't any use for outward-facing services such as websites. If it was Federal websites could be Federally certified as GeoTrust websites are GeoTrust certified.

In what world does this make sense? One in which there's money to be made selling services to the Federal government. By failing to follow the example of other governments by putting a root CA that they control into the list, the government arranges for funds to flow to for-profit companies who can protect the cash flow by lobbying, and arranging a warm welcome on the other side of the revolving door for the decision makers. All the for-profit CAs need to do is to make sure they stay in GSA's list of approved vendors, like DigiCert.

So, apart from the waste of taxpayer money, and the failure of my idea for finding government websites, what is the downside of this situation? CAs sometimes misbehave, as DigiNotar and StartSSL did. The result is a dispute between the guilty CAs and the browser vendors, resolved by removing them from the list, as StartSSL was, or by explicitly distrusting their root certificates, as DigiNotar's were. If this happened to one of the CAs Federal websites use, the dispute to which the Feds were not a party would result in the websites using the guilty CA becoming unavailable until the affected certificates could be replaced with new ones from a different CA in GSA's list. The browser vendors control the trusted CA list, so among other things they control citizens' access to government information. Since they're all based in the US, there would be good reasons why they'd be reluctant to remove a US government CA from the list.

I'm naturally reluctant to trust the Federal government, but I'm a whole lot more reluctant to trust for-profit CAs. It looks like I'm out of luck; the policy about public access to the Federal root CA is up on the Web:
Does the US government operate a publicly trusted certificate authority?

No, not as of early 2016, and this is unlikely to change in the near future.

The Federal PKI root is trusted by some browsers and operating systems, but is not contained in the Mozilla Trusted Root Program. The Mozilla Trusted Root Program is used by Firefox, as well as a wide variety of devices and operating systems. This means that the Federal PKI is not able to issue certificates for use in TLS/HTTPS that are trusted widely enough to secure a web service used by the general public.

The Federal PKI has an open application to the Mozilla Trusted Root Program. However, even if the Federal PKI’s application is accepted, it will take a significant amount of time for the Federal PKI’s root certificate to actually be shipped onto devices and propagate widely around the world.

DPLA: DPLA staff attend LDCX at Stanford University

planet code4lib - Tue, 2016-04-05 14:00

Four DPLA staff members — Mark Breedlove, Tom Johnson, Mark Matienzo, and I —  recently attended LDCX at Stanford University. The annual conference is a chance for those in the library, archive, and museum (LAM) communities who work with technology to collaborate on solutions to common problems. Since most the staff attending live on the East Coast and have been weathering the last dregs of winter, the trip injected a much needed dose of sunshine too!

LDCX is an “unconference” meaning the agenda is decided on by the attendees on the morning of the first day. Topics are ad hoc discussions and geared towards coming up with actual solutions using shared standards and infrastructures. This year’s conference saw work on emerging standards like the Portland Common Data Model (PCDM), the International Image Interoperability Framework (IIIF), along with more general and perennial topics like mentoring and training and the use of standard technologies like Ruby on Rails.

DPLA staff in particular were heavily in discussions on data quality and methods for analysis, the use of the DPLA’s Metadata Application Profile (MAP) and how it fits into the broader scheme of data models in the cultural heritage community, such as Europeana’s Data Model (EDM) as well as PCDM. Further discussions of DPLA MAP explored the possibilities of further extensions and adoption of DPLA MAP in other venues. (For more information about DPLA’s MAP, please see

In general, one of the best things about LDCX is the chance to meet face to face with colleagues and collaborators in the LAM technology community to discuss our challenges and work together on our shared opportunities.

Several other staff members joined the group starting on Wednesday for a series of meetings related to the Hydra open source repository system. Dan Cohen, Rachel Frick, and Audrey Altman are all part of the Hydra-in-a-Box project and came out to participate in team meetings. More information about the outcomes of that work will be posted soon to the Hydra-in-a-Box blog. Not only was the “Hybox” team in high-gear, but Thursday and Friday also saw meetings of two Hydra-related groups: the Hydra Developers Congress and the Hydra Power Steering Meeting. The Developers Congress is a regular event in the Hydra community, held in different locations around the country, that provides a chance for developers in the Hydra Open Source community to work together on code for Hydra and its components. The Hydra Power Steering meeting is an annual meeting of the Hydra Steering group dedicated to strategic planning for the community.

It was a productive and fun week for those DPLA-ers that attended — a chance to recharge our creative batteries for some, work through vital and important topics with trusted colleagues for others, and maybe to chase away the winter blues for a week as well!

Conal Tuohy: Visualizing Government Archives through Linked Data

planet code4lib - Tue, 2016-04-05 13:41

Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

Open Library Data Additions: Amazon Crawl: part bn

planet code4lib - Tue, 2016-04-05 13:39

Part bn of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Tod Robbins: Log 4213 by todrobbins

planet code4lib - Tue, 2016-04-05 03:37

Trying to fix #homebrew #tmate #askpass and feeling swallowed by my lack of #unix know-how. Ha! See:

District Dispatch: Learn and network at ALA’s National Policy Convening

planet code4lib - Mon, 2016-04-04 21:49

ALA will hold a National Policy Convening on April 12th and 13th in Washington, D.C.

Come join us for ALA’s first-ever National Policy Convening in Washington, D.C. on April 12-13. Given that a new Administration and Congress will be coming to town, it is timely to discuss and debate information policy and the public interest:

  • How can we advance creative and innovative learning for our children?
  • What can be done to advance entrepreneurship and small business in local communities?
  • What are the big issues for the digital age and how could the Library of Congress be best leveraged for the public interest?

Come join us for some answers! ALA’s president, Sari Feldman, will chair this National Policy Convening with featured policy players from the public, private and non-profit sectors, including:

Youth Engagement with Technology

  • Senator Angus King, I-Maine
  • Alan Fishel, Partner, Arent Fox, LLP
  • Tiffany Moore, Vice President, Congressional Affairs, Consumer Technology Association
  • Stephan Turnipseed, Executive Vice President & Chief Strategy Officer, Destination Imagination

Advancing Economic Opportunity in Communities

  • Maureen Conway, Vice President, Aspen Institute
  • Darryl L. DePriest, Chief Counsel, Office of Advocacy, U.S. Small Business Administration
  • Sari Feldman, President, American Library Association
  • Russell D. Greiff, Managing Director and General Partner, 1776 Ventures
  • Emily Robbins, Principal Associate, National League of Cities
  • Moderated by Larra Clark, Deputy Director, Office for Information Technology Policy, American Library Association

Future Directions for the Library of Congress

  • Robert Darnton, Carl H. Pforzheimer University Professor and University Librarian, Emeritus, Harvard University
  • Katie Oyama, Senior Policy Counsel, Google, Inc.
  • Sascha Meinrath, Palmer Chair in Telecommunications, The Pennsylvania State University
  • Moderated by Alan S. Inouye, Director, Office for Information Technology Policy, American Library Association.

Convening Schedule

Tuesday, April 12, 2016:

  • 5:00 – 5:30 pm           Registration and Check-In
  • 5:30 – 6:30 pm           Youth Engagement with Technology
  • 6:30 – 8:00 pm           Reception

Wednesday, April 13, 2016:

ALA appreciates in-kind and financial support from Arent Fox LLP, the Bill & Melinda Gates Foundation, and Google, Inc.

Register Now for the convening!

The post Learn and network at ALA’s National Policy Convening appeared first on District Dispatch.

Tod Robbins: Log 4209 by todrobbins

planet code4lib - Mon, 2016-04-04 21:14

Thinking through packaging #Adobe #Premiere and #AfterEffects projects at the end of a post-production phase.

  • What kinds of file checks (#MD5)?
  • Various ways of grouping media assets
  • Would sidecar #metadata files be helpful?

Nicole Engard: Bookmarks for April 4, 2016

planet code4lib - Mon, 2016-04-04 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Sponsored: 64% off Code Black Drone with HD Camera Our #1 Best-Selling Drone–Meet the Dark Night of the Sky!
  • Mattermost Mattermost is an open source, self-hosted Slack-alternative
  • mBlock Program your app, Arduino projects and robots by dragging & dropping
  • Fidus Writer Fidus Writer is an online collaborative editor especially made for academics who need to use citations and/or formulas.
  • Beek Social network for booklovers
  • Open eBooks Open eBooks is a partnership between Digital Public Library of America, The New York Public Library, and First Book, with content support from digital books distributor Baker & Taylor.

Digest powered by RSS Digest

The post Bookmarks for April 4, 2016 appeared first on What I Learned Today....

Related posts:

  1. Google Citations and
  2. Using DOIs in Blogs
  3. How to be productive & social

Tod Robbins: Log 4208 by todrobbins

planet code4lib - Mon, 2016-04-04 18:42

Estimating transfer size/time for project archives. Needs to be 10-12TB to fit on three 4TB drives. #digipres

Harvard Library Innovation Lab: How We’re Freeing the Law, Part 1: Books

planet code4lib - Mon, 2016-04-04 18:29

Adi Kamdar is a 1L at Harvard Law School and our embedded reporter on the Free the Law project. In this first post, he tracks the progress of a casebook through our scanning process from start to finish.

Harvard Law Library is one of the few collections with nearly every law reporter—roughly 40,000 books in total. The Free the Law project’s goal is to put the court decisions inside these volumes online, so anyone can access the precedents that shape the American legal system. Right now, the project is about halfway through, and within the next couple years they’ll have completed this monumental task.

But how exactly does a book become a byte? And what happens to these physical texts after they’ve been digitized?

Harvard Depository

The project begins each week with a book order—a 600 book order, to be exact, for law reporters that chronicle U.S. legal history since the country’s inception.

The law reporters are held in a sprawling warehouse 30 miles away from the law school—the Harvard Depository. With over 200,000 square feet of storage space, the climate-controlled Depository’s mission is pure efficiency: each book—and there are over nine million—is sorted and stored by size, rather than by name or author, in order to maximize space.

But it turns out law reporters are the packing peanuts of the Harvard Depository. When the reporters were first sent over to the warehouse, instead of being stored normally, they kept the volumes around in the packaging room. Whenever they filled a cardboard box with other books for storage, they would throw in a reporter or two if there was any extra space that needed to be filled. No one thought the print reporters would be that useful anymore, so making them easily available in bulk was a low priority. Plus, the library had decided to cancel print runs of reporters in 2010, saving valuable shelf space, especially when digital copies were easily available online.

Because of this tactic, law reporters are spread all throughout the Depository. Asking for, say, Michigan’s volumes isn’t as simple as pulling out a handful of boxes—it’s a hunt.

Langdell Library

Every Wednesday, the team receives the 600 volumes of case reporters. They line the hallway of the ground floor of Langdell, filling shelf after shelf. One by one, each book is examined before it can be taken apart. (Some books—for example, volumes with marginalia—are flagged for archival purposes.) Each volume is then catalogued and given a unique barcode so it can be tracked throughout the whole process.


The books are then taken to the Prep Room where, ironically, they’re repaired before they’re chopped up. Damaged pages are taped together, book bindings are cut off by hand, and the remaining sheets are taken over to a guillotine. Once aligned, the operator has to press two separate buttons underneath the cutting table at the same time to make sure her hands aren’t under the blade. The result? Cleanly cut pages.


View post on

View post on

Next, the bundle of pages is hauled over to the Scanning Room. Here, six employees work overlapping shifts to ensure that pages are being scanned every day, 14 hours a day. Roughly 200 documents per minute are fed through the machine, which has a camera on top and bottom to image both sides of the page.


View post on

View post on

View post on

Now that the books are chopped and scanned, what happens to the physical pages? After all, the purpose of this project is to digitize the law. Plus, according to circulation records, very few people were reading the old reporters anyway. Rebinding them and keeping them in the library would be a waste of space, time, and money. But just in case anyone questions the authenticity of the scans, Harvard decided it would be valuable to have the physical copies accessible. So the project decided to vacuum seal the pages. Once the pages are jogged together (using a state-of-the-art paper-jogging machine) and placed back inside their book jacket, the volumes are taken over to one last room—where they will be put inside a meat packing device. Yes, it turns out that the meat industry unwittingly stumbled across the best way to preserve books. The machine shrink wraps the pages, maintaining the integrity of the volume while handily adding an extra layer of protection from mold, humidity, and bugs.

View post on

The re-bound volumes are then re-shelved, where they await being shipped off to…


View post on

Louisville, Kentucky

Because of the Harvard Law Library’s limited shelf capacity, the newly packaged pages will soon be loaded onto trucks and shipped down to Louisville.

Why Kentucky? Well, because of Underground Vaults & Storage, a company that has been storing all manner of things in Louisville’s old limestone mines. The sealed books will be stored there (where they will “fear no tornado, wildfire, flood or other natural disaster”) until the rare instance that they need to be recalled.

And that’s the story of these legal volumes—from one massive depository to another, by way of a guillotine, a scanner, and a meat packer. In our next post, we’ll explore what happens after they become digital images, and how Free the Law is building the largest free database of legal opinions in the world.

FOSS4Lib Recent Releases: veraPDF - 0.12.4

planet code4lib - Mon, 2016-04-04 14:12

Last updated April 4, 2016. Created by Peter Murray on April 4, 2016.
Log in to edit this page.

Package: veraPDFRelease Date: Thursday, March 31, 2016

Open Library Data Additions: Amazon Crawl: part 17

planet code4lib - Mon, 2016-04-04 12:45

Part 17 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 16

planet code4lib - Mon, 2016-04-04 06:13

Part 16 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part ch

planet code4lib - Mon, 2016-04-04 00:42

Part ch of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

LibUX: Activity Impact Score

planet code4lib - Sun, 2016-04-03 16:29

We get that page speed matters, but practically optimizing a website for performance is hardly so straightforward as “let’s make this shit blazing fast.” There’s more to it. We know, for instance, that the order in which elements load may matter more than just the total page load time – but even this can be pretty hard. These efforts accrue real technical debt, which means they cost real money. For folks where budgets and talents and times are constrained, we need to be able to determine where cranking that speedometer has the most bang for its buck, where speed matters most, and where it doesn’t (gasp).

The Activity Impact Score introduced by Tammy Everts for Soasta measures what impact page speed has on the length of time people spend on your site. This compares a performance metric 1 like load time in milliseconds with session length, because this can be a useful indicator that people are consuming content, wherein longer sessions mean likelier discovery of new events, new and old services, cool repos, archives – all the myriad things — let’s say — that libraries do and their patrons forget.

Pages are grouped into content types 2 — lists, events, searches, landing pages — and the proportion of the overall requests associated with that group is used with the Spearman Ranked Correlation between their load times and the user’s session length to calculate an activity impact score 3 on a scale between -1 — low impact — and 1 — high impact.

The bar chart represents relative activity impact scores and the line represents load time in milliseconds.

Higher scores (the homepage, search, and subject guides) demonstrate greater correlation between page speed and session length. So we can use the example above to determine that our “about” page group — informational pages where I threw-in parking, policies, and the like — has a relatively low activity impact score despite fast load times, so these kinds of pages don’t benefit all that much from really cranking it up.

And although we might hear the carrion call of those databases, baking pitifully in the lag desert, the score of our homepage proves its speed has way more impact on the length of time people are hanging around. Our time, then, is better off doting there, leaving poorer scorers to choke in the dust a little bit longer.

  1. I wrote a thing for Weave: Journal of Library User Experience about Meaningfully Judging Performance in Terms of User Experience.
  2. With some headscratching I managed to group pages with Google Analytics, but I couldn’t say whether a tool like mPulse (by the folks who brought you the Activity Impact Score) wouldn’t be easier.
  3. The activity impact score uses a similar method as the conversion impact score, where Tammy explains this better.

The post Activity Impact Score appeared first on LibUX.

Patrick Hochstenbach: Brush Inking Exercise

planet code4lib - Sun, 2016-04-03 11:00
Filed under: portaits, Sketchbook Tagged: brush, girl, ink, pen, portrait, ski, sktchy

Open Library Data Additions: Amazon Crawl: part ib

planet code4lib - Sun, 2016-04-03 07:16

Part ib of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Patrick Hochstenbach: Brush Inking Exercise

planet code4lib - Sat, 2016-04-02 07:13
Filed under: portaits, Sketchbook Tagged: art, brush, illustration, ink, portrait, sktchy

Ed Summers: Follow

planet code4lib - Sat, 2016-04-02 04:00

I was just testing a new release of twarc and was alerted to a test failure from Travis that seems to point to a change in Twitter’s API. I thought I would quickly write it up because it seemed like an interesting bit of Twitter API arcana, as well as an unexpected use of black box testing to detect changes in social media platforms like Twitter. It also highlights (for me) the benefit of sometimes being lazy and not using mock objects in automated tests.

The specific test that is failing is test_follow which uses Twitter’s follow streaming API request to watch for tweets from a handful of major news organizations like @guardian, @nytimes, @cnnbreak, etc. Once the test gets a tweet it examines the JSON and simply verifies that it came from one of the followed accounts. I added the test to make sure twarc was doing things correctly more than to test Twitter’s API.

The weird thing is that the test has recently started to fail because sometimes it gets back tweets that don’t appear to be sent from any of the followed accounts. The tweets don’t even appear to be retweets or quotes of those accounts either. As you can see from the JSON included at the bottom of the test failure the tweet that is returned is a retweet from @Margaritin22 (971990761) of a tweet by @nytimesbusiness (1754641) but the test wasn’t following either of those users…but it was following @nytimes (807095).

So it looks as if Twitter’s follow API now has some (new?) logic that provides @nytimesbusiness tweets because I am following tweets from @nytimes. Perhaps this is some new marketing feature that allows media outlets like the New York Times to pay Twitter to bundle accounts together for promotional purposes? Or perhaps there is an algorithm at play that assumes that since I am following @nytimes I would be interested in @nytimesbusiness?

The test has only been in use since December 2015 which really isn’t too long ago. Plus the tests normally only run when I’m actively developing twarc or when I push to GitHub. So perhaps this behavior isn’t new and it has just been lucky that the test hasn’t failed until now?

But this follow behavior feels a bit like some other changes that Twitter has made to the timeline where tweets are suggested based on popularity. If anyone has any insight into this please let me know.

For the moment the test has stopped following @nytimes and it appears to be working again…at least for now. Here’s a musical interlude since you made this far:


Subscribe to code4lib aggregator