news aggregator

Summers, Ed: Paris Review Interviews and Wikipedia

planet code4lib - Mon, 2014-02-03 03:29

I was recently reading an amusing piece by David Dobbs about William Faulkner being a tough interview. Dobbs has been working through the Paris Review archive of interviews which are available on the Web. The list of authors is really astonishing, and the interviews are great examples of longform writing on the Web.

The 1965 interview with William S. Burroughs really blew me away. So much so that I got to wondering how many Wikipedia articles reference these interviews.

A few years ago, I experimented with a site called Linkypedia for visualizing how a particular website is referenced on Wikipedia. It’s actually pretty easy to write a script to see what Wikipedia articles point at a Website, and I’ve done it enough times that it was convenient to wrap it up in a little Python module.

from wplinks import extlinks   for src, target in extlinks('http://www.theparisreview.org/interviews'): print wikipedia_url, website_url

But I wanted to get a picture not only of what Wikipedia articles pointed at the Paris Review, but also Paris Review interviews which were not referenced in Wikipedia. So I wrote a little crawler that collected all the Paris Review interviews, and then figured out which ones were pointed at by English Wikipedia.

This was also an excuse to learn about JSON-LD, which became a W3C Recommendation a few weeks ago. I wanted to use JSON-LD to serialize the results of my crawling as an RDF graph so I could visualize the connections between authors, their interviews, and each other (via influence links that can be found on dbpedia) using D3′s Force Layout. Here’s a little portion of the larger graph, which you can find by clicking on it.

As you can see it’s a bit of a hairball. If you want to have a go at visualizing the data the JSON-LD can be found here. The blue nodes are Wikipedia articles, the white and red nodes are Paris Review interviews. The red ones are interviews that are not yet linked to from Wikipedia. 322 of the 362 interviews are already linked to Wikipedia. Here is the list of 40 that still need to be linked, in the unlikely event that you are a Wikipedian looking for something to do:

I ran into my friend Dan over coffee who sketched out a better way to visualize the relationships between the writers, the interviews and the time periods. Might be a good excuse to get a bit more familiar with D3 …

Hochstenbach, Patrick: The Conference Call

planet code4lib - Sun, 2014-02-02 20:21
Filed under: Comics Tagged: comic, Illustrator, inking, Penguin, Photoshop, webcomic

Hochstenbach, Patrick: The Conference Call

planet code4lib - Sun, 2014-02-02 20:21
Filed under: Comics Tagged: comic, Illustrator, inking, Penguin, Photoshop, webcomic

Brantley, Peter: Story as code: Books in Browsers IV

planet code4lib - Sun, 2014-02-02 02:03

Sometimes you in a position to create something, and give it a name without too much consideration; before long the name doesn’t mean what you thought it once did. Yet sometimes, with luck and grace, the name becomes more fitting than you could have ever realized.

For several years running, I’ve organized a small conference called “Books in Browsers” (BiB), now in concert with the Frankfurt Book Fair, that is focused on the design and development of next generation books and other publications. Its premise has always been that digital technology lowers barriers for entry into new forms of publishing, enabling a range of experiences in digital contexts that were not previously possible.

Although that’s a neat encapsulation, in practice the conference’s agenda has changed markedly over the first four years it has run. In the first couple of years, BiB concerned itself with “hacking publishing,” which was best reflected in the cluster of publishing startup firms then emerging. These new companies were attempting to engage with traditional companies, developing tools that took advantage of network affordances to enable more efficient online selling, discovery, and early forms of social reading. They were trying to engage the existing industry; by and large, they failed.

They failed for a variety of reasons; misunderstandings between technology-centered firms running head on against large companies’ practices created to-be-expected impedances. As a consequence, by BiB’s third year, it was evident that the era of engagement with traditional trade publishing had come and gone, and there was a palpable sense that the design and technology communities interested in publishing were on the threshold of new ways of creating literature. BiB 2012 was a conference awash in the excitement of edgy hacks – for example, merging voice recognition with Google Docs, using a git software repository on the back end, demonstrating the feasibility of new forms of composing literature that possessed built-in revision control.

With BiB IV in October of 2013, that era of a new engagement with literature seems to have arrived. This is not to state that what emerged out of the conference now defines, or even will necessarily define in the future, what we know of as literature. I certainly do not expect the zeitgeist of BiB IV to be manifest in an Amazon shopping experience anytime soon. But, in a way, BiB IV pointed to a future of literature that extends beyond Amazon – it makes it obvious, in other words, that there can be literature that defines itself in terms far different than what we understand today.

Digital craft.

BiB in 2013 brought a wide range of people together – artists, novelists, technologists, and those straddling these boundaries – and what the gathering demonstrated more than anything else was people creating new publishing systems with web technologies without citation towards traditional book publishing. People demonstrated new forms for creating and discussing literature, divorced from a historical methodology for creating books.

A straightforward example of this was the Booksprint work from Adam Hyde. Booksprints bring together a small number of people to produce a print and digital book in a very limited number of days – a week or less – usually in a niche or market that could not be addressed in traditional publishing. Thus, Adam Hyde coordinated the production of an open access book on negotiating oil contracts with the input of developing nations dependent on extractive industry, as well as legal experts; a work that has been translated into many languages, and is used as part of the induction process for new employees at one of the largest oil companies in the world. A guide to understanding and negotiating mining contracts was produced later in the year, using the same process, within five days. Traditional publishing does not have a place for these pamphlets.

To the extent that there was a dominant theme at BiB in 2013, it was one of craft. There were many discussions of the forms of fine control that book designers exercised in the past, in both pre-digital as well as early digital workstation eras using QuarkXPress and Adobe Indesign. This was made tangible in the compelling presentations by the European designers who attended BiB, such as Etienne Mineur, whose volumique highlights alluring mergers of physical and digital interfaces, where a game piece, for instance, controls a digital environment or story. This carefree breaking of boundaries between physical and digital, conducted without the pretense of novelty, demonstrated maturity in design that has not previously existed.

Craft invokes the importance of art and aesthetics as part of the message of literature. In a digital era, the challenge is to integrate that sense into new forms of production using digital technology: not via traditional manual hand-work, but by building and working with digital tools to effect an individual expression of artistic intent. Ultimately, digital craft represents a new manner of engaging the reader as a participant, directly and actively.

Gaming literature.

Digital design and expression does not imply, however, that new forms of literature need to be “gamified” – in other words, for elements of compulsory interaction to be embedded within stories. The new design ethos does not require readers to help produce a narrative by constructing a path through a creative work that is critical to the definition of that work.

What is changing with digital design is that increasingly content is perceived to be procedural, in a programming language or computer markup sense. It has to be, inherently, because it is being developed with computer tools. Artists, writers, and engineers are beginning to think of their output as retaining that programmatic nature, not just being produced with it. Computer code is recognized as the driver for shaping and delivering content.

When content is recognized from conception as code, it becomes straightforward, indeed, elemental, for a reader to have the ability to experience a story in different contexts and in different forms – on a desktop computer, in one presentation; or as a mobile experience within a city, on a phone; or as a game through a console. The reader can choose how the wish to consume, engage, and potentially – participate – in literature.

Literature is an ongoing project.

We live in a world where the literature of our past inevitably becomes a context for the literature of the future. How we produce literature is an ongoing project of our human society, and that exploration is evolving alongside our own understanding of ourselves.

As the digital network becomes more pervasive in our lives, there is an innate change in how we understand what our engagement with that network should be. As my colleague and friend James Bridle, an artist in London, observes, there is a growing dialogue between us and the network that we have constructed. If we attune ourselves, we can appreciate that the network is talking back to us, speaking to us in the language we have given it: our stories increasingly can tell themselves. The network, as it attaches itself to more and more of our lived experience, alters our expectations of the environment, shaping and conditioning our own artistic expressions.

We have become something different than we used to be. We are no longer simply artists using digital tools, but artists in a conversation with a world that our network is itself transforming, writing back to us through sensors, software bots, and an increasingly subtle mingling of digital and physical interactions. We take the network for granted; it brings a presence into our world that we assume is part of how we are experiencing our life. You have that experience when you walk around with a mobile phone. Whether you are conscious of it or not, you are connected within an environment that extends far beyond yourself, and are able to reach into it and intervene with it.

This is a new way of perceiving the world, and it must change how we think of literature. It changes how we understand time. Interaction with the omnipresent network introduces a temporal element that I think we need to consider. In other words, we must think about how we live within time’s fabric. This is part of the story that each of us tells in the world by being part of it: we must grasp the nexus between our sense of time and living in a world of interconnected devices and people, and how that punctuates the stories we tell and that we are part of. That is the humanist project of the 21st Century, and understanding it is the task we must complete to emerge beyond it.

The authoritative voice.

We have been living through a couple of centuries of human history where we have understood the authorial voice to be located in the role of the producer of a narrative. Yet, surely our understanding of that is shifting away from the author or authors as a unitary point of creating a reality about the world. Readers are merging into the authoritative voice of the narrative, growing into the story. As readers, we are increasingly choosing the stories we want to encounter by choosing the story elements that are presented to us.

At the start of our preparations for Books in Browsers in 2013, my colleague Kat Meyer suggested a tagline carrying a meme from The Fellowship of the Rings, “One doesn’t simply walk into Mordor,” therefore, “One does not simply put a book in a browser.”’ I protested, “Isn’t that exactly the point, that one does simply put the book in the browser?” She noted that the best thing that Books in Browsers has demonstrated is how much design craft is required to deliver storytelling in a browser on the network. There was evident truth in this, and it became our theme for BiB IV. “Books in Browsers,” as a conference name, has aged better than I feared it might.

We are realizing that beyond the book, the reader is moving onto the network as well. Our understanding of the world has changed as the network has grown to the point where we can be in a conversation with ourselves and the world the network has created. In his talk, James Bridle observed that perhaps this is the network that we had to create – that we were compelled to build – because it is what we need. Our literature is an expression of that conversation about ourselves, and we will begin to see how our understanding of our stories has changed as we learn to perceive ourselves within them in greater clarity.

Brantley, Peter: Story as code: Books in Browsers IV

planet code4lib - Sun, 2014-02-02 02:03

Sometimes you in a position to create something, and give it a name without too much consideration; before long the name doesn’t mean what you thought it once did. Yet sometimes, with luck and grace, the name becomes more fitting than you could have ever realized.

For several years running, I’ve organized a small conference called “Books in Browsers” (BiB), now in concert with the Frankfurt Book Fair, that is focused on the design and development of next generation books and other publications. Its premise has always been that digital technology lowers barriers for entry into new forms of publishing, enabling a range of experiences in digital contexts that were not previously possible.

Although that’s a neat encapsulation, in practice the conference’s agenda has changed markedly over the first four years it has run. In the first couple of years, BiB concerned itself with “hacking publishing,” which was best reflected in the cluster of publishing startup firms then emerging. These new companies were attempting to engage with traditional companies, developing tools that took advantage of network affordances to enable more efficient online selling, discovery, and early forms of social reading. They were trying to engage the existing industry; by and large, they failed.

They failed for a variety of reasons; misunderstandings between technology-centered firms running head on against large companies’ practices created to-be-expected impedances. As a consequence, by BiB’s third year, it was evident that the era of engagement with traditional trade publishing had come and gone, and there was a palpable sense that the design and technology communities interested in publishing were on the threshold of new ways of creating literature. BiB 2012 was a conference awash in the excitement of edgy hacks – for example, merging voice recognition with Google Docs, using a git software repository on the back end, demonstrating the feasibility of new forms of composing literature that possessed built-in revision control.

With BiB IV in October of 2013, that era of a new engagement with literature seems to have arrived. This is not to state that what emerged out of the conference now defines, or even will necessarily define in the future, what we know of as literature. I certainly do not expect the zeitgeist of BiB IV to be manifest in an Amazon shopping experience anytime soon. But, in a way, BiB IV pointed to a future of literature that extends beyond Amazon – it makes it obvious, in other words, that there can be literature that defines itself in terms far different than what we understand today.

Digital craft.

BiB in 2013 brought a wide range of people together – artists, novelists, technologists, and those straddling these boundaries – and what the gathering demonstrated more than anything else was people creating new publishing systems with web technologies without citation towards traditional book publishing. People demonstrated new forms for creating and discussing literature, divorced from a historical methodology for creating books.

A straightforward example of this was the Booksprint work from Adam Hyde. Booksprints bring together a small number of people to produce a print and digital book in a very limited number of days – a week or less – usually in a niche or market that could not be addressed in traditional publishing. Thus, Adam Hyde coordinated the production of an open access book on negotiating oil contracts with the input of developing nations dependent on extractive industry, as well as legal experts; a work that has been translated into many languages, and is used as part of the induction process for new employees at one of the largest oil companies in the world. A guide to understanding and negotiating mining contracts was produced later in the year, using the same process, within five days. Traditional publishing does not have a place for these pamphlets.

To the extent that there was a dominant theme at BiB in 2013, it was one of craft. There were many discussions of the forms of fine control that book designers exercised in the past, in both pre-digital as well as early digital workstation eras using QuarkXPress and Adobe Indesign. This was made tangible in the compelling presentations by the European designers who attended BiB, such as Etienne Mineur, whose volumique highlights alluring mergers of physical and digital interfaces, where a game piece, for instance, controls a digital environment or story. This carefree breaking of boundaries between physical and digital, conducted without the pretense of novelty, demonstrated maturity in design that has not previously existed.

Craft invokes the importance of art and aesthetics as part of the message of literature. In a digital era, the challenge is to integrate that sense into new forms of production using digital technology: not via traditional manual hand-work, but by building and working with digital tools to effect an individual expression of artistic intent. Ultimately, digital craft represents a new manner of engaging the reader as a participant, directly and actively.

Gaming literature.

Digital design and expression does not imply, however, that new forms of literature need to be “gamified” – in other words, for elements of compulsory interaction to be embedded within stories. The new design ethos does not require readers to help produce a narrative by constructing a path through a creative work that is critical to the definition of that work.

What is changing with digital design is that increasingly content is perceived to be procedural, in a programming language or computer markup sense. It has to be, inherently, because it is being developed with computer tools. Artists, writers, and engineers are beginning to think of their output as retaining that programmatic nature, not just being produced with it. Computer code is recognized as the driver for shaping and delivering content.

When content is recognized from conception as code, it becomes straightforward, indeed, elemental, for a reader to have the ability to experience a story in different contexts and in different forms – on a desktop computer, in one presentation; or as a mobile experience within a city, on a phone; or as a game through a console. The reader can choose how the wish to consume, engage, and potentially – participate – in literature.

Literature is an ongoing project.

We live in a world where the literature of our past inevitably becomes a context for the literature of the future. How we produce literature is an ongoing project of our human society, and that exploration is evolving alongside our own understanding of ourselves.

As the digital network becomes more pervasive in our lives, there is an innate change in how we understand what our engagement with that network should be. As my colleague and friend James Bridle, an artist in London, observes, there is a growing dialogue between us and the network that we have constructed. If we attune ourselves, we can appreciate that the network is talking back to us, speaking to us in the language we have given it: our stories increasingly can tell themselves. The network, as it attaches itself to more and more of our lived experience, alters our expectations of the environment, shaping and conditioning our own artistic expressions.

We have become something different than we used to be. We are no longer simply artists using digital tools, but artists in a conversation with a world that our network is itself transforming, writing back to us through sensors, software bots, and an increasingly subtle mingling of digital and physical interactions. We take the network for granted; it brings a presence into our world that we assume is part of how we are experiencing our life. You have that experience when you walk around with a mobile phone. Whether you are conscious of it or not, you are connected within an environment that extends far beyond yourself, and are able to reach into it and intervene with it.

This is a new way of perceiving the world, and it must change how we think of literature. It changes how we understand time. Interaction with the omnipresent network introduces a temporal element that I think we need to consider. In other words, we must think about how we live within time’s fabric. This is part of the story that each of us tells in the world by being part of it: we must grasp the nexus between our sense of time and living in a world of interconnected devices and people, and how that punctuates the stories we tell and that we are part of. That is the humanist project of the 21st Century, and understanding it is the task we must complete to emerge beyond it.

The authoritative voice.

We have been living through a couple of centuries of human history where we have understood the authorial voice to be located in the role of the producer of a narrative. Yet, surely our understanding of that is shifting away from the author or authors as a unitary point of creating a reality about the world. Readers are merging into the authoritative voice of the narrative, growing into the story. As readers, we are increasingly choosing the stories we want to encounter by choosing the story elements that are presented to us.

At the start of our preparations for Books in Browsers in 2013, my colleague Kat Meyer suggested a tagline carrying a meme from The Fellowship of the Rings, “One doesn’t simply walk into Mordor,” therefore, “One does not simply put a book in a browser.”’ I protested, “Isn’t that exactly the point, that one does simply put the book in the browser?” She noted that the best thing that Books in Browsers has demonstrated is how much design craft is required to deliver storytelling in a browser on the network. There was evident truth in this, and it became our theme for BiB IV. “Books in Browsers,” as a conference name, has aged better than I feared it might.

We are realizing that beyond the book, the reader is moving onto the network as well. Our understanding of the world has changed as the network has grown to the point where we can be in a conversation with ourselves and the world the network has created. In his talk, James Bridle observed that perhaps this is the network that we had to create – that we were compelled to build – because it is what we need. Our literature is an expression of that conversation about ourselves, and we will begin to see how our understanding of our stories has changed as we learn to perceive ourselves within them in greater clarity.

PeerLibrary: Version 0.1 released

planet code4lib - Sat, 2014-02-01 23:25

We released a 0.1 version of PeerLibrary. Check it out live. It is a really very beta version, with many planned features missing, but so that you can get a better taste of what is coming. Please, give us feedback.

Hellman, Eric: Crowd-Frauding: Why the Internet is Fake

planet code4lib - Sat, 2014-02-01 22:35
Power in human societies derives from the ability to get people to act together. Armies, religions, governments, and businesses have dominated societies using weapons, beliefs, laws and money to exert collective effort. In modern societies, mass media have emerged as a similar organizing power.

A new kind of collective organization, mediated by the internet, code and connections, is emerging as another avenue of power. It's no longer ridiculous to think that social networks, crowd-sourcing and crowd-funding could achieve the social consensus, action and compulsion that were the province of governments, armies and religions. As the founder of a crowd funding site for ebooks, I'm naturally optimistic that the crowd, connected by the social internet, will be an immensely powerful force for good.

I'm also continually reminded that the bad guys will use the crowd, too. And it won't be pretty.

In December, our site, Unglue.it, began to get a surge of new users. But there was something fishy. The registrations were all from hotmail, outlook, and various dodgy-sounding email hosts. The names being registered were things like Linette, Ophelia, Rhys, Deanne, Agueda, Harvey, Darcy, Eleanore and Margene. Nothing against the Harveys of the world, but those didn't look like user handles. Of course it was registration bots coming from many different IP addresses.

But they were stupid registration bots- they never complete the registration, so the fake accounts can't leave comments or anything. It wasn't causing us any harm except it was inflating our user numbers. It was mystifying. So I started studying why bots around the world might start making inactive accounts on our site.

And that's how I learned about the dark side of the crowd-force. The best example of this is a program called Jingling, also known as FlowSpirit. It's been around for 5 years or so.

Jingling is an example of a "cooperative traffic generation" tool. It's software-organized crime. Crowd-frauding, if you will.

It works like magic. You download the Jingling software and install it on your computer. You then enter the URLs for four webpages that you want to promote. (or more, if you have a good internet connection.)  Although the user interface is in Chinese, you can get annotated instructions in English on YouTube or websites like theBot.net. Once you've activated Jingling, the webpages you want to promote start getting hundreds of visitors from around the world. The visitors look real, they click around your page, they click on the advertisements, they register accounts on websites, they click "like" buttons and follow you on Twitter.

Meanwhile, your computer starts running a website-visiting, ad-clicking daemon. It visits websites specified by other Jingling users. It clicks ads, registers on sites, watches videos, makes spam comments and plays games. In short, your computer has become part of a botnet. You get paid for your participation with web traffic. What you thought was something innocuous to increase your Alexa- ranking has turned you into a foot-soldier in a software-organized crime syndicate. If you forgot to run it in a sandbox, you might be running other programs as well. And who knows what else.

The thing that makes cooperative traffic generation so difficult to detect is that the advertising is really being advertised. The only problem for advertisers is that they're paying to be advertised to robots, and robots do everything except buy stuff. The internet ad networks work hard to battle this sort of click fraud, but they have incentives to do a middling job of it. Ad networks get a cut of those ad dollars, after all.

The crowd wants to make money and organizes via the internet to shake down the merchants who think they're sponsoring content. Turns out, content isn't king, content is cattle.

Jingling is by no means alone; there are all sorts of bots you can acquire for "free". Diversity of bots is enforced because the click fraud countermeasures only attack the most popular bots; new bots are being constantly developed and improved.

What does this mean for advertising, ad-supported websites, and the internet in general?

It means that the internet rich will get richer and power will concentrate. Let me explain.

I used to run my own mail server. It was a small process on one of my old machines. I was a small independent internet entity. The NSA couldn't scan my emails and I could control my mail service. But as spammers cranked up their assault, it became more and more complicated to run a mail server. At first, I could block some bad ip addresses. But when dictionary attacks could be run by script kiddies, running my own email server got to be a real drag. And because I was nobody, other people running mail servers started blocking the email I tried to send. So I gave up and now I let big brother Google run my email. And Google gets to decide whether email reaches me or gets blocked by spam. That doesn't make me happy, and I still get a fair amount of spam. (Somehow the SEO and traffic generation scammers still get through!)

It's probably the same way that kingdoms and countries arose. Farmers farmed and hunters hunted until some bad guys started making trouble. People accepted these kings and armies because it was too much trouble for farmers and hunters to deal with the bad guys on their own. But sooner or later the bad guys figured out how to be the kings. Power concentrated, the rich got richer.

So with the crowd-frauders attacking advertising, the small advertiser will shy away from most publishers except for the least evil ones- Google or maybe Facebook. Ad networks will become less and less efficient because of the expense of dealing with click-fraud. The rest of the the internet will become fake as collateral damage. Do you think you know how many users you have? Think again, because half of them are already robots, soon it will be 90%. Do you think you know how much visitors you have? Sorry, 60% of it is already robots.

Sooner or later the VCs will catch on to the fact that companies they've funded are chasing after bot-clicks and bot views. They'll start demanding real business models; those of us older than forty may remember those from college. And maybe reality will have a renaissance. But more likely, the absolute power of Google, Amazon, Apple and the rest will corrupt them absolutely and we'll suffer through internet centuries of dark ages (5 solar years at least) before the arrival of an internet enlightenment.

Until then, let's not give in to the dark side of the force, OK?

Notes:
  1. Last year, I wrote about some strange Twitter bots. I now think it's likely that the encoded messages I saw are part of a cooperative traffic generation scheme. If you're trying to orchestrate a vast network of click-bots, what better way to communicate with them than twitter?
  2. There are now disposable email hosts that will autoclick confirmation links. These email hosts are the registration spammer's best friends. A list is here.

Hellman, Eric: Crowd-Frauding: Why the Internet is Fake

planet code4lib - Sat, 2014-02-01 22:35
Power in human societies derives from the ability to get people to act together. Armies, religions, governments, and businesses have dominated societies using weapons, beliefs, laws and money to exert collective effort. In modern societies, mass media have emerged as a similar organizing power.

A new kind of collective organization, mediated by the internet, code and connections, is emerging as another avenue of power. It's no longer ridiculous to think that social networks, crowd-sourcing and crowd-funding could achieve the social consensus, action and compulsion that were the province of governments, armies and religions. As the founder of a crowd funding site for ebooks, I'm naturally optimistic that the crowd, connected by the social internet, will be an immensely powerful force for good.

I'm also continually reminded that the bad guys will use the crowd, too. And it won't be pretty.

In December, our site, Unglue.it, began to get a surge of new users. But there was something fishy. The registrations were all from hotmail, outlook, and various dodgy-sounding email hosts. The names being registered were things like Linette, Ophelia, Rhys, Deanne, Agueda, Harvey, Darcy, Eleanore and Margene. Nothing against the Harveys of the world, but those didn't look like user handles. Of course it was registration bots coming from many different IP addresses.

But they were stupid registration bots- they never complete the registration, so the fake accounts can't leave comments or anything. It wasn't causing us any harm except it was inflating our user numbers. It was mystifying. So I started studying why bots around the world might start making inactive accounts on our site.

And that's how I learned about the dark side of the crowd-force. The best example of this is a program called Jingling, also known as FlowSpirit. It's been around for 5 years or so.

Jingling is an example of a "cooperative traffic generation" tool. It's software-organized crime. Crowd-frauding, if you will.

It works like magic. You download the Jingling software and install it on your computer. You then enter the URLs for four webpages that you want to promote. (or more, if you have a good internet connection.)  Although the user interface is in Chinese, you can get annotated instructions in English on YouTube or websites like theBot.net. Once you've activated Jingling, the webpages you want to promote start getting hundreds of visitors from around the world. The visitors look real, they click around your page, they click on the advertisements, they register accounts on websites, they click "like" buttons and follow you on Twitter.

Meanwhile, your computer starts running a website-visiting, ad-clicking daemon. It visits websites specified by other Jingling users. It clicks ads, registers on sites, watches videos, makes spam comments and plays games. In short, your computer has become part of a botnet. You get paid for your participation with web traffic. What you thought was something innocuous to increase your Alexa- ranking has turned you into a foot-soldier in a software-organized crime syndicate. If you forgot to run it in a sandbox, you might be running other programs as well. And who knows what else.

The thing that makes cooperative traffic generation so difficult to detect is that the advertising is really being advertised. The only problem for advertisers is that they're paying to be advertised to robots, and robots do everything except buy stuff. The internet ad networks work hard to battle this sort of click fraud, but they have incentives to do a middling job of it. Ad networks get a cut of those ad dollars, after all.

The crowd wants to make money and organizes via the internet to shake down the merchants who think they're sponsoring content. Turns out, content isn't king, content is cattle.

Jingling is by no means alone; there are all sorts of bots you can acquire for "free". Diversity of bots is enforced because the click fraud countermeasures only attack the most popular bots; new bots are being constantly developed and improved.

What does this mean for advertising, ad-supported websites, and the internet in general?

It means that the internet rich will get richer and power will concentrate. Let me explain.

I used to run my own mail server. It was a small process on one of my old machines. I was a small independent internet entity. The NSA couldn't scan my emails and I could control my mail service. But as spammers cranked up their assault, it became more and more complicated to run a mail server. At first, I could block some bad ip addresses. But when dictionary attacks could be run by script kiddies, running my own email server got to be a real drag. And because I was nobody, other people running mail servers started blocking the email I tried to send. So I gave up and now I let big brother Google run my email. And Google gets to decide whether email reaches me or gets blocked by spam. That doesn't make me happy, and I still get a fair amount of spam. (Somehow the SEO and traffic generation scammers still get through!)

It's probably the same way that kingdoms and countries arose. Farmers farmed and hunters hunted until some bad guys started making trouble. People accepted these kings and armies because it was too much trouble for farmers and hunters to deal with the bad guys on their own. But sooner or later the bad guys figured out how to be the kings. Power concentrated, the rich got richer.

So with the crowd-frauders attacking advertising, the small advertiser will shy away from most publishers except for the least evil ones- Google or maybe Facebook. Ad networks will become less and less efficient because of the expense of dealing with click-fraud. The rest of the the internet will become fake as collateral damage. Do you think you know how many users you have? Think again, because half of them are already robots, soon it will be 90%. Do you think you know how much visitors you have? Sorry, 60% of it is already robots.

Sooner or later the VCs will catch on to the fact that companies they've funded are chasing after bot-clicks and bot views. They'll start demanding real business models; those of us older than forty may remember those from college. And maybe reality will have a renaissance. But more likely, the absolute power of Google, Amazon, Apple and the rest will corrupt them absolutely and we'll suffer through internet centuries of dark ages (5 solar years at least) before the arrival of an internet enlightenment.

Until then, let's not give in to the dark side of the force, OK?

Notes:
  1. Last year, I wrote about some strange Twitter bots. I now think it's likely that the encoded messages I saw are part of a cooperative traffic generation scheme. If you're trying to orchestrate a vast network of click-bots, what better way to communicate with them than twitter?
  2. There are now disposable email hosts that will autoclick confirmation links. These email hosts are the registration spammer's best friends. A list is here.

ALA Equitable Access to Electronic Content: Save the date for the 2014 national freedom of information day

planet code4lib - Sat, 2014-02-01 06:59

ALA Immediate Past President Maureen Sullivan speaking at the 2013 Freedom of Information Day.

Mark your calendars: the 16th annual National Freedom of Information Day conference will be held Friday, March 14, 2014, at the Knight Conference Center at the Newseum in Washington, D.C. The Newseum Institute’s annual conference brings together librarians, nonprofits, government officials, lawyers, journalists and educators to discuss freedom of information and open records.  As part of the conference, the American Library Association will announce the James Madison Award recipient, an award presented to individuals or groups that have championed, protected and promoted public access to government information and the public’s right to know.
Hosted annually to commemorate the March 16th birth date of James Madison, the conference is conducted in partnership with the American Library Association, the Project on Government Oversight, and the Reporters Committee for Freedom of the Press.Last year, the American Library Association posthumously awarded activist Aaron Swartz the Madison Award for his dedication to promoting and protecting public access to research and government information.

The post Save the date for the 2014 national freedom of information day appeared first on District Dispatch.

ALA Equitable Access to Electronic Content: Save the date for the 2014 national freedom of information day

planet code4lib - Sat, 2014-02-01 06:59

ALA Immediate Past President Maureen Sullivan speaking at the 2013 Freedom of Information Day.

Mark your calendars: the 16th annual National Freedom of Information Day conference will be held Friday, March 14, 2014, at the Knight Conference Center at the Newseum in Washington, D.C. The Newseum Institute’s annual conference brings together librarians, nonprofits, government officials, lawyers, journalists and educators to discuss freedom of information and open records.  As part of the conference, the American Library Association will announce the James Madison Award recipient, an award presented to individuals or groups that have championed, protected and promoted public access to government information and the public’s right to know.
Hosted annually to commemorate the March 16th birth date of James Madison, the conference is conducted in partnership with the American Library Association, the Project on Government Oversight, and the Reporters Committee for Freedom of the Press.Last year, the American Library Association posthumously awarded activist Aaron Swartz the Madison Award for his dedication to promoting and protecting public access to research and government information.

The post Save the date for the 2014 national freedom of information day appeared first on District Dispatch.

Farkas, Meredith: Don’t go it alone. On the benefits of collaboration.

planet code4lib - Fri, 2014-01-31 22:14

I don’t have all the answers. There, I said it! I’m a pretty smart person who did well in school and has been relatively successful in her career, but I don’t consider myself an “expert” in anything. However, when you teach, write a column for a major magazine in your profession, or even express yourself on a blog (like this), people come to see you as a person to go to to get answers. And while I have answers, mine are rarely definitive. I’m a huge believer in information sharing, collaboration, and querying the hive mind. I believe we all hold little bits of “expertise” based on our experiences, and that any project benefits from other sets of eyes, ears, hands, and ideas. This blog, the Library Success Wiki, and the Oregon Library Association (OLA) Mentoring Program I helped build and now manage are testaments to my belief that knowledge-sharing makes us all better librarians.

I remember seven years ago, when I was told I would have to share an office with the not-yet-hired Electronic Resources Librarian, I was less than thrilled. I thought it would kill my concentration and make web design and video tutorial-making projects more difficult. I laugh when I think back to my initial attitude, because the two years that I shared an office with Josh Petrusa were among the most exciting and productive of my time at Norwich. We were constantly bouncing ideas off each other and making our work better because of it. Together, we were so much more effective than individually, because we each brought our own unique ideas to the table. Whatever productivity we lost through chatting was more than gained back in the improved quality of our work. 

This attitude can also be applied to library instruction. Contrary to what students might think, we librarians don’t have all the answers. We do not have the level of expertise in other subjects that the disciplinary faculty do, or even some of the students. Yesterday, I was working with senior anthropology majors and asked them to spend a few minutes brainstorming keywords on their research topics. I then asked them to pass that list on to the person next to them to see if they could come up with additional terms. In every case, they were able to add additional terms. Sometimes when you’re close to something, it’s easy to miss obvious search terms. I then had them do a jigsaw-type exercise where they searched in groups for scholarly sources on one of their topics in different databases. They then had to come up and demonstrate what they did, with me, their instructor, and their classmates making additional suggestions. In doing that, they first learned tips and tricks from their small group, and then from the rest of the class when they demoed their searching. Sure, I could have lectured about brainstorming keywords and how to use Anthropology Plus, JSTOR, and the library catalog, but they came up with a lot of tips I wouldn’t have thought of, not being immersed in archaeology myself. Letting them bring their own pieces of expertise to the fore benefitted everyone. 

I also think collaboration and talking about our teaching can make us better instructors. In academe, teaching is often though of as a solitary thing — you create your own syllabus, reading list, lectures, and assignments. You’re supposed to be able to do this on your own. And I think that idea has trickled down to many libraries as well. The idea of running your teaching ideas by someone else or having someone observe your teaching is antithetical in some places. While there’s always someone watching us teach, it’s rarely our colleagues. In my last year at Norwich, I instituted a peer-observation program where each of us watched two other people teach and then met afterwards to debrief. I learned so much from seeing how other people approached similar teaching situations! Last year, while I was still Head of Instruction, I created a reflective peer coaching program where librarians paired up to discuss their plans for an instruction session and then met after the session to debrief (based on the model of a brilliant fellow Oregonian, Dale Vidmar). The majority of the instruction librarians participated in the voluntary program and really enjoyed it. We often are so busy that we don’t even have the time to reflect individually on how an instruction session went, but we’re never going to improve without that reflective practice. I’d suggest that we’d improve even more if we tied reflective practice in with group or paired discussions. We all can learn so much from the ideas and approaches of our colleagues.

Back in early 2012, I had an idea for an online library self-help system that would help students get to just the small piece of instructional content they needed to move forward in their research. It wouldn’t be like a typical research tutorial because it would be focused on students who are already doing their research and have a specific need, much like they do at the reference desk. It was a good idea, sure, but it was made better by every person I collaborated with to make it a reality. From the instructional designer who took my description and gave me some mockups of home page designs to look at, to the web programmer who shifted my thinking about some of the behind-the-scenes stuff, to my instructional design team who helped create the information architecture and content, to my colleague who brilliantly suggested a Site Map after it had all been built, every bit of feedback made Library DIY a better product in the end. I don’t think it would have been terrible had I done it largely alone, but it certainly wouldn’t have been as awesome as it is now.

I’m not going to say collaboration is always comfortable. It makes you vulnerable. It requires compromise. Criticisms of your ideas or approaches can sting. It can be physically painful for those of us who tend towards being control freaks (and there are a lot of us in this profession). But it’s worth the pain to have a better product in the end. I’d also argue that in addition to making me a better librarian, collaboration has made me a better person. 

Libraries will be more successful and better places to work if they can build structures that support collaboration, team-building, and a learning culture. This requires creating an environment that encourages experimentation and risk-taking and doesn’t penalize failure. People need to feel comfortable putting ideas out there. They need to know their colleagues aren’t going to judge them and are going to offer them positive and constructive advice. Iris Jastram wrote about people who fixate on flaws and only offer negativity. Those people are death to a learning culture. Folks can’t feel like they’re going up to a firing squad every time they need to get feedback on something. I’ve got this little pocket of wonderful in my current job, with four other librarians whose offices we are constantly going back-and-forth to to share ideas and get feedback. They make me so much better at my job.

May you all have such pockets of wonderful.

 

Photo credit: Honey Bees in Hive by Bob Gutowski on Flickr (cc license: Attribution Non-Commercial ShareAlike 2.0)

Farkas, Meredith: Don’t go it alone. On the benefits of collaboration.

planet code4lib - Fri, 2014-01-31 22:14

I don’t have all the answers. There, I said it! I’m a pretty smart person who did well in school and has been relatively successful in her career, but I don’t consider myself an “expert” in anything. However, when you teach, write a column for a major magazine in your profession, or even express yourself on a blog (like this), people come to see you as a person to go to to get answers. And while I have answers, mine are rarely definitive. I’m a huge believer in information sharing, collaboration, and querying the hive mind. I believe we all hold little bits of “expertise” based on our experiences, and that any project benefits from other sets of eyes, ears, hands, and ideas. This blog, the Library Success Wiki, and the Oregon Library Association (OLA) Mentoring Program I helped build and now manage are testaments to my belief that knowledge-sharing makes us all better librarians.

I remember seven years ago, when I was told I would have to share an office with the not-yet-hired Electronic Resources Librarian, I was less than thrilled. I thought it would kill my concentration and make web design and video tutorial-making projects more difficult. I laugh when I think back to my initial attitude, because the two years that I shared an office with Josh Petrusa were among the most exciting and productive of my time at Norwich. We were constantly bouncing ideas off each other and making our work better because of it. Together, we were so much more effective than individually, because we each brought our own unique ideas to the table. Whatever productivity we lost through chatting was more than gained back in the improved quality of our work. 

This attitude can also be applied to library instruction. Contrary to what students might think, we librarians don’t have all the answers. We do not have the level of expertise in other subjects that the disciplinary faculty do, or even some of the students. Yesterday, I was working with senior anthropology majors and asked them to spend a few minutes brainstorming keywords on their research topics. I then asked them to pass that list on to the person next to them to see if they could come up with additional terms. In every case, they were able to add additional terms. Sometimes when you’re close to something, it’s easy to miss obvious search terms. I then had them do a jigsaw-type exercise where they searched in groups for scholarly sources on one of their topics in different databases. They then had to come up and demonstrate what they did, with me, their instructor, and their classmates making additional suggestions. In doing that, they first learned tips and tricks from their small group, and then from the rest of the class when they demoed their searching. Sure, I could have lectured about brainstorming keywords and how to use Anthropology Plus, JSTOR, and the library catalog, but they came up with a lot of tips I wouldn’t have thought of, not being immersed in archaeology myself. Letting them bring their own pieces of expertise to the fore benefitted everyone. 

I also think collaboration and talking about our teaching can make us better instructors. In academe, teaching is often though of as a solitary thing — you create your own syllabus, reading list, lectures, and assignments. You’re supposed to be able to do this on your own. And I think that idea has trickled down to many libraries as well. The idea of running your teaching ideas by someone else or having someone observe your teaching is antithetical in some places. While there’s always someone watching us teach, it’s rarely our colleagues. In my last year at Norwich, I instituted a peer-observation program where each of us watched two other people teach and then met afterwards to debrief. I learned so much from seeing how other people approached similar teaching situations! Last year, while I was still Head of Instruction, I created a reflective peer coaching program where librarians paired up to discuss their plans for an instruction session and then met after the session to debrief (based on the model of a brilliant fellow Oregonian, Dale Vidmar). The majority of the instruction librarians participated in the voluntary program and really enjoyed it. We often are so busy that we don’t even have the time to reflect individually on how an instruction session went, but we’re never going to improve without that reflective practice. I’d suggest that we’d improve even more if we tied reflective practice in with group or paired discussions. We all can learn so much from the ideas and approaches of our colleagues.

Back in early 2012, I had an idea for an online library self-help system that would help students get to just the small piece of instructional content they needed to move forward in their research. It wouldn’t be like a typical research tutorial because it would be focused on students who are already doing their research and have a specific need, much like they do at the reference desk. It was a good idea, sure, but it was made better by every person I collaborated with to make it a reality. From the instructional designer who took my description and gave me some mockups of home page designs to look at, to the web programmer who shifted my thinking about some of the behind-the-scenes stuff, to my instructional design team who helped create the information architecture and content, to my colleague who brilliantly suggested a Site Map after it had all been built, every bit of feedback made Library DIY a better product in the end. I don’t think it would have been terrible had I done it largely alone, but it certainly wouldn’t have been as awesome as it is now.

I’m not going to say collaboration is always comfortable. It makes you vulnerable. It requires compromise. Criticisms of your ideas or approaches can sting. It can be physically painful for those of us who tend towards being control freaks (and there are a lot of us in this profession). But it’s worth the pain to have a better product in the end. I’d also argue that in addition to making me a better librarian, collaboration has made me a better person. 

Libraries will be more successful and better places to work if they can build structures that support collaboration, team-building, and a learning culture. This requires creating an environment that encourages experimentation and risk-taking and doesn’t penalize failure. People need to feel comfortable putting ideas out there. They need to know their colleagues aren’t going to judge them and are going to offer them positive and constructive advice. Iris Jastram wrote about people who fixate on flaws and only offer negativity. Those people are death to a learning culture. Folks can’t feel like they’re going up to a firing squad every time they need to get feedback on something. I’ve got this little pocket of wonderful in my current job, with four other librarians whose offices we are constantly going back-and-forth to to share ideas and get feedback. They make me so much better at my job.

May you all have such pockets of wonderful.

 

Photo credit: Honey Bees in Hive by Bob Gutowski on Flickr (cc license: Attribution Non-Commercial ShareAlike 2.0)

Morgan, Eric Lease: RDF serializations

planet code4lib - Fri, 2014-01-31 18:07

RDF can be expressed in many different formats, called “serializations”.

RDF (Resource Description Framework) is a conceptual data model made up of “sentences” called triples — subjects, predicates, and objects. Subjects are expected to be URIs. Objects are expected to be URIs or string literals (think words, phrases, or numbers). Predicates are “verbs” establishing relationships between the subjects and the objects. Each triple is intended to denote a specific fact.

When the idea of the Semantic Web was first articulated XML was the predominant data structure of the time. It was seen as a way to encapsulate data that was both readable by humans as well as computers. Like any data structure, XML has both its advantages as well as disadvantages. On one hand it is easy to determine whether or not XML files are well-formed, meaning they are syntactically correct. Given a DTD, or better yet, an XML schema, it is also easy determine whether or not an XML file is valid — meaning does it contain the necessary XML elements, attributes, and are they arranged and used in the agreed upon manner. XML also lends itself to transformations into other plain text documents through the generic, platform-independent XSLT (Extensible Stylesheet Language Transformation) process. Consequently, RDF was originally manifested — made real and “serialized” — though the use of RDF/XML. The example of RDF at the beginning of the Guidebook was an RDF/XML serialization:

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/"> <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Declaration_of_Independence"> <dcterms:creator> <foaf:Person rdf:about="http://id.loc.gov/authorities/names/n79089957"> <foaf:gender>male</foaf:gender> </foaf:Person> </dcterms:creator> </rdf:Description> </rdf:RDF>

This RDF can be literally illustrated with the graph, below:

On the other hand, XML, almost by definition, is verbose. Element names are expected to be human-readable and meaningful, not obtuse nor opaque. The judicious use of special characters (&, <, >, “, and ‘) as well as entities only adds to the difficulty of actually reading XML. Consequently, almost from the very beginning people thought RDF/XML was not the best way to express RDF, and since then a number of other syntaxes — data structures — have manifested themselves.

Below is the same RDF serialized in a format called Notation 3 (N3), which is very human readable, but not extraordinarily structured enough for computer processing. It incorporates the use of a line-based data structure called N-Triples used to denote the triples themselves:

@prefix foaf: <http://xmlns.com/foaf/0.1/>. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. @prefix dcterms: <http://purl.org/dc/terms/>. <http://en.wikipedia.org/wiki/Declaration_of_Independence> dcterms:creator <http://id.loc.gov/authorities/names/n79089957>. <http://id.loc.gov/authorities/names/n79089957> a foaf:Person; foaf:gender "male".

JSON (JavaScript Object Notation) is a popular data structure inherent to the use of JavaScript and Web browsers, and RDF can be expressed in a JSON format as well:

{ "http://en.wikipedia.org/wiki/Declaration_of_Independence": { "http://purl.org/dc/terms/creator": [ { "type": "uri", "value": "http://id.loc.gov/authorities/names/n79089957" } ] }, "http://id.loc.gov/authorities/names/n79089957": { "http://xmlns.com/foaf/0.1/gender": [ { "type": "literal", "value": "male" } ], "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [ { "type": "uri", "value": "http://xmlns.com/foaf/0.1/Person" } ] } }

Just about the newest RDF serialization is an embellishment of JSON called JSON-LD. Compare & contrasts the serialization below to the one above:

{ "@graph": [ { "@id": "http://en.wikipedia.org/wiki/Declaration_of_Independence", "http://purl.org/dc/terms/creator": { "@id": "http://id.loc.gov/authorities/names/n79089957" } }, { "@id": "http://id.loc.gov/authorities/names/n79089957", "@type": "http://xmlns.com/foaf/0.1/Person", "http://xmlns.com/foaf/0.1/gender": "male" } ] }

RDFa represents a way of expressing RDF embedded in HTML, and here is such an expression:

<div xmlns="http://www.w3.org/1999/xhtml" prefix=" foaf: http://xmlns.com/foaf/0.1/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# dcterms: http://purl.org/dc/terms/ rdfs: http://www.w3.org/2000/01/rdf-schema#" > <div typeof="rdfs:Resource" about="http://en.wikipedia.org/wiki/Declaration_of_Independence"> <div rel="dcterms:creator"> <div typeof="foaf:Person" about="http://id.loc.gov/authorities/names/n79089957"> <div property="foaf:gender" content="male"></div> </div> </div> </div> </div>

The purpose of publishing linked data is to make RDF triples easily accessible. This does not necessarily mean the transformation of EAD or MARC into RDF/XML, but rather making accessible the statements of RDF within the context of the reader. In this case, the reader may be a human or some sort of computer program. Each serialization has its own strengths and weaknesses. Ideally the archive would figure out ways exploit each of these RDF serializations in specific publishing purposes.

For a good time, play with the RDF Translator which will convert one RDF serialization into another.

The RDF serialization process also highlights how data structures are moving away from document-centric models to statement-central models. This too has consequences for way cultural heritage institutions, like archives, think about exposing their metadata, but that is the topic of another essay.

Morgan, Eric Lease: RDF serializations

planet code4lib - Fri, 2014-01-31 18:07

RDF can be expressed in many different formats, called “serializations”.

RDF (Resource Description Framework) is a conceptual data model made up of “sentences” called triples — subjects, predicates, and objects. Subjects are expected to be URIs. Objects are expected to be URIs or string literals (think words, phrases, or numbers). Predicates are “verbs” establishing relationships between the subjects and the objects. Each triple is intended to denote a specific fact.

When the idea of the Semantic Web was first articulated XML was the predominant data structure of the time. It was seen as a way to encapsulate data that was both readable by humans as well as computers. Like any data structure, XML has both its advantages as well as disadvantages. On one hand it is easy to determine whether or not XML files are well-formed, meaning they are syntactically correct. Given a DTD, or better yet, an XML schema, it is also easy determine whether or not an XML file is valid — meaning does it contain the necessary XML elements, attributes, and are they arranged and used in the agreed upon manner. XML also lends itself to transformations into other plain text documents through the generic, platform-independent XSLT (Extensible Stylesheet Language Transformation) process. Consequently, RDF was originally manifested — made real and “serialized” — though the use of RDF/XML. The example of RDF at the beginning of the Guidebook was an RDF/XML serialization:

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/"> <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Declaration_of_Independence"> <dcterms:creator> <foaf:Person rdf:about="http://id.loc.gov/authorities/names/n79089957"> <foaf:gender>male</foaf:gender> </foaf:Person> </dcterms:creator> </rdf:Description> </rdf:RDF>

This RDF can be literally illustrated with the graph, below:

On the other hand, XML, almost by definition, is verbose. Element names are expected to be human-readable and meaningful, not obtuse nor opaque. The judicious use of special characters (&, <, >, “, and ‘) as well as entities only adds to the difficulty of actually reading XML. Consequently, almost from the very beginning people thought RDF/XML was not the best way to express RDF, and since then a number of other syntaxes — data structures — have manifested themselves.

Below is the same RDF serialized in a format called Notation 3 (N3), which is very human readable, but not extraordinarily structured enough for computer processing. It incorporates the use of a line-based data structure called N-Triples used to denote the triples themselves:

@prefix foaf: <http://xmlns.com/foaf/0.1/>. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. @prefix dcterms: <http://purl.org/dc/terms/>. <http://en.wikipedia.org/wiki/Declaration_of_Independence> dcterms:creator <http://id.loc.gov/authorities/names/n79089957>. <http://id.loc.gov/authorities/names/n79089957> a foaf:Person; foaf:gender "male".

JSON (JavaScript Object Notation) is a popular data structure inherent to the use of JavaScript and Web browsers, and RDF can be expressed in a JSON format as well:

{ "http://en.wikipedia.org/wiki/Declaration_of_Independence": { "http://purl.org/dc/terms/creator": [ { "type": "uri", "value": "http://id.loc.gov/authorities/names/n79089957" } ] }, "http://id.loc.gov/authorities/names/n79089957": { "http://xmlns.com/foaf/0.1/gender": [ { "type": "literal", "value": "male" } ], "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": [ { "type": "uri", "value": "http://xmlns.com/foaf/0.1/Person" } ] } }

Just about the newest RDF serialization is an embellishment of JSON called JSON-LD. Compare & contrasts the serialization below to the one above:

{ "@graph": [ { "@id": "http://en.wikipedia.org/wiki/Declaration_of_Independence", "http://purl.org/dc/terms/creator": { "@id": "http://id.loc.gov/authorities/names/n79089957" } }, { "@id": "http://id.loc.gov/authorities/names/n79089957", "@type": "http://xmlns.com/foaf/0.1/Person", "http://xmlns.com/foaf/0.1/gender": "male" } ] }

RDFa represents a way of expressing RDF embedded in HTML, and here is such an expression:

<div xmlns="http://www.w3.org/1999/xhtml" prefix=" foaf: http://xmlns.com/foaf/0.1/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# dcterms: http://purl.org/dc/terms/ rdfs: http://www.w3.org/2000/01/rdf-schema#" > <div typeof="rdfs:Resource" about="http://en.wikipedia.org/wiki/Declaration_of_Independence"> <div rel="dcterms:creator"> <div typeof="foaf:Person" about="http://id.loc.gov/authorities/names/n79089957"> <div property="foaf:gender" content="male"></div> </div> </div> </div> </div>

The purpose of publishing linked data is to make RDF triples easily accessible. This does not necessarily mean the transformation of EAD or MARC into RDF/XML, but rather making accessible the statements of RDF within the context of the reader. In this case, the reader may be a human or some sort of computer program. Each serialization has its own strengths and weaknesses. Ideally the archive would figure out ways exploit each of these RDF serializations in specific publishing purposes.

For a good time, play with the RDF Translator which will convert one RDF serialization into another.

The RDF serialization process also highlights how data structures are moving away from document-centric models to statement-central models. This too has consequences for way cultural heritage institutions, like archives, think about exposing their metadata, but that is the topic of another essay.

LITA: A Report on LibHack

planet code4lib - Fri, 2014-01-31 14:42
What happened at LibHack? Wait, what was LibHack? LibHack was a library hackathon held during ALA Midwinter on Friday, January 24 at the University of Pennsylvania’s Kislak Center for Special Collections. Organized by the LITA/ALCTS Code Year Interest Group and sponsored by OCLC and the Digital Public Library of America (DPLA), the event featured two separate tracks — one specifically catered to beginners that worked on the OCLC WorldCat Search API, and another track open to beginners and advanced hackers that worked on the DPLA API. OCLC Track

Of the 55 attendees, there was a 50-50 split between the two tracks. The OCLC track was led by Steve Meyer, Technical Platform Project Manager at OCLC, and several other OCLC staff members were on hand to lend support. Since the track was designed to meet the needs of beginner programmers, Steve led a workshop that used the WorldCat Search API to introduce participants to some of the basics of programming. For example, Steve provided a walkthough of PHP and XML using lesson files, making sure people understood the connection of the code back with the API output.

The OCLC track filled a need within the ALA community for introductory level programming at ALA conferences. Based on the success of the Intro to Python preconference at the 2013 ALA Annual conference in Chicago and data gathered from an initial planning survey gathered by the LibHack organizers (Zach Coble, Emily Flynn, and Chris Strauber) in August 2013, it was clear than many librarians were interested in a more structured learning opportunity. LibHack’s old-fashion, synchronus, face-to-face environment contributed to the OCLC track’s success in teaching participants the basics and helping them to become more comfortable with the challenges of programming.

DPLA Track

The DPLA track, on the other hand, was more loosely organized and was open to all levels of hackers. As with the OCLC track, we were fortunate to have four DPLA staff members on hand to provide guidance and technical assistance. At the beginning of the day, people pitched ideas for projects, and groups quickly formed around those ideas. Some of the projects that were worked on include:

  • WikipeDPLA, by Jake Orlowitz and Eric Phetteplace, a userscript that finds DPLA content and posts relevant links at the top of Wikipedia pages.
  • @HistoricalCats, by Adam Malantonio, a DPLA Twitter bot that retrieves cat-related items in DPLA.
  • #askDPLA Twitter bot, by Tessa Fallon, Simon Hieu Mai, and Coral Sheldon-Hess, replies to tweets using the #askDPLA hashtag with DPLA items.
  • [Exhibit Master 2000[(https://github.com/chadfennell/exhibitmaster2000), by Chad Fennell, Nabil Kashyap, and Chad Nelson, aims to create a simple way for users to create exhibits based on DPLA search terms.
  • Thomas Dukleth, Jenn (Yana) Miller, Allie Verbovetskaya, and Roxanne Shirazi, investigated how to systematically apply rights for DPLA items. While DPLA’s metadata is CC0, the items themselves have a spectrum of rights and reuse rights tied to them, all in free-form metadata fields.
  • VuDL DPLA extension, by Chris Hallberg and David Lacy, created a DPLA extension to VuDL, the open-source digital library package Villanova University develops. The extension can be seen at http://digital.library.villanova.edu/
  • Other projects included Francis Kayiwa, who started work to apply LCSH subject terms to DPLA items, and Dot Porter and Doug Emery, who worked on integrating DPLA’s medieval holdings with the Medieval Electronic Scholarly Alliance’s federated search.

Since LibHack was a one day event, many projects were not completed, although some groups made plans to continue working. Chad Fennell and Chad Nelson’s project Exhibit Master 2000 was continued as last weekend’s GLAM Hack Philly. And the project investigating copyright and reuse rights is a long-term DPLA project that will take many more hackathons to complete!

Future Plans

Given the overall success of the event, the Code Year Interest Group is exploring the idea of hosting another LibHack, possibly at the 2014 ALA Annual conference in Las Vegas. If you are interested in organizing or sponsoring, contact libraryhackathon@gmail.com.

LITA: A Report on LibHack

planet code4lib - Fri, 2014-01-31 14:42
What happened at LibHack? Wait, what was LibHack? LibHack was a library hackathon held during ALA Midwinter on Friday, January 24 at the University of Pennsylvania’s Kislak Center for Special Collections. Organized by the LITA/ALCTS Code Year Interest Group and sponsored by OCLC and the Digital Public Library of America (DPLA), the event featured two separate tracks — one specifically catered to beginners that worked on the OCLC WorldCat Search API, and another track open to beginners and advanced hackers that worked on the DPLA API. OCLC Track

Of the 55 attendees, there was a 50-50 split between the two tracks. The OCLC track was led by Steve Meyer, Technical Platform Project Manager at OCLC, and several other OCLC staff members were on hand to lend support. Since the track was designed to meet the needs of beginner programmers, Steve led a workshop that used the WorldCat Search API to introduce participants to some of the basics of programming. For example, Steve provided a walkthough of PHP and XML using lesson files, making sure people understood the connection of the code back with the API output.

The OCLC track filled a need within the ALA community for introductory level programming at ALA conferences. Based on the success of the Intro to Python preconference at the 2013 ALA Annual conference in Chicago and data gathered from an initial planning survey gathered by the LibHack organizers (Zach Coble, Emily Flynn, and Chris Strauber) in August 2013, it was clear than many librarians were interested in a more structured learning opportunity. LibHack’s old-fashion, synchronus, face-to-face environment contributed to the OCLC track’s success in teaching participants the basics and helping them to become more comfortable with the challenges of programming.

DPLA Track

The DPLA track, on the other hand, was more loosely organized and was open to all levels of hackers. As with the OCLC track, we were fortunate to have four DPLA staff members on hand to provide guidance and technical assistance. At the beginning of the day, people pitched ideas for projects, and groups quickly formed around those ideas. Some of the projects that were worked on include:

  • WikipeDPLA, by Jake Orlowitz and Eric Phetteplace, a userscript that finds DPLA content and posts relevant links at the top of Wikipedia pages.
  • @HistoricalCats, by Adam Malantonio, a DPLA Twitter bot that retrieves cat-related items in DPLA.
  • #askDPLA Twitter bot, by Tessa Fallon, Simon Hieu Mai, and Coral Sheldon-Hess, replies to tweets using the #askDPLA hashtag with DPLA items.
  • [Exhibit Master 2000[(https://github.com/chadfennell/exhibitmaster2000), by Chad Fennell, Nabil Kashyap, and Chad Nelson, aims to create a simple way for users to create exhibits based on DPLA search terms.
  • Thomas Dukleth, Jenn (Yana) Miller, Allie Verbovetskaya, and Roxanne Shirazi, investigated how to systematically apply rights for DPLA items. While DPLA’s metadata is CC0, the items themselves have a spectrum of rights and reuse rights tied to them, all in free-form metadata fields.
  • VuDL DPLA extension, by Chris Hallberg and David Lacy, created a DPLA extension to VuDL, the open-source digital library package Villanova University develops. The extension can be seen at http://digital.library.villanova.edu/
  • Other projects included Francis Kayiwa, who started work to apply LCSH subject terms to DPLA items, and Dot Porter and Doug Emery, who worked on integrating DPLA’s medieval holdings with the Medieval Electronic Scholarly Alliance’s federated search.

Since LibHack was a one day event, many projects were not completed, although some groups made plans to continue working. Chad Fennell and Chad Nelson’s project Exhibit Master 2000 was continued as last weekend’s GLAM Hack Philly. And the project investigating copyright and reuse rights is a long-term DPLA project that will take many more hackathons to complete!

Future Plans

Given the overall success of the event, the Code Year Interest Group is exploring the idea of hosting another LibHack, possibly at the 2014 ALA Annual conference in Las Vegas. If you are interested in organizing or sponsoring, contact libraryhackathon@gmail.com.

Bisson, Casey: Rumors

planet code4lib - Fri, 2014-01-31 14:38

Subcomandante Marcos, by Jose Villa, from Wikipedia

It started at the coffee shop. Somebody pointed and made the claim, then everybody was laughing. “He looks just like him!” one said. “How would you know, he wore a mask!” exclaimed another.

I looked him up. I could be accused of being a less interesting figure.

Bisson, Casey: Rumors

planet code4lib - Fri, 2014-01-31 14:38

Subcomandante Marcos, by Jose Villa, from Wikipedia

It started at the coffee shop. Somebody pointed and made the claim, then everybody was laughing. “He looks just like him!” one said. “How would you know, he wore a mask!” exclaimed another.

I looked him up. I could be accused of being a less interesting figure.

State Library of Denmark: kaarefc

planet code4lib - Fri, 2014-01-31 12:39

Our large-scale digitization project of newspapers from microfilm is just on the verge of going into production. Being technical lead on the process of ingesting and controlling quality of the digitized data has been a roller coaster of excitement, disillusionment, humility, admiration, despair, confidence, sleeplessness, and triumph.

I would like to record some of the more important lessons learned as seen from my role in the project. I hope this will help other organizations, when doing similar projects. I wish we had had these insights at the beginning of our project, but also recognize that some of these lessons can only be learned by experience.

Lesson 1: For a good end result, you need people in your organization that understand all parts of the process of digitization

At the beginning of our project, we had people in our organization who knew a lot about our source material (newspapers and microfilm) and people who knew a lot about digital preservation.

We did not have people who knew much about microfilm scanning.

We assumed this would be no problem, because we would hire a contractor, and they would know about the issues concerning microfilm scanning.

We were not wrong as such. The contractor we chose had great expertise in microfilm scanning. And yet it still turned out we ended
up with a gap in required knowledge.

The reason is, most scanning companies do not scan for preservation. They scan for presentation. The two scenarios entail two different sets of requirements. Our major requirement was to have a digital copy that resembled our physical copy as closely as possible. The usual set of requirements a scanning company gets from its customers, is to get the most legible files for the lowest cost possible. These two sets of requirements are not always compatible.

One example was image compression. We had decided on losslessly compressed images (in JPEG2000), which is more expensive than a lossy compression but avoids the graphic artifacts that lossy compression always leave, and can be a hassle in any post-processing or migration of the images. Using lossless image formats is an expensive choice when it comes to storage, but since we were scanning to replace originals we opted for the costly but unedited files.

Once we got our first images, though, inspection of the images showed definite signs of lossy compression artifacts. The files themselves were in a lossless format as expected, but the compressions artifacts were there all the same. Somewhere along the path to our lossless JPEG2000 images, a lossy compression had taken place. The contractor assured us that they used no lossy compressions. Not until we visited the contractor and saw the scanning stations did we find the culprit. It was the scanners themselves! It turned out that the scanner, when transferring the images from the scanner to the scanner processing software, used JPEG as an intermediary format. So in the end we got the costly lossless image format, but the artifacts from lossy compression as well. It was a pure lose/lose situation. And even worse, there was no obvious way to turn it off! We finally managed to resolve it, though, with three-way communication between us, the scanner manufacturer and the contractor. Luckily, there was a non-obvious way to avoid the JPEG transfer format. The way to turn it off was to change the color profile from “gray-scale” to “gray-scale (lossless)”.

As another example, we had in our tender the requirement that the images should not be post-processed in any way. No sharpening, no brightening, no enhancement. We wanted the original scans from the microfilm scanner. The reason for this was that we can always do post-processing on the images for presentation purposes, but once you post-process an image, you lose information that cannot be regained – you can’t ”unsharpen” an image and get the original back. We had assumed this would be one of our more easily met requirements. After all, we were asking contractors to not do a task, not to perform one.

However, ensuring that images are not post-processed was a difficult task on its own. First there is the problem of communicating it at all. Scanner companies have great expertise in adjusting images for the best possible experience, and now we asked them not to do that. It was at first completely disruptive to communication, because our starting points were so completely different. Then there was the problem that some of the post-processing was done by the scanner software, and the contractor had no idea how to turn it off. Once again, it took three-way communication between the scanner manufacturer, our organization, and the contractor before we found a way to get the scanner to deliver the original images without post-processing.

The crucial point in both these examples is that we would not even have noticed all of this, if we hadn’t had a competent, dedicated
expert in our organization, analyzing the images and finding the artifacts of lossy compression and post processing. And in our case we only had that by sheer luck. We had not scheduled any time for this analysis or dedicated an expert to the job. We had drawn on expertise like this when writing the tender, so the requirements were clear and documented, and we had expected the contractor to honor these requirements as written. It was no one’s job to make sure they did.

However, one employee stepped up on his own initiative. He is an autodidact image expert, who originally was not assigned to the project at all. He took a look at the images and started pointing out the various signs of post processing. He wrote analysis tools and went out of his way to communicate to the rest of the digitization project how artifacts could be seen and histograms could expose the signs of post processing. It is uncertain that we would ever have had the quality of images we are getting from this project, if it had not been for his initiative.

Lesson 2: Your requirements are never as clear as you think they are

This one is really a no-brainer and did not come as a surprise for us, but it bears repeating.

Assuming you can write something in a tender and then have it delivered as described is an illusion. You really need to discuss and explain each requirement to your contractor, if you want a common understanding. And even then you should expect to have to clarify at any point during the process.

Also, in a large-scale digitization project, your source material is not as well-known as you think it is. You will find exceptions,
faults and human errors, that cause the source material to vary from the specifications.

Make sure you keep communication open with the contractor to clarify such issues. And make sure you have resources available to handle that communication.

Examples can be trivial – we had cases where metadata documents were delivered with the example text from our templates in
our specifications, instead of with the actual value it should contain. But they can also be much more complex – for instance we
asked our contractors to record the section title of our newspapers in metadata. But how do you tell an Indian operator where to find a
section title in a Danish newspaper?

Examples can also be the other way round. Sometimes your requirements propose a poorer solution than what the contractor can provide. We had our contractors suggest a better solution for recording metadata for target test scans. Be open to suggestions from your contractor, in some cases they know the game better than you do.

Lesson 3: Do not underestimate the resources required for QA

Doing a large-scale digitization project probably means you don’t have time to look at all the output you get from your contractor. The solution is fairly obvious when you work in an IT department: Let the computer do it for you. We planned a pretty elaborate automatic QA system, which would check data and metadata for conformity to specifications and internal consistency. We also put into our contract that this automatic QA system should be run by the contractor as well to check their data before delivery.

This turned out to be a much larger task than we had anticipated. While the requirements are simple enough, there is simply so much grunt work to do, that it takes a lot of resources to make a complete check of the entire specification. Communicating with the contractor about getting the tool to run and interpreting the results is an important part of getting value from the automatic QA tool. We have found that assumptions about technical platforms, input and output, and even communicating output of failed automatic QA are things that should not be underestimated.

However the value of this has been very high. It has served to clarify requirements in both our own organization and with our contractor, and it has given us a sound basis for accepting the data from our contractor.

In other, smaller digitization projects, we have sometimes tried to avoid doing a thorough automatic QA check. Our experience, in these cases, is that this has simply postponed finding mistakes, that could have been automatically detected to our manual QA spot checks. The net effect of this is that the time spent on manual QA and on requesting redeliveries has been greatly increased. So our
recommendation is to do thorough automatic QA, but also to expect this to be a substantial task to do.

Even when you have done thorough automatic QA, it does not replace the need for a manual QA process, but since you don’t have time to check every file manually, you will need to do a spot check. Our strategy in this case has been twofold: First we take a random sample of images to check, giving us a statistical model allowing us to make statements about the probability of undiscovered mistakes. Second we amend this list of images to check, with images that an automatic analysis tool marks as suspicious – for instance very dark, unexpected (that is: possibly wrong) metadata, very low OCR success rates etc.

We have had our contractor build a workflow system for doing the actual manual QA process for us. So given the input of random and
suspect pages, they are presented in an interface, where a QA tester can approve images, or reject them with a reason. A supervisor will then use the output from the testers to confirm the reception of the data, or request a redelivery with some data corrected.

Even though the contractor builds our manual QA interface, we still need to integrate with this system, and the resources required for this should not be underestimated. We opted to have the tool installed in our location, to ensure the data checked in the manual QA spot check was in fact the data that was delivered to us. If the manual QA spot check had been done at the contractor, in theory the data could have been altered after the manual QA spot check and before delivery. Communication concerning installation of the system and providing access to images for manual QA spot check also turned out to be time consuming.

 

In conclusion, in a large-scale digitization project, QA is a substantial part of the project, and must be expected to require considerable resources.

Lesson 4: Expect a lot of time to elapse before first approved batch

This lesson may be a corollary of the previous three, but it seems to be one that needs to be learned time and time again.

When doing time-lines for a digitization project, you always have a tendency to expect everything to go smoothly. We had made that assumption once again in this project, and as we should have expected, it didn’t happen.

Nothing went wrong as such, but during planning we simply didn’t take into account the time it takes to communicate about requirements when we did the planning. So when we received the first pilot batch, our time-line said we would go
into production soon after. This, of course, did not happen. What happened was that the communication process about what needs to be changed (in the specification or in the data) started. And then, after this communication process had been completed it took a while before new data could be delivered. And then the cycle starts again.

Our newly revised plan has no final deadline. Instead it has deadlines on each cycle, until we approve the first batch. We expect
this to take some time. The plan says we allow three weeks for the first cycle, then when problems seem less substantial, we reduce the cycle to two weeks. Finally we go to one week cycles for more trivial issues. And once we have finally approved the batch, we can go into production. Obviously, this pushes our original deadline back months, but it this is really how our plan should have been designed from the very beginning. So make sure your plans allow time to work out the kinks and approve the workflow, before you plan on going into production.

Lesson 5: Everything and nothing you learned from small-scale digitization projects still applies

Running small-scale digitization projects is a good way to prepare you for handling a large-scale digitization project. You learn a lot about writing tenders, communicating with contractors, what doing QA and ingest entails, how you evaluate scan results etc. It is definitely recommended to do several small-scale digitization projects before you go large-scale.

But a lot of the things you learned in small-scale digitization projects turn out not to apply when you go large-scale.

We are digitizing 32 million newspaper pages in three years. That means that every single day, we need to be able to receive 30.000
pages. With each page being roughly 15 MB, that’s close to half a terabyte a day. Suddenly a lot of resources usually taken for granted need to be re-evaluated. Do we have enough storage space just to receive the data? Do we have enough bandwidth? Can our in-house routers keep up with managing the data? How long will it actually take to run characterization of all these files? Can we keep up? What happens if we are delayed a day or even a week?

Also in small digitization projects, manually handling minor issues are feasible. Even doing a manual check of the delivered data is
feasible. In this case, if you want to do a check of everything, a full-time employee would only be able to spend about two thirds of a second per page if he or she wanted to keep up. So you really need to accept that you can not manually do anything on the whole project. Spot checks and automation are crucial.

This also means that the only ones who will see every page of your digitization project ever will probably be your contractor. Plan
carefully what you want from them, because you probably only have this one chance to get it. If you want anyone to read anything printed on the newspaper page, now is the time to specify it. If you want anything recorded about the visual quality of the page, now is the time.

Another point is that you need to be very careful what you accept as okay. Accepting something sub-optimal because it can always be
rectified later will probably be “never” rather than “later”. This needs to be taken into account every time a decision is made that
effects the entire digitization output.

Conclusion

Every kind of project has its own gotchas and kinks. Large-scale digitization projects are no exception.

Listed above are some of our most important lessons learned so far and seen from the perspective of a technical lead primarily working with receiving the files. This is only one small part of the mass digitization project, and other lessons are learned in different parts of the organization.

I hope these lessons will be of use to other people – and even to ourselves next time we embark on a mass digitization adventure. It has been an exhilarating ride so far, and it has taken a lot of effort to get a good result. Next step is the process of giving our users access to all this newspaper gold. Let the new challenges begin!


State Library of Denmark: kaarefc

planet code4lib - Fri, 2014-01-31 12:39

Our large-scale digitization project of newspapers from microfilm is just on the verge of going into production. Being technical lead on the process of ingesting and controlling quality of the digitized data has been a roller coaster of excitement, disillusionment, humility, admiration, despair, confidence, sleeplessness, and triumph.

I would like to record some of the more important lessons learned as seen from my role in the project. I hope this will help other organizations, when doing similar projects. I wish we had had these insights at the beginning of our project, but also recognize that some of these lessons can only be learned by experience.

Lesson 1: For a good end result, you need people in your organization that understand all parts of the process of digitization

At the beginning of our project, we had people in our organization who knew a lot about our source material (newspapers and microfilm) and people who knew a lot about digital preservation.

We did not have people who knew much about microfilm scanning.

We assumed this would be no problem, because we would hire a contractor, and they would know about the issues concerning microfilm scanning.

We were not wrong as such. The contractor we chose had great expertise in microfilm scanning. And yet it still turned out we ended
up with a gap in required knowledge.

The reason is, most scanning companies do not scan for preservation. They scan for presentation. The two scenarios entail two different sets of requirements. Our major requirement was to have a digital copy that resembled our physical copy as closely as possible. The usual set of requirements a scanning company gets from its customers, is to get the most legible files for the lowest cost possible. These two sets of requirements are not always compatible.

One example was image compression. We had decided on losslessly compressed images (in JPEG2000), which is more expensive than a lossy compression but avoids the graphic artifacts that lossy compression always leave, and can be a hassle in any post-processing or migration of the images. Using lossless image formats is an expensive choice when it comes to storage, but since we were scanning to replace originals we opted for the costly but unedited files.

Once we got our first images, though, inspection of the images showed definite signs of lossy compression artifacts. The files themselves were in a lossless format as expected, but the compressions artifacts were there all the same. Somewhere along the path to our lossless JPEG2000 images, a lossy compression had taken place. The contractor assured us that they used no lossy compressions. Not until we visited the contractor and saw the scanning stations did we find the culprit. It was the scanners themselves! It turned out that the scanner, when transferring the images from the scanner to the scanner processing software, used JPEG as an intermediary format. So in the end we got the costly lossless image format, but the artifacts from lossy compression as well. It was a pure lose/lose situation. And even worse, there was no obvious way to turn it off! We finally managed to resolve it, though, with three-way communication between us, the scanner manufacturer and the contractor. Luckily, there was a non-obvious way to avoid the JPEG transfer format. The way to turn it off was to change the color profile from “gray-scale” to “gray-scale (lossless)”.

As another example, we had in our tender the requirement that the images should not be post-processed in any way. No sharpening, no brightening, no enhancement. We wanted the original scans from the microfilm scanner. The reason for this was that we can always do post-processing on the images for presentation purposes, but once you post-process an image, you lose information that cannot be regained – you can’t ”unsharpen” an image and get the original back. We had assumed this would be one of our more easily met requirements. After all, we were asking contractors to not do a task, not to perform one.

However, ensuring that images are not post-processed was a difficult task on its own. First there is the problem of communicating it at all. Scanner companies have great expertise in adjusting images for the best possible experience, and now we asked them not to do that. It was at first completely disruptive to communication, because our starting points were so completely different. Then there was the problem that some of the post-processing was done by the scanner software, and the contractor had no idea how to turn it off. Once again, it took three-way communication between the scanner manufacturer, our organization, and the contractor before we found a way to get the scanner to deliver the original images without post-processing.

The crucial point in both these examples is that we would not even have noticed all of this, if we hadn’t had a competent, dedicated
expert in our organization, analyzing the images and finding the artifacts of lossy compression and post processing. And in our case we only had that by sheer luck. We had not scheduled any time for this analysis or dedicated an expert to the job. We had drawn on expertise like this when writing the tender, so the requirements were clear and documented, and we had expected the contractor to honor these requirements as written. It was no one’s job to make sure they did.

However, one employee stepped up on his own initiative. He is an autodidact image expert, who originally was not assigned to the project at all. He took a look at the images and started pointing out the various signs of post processing. He wrote analysis tools and went out of his way to communicate to the rest of the digitization project how artifacts could be seen and histograms could expose the signs of post processing. It is uncertain that we would ever have had the quality of images we are getting from this project, if it had not been for his initiative.

Lesson 2: Your requirements are never as clear as you think they are

This one is really a no-brainer and did not come as a surprise for us, but it bears repeating.

Assuming you can write something in a tender and then have it delivered as described is an illusion. You really need to discuss and explain each requirement to your contractor, if you want a common understanding. And even then you should expect to have to clarify at any point during the process.

Also, in a large-scale digitization project, your source material is not as well-known as you think it is. You will find exceptions,
faults and human errors, that cause the source material to vary from the specifications.

Make sure you keep communication open with the contractor to clarify such issues. And make sure you have resources available to handle that communication.

Examples can be trivial – we had cases where metadata documents were delivered with the example text from our templates in
our specifications, instead of with the actual value it should contain. But they can also be much more complex – for instance we
asked our contractors to record the section title of our newspapers in metadata. But how do you tell an Indian operator where to find a
section title in a Danish newspaper?

Examples can also be the other way round. Sometimes your requirements propose a poorer solution than what the contractor can provide. We had our contractors suggest a better solution for recording metadata for target test scans. Be open to suggestions from your contractor, in some cases they know the game better than you do.

Lesson 3: Do not underestimate the resources required for QA

Doing a large-scale digitization project probably means you don’t have time to look at all the output you get from your contractor. The solution is fairly obvious when you work in an IT department: Let the computer do it for you. We planned a pretty elaborate automatic QA system, which would check data and metadata for conformity to specifications and internal consistency. We also put into our contract that this automatic QA system should be run by the contractor as well to check their data before delivery.

This turned out to be a much larger task than we had anticipated. While the requirements are simple enough, there is simply so much grunt work to do, that it takes a lot of resources to make a complete check of the entire specification. Communicating with the contractor about getting the tool to run and interpreting the results is an important part of getting value from the automatic QA tool. We have found that assumptions about technical platforms, input and output, and even communicating output of failed automatic QA are things that should not be underestimated.

However the value of this has been very high. It has served to clarify requirements in both our own organization and with our contractor, and it has given us a sound basis for accepting the data from our contractor.

In other, smaller digitization projects, we have sometimes tried to avoid doing a thorough automatic QA check. Our experience, in these cases, is that this has simply postponed finding mistakes, that could have been automatically detected to our manual QA spot checks. The net effect of this is that the time spent on manual QA and on requesting redeliveries has been greatly increased. So our
recommendation is to do thorough automatic QA, but also to expect this to be a substantial task to do.

Even when you have done thorough automatic QA, it does not replace the need for a manual QA process, but since you don’t have time to check every file manually, you will need to do a spot check. Our strategy in this case has been twofold: First we take a random sample of images to check, giving us a statistical model allowing us to make statements about the probability of undiscovered mistakes. Second we amend this list of images to check, with images that an automatic analysis tool marks as suspicious – for instance very dark, unexpected (that is: possibly wrong) metadata, very low OCR success rates etc.

We have had our contractor build a workflow system for doing the actual manual QA process for us. So given the input of random and
suspect pages, they are presented in an interface, where a QA tester can approve images, or reject them with a reason. A supervisor will then use the output from the testers to confirm the reception of the data, or request a redelivery with some data corrected.

Even though the contractor builds our manual QA interface, we still need to integrate with this system, and the resources required for this should not be underestimated. We opted to have the tool installed in our location, to ensure the data checked in the manual QA spot check was in fact the data that was delivered to us. If the manual QA spot check had been done at the contractor, in theory the data could have been altered after the manual QA spot check and before delivery. Communication concerning installation of the system and providing access to images for manual QA spot check also turned out to be time consuming.

 

In conclusion, in a large-scale digitization project, QA is a substantial part of the project, and must be expected to require considerable resources.

Lesson 4: Expect a lot of time to elapse before first approved batch

This lesson may be a corollary of the previous three, but it seems to be one that needs to be learned time and time again.

When doing time-lines for a digitization project, you always have a tendency to expect everything to go smoothly. We had made that assumption once again in this project, and as we should have expected, it didn’t happen.

Nothing went wrong as such, but during planning we simply didn’t take into account the time it takes to communicate about requirements when we did the planning. So when we received the first pilot batch, our time-line said we would go
into production soon after. This, of course, did not happen. What happened was that the communication process about what needs to be changed (in the specification or in the data) started. And then, after this communication process had been completed it took a while before new data could be delivered. And then the cycle starts again.

Our newly revised plan has no final deadline. Instead it has deadlines on each cycle, until we approve the first batch. We expect
this to take some time. The plan says we allow three weeks for the first cycle, then when problems seem less substantial, we reduce the cycle to two weeks. Finally we go to one week cycles for more trivial issues. And once we have finally approved the batch, we can go into production. Obviously, this pushes our original deadline back months, but it this is really how our plan should have been designed from the very beginning. So make sure your plans allow time to work out the kinks and approve the workflow, before you plan on going into production.

Lesson 5: Everything and nothing you learned from small-scale digitization projects still applies

Running small-scale digitization projects is a good way to prepare you for handling a large-scale digitization project. You learn a lot about writing tenders, communicating with contractors, what doing QA and ingest entails, how you evaluate scan results etc. It is definitely recommended to do several small-scale digitization projects before you go large-scale.

But a lot of the things you learned in small-scale digitization projects turn out not to apply when you go large-scale.

We are digitizing 32 million newspaper pages in three years. That means that every single day, we need to be able to receive 30.000
pages. With each page being roughly 15 MB, that’s close to half a terabyte a day. Suddenly a lot of resources usually taken for granted need to be re-evaluated. Do we have enough storage space just to receive the data? Do we have enough bandwidth? Can our in-house routers keep up with managing the data? How long will it actually take to run characterization of all these files? Can we keep up? What happens if we are delayed a day or even a week?

Also in small digitization projects, manually handling minor issues are feasible. Even doing a manual check of the delivered data is
feasible. In this case, if you want to do a check of everything, a full-time employee would only be able to spend about two thirds of a second per page if he or she wanted to keep up. So you really need to accept that you can not manually do anything on the whole project. Spot checks and automation are crucial.

This also means that the only ones who will see every page of your digitization project ever will probably be your contractor. Plan
carefully what you want from them, because you probably only have this one chance to get it. If you want anyone to read anything printed on the newspaper page, now is the time to specify it. If you want anything recorded about the visual quality of the page, now is the time.

Another point is that you need to be very careful what you accept as okay. Accepting something sub-optimal because it can always be
rectified later will probably be “never” rather than “later”. This needs to be taken into account every time a decision is made that
effects the entire digitization output.

Conclusion

Every kind of project has its own gotchas and kinks. Large-scale digitization projects are no exception.

Listed above are some of our most important lessons learned so far and seen from the perspective of a technical lead primarily working with receiving the files. This is only one small part of the mass digitization project, and other lessons are learned in different parts of the organization.

I hope these lessons will be of use to other people – and even to ourselves next time we embark on a mass digitization adventure. It has been an exhilarating ride so far, and it has taken a lot of effort to get a good result. Next step is the process of giving our users access to all this newspaper gold. Let the new challenges begin!


Syndicate content