You are here

Feed aggregator

William Denton: Thinks and Burns

planet code4lib - Sat, 2015-09-12 14:39

Yesterday I stumbled on the Thinks and Burns with Fink and Byrne podcast. I have no idea where or when (indeed if) it was announced, but I was happy to find it. It’s John Fink (@adr) and Gillian Byrne (@redgirl13) talking about library matters. I’m acquainted with both of them and we all work near each other, so it’s of more interest to me than if it were two complete strangers rambling on, but if you know either of them, or are a Canadian librarian, or like podcasts where two librarians talk like you’re hanging out down the pub and one keeps hitting the table for emphasis, it’s worth a listen. They’ve done three episodes, and I hope they keep it going, even if irregularly.

Ed Summers: Zombie Information Science

planet code4lib - Sat, 2015-09-12 04:00

One of the readings for INST800 this week was (Bates, 2007). It’s a short piece that outlines how she and Mary Maack organized their work on the third edition of the Encyclopedia of Library and Information Sciences. When creating an encyclopedia it becomes immediately necessary to define the scope, so you have some hope of finishing the work. As she points out, information science is contested territory now because of all the money and power that is aggregated in Silicon Valley. Everyone wants a piece of it now, whereas it has struggled to be a discipline before people started billion dollar companies in their garages:

We have been treated as the astrologers and phrenologists of modern science— assumed to be desperately trying to cobble together the look of scholarship in what are surely trivial and nearly content-free disciplines.

It’s fun to pause for a moment to consider how much of what comes out of Silicon Valley feels like astrology or phrenology in its marketing and futurism.

So, in large part Bates and Maack needed to define what information science is before they could get to work. But Bates seems to feel like this theory transcended its use to scope the work, and really did define the field. Or perhaps it’s more likely that her previous work in the field (she is a giant) informed the scope chosen for the encyclopedia. What better way to establish the foundations of a field (ideology) than write an encyclopedia about it?

A few things struck me about this article. The small addition of “s” to the end of the title of the encyclopedia seemed like an important detail. It recognizes the pluralistic nature of information, its inter-disciplinarity – how information science has grown out of many disciplines across the humanities, social sciences and physical sciences. But she goes on to point out that in fact all these disciplines do have something in common:

… we engage in living and working our daily lives, and these vast numbers of human activities give off or throw off a remarkably extensive body of documentation of one sort or another: Business records, family histories, scholarly books, scientific and technical journals, websites, listservs and blogs for groups with common interests of a work or avocational nature, religious texts, educational curricula, x-rays, case records in law, medicine, and criminal justice, architectural drawings and purchase orders from construction sites, and on and on and on. The universe of living throws off documentary products, which then form the universe of documentation.

This description of different universes is compelling for me because it seems to recognize the processes and systems that information is a part of. She also includes a quirky mindmap-like diagram of these two universes in interaction which helps illustrate what she sees going on:

The Universe of Living and the Universe of Documentation

Now I would be sold if things stopped here, but of course they don’t. Bates goes on to criticize Briet’s idea of documents (anything that can be used as evidence) in order to say:

I argue that a document, above all, contains “recorded information,” that is, “communicatory or memorial information preserved in a durable medium”. (Bates, 2006).

For bates the antelope in the zoo is a specimen, not a document. Now I should probably dig down into (Bates, 2006) to understand fully what’s going on here, but this definition on the surface seems specific, but begs a few questions. Who or what is in communication? Is understanding required for it to be communication? Does this communication need to be intentional? What is does durable mean, over what time scales? Maybe I’m just getting defensive because I’ve always been a bit partial to Briet’s definition.

Bates goes on to use this distinction as an argument for not including the study of living things as information in the encyclopedia; which seems like a perfectly fine definition of scope for an encyclopedia, but not as a definition of what is and is not information science:

In the definition of scope of the Encyclopedia of Library and Information Sciences, the first two branches of the collections sciences, all working with collections of non-living but highly informative objects, are being included in the coverage of the encyclopedia, while the collectors of live specimens— the branch most remote from document collecting— are not included at this time.

Can we really say information is dead, or rather, that it has never been alive? Is information separable from living things (notably us) and still meaningful as an object of study? Where do you draw the line between the living and the unliving? I suspect she herself would agree that this is not a sustainable view of the information sciences.


Bates, M. J. (2006). Fundamental forms of information. Journal of the American Society for Information Science and Technology, 57(8), 1033–1045. Retrieved from

Bates, M. J. (2007). Defining the information disciplines in encyclopedia development. Information Research, 12. Retrieved from

Ed Summers: Red Thread

planet code4lib - Sat, 2015-09-12 04:00

Bates, M. (1999). The invisible substrate of information science. Journal of the Society for Information Science, 50(12):1043–1050.

Of all the introductions to information science we’ve read so far I’m quite partial to this one. It does have a few moments of “just drink the kool aid already”, but the general thrust is to help familiarize the growing number of people working with information about the field of information science. So it’s purpose is largely educational not theoretical. Bates wrote this in 1999, and I think there is still a real need to broaden the conversation about what the purpose of information science is today, although perhaps we know it more in the context of human-computer-interaction. I also suspect this article helped define the field for those who were already working in the middle of this highly inter-disciplinary space and trying to find their way.

The reason why it appealed to me so much is because it speaks to the particularly strange way information science permeates, but is not contained by other disciplines. Information science is distinguished by the way its practitioners:

… are always looking for the red thread of information in the social texture of people’s lives. When we study people, we do so with the purpose of understanding information creation, seeking, and use. We do not just study people in general. (Bates, 1999, p. 1048)

Her definitions are centered on people and these weird artifacts they create bearing information, which are only noticed with a particular kind of attention, which you learn when you study information science. I was reminded a bit of (Star, 1999), written in the same year. I can hear echoes of STS in Bates’ exhortation to follow “the red thread” of information–which reminds me of Bruno Latour more than it does Woodward and Bernstein.


Bates, M. (1999). The invisible substrate of information science. Journal of the Society for Information Science, 50(12), 1043–1050.

Star, S. L. (1999). The ethnography of infrastructure. American Behavioral Scientist, 43(3), 377–391.

LibUX: Does the best library web design eliminate choice?

planet code4lib - Fri, 2015-09-11 21:38

There is research to suggest that libraries’ commitment to privacy may be its ace up the sleeve, as constant tracking and creepy Facebook ads repulse growing numbers of users. We can use libraries’ killer track-record to our advantage to insulate our patrons’ trust which raises our esteem. The only conniption I have when we are talking shop is how often privacy is at odds with personalization – this is a real shame.

Does the best library web design eliminate choice?
This writeup is also on medium.

The root of the “Library User Experience Problem” is not design. No – design is just a tool. What gets in the way of good UX is that there is just too much. These websites suffer from too many menu items, too many images, too many hands in the pot, too much text on the page, too many services, too many options.
The solution is less:

  • fewer menu items increase navigability
  • fewer images increase site-speed, which increases conversion
  • fewer hands in the pot increase consistency and credibility
  • less text on the page increases readability, navigability
  • fewer options decrease the interaction cost.

Interaction cost describes the amount of effort users exhaust to interact with a service. Tasks that require particularly careful study to navigate, validate, complete – whether answering crazy-long forms or clicking through a ton of content – are emotionally and mentally taxing. High cost websites churn. To increase usability, reduce this cost.

Decision fatigue

John Tierney, in the New York Times in 2011, called this “decision fatigue.”

Decision fatigue helps explain why ordinarily sensible people get angry at colleagues and families, splurge on clothes, buy junk food at the supermarket and can’t resist the dealer’s offer to rustproof their new car.

These well-founded negative repercussions of cognitive load inspired Aaron Shapiro to write, a couple years later, that “choice is overrated.” As Orwellian as that sounds, I think I agree. Functional design is design that gets out of your way. It facilitates – when you need it. It is the unwritten servant in Jane Austen who relinquishes such trivial concerns like cooking, fetching, cleaning, delivering, dressing, so that our heroines can hob-knob and gossip.

Already we engage with nascent services anticipating our choices, and these will mature. In the next couple of years, when I schedule a flight in my calendar it will go ahead and reserve the Uber and inform the driver of my destination (the airport).

It is not that these choices were eliminated for me, but context and past behavior spared me from dealing with the nitty gritty.

Anticipatory design is a way to think about using context and user behavior as well as personal data – if and when ethically available – to craft a “user experience for one” to reduce the interaction cost, the decision fatigue, or – in the Brad Frost way of doing things – cut out the bullshit.

Context, behavior, and personal data

Here are pillars for developing an anticipating service. The last one, personal data, is what makes librarians – who, you should know if you’re not one, care more about your privacy than you probably do – pretty uneasy.

The context can be inferred from neutral information such as the time of day, business hours, weather, events, holidays. If the user opts in, then information such as device orientation, location or location in relation to another, and motion from the accelerometer can build a vague story about that patron and make educated guesses about which content to present.

Behavior comes in two flavors: general analytics-driven user behavior of your site or service on the web or in house, and individual user behavior such as browsing history. I distinguish the latter from personal data because I consider this information available to the browser without need for actually retrieving information from a database. General analytics reveals a lot about how a site functions, what’s most sought after, at what times, by which devices. Specific user behavior, which can be gleaned through cookies, can then narrow the focus of analytics-driven design.

It can only be a user experience for one when personal data can expose real preferences — Michael loves science fiction, westerns, prefers Overdrive audiobooks to other vendors and formats — to automatically skip the hunt-and-peck and curate a landing page unique to the patron. Jane is a second-year student at the College of Psychology, she reserves study rooms every Thursday, it’s finals week, she’s in the grind and it’s in the evening: when she visits the library homepage, we should ensure that she can reserve her favorite study room, give her the heads up that the first-floor cafe is opened late if she needs a pick-me-up, and give her the databases she most frequents.

We just have to reach out and take the ring

Libraries have access to all the data we could want, but as Anne Quito wrote over on Quartz, anticipatory design requires a system of trust.

This means relinquishing personal information – passwords, credit card numbers, activity tracking data, browsing histories, calendars – so the system can make and execute informed decision on your behalf.

This would never fly.

Anticipatory design presents new ethical checkpoints for designers and programmers behind the automation, as well as for consumers. Can we trust a system to safeguard our personal data from hackers and marketers – or does privacy become a moot concern?

I do not believe that privacy and personalization are mutually exclusive, but I am skeptical of libraries’ present ability to safeguard this data. As I told Amanda in our podcast about anticipatory design, I do not trust the third-party vendors with which libraries do business to not ethically exploit personal information, nor do I trust libraries without seasoned DevOps to deeply personalize the experience without leaving it vulnerable.

Few libraries benefit from the latter. So …, bummer. The shame to which I alluded above is that while users not only benefit from the convenience, libraries by so drastically improving usability thus drastically improve the likelihood of mission success. These things matter. Investing in a net-positive user experience matters, because libraries thrive and rely on the good-vibes from its patronbase – especially during voting season.

The low-fat flavor of “anticipatory design” without the personal-data part has also been referred to as context-driven design , which I think a compelling strategy. It doesn’t require libraries to store and safeguard more information than is necessary for basic function. Context inferred from device or browser information is usually opt-in by default, and this would do most of the heavy lifting without crossing that deep, deep line in the sand, or crossing into the invasive valley.

The post Does the best library web design eliminate choice? appeared first on LibUX.

William Denton: Now with and COinS structured metadata

planet code4lib - Fri, 2015-09-11 21:19

Thanks to a combination of Access excitement, a talk by Dan Scott, a talk by Mita Williams, and wanting to learn more, I added and COinS metadata to this site. It validates, though I’m not sure if the semantic structure is correct. Here’s what I’ve got so far.

My Jekyll setup

I build this site with Jekyll. It uses static HTML templates in which you can place content as needed or do a little bit of simple scripting (inside double braces, which here I’ve spaced out: { { something } }). My main template is _layouts/miskatonic.html, which (leaving out the side, the footer, CSS and much else) looks like this:

<!DOCTYPE html> <html itemscope itemtype=""> <head> <meta charset="utf-8"> <meta name="referrer" content="origin-when-cross-origin" /> <title>{ {page.title} } | { { } }</title> <meta itemprop="creator" content="William Denton"> <meta itemprop="name" content="Miskatonic University Press"> <link rel="icon" type="image/x-icon" href="/images/favicon.ico"> <link rel="alternate" type="application/rss+xml" title="Miskatonic University Press RSS" href="" /> </head> <body> <article> { { content } } </article> <aside></aside> </body> </html>

It declares that the web site is a CreativeWork, what its name is, and who owns it.

I have two types of pages: posts and pages. Posts are blog posts like this, and pages are things like Fictional Footnotes and Indexes.

My page template, _layouts/page.html, sets out that the page is, in the sense, an Article:

--- layout: miskatonic --- <div itemscope itemtype=""> { % include coins.html % } <meta itemprop="creator" content="William Denton"> <meta itemprop="license" content=""> <meta itemprop="name" content="{ { page.title } }"> <meta itemprop="headline" content="{ { page.title } }"> { % if page.description % } <meta itemprop="description" content="{ { page.description } }"> { % endif % } <img itemprop="image" src="/images/dentograph-400px-400px.png" alt="" style="display: none;"> <h1>{ { page.title } }</h1> <p> <time itemprop="datePublished" datetime="{ { | date_to_xmlschema } }"> { { | date_to_long_string } } </time> <span class="tags"> { % for tag in page.tags % } <a href="/posts/tags.html#{ { tag } }"><span itemprop="keywords">{ { tag } }</span></a> { % endfor % } </span> </p> <div class="post" itemprop="articleBody"> { { content } } </div> </div>

Those meta tags declare some properties of the Article. Every Article is required to have a headline and an image, which doesn’t really suit my needs and shows the commercial nature of the system. For the headline, I just use the title of the page. For the image, I use a generic image that will repeat on every page, and what’s more I style it with CSS so it’s not visible. I may come back to this later and make it work better.

The layout: miskatonic at the top means that this content gets inserted into that layout where the { { content } } line is.

The _layouts/post.html template looks like this:

--- layout: miskatonic --- <div itemscope itemtype=""> { % include coins.html % } <meta itemprop="creator" content="William Denton"> <meta itemprop="license" content=""> <h1 itemprop="name">{ { page.title } }</h1> <p> <time itemprop="datePublished" datetime="{ { | date_to_xmlschema } }"> { { | date_to_long_string } } </time> <span class="tags"> { % for tag in page.tags % } <a href="/posts/tags.html#{ { tag } }"><span itemprop="keywords">{ { tag } }</span></a> { % endfor % } </span> </p> <div class="post" itemprop="articleBody"> { { content } } </div> </div>

Every blog post is a BlogPosting. The same kind of properties are given about it as for a page, and the same image trick. I usually include images with blog posts and maybe there’s a simple way to make Jekyll find the first one and bung it in there. I don’t think I want to get into listing all the images I use in the YAML header … that’s too much work.

When I write a blog post, like this one, I start it with

--- layout: post title: Now with and COinS structured metadata tags: jekyll metadata date: 2015-09-11 16:24:17 -0400 ---

That defines that this is a post, so the Markdown is processed and inserted into the post layout, which is processed and put into the miskatonic layout, which is processed, and that’s turned into static HTML and dumped to disk. (Or something along those lines.)

Proper semantics?

This all validates, but I’m not sure if the semantics are correct. Google’s Structured Data Testing Tool says this about a recent blog post:

CreativeWork (my site) and the BlogPosting (the post) are at the same level. I’m not sure if the BlogPosting should be a child of the Creative Work. It is in the schema, but I don’t know if that should apply to this structure here.

Useful links



While I was at all this, I decided to add COinS metadata to everything so Zotero could make sense of it. Adapting Matthew Lincoln’s COinS for your Jekyll blog, I created _includes/coins.html, which looks like this, though if you want to use it, reformat it to remove all the newlines and spaced braces, and change the name:

<span class="Z3988" title="ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc &amp;rft.title={ { page.title | cgi_escape } } &amp;rft.aulast=Denton&amp;rft.aufirst=William &amp;rft.source={ { | cgi_escape } } &amp;{ { | date_to_xmlschema } } &amp;rft.type=blogPost&amp;rft.format=text &amp;rft.identifier={ { site.url | cgi_escape } }{ { page.url | cgi_escape } } &amp;rft.language=English"></span>

I just noticed that this says the thing is a blog post, and I’m using this COinS snippet on both my pages and posts, so Zotero thinks the pages are posts, but I’ll let that ride for now. Zotero users, if you ever cite one of my pages, watch out.

COinS is over ten years old now! There must be a more modern way to do this. Or is there?

Bibliographic information in

Now that I’ve done this, search engines like Google can make better sense of the content of the site, which is nice enough, though I hardly ever use Google (I’m a DuckDuckGo man—it’s not as good, but it’s better). I would like to mark up my talks and publications so all of that citation information is machine-readable, but hasn’t been formally approved yet, from what I can see. And look at all of the markup going on for something like a Chapter!

Blecch. I don’t want to type all that kind of cruft every time I want to describe a chapter or article. There are things like jekyll-scholar that would let me turn a set of BibTeX citations into HTML, but it doesn’t do microformats. Maybe that would something to hack on. Or maybe I’ll just leave it all for now and come back to it next time I feel like doing some fiddling with this site. That’s enough template hacking for one week!

Corrections welcome

If anyone who happens to read this sees any errors in what I’ve done, please let me know. I don’t really care if my headlines could be better, but if there’s something semantically wrong with what I’ve described here, I’d like to get it right.

Nicole Engard: Bookmarks for September 11, 2015

planet code4lib - Fri, 2015-09-11 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Roundcube Free and Open Source Webmail Software
  • Bolt Bolt is an open source Content Management Tool, which strives to be as simple and straightforward as possible. It is quick to set up, easy to configure, uses elegant templates, and above all: It’s a joy to use.

Digest powered by RSS Digest

The post Bookmarks for September 11, 2015 appeared first on What I Learned Today....

Related posts:

  1. Who are our peers?
  2. September Workshops
  3. Google Docs Templates

Alf Eaton, Alf: What Aaron understood

planet code4lib - Fri, 2015-09-11 18:52

I didn’t know Aaron, personally, but I’d been reading his blog as he wrote it for 10 years. When it turned out that he wasn’t going to be writing any more, I spent some time trying to work out why. I didn’t find out why the writing had stopped, exactly, but I did get some insight into why it might have started.

Philip Greenspun, founder of ArsDigita, had written extensively about the school system, and Aaron felt similarly, documenting his frustrations with school, leaving formal education and teaching himself.

In 2000, Aaron entered the competition for the ArsDigita Prize and won, with his entry The Info Network — a public-editable database of information about topics. (Jimmy Wales & Larry Sanger were building Nupedia at around the same time, which became Wikipedia. Later, Aaron lost a bid to be on the Wikimedia Foundation’s Board of Directors, in an election).

Aaron’s friends and family added information on their specialist subjects to the wiki, but Aaron knew that a centralised resource could lead to censorship (he created zpedia, for alternative views that would not survive on Wikipedia). Also, some people might add high-quality information, but others might not know what they’re talking about. If everyone had their own wiki, and you could choose which trusted sources to subscribe to, you’d be able to collect just the information that you trusted, augment it yourself, and then broadcast it back out to others.

In order to pull information in from other people’s databases, you needed a standard way of subscribing to a source, and a standard way of representing information.

RSS feeds (with Aaron’s help) became a standard for subscribing to information, and RDF (with Aaron’s help) became a standard for describing objects.

I find — and have noticed others saying the same — that to thoroughly understand a topic requires access to the whole range of items that can be part of that topic — to see their commonalities, variances and range. To teach yourself about a topic, you need to be a collector, which means you need access to the objects.

Aaron created Open Library: a single page for every book. It could contain metadata for each item (allowable up to a point - Aaron was good at pushing the limits of what information was actually copyrightable), but some books remained in copyright. This was frustrating, so Aaron tried to reform copyright law.

He saw that many of the people incarcerated in America were there for breaking drug-related laws, so he tried to do something about that, as well.

He found that it was difficult to make political change when politicians were highly funded by interested parties, so he tried to do something about that. He also saw that this would require politicians being open about their dealings (but became sceptical about the possibility of making everything open by choice; he did, however, create a secure drop-box for people to send information anonymously to reporters).

To return to information, though: having a single page for every resource allows you to make statements about those resources, referring to each resource by its URL.

Aaron had read Tim Berners-Lee’s Weaving The Web, and said that Tim was the only other person who understood that, by themselves, the nodes and edges of a “semantic web” had no meaning. Each resource and property was only defined in terms of other nodes and properties, like a dictionary defines words in terms of other words. In other words, it’s ontologies all the way down.

To be able to understand this information, a reader would need to know which information was correct and reliable (using a trust network?).

He wanted people to be able to understand scientific research, and to base their decisions on reliable information, so he founded Science That Matters to report on scientific findings. (After this launched, I suggested that Aaron should be invited to SciFoo, which he attended; he ran a session on open access to scientific literature).

He had the same motivations as many LessWrong participants: a) trying to do as little harm as possible, and b) ensuring that information is available, correct, and in the right hands, for the sake of a “good AI”.

As Alan Turing said (even though Aaron spotted that the “Turing test” is a red herring), machines can think, and machines will think based on the information they’re given. If an AI is given misleading information it could make wrong decisions, and if an AI is not given access to the information it needs it could also make wrong decisions, and either of those could be calamitous.

Aaron chose not to work at Google because he wanted to make sure that reliable information was as available as possible in the places where it was needed, rather than being collected by a single entity, and to ensure that the AI which prevails will be as informed as possible, for everyone’s benefit.

District Dispatch: (Bene)tech as a leveler

planet code4lib - Fri, 2015-09-11 16:58

Disability issues are a third rail in our public discourse. To “de-electrify” this rail is no simple task. It requires us to be critical of our own predilections and instincts. Here’s the problem: We’re human. Humans are reflexively disconcerted by what we perceive as an aberration from the norm. For this reason, we celebrate difference in the abstract, but are often paralyzed by it in practice; we exalt people and things we perceive as different, but devote too little time to truly understanding them. How can we have robust conversations about addressing the unique challenges facing people with disabilities if we’re afraid to broach the subject of disability in the first place? We can’t. To make real headway on these challenges, we have to check those parts of our nature and our milieu that compel us to clam up in the face of “otherness.” We have to bridge the gap between our best of intentions and our actions in the world.

3D printer in action

Thankfully, there are social advocacy organizations that realize this. An example par excellence: the Silicon Valley-based non-profit, Benetech. The men and women of Benetech realize that one of the greatest opportunities for progress on disability issues lies at the confluence of education, technology, science and public policy. They encourage individuals from across these fields – both with and without disabilities – to work together to develop solutions to accessibility challenges. Benetech’s latest effort on this front: A convening of library, museum and school professionals from across the country to discuss strategies for using 3D printing technology to improve the quality of education for students with disabilities. I was honored to be given the chance to attend and share my perspective on 3D printing as a policy professional.

The convening’s discussions and workshops highlighted a number of ways in which 3D printers can level the playing field for students with disabilities. 3D printers can render molecules and mathematical models coated with braille to bring STEM learning to life for individuals with print disabilities; they can provide a boost of confidence to a child whose motor skills are compromised by cerebral palsy by helping him or her create an intricately shaped object; and they can energize a student with a learning disability by illustrating a practical application of a subject with which he or she may struggle. Participants discussed how libraries, schools and museums can collaborate to help disabled students everywhere enjoy these and more “leveling” applications of 3D printing technology.

As fruitful as Benetech’s San Jose convening was, its participants all agreed that it should represent the beginning of a broader conversation on the need to use technology to address the myriad of learning challenges facing disabled students; one that must include not just professionals from the library, museum and school communities, but also government decision makers, academics and the public. The more people we involve in the conversation, the closer we will come to de-electrifying the third rail of disability issues in the education, tech and policy arenas.

Benetech is already taking steps to broaden the conversation. Last month, Benetech staff and participants in its June convening on 3D printing held a webinar in which they highlighted several projects they’ve spearheaded to raise awareness of the capacity of 3D printing to put students with disabilities on a level playing field with their peers. These include the development of draft technical standards aimed at making 3D printing technology accessible, and the creation of 3D printing curricula that encourage teachers to use 3D printers and 3D printed objects to create new learning opportunities for students with disabilities.

The ALA Washington Office would like to thank Lisa Wadors Verne, Robin Seaman, Anh Bui, Julie Noblitt and the rest of the Benetech team for the opportunity to participate in its June convening on 3D printing. ALA Washington looks forward to continued engagement with Benetech – we have already begun discussions with Lisa, Robin, Anh, Julie and others about how libraries can promote their work to improve access to digital content and empower all people to enjoy the transformative power of technology.

You can read about a program on 3D printing and educational equity that Benetech has proposed for the 2016 SXSWedu conference here.

The post (Bene)tech as a leveler appeared first on District Dispatch.

DPLA: Announcing DPLA Workshops and Groups

planet code4lib - Fri, 2015-09-11 16:45

We’re pleased to share some news regarding the DPLA open committee calls, a series of conference calls that we’ve hosted for the past two years, and that helped us transition from DPLA’s planning phase into its implementation and existence as a nonprofit. Based on helpful feedback from our board, committees, and community, we have come to understand that these calls, which have been rather broad-based (on “Content,” “Tech,” etc.), have in many ways been superseded by our more focused open discussions, e.g., around our work on Hydra, and of course our critical ongoing work with our hubs and national network.

Those working areas, interest groups, and related virtual conversations will expand, and along with our continued open quarterly board calls, replace the open committee calls starting this fall. We are also in the process of starting up new groups to explore and better understand topics that are important to our community, including education, ebooks, archival description, Hydra-in-a-box, and others. Some of this work is taking place in special working groups like the Ebooks Working Group, the Archival Description Working Group, and the Education Advisory Committee. This initial slate of groups is just a starting point, and over time we will add new groups to reflect current priorities and interests. We expect that those who have been participating in the open committee calls will find one or more of these groups helpful, and we encourage you to get involved by signing up for the public discussion lists associated with these groups if you are interested.

As an additional replacement for the open calls, we’re excited to announce a new series of public workshops that will begin in October. These new online conversations will highlight subjects central to our community, such as education, metadata, and copyright, and will also include time for general DPLA updates and questions. We will be sure to rotate the topics so that areas formerly covered by our former content, tech, legal, and outreach calls are included, but the new format, piloted last spring, will allow us to get into much greater depth than we have been able to do on the more diffuse open calls.

We’re looking forward to instituting these changes over the next couple of months. Below you will find an initial schedule for these new public workshops, including links to register where appropriate. We will be adding more discussions to the list below as the year progresses.

Upcoming DPLA Workshops

Using DPLA for Teaching and Learning (November 3, 2015, 7:00pm Eastern)
In this workshop, DPLA staff and members of the DPLA’s Education Advisory Committee will demonstrate how participants can use DPLA’s search capabilities, primary source sets, and exhibitions in instruction. Questions? Email us.

What is an API? (Winter 2016)
Details forthcoming! Sign up for our mailing list to receive updates.


Copyright basics (1 of 2)(Spring 2016)
Details forthcoming! Sign up for our mailing list to receive updates.


How to Create Good Rights and Access Statements (2 of 2)(Spring 2016)
Details forthcoming! Sign up for our mailing list to receive updates.


DPLA Groups

Archival Description Working Group
The Archival Description Working Group is composed of representatives from DPLA Partner institutions as well as national-level experts in digital object description and discovery. The group will explore solutions to support both item-level and aggregate-level approaches to digital object description and access and will develop recommendations, data models, and tools as appropriate.

eBooks Work Group
The purpose of this group is to hold focused conversations about developing a framework for a national ebook strategy. Topics include: User stories; Scoping tools/Challenges relating to content types; Defining the national marketplace; Licensing Best Practices Workgroup; and Demonstration Projects. Sign up for the discussion list to get involved in the conversation.

The purpose of this group is to share news and announcements about DPLA’s education projects.

DPLA, Stanford University and DuraSpace are partnering to extend the existing Hydra project codebase and its vibrant and growing community to build, bundle, and promote a feature-rich, robust, flexible digital repository that is easy to install, configure, and maintain. This next-generation repository solution — “Hydra in a Box” — will work for institutions large and small, incorporating the capabilities and affordances to support networked resources and services in a shared, sustainable, national platform. The overall intent is to develop a digital collections platform that is not just “on the web,” but “of the web.”

Open eBooks
Open eBooks is an app containing thousands of popular and award-winning titles that are free for unlimited use by low-income students. These eBooks can be read in an unlimited fashion, without checkouts or holds. Children from low-income areas can access these eBooks, which include some of the most popular works of the present and past, and read as many as they like without incurring any costs. We believe access to these books will serve as a gateway to even more reading, whether at libraries, bookstores, or through other ebook reading apps. Open eBooks is a partnership between Digital Public Library of America, New York Public Library, and First Book, with support from Baker & Taylor. Sign up for the discussion list to get involved in the conversation.


Upcoming DPLA Board Calls
  • Wednesday, September 16, 2015 at 3:00 PM Eastern
  • [Governance Committee] Wednesday, November 18, 2015, 1:00 PM Eastern
  • Tuesday, December 15, 2015 at 3:00 PM Eastern

Harvard Library Innovation Lab: Link roundup September 11, 2015

planet code4lib - Fri, 2015-09-11 16:12


The New York Times wrestled with many dimensions of video to visualize the making of a hit » Nieman Journalism Lab

NYTimes creates multiple versions of a video to tell the story of a Justin Bieber/Diplo/Skrillex hit

A Roving ‘Batmobile’ Is Helping Map Alaska’s Bats

Citizen scientists check out equipment from the library to collect data on bats for the Alaska Fish and Game Dept

Classic book jackets come to life – in pictures | Books | The Guardian

Living cover

This Tokyo Book Store Only Carries One Book at a Time | Mental Floss

A book store selling only one title per week

Backpack Makers Rethink a Student Staple

To build a better backpack, get out and see how people use backpacks.

Library of Congress: The Signal: Describing Records Before They Arrive: An NDSR Project Update

planet code4lib - Fri, 2015-09-11 15:45

The following is a guest post by Valerie Collins, National Digital Stewardship Resident at The American Institute of Architects.

At the American Institute of Architects, the AIA Archives is building a digital repository for permanent born-digital records that capture the intellectual capital of the AIA, or have continuing value to the practice of architecture. In a story that is probably familiar to many readers, the AIA has important digital records that are not currently stored in a central repository, and that are subject to accidental deletions or movement on the AIA’s shared drive. The challenge for our team in the archives is to find and identify these records and provide the AIA with a repository system that will do all of the following: preserve the records, be flexible enough to support the AIA’s changing information needs, and be simple for AIA staff to use in order to deposit their own digital records into the repository.

The project includes interviewing departments to determine the permanent digital records that they produce, choosing a repository system, and implementing the chosen system based on our requirements and the information we’ve gathered from our interviews. We’re in the process of department interviews and settling on our final system requirements, so this blog post will focus on one of the design principles behind how we plan to implement our repository. The main concept driving the design of our repository is “describing records before they arrive.”

Our team at the AIA believes that successful digital curation begins at the time of record creation. The more time that passes between when an important digital record is finalized and when it is moved to the archive, the greater the likelihood that the file or important information about it will be lost or forgotten. In an effort to collapse the danger area between creation and deposit, the model that the AIA is using flips the traditional method of describing records after they’ve arrived in the Archives by describing them before they arrive.

This method is not ideal for most traditional archives, but the AIA Archives has several characteristics, described below, that make this feasible. Other business or government archives may find this method useful as well.

  1. The AIA Archives is not a collecting archives, and has a clearly limited scope of materials it accepts
  2. The designated community of users is the same as contributors (staff members)
  3. The AIA has clearly definable programs. Programs are stable and records created for these programs are mostly the same year after year (which means they can be anticipated by the archivist)
  4. The AIA is a relatively small organization, and the Archives already works closely with many departments

One of the main features of this approach is that we are not structuring our repository around departments – we know they change, and change frequently. So instead of creating departmental collections of records in our repository with programs as subcategories, we are making the programs the main unit of organization – and we are treating departments as the “author” of that program’s records, which will connect related programs together if a user needs to see everything a department has produced. For our project, a “program” is a specific activity, product, or function of the AIA that produces records.

A traditional “fonds” approach to archives is to organize everything around the creating department. In an institutional repository for a university, for example, you may see a list of department collections to browse:

An example of a common academic Institutional Repository collections list.

If you browse through any of these collections, you’ll probably find papers produced from those faculties, maybe broken down further into collections centered on certain initiatives. This makes sense in the academic realm, where faculties, schools, and departments are slower to be renamed and reorganized, but at the AIA and in other organizations, there is far more fluidity in department organization and names.

When I first started this project in June, I was repeatedly told that “the programs are consistent over time; but departments change frequently.” I was skeptical about the longevity of programs (turns out some of these programs go back to the 1940s, and have produced basically the same kind of records since then). Neither did I understand how quickly departments could change – until we started meeting with current departments and discussing their permanent digital records. These department meetings have been crucial for understanding the records that each department produces, as well as the way in which departments interact with records.

Frequently, a department changes name, major programs are split between departments, or different aspects of a program are managed by different departments. Each record that is submitted to the repository should have metadata that connects it to the name of the department as it existed when the record was submitted, but the record should also be discoverable using the name of the current iteration of that department.

Currently on the AIA shared drive, staff members navigate to their content by clicking through folders that are modeled after the current organizational structure. One of the side effects of organizational restructuring is the challenge of maintaining a continual administrative history for each record and program, while also keeping the repository reflective of the current organizational structure. I mentioned earlier that we plan to treat departments as “authors” of records. Functionally, this would work in a similar manner to an author page (or a faculty profile page in a university IR), where additional information about the author can be found, with links to and from each book associated with that author. This department entity record would hold all of the administrative history of that department, similar to EAC-CPF records that pair with EAD finding aids, one describing the record creator and the other describing the collection.

The idea behind this is that a single update to that entity record as departments change will serve to keep track of the overall department function and role over time. For example, a department currently called “Communities by Design” used to be “Livable Communities” in the 1990s, but the programs the department is responsible for hasn’t changed (except for one, which was moved to another department this year), and neither have the actual records produced for these programs changed much in the interceding decades. The repository should be able to track that a report on a community produced in 1999 was produced by Livable Communities, and the same kind of community report produced in 2015 came from Communities by Design, but that conceptually, these two department names represent the same function in the AIA.

For a more detailed look at how we’re planning to relate programs, records, and departments (and some of the metadata we’ll need to capture at each level) we’ve put together the graphic below:

Graphic detailing AIA’s planned records descriptions.

You’ll notice that we’re using language more likely to be found in records management than in archives – for the most part, we are doing records management for the permanent digital records of the AIA. These records need to be saved in a manner so that staff can find and use them and they need to be preserved adequately for the future. Once we’ve managed this with current records, we’ll work backwards to organize older digital records or retrieve files off removable media.

This is just a brief description of one part of our approach to developing a digital repository to preserve the AIA’s permanent born digital records. Another similar approach is the Australian Series System (PDF). Our approach has definitely been influenced by the nature of departments and records here at the AIA, but there are pieces of it that I think could be a useful approach at other institutions. But the next step for the AIA Digital Repository team is to start implementing our ideas, putting the knowledge we’ve gained from our department meetings to work, and start describing those records before they arrive.

For more information about the National Digital Stewardship Residency program in Washington, DC, see the program website here.


David Rosenthal: Prediction: "Security will be an on-going challenge"

planet code4lib - Fri, 2015-09-11 15:00
The Library of Congress' Storage Architectures workshop asked a group of us each 3 minutes to respond to a set of predictions for 2015 and questions accumulated at previous instances of this fascinating workshop. Below the fold, the brief talk in which I addressed one of the predictions. At the last minute, we were given 2 minutes more, so I made one of my own.

One of the 2012 Predictions was "Security will be an on-going challenge".

It might seem that this prediction was about as risky as predicting "in three years time, water will still be wet". But I want to argue that "an on-going challenge" is not an adequate description of the problems we now understand that security poses for digital preservation. The 2012 meeting was about 9 months before Edward Snowden opened our eyes to how vulnerable everything connected to the Internet was to surveillance and subversion.

Events since have greatly reinforced this message. The US Office of Personnel Management is incapable of keeping the personal information of people with security clearances secure from leakage or tampering. Sony Pictures and Ashley Madison could not keep their most embarrassing secrets from leaking. Cisco and even computer security heavyweight Kaspersky could not keep the bad guys out of their networks. Just over two years before the meeting, Stuxnet showed that even systems air-gapped from the Internet were vulnerable. Much more sophisticated attacks have been discovered since, including malware hiding inside disk drive controllers.

Dan Kaminsky was interviewed in the wake of the compromise at the Bundestag:
No one should be surprised if a cyber attack succeeds somewhere. Everything can be hacked. ... All great technological developments have been unsafe in the beginning, just think of rail, automobiles and aircraft. The most important thing in the beginning is that they work, after that they get safer. We have been working on the security of the Internet and the computer systems for the last 15 years.Yes, automobiles and aircraft are safer but they are not safe. Cars kill 1.3M and injure 20-50M people/year, being the 9th leading cause of death. And that is before their software starts being exploited.

For a less optimistic view, read A View From The Front Lines, the 2015 report from Mandiant, a company whose job is to clean up after compromises such as the 2013 one that meant Stanford had to build a new network from scratch and abandon the old one. The sub-head of Mandiant's report is:
For years, we have argued that there is no such thing as perfect security. The events of 2014 should put any lingering doubts to rest.The technology for making systems secure does not exist. Even if it did it would not be feasible for organizations to deploy only secure systems. Given that the system vendors bear no liability for the security of even systems intended to create security, this situation is unlikely to change in the foreseeable future. Until it is at least possible for organizations to deploy a software and hardware stack that is secure from the BIOS to the user interface, and until there is liability on the organization for not doing so, we have to assume the our systems will be compromised, the only questions being when, and how badly.

Our digital preservation systems are very vulnerable, but we don't hear reports of them being compromised. There are two possibilities. Either they have been and we haven't noticed, or it hasn't yet been worth anyone's time to do it.

In this environment the key to avoiding loss of digital assets is diversity, so that a single attack can't take out all replicas. Copies must exist in diverse media, in diverse hardware running diverse software under diverse administration. But this diversity is very expensive. Research has shown that the resources we have to work with suffice to preserve less than half the material that should be preserved. Making the stuff we have preserved safer means preserving less stuff.

To be fair, I should make a prediction of my own. If we're currently preserving less than half of the material we should, how much will we be preserving in 2020? Two observations drive my answer. The digital objects being created now are much harder and more expensive to preserve than those created in the past. Libraries and archives are increasingly suffering budget cuts. So my prediction is:
If the experiments to measure the proportion of material being preserved are repeated in 2020, the results will not be less than a half, but less than a third.

LITA: LITA Fall Online Continuing Education

planet code4lib - Fri, 2015-09-11 13:00
Immediate Registration Available Now for any of 4 webinars, or the web course

Check out all of the empowering learning opportunities at the line up page, with registration details and links on each of the sessions pages.

The offerings include 4 fast paced one shot webinars:

Teaching Patrons about Privacy in a World of Pervasive Surveillance: Lessons from the Library Freedom Project, with Alison Macrina

Offered: October 6, 2015, 1:30 pm Central Time
In the wake of Edward Snowden’s revelations about NSA and FBI dragnet surveillance, Alison Macrina started the Library Freedom Project as a way to teach other librarians about surveillance, privacy rights, and technology tools that protect privacy. In this 90 minute webinar, she’ll talk about the landscape of surveillance, the work of the LFP, and some strategies you can use to protect yourself and your patrons online.

Creative Commons Crash Course, with Carli Spina
Offered: October 7, 2015, 1:30 pm Central Time
Since the first versions were released in 2002, Creative Commons licenses have become an important part of the copyright landscape, particularly for organizations that are interested in freely sharing information and materials. Participants in this 90 minute webinar will learn about the current Creative Commons licenses and how they relate to copyright law. This webinar will follow up on Carli Spina’s highly popular Ignite Session at the 2015 ALA Mid Winter conference.

Digital Privacy Toolkit for Librarians, with Alison Macrina
Offered: October 20, 2015
This 90 minute webinar will include a discussion and demonstration of practical tools for online privacy that can be implemented in library PC environments or taught to patrons in classes/one-on-one tech sessions, including browsers for privacy and anonymity, tools for secure deletion of cookies, cache, and internet history, tools to prevent online tracking, and encryption for online communications.

Top Technologies Every Librarian Needs to Know – 2, with Steven Bowers, A.J. Million, Elliot Polak and Ken Varnum
Offered: November 2, 2015, 1:00 pm Central Time
We’re all awash in technological innovation. It can be a challenge to know what new tools are likely to have staying power ­­and what that might mean for libraries. The 2014 LITA Guide, Top Technologies Every Librarian Needs to Know, highlights a selected set of technologies that are just starting to emerge and describes how libraries might adapt them in the next few years. In this fast paced hour long webinar, join the authors of three chapters as they talk about their technologies and what they mean for libraries. Those chapters covered will be

  • Impetus to Innovate: Convergence and Library Trends, with A.J. Million
  • The Future of Cloud-Based Library Systems with Elliot Polak & Steven Bowers
  • Library Discovery: From Ponds to Streams with Ken Varnum
Plus the 4 week deep dive web course:

Personal Digital Archiving for Librarians, with Melody Condron
Offered: October 6 – November 3, 2015
Most of us are leading very digital lives. Bank statements, interaction with friends, and photos of your dog are all digital. Even as librarians who value preservation, few of us organize our digital personal lives, let alone back it up or make plans for it. Participants in this 4 week online class will learn how to organize and manage their digital selves. Further, as librarians participants can use what they learn to advocate for better personal data management in others. “Train-the-trainer” resources will be available so that librarians can share these tools and practices with students and patrons in their own libraries after taking this course.

Sign up for any and all of these great sessions today.

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4269 or Mark Beatty,

Ed Summers: zombies

planet code4lib - Fri, 2015-09-11 10:37

One of the readings this week for INST800 was (Bates, 2007). It’s a short piece that outlines how she and Mary Maack organized their work on the third edition of the Encyclopedia of Library and Information Sciences. When creating an encyclopedia it becomes immediately necessary to define the scope, so you have some hope of finishing the work. As she points out, information science is contested territory now because of all the money and power that is aggregated in Silicon Valley. Everyone wants a piece of that now ; whereas it struggled to be a discipline before.

We have been treated as the astrologers and phrenologists of modern science— assumed to be desperately trying to cobble together the look of scholarship in what are surely trivial and nearly content-free disciplines.

So, in large part Bates and Maack needed to define what information science is before they could get to work. But Bates felt like this theory transcended its use to scope the work, and really did define the field. What better way to establish the ideological foundations of a field than write an encyclopedia about it?

A few things struck me about this article. The small addition of “s” to the end of the title of the encyclopedia seemed like an important detail. It recognizes the pluralistic nature of information, how it has grown out of many disciplines across the humanities, social sciences and physical sciences. But she goes on to point out that in fact all these sciences do have something in common:

… we engage in living and working our daily lives, and these vast numbers of human activities give off or throw off a remarkably extensive body of documentation of one sort or another: Business records, family histories, scholarly books, scientific and technical journals, websites, listservs and blogs for groups with common interests of a work or avocational nature, religious texts, educational curricula, x-rays, case records in law, medicine, and criminal justice, architectural drawings and purchase orders from construction sites, and on and on and on. The universe of living throws off documentary products, which then form the universe of documentation.

This is one of the better descriptions of documents I’ve run across so far in the introductory readings we’ve been doing. Maybe it works because she follows it up with a really nice diagram:

The Universe of Living and the Universe of Documentation

Bates, M. J. (2007). Defining the information disciplines in encyclopedia development. Information Research, 12. Retrieved from

Alf Eaton, Alf: Distributed Asynchronous Composable Resources

planet code4lib - Fri, 2015-09-11 07:44

Imagine a data table, where each row is an item and each column is a property.

It might look like this:

url name published Item One 2015–09–10 Item Two 2015–09–11

The table is a representation of a collection of objects, each with several properties.

Using JavaScript notation, they would look like this:

[ { url: '', name: 'Item One', published: '2015-09-10', }, { url: '', name: 'Item Two', published: '2015-09-11', } ]

An abstract definition of the object, using Polymer’s notation, would look like this:

{ properties: { url: { type: URL }, name: { type: String }, published: { type: String } } }

You might notice that the published property is represented as a String, when it would be easier to use as a Date object. To convert it, you could add a “computed property”: a function that takes one or more existing properties as input, and outputs a new property:

{ properties: { url: { type: URL }, name: { type: String }, published: { type: String }, publishedDate: { type: Date, computed: function(published) { return new Date(published) } } } }

From this definition, you can see that the publishedDate property has a dependency on the published property: any computed properties should be updated when any of its dependencies are updated. In this case, when the published property is updated, the publishedDate property is also updated.

This is fine when the dependencies are all stored locally, but it’s also possible to imagine data that’s stored elsewhere. For example, this object might have associated metrics data counting how many times it’s been viewed:

{ properties: { url: URL, name: String, published: String, publishedDate: { type: Date, computed: function(published) { return new Date(published) } }, viewCount: { type: Number, computed: function(url) { return Resource(url).get('json').then(function(data) { return data.views }); } } } }

The Resource object used above is a Web Resource, part of a library I built to make it easier to fetch and parse remote resources. If it helps, an alternative using the standard Fetch API would look like this:

return fetch(url).then(function(response) { return response.json() }).then(function(data) { return data.views });

In either of those cases, the data is being fetched asynchronously, and a Promise is returned. Once the Promise is resolved, the viewCount property is updated. If this property was bound to the original table, you would see the new values being filled in as the data arrives!

url name published views Item One 2015–09–10 1000 Item Two 2015–09–11 2000 Implementations

I talked about this kind of thing at XTech in 2008, illustrating the object as a Katamari Damacy-style of “ball of stuff”, being passed around various different services and accumulating properties as it goes.

vege-table is an implementation of asynchronous composable resources: it’s an interface for adding computed properties to a collection of items and fetching the data asynchronously into a table.

Talis’ data platform had a similar feature, where results from a SPARQL query could be augmented by passing each result through another data store, matching on identifiers and adding selected properties each time.

The SERVICE feature of Wikidata’s SPARQL endpoint is also similar: it takes an object in each result and passes it to a specific service, assigning the resulting data to a specified property.

In a Google Sheet, adding a column/property using the IMPORT* functions will fetch data from a remote resource, inserting the value into the table when it arrives.

In OpenRefine, remote data can be fetched from web services and added to each item in the background.

Are there any others?

Alf Eaton, Alf: What colour is a tree?

planet code4lib - Fri, 2015-09-11 06:58

As much as individual, composable objects are interesting, the real understanding comes when a collection of items is analysed as a whole (or a part, if filtered).

There’s more to a collection of items than is immediately obvious - it’s not just a [1, 2, 3] list, with "array" methods for filtering and iteration: the Collection itself is an object with its own set of observable properties - many of which are summaries, in some way, of the properties in the items in the collection.

These summaries describe some aggregate quality of the collection, and - ideally - an indication of the variance, or confidence intervals, for that value within the collection.

For example, consider this question:
What colour is a tree?

When you look at a tree, what you’re seeing is a collection of trees over time. Your eye analyses the light arriving from the tree, and your brain tries to summarise the wavelengths that it’s seeing. The colours might cycle over time, as day and night pass, and they might cycle over longer periods, as seasons pass.

If you look around, you’ll see trees with different coloured leaves, depending on their genotype and phenotype. This is also a collection of trees, but distributed over space rather than time. The further away you look, the greater likelihood that the colour of a tree will be more different from the closest trees - the variance within the collection will increase.

So: observed properties of a collection can vary over time, or over space, depending on the conditions in which they’re found and the conditions of observation.

The observed colour of a tree - or a collection of trees - is a function with many inputs and one output: the wavelength(s) of light that leave the tree and enter your eye (or some other detector).

For any collection of items, a function can be written that describes one of their properties under certain conditions.

For example, the value(s) that this function outputs might be the mean (average) and standard deviation of a series of measurements over time, or it may group those values into buckets (the sort of data that might be displayed as a bar chart).

This is how we understand the world, and why we like collecting and classifying objects. To be able to understand the shared properties of items in a group, and differences from items in a different group, is to begin to understand them. Once we group items together, we can start to predict how they might behave.

SearchHub: How Cloudera Secures Solr with Apache Sentry

planet code4lib - Thu, 2015-09-10 22:30
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Cloudera’s Gregory Chanan’s session on TOPIC. Apache Solr, unlike other enterprise Big Data applications that it is increasingly deployed alongside, provides minimal security features out of the box. This limitation makes it significantly more burdensome for organizations to deploy Solr than solutions that have built-in support for standard authentication and authorization mechanisms. Apache Sentry is a project in the Apache Incubator designed to address these concerns. Sentry augments Solr with support for Kerberos authentication as well as collection and document-level access control. In this talk, we’ll discuss the ACL models and features of Sentry’s security mechanisms. We will also present implementation details on Sentry’s integration with Solr. Finally, we will present performance measurements in order to characterize the impact of integrating Sentry with Solr. Gregory Chanan is a Software Engineer at Cloudera working on Search, where he leads the security integration efforts around Apache Solr. He is a committer on the Apache HBase and Apache Sentry (incubating) projects and a contributor to various other Apache projects. Prior to Cloudera, he worked as a Software Engineer for distributed computing software startup Optumsoft. Secure Search – Using Apache Sentry to Add Authentication and Authorization Support to Solr: Presented by Gregory Chanan, Cloudera from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How Cloudera Secures Solr with Apache Sentry appeared first on Lucidworks.

Nicole Engard: Bookmarks for September 10, 2015

planet code4lib - Thu, 2015-09-10 20:30

Today I found the following resources and bookmarked them on Delicious.

  • MadEye MadEye is a collaborative web editor backed by your filesystem.

Digest powered by RSS Digest

The post Bookmarks for September 10, 2015 appeared first on What I Learned Today....

Related posts:

  1. September Workshops
  2. Software Freedom Day in September
  3. Getting started with a manageable OS project

Alf Eaton, Alf: Fetching Web Resources

planet code4lib - Thu, 2015-09-10 19:47

Imagine someone new to writing code for the web. They’ve read Tim Berners-Lee’s books, and understand that there are Resources out there, with URLs that can be used to fetch them. What code do they need to write to fetch and use those Resources?

  1. Use jQuery.ajax.

    Question: What’s “ajax”…?

    Answer: It’s an acronym (AJAX). It stands for

    Asynchronous (fair enough)

    JavaScript (ok)

    And (er…)

    XML (oh.)

  2. Use XMLHttpRequest.


If you’re working with JSON or HTML (which is probably the case), these interface names make no sense. And that’s before you get into the jQuery.ajax option names (data for the query parameters, dataType for the response type, etc).

As is apparently the way with all DOM APIs, XMLHttpRequest wasn’t designed to be used directly. It also doesn’t return a Promise, though there’s an onload event that gets called when the request finishes. Additionally, query strings are treated as just plain strings, when they’re actually a serialisation of a set of key/value pairs.

The Fetch API is an attempt to improve this situation, but it’s still quite unwieldy (being a low-level interface):

fetch(url).then(function(response) { return response.json(); }).then(function(data) { // do something with the data });

What’s really going on, and what should the interface look like?

  1. There’s a Resource on the web, with a URL:

    var resource = new Resource(url)
  2. The URL may have query parameters (filters, essentially):

    var resource = new Resource(url, params)
  3. When an action (get, put, delete) is performed on a Resource, a Request is made to the URL of the resource. This is usually a HTTP request.

  4. The Resource is available in multiple formats:


    (sets the Accept header to ‘application/json’, and parses the response as JSON)

    resource.get('html') (sets the Accept header to ‘text/html’, and parses the response as HTML)
  5. The Resource may be contained in a data wrapper. If the response is JSON, HTML or XML, the browser will parse it into the appropriate data object or DOM document:

    return resource.get('json').then(function(data) { return data.item; } return resource.get('html').then(function(doc) { return doc.querySelector('#item'); }
  6. Allow the selector(s) for extraction to be specified declaratively, avoiding the use of querySelector directly:

    return resource.get('html', { select: '#item', });
  7. To reduce the amount of code, allow a new instance of a Resource object to be created in a single line:

    return Resource(url).get('json').then(function(data) { return data.item; });
  8. A Collection is a paginated Resource. Given a page of items, it needs to know how to find a) each item in the page and b) the URL of the next/previous page, if there is one:

    Collection(url).get('json', { // select the array of items items: function(data) { return data.artists.items; }, // select the URL of the next chunk next: function(data) { return; } }).then(function(items) { // do something with the items });
  9. Instead of sending hundreds of requests to the same domain at once, send them one at a time: each Request is added to a per-domain Queue. When one request finishes, the next request in the queue is sent.

x-ray is a really nice implementation of a scraper for extracting collections of data from HTML web pages. It doesn’t extend to other data formats, though.

web-resource is my JavaScript library that implements the Resource and Collection interfaces described above.

Alf Eaton, Alf: It's a shame about Google Plus

planet code4lib - Thu, 2015-09-10 19:40

Google Plus was formed around one observation: most of the people on the web don't have URLs.

When people don’t have URLs, it’s difficult to make assertions about them.

For example, to show you which restaurants people you trust* have recommended in an area you’re visiting, a recommendation system needs to have a latitude + longitude for the area, a URL for each restaurant (solved by Google Places) and a URL for each person (solved, ostensibly, by Google Plus).

People might be leaving reviews in TripAdvisor, or Yelp, and there’s no obvious way to tie all those people together into any kind of coherent social graph. Even with Gmail, there's no way to say that the person you email is the same person who's left a review, unless they have a URL (i.e. a Google Plus account) that connects the email address and the reviewer account together.

One URL to rule them all

Google Plus has an extremely clever way of linking together all those accounts, which involves starting with one trusted URL (Google Plus account), linking to another URL (GitHub, say), then linking back from that URL to your Google Plus account to prove that you own the GitHub account and can write to it. Now that both of those URLs are trusted, either of them can be used as the basis of a new trusted connection: linking from the trusted GitHub URL to a Flickr URL, and then from the Flickr URL to the trusted Google Plus URL (or any other trusted profile URL), is enough to prove that you also own the Flickr account and can write to it.

The problem is (and the question “why” is an interesting one), even after people had their Google Plus account, they didn’t use it to post reviews. It’s mystifying why most businesses don’t have thousands of reviews. The problem, I think, is what each URL represents. Each of my online profiles on different sites is literally a different “profile”, and I only choose to link some of them together. I don’t consider all of them to be equivalent. Someone’s TripAdvisor persona - the one they present to hotels and B&B’s - is not necessarily the one they’d use for LinkedIn, or for writing a peer review of an academic article.

When Google tried to connect YouTube accounts to Google Plus accounts, and failed, it was because people felt that those personas were distinct, and wanted the freedom to do certain things on YouTube without having it show up on their “personal record” in Google Plus.

This also perhaps explains why people are wary of using Google Plus authentication to sign in to an untrusted site - they’re not so much worried about Google knowing where their accounts are, but also that the untrusted site might create a public profile for them without asking, and link it to their Google Plus profile.

Anyway, Google Plus is going away as a social network, and maybe even as a public profile, but the data’s still going to be connected together behind the scenes - perhaps using fuzzier, less explicit connections as a basis for recommendations and decision-making.

* Note that this doesn’t necessarily mean “friends”, or even “people you know” - it’s quite common to trust the recommendations of a large group of strangers more than a small number of friends (particularly when it’s a recommendation of what that person likes rather than what they think you’d like - perhaps people consider themselves to be closer to “average” than each of their friends individually?)


Subscribe to code4lib aggregator