You are here

Feed aggregator

David Rosenthal: Vint Cerf's talk at AAAS

planet code4lib - Tue, 2015-02-17 16:00
Vint Cerf gave a talk entitled Digital Vellum at the AAAS meeting last Friday that has received a lot of attention in the media, including follow-up pieces by other writers, and even drew the attention of Dave Farber's famed IP list. I have some doubts about how accurately the press has reported his talk, which isn't available via the AAAS meeting website. I am commenting on the reports, not the talk. But, as The Register points out, Cerf has been making similar points for some time. I did find a TEDx talk he titled Bit Rot on YouTube, uploaded a year ago. Below the fold is my take.

Cerf's talk was the first in a session devoted to Information-Centric Networks:
Vinton Cerf’s talk discusses the desirable properties of a "Digital Vellum" — a system that is capable of preserving the meaning of the digital objects we create over periods of hundreds to thousands of years. This is not about preserving bits, It is about preserving meaning, much like the Rosetta Stone. Information Centric Networking may provide an essential element to implement a Digital Vellum. This long-term thinking will serve as a foundation and context for exploring ICNs in more detail. ICN is a generalization of the Content-Centric Networks about which I blogged two years ago. I agree with Cerf that these concepts are probably very important for long-term digital preservation, but not why they are. ICNs make it easy for Lots Of Copies to Keep Stuff Safe, and thus make preserving bits easier, but I don't see that they affect the interpretation of the bits.

There's more to disagree with Cerf about. What he calls "bit rot" is not what those in the digital preservation field call it. In his 1995 Scientific American article Jeff Rothenberg analyzed the reasons digital information might not reach future readers:
  • Media Obsolescence - you might not be able to read bits from the   storage medium, for example because a reader for that medium might no   longer be available.
  • Bit Rot - you might be able to read bits from the medium, but they might be corrupt.
  • Format Obsolescence - you might be able to read the correct bits from   the medium but they might no longer be useful because software to render them into an intelligible form might no longer be available.
Media Obsolescence was a big problem, but as Patrick Gilmore pointed out on Farber's list it is much less of a problem now that most data is on-line and thus easily copied to replacement media.

Bit Rot (not in Cerf's sense) is an inescapable problem - no real-world storage system can be perfectly reliable. In the TEDx talk Cerf simply assumes it away.

Format Obsolescence is what Cerf was discussing. There is no doubt that it is a real problem, and that in the days before the Web it was rampant. However, the advent of the Web forced a change. Pre-Web, most formats were the property of the application that both wrote and read the data. In the Web world, these two are different and unrelated.

Google is famously data-driven, and there is data about the incidence of format obsolescence - for example the Institut National de l'Audiovisuel surveyed their collection of audiovisual content from the early Web, which would be expected to have been very vulnerable to format obsolescence. They found an insignificant amount. I predicted this finding on twofold theoretical grounds three years before their study:
  • The Web is a publishing medium. The effect is that formats in the Web world are effectively network protocols - the writer has no control over the reader. Experience shows protocols are the hardest things to change in incompatible ways (cf. Postel's Law, "no flag day on the Internet", IPv6, etc.).
  • Almost all widely used formats have open source renderers, preserved in source code repositories. It is very difficult to construct a plausible scenario by which a format with an open source renderer could become uninterpretable.
Even The Guardian's Samuel Gibbs is skeptical of Cerf's worry about format obsolescence:
That is the danger of closed, proprietary formats and something consumers should be aware of. However, it is much less of an issue for most people because the majority of the content they collect as they move through life will be documented in widely supported, more open formats.While format obsolescence is a problem, it is neither significant nor pressing for most digital resources.

However, there is a problem that is both significant and pressing that affects the majority of digital resources. By far the most important reason that digital information will fail to reach future readers is not technical, or even the very real legal issues that Cerf points to. It is economic. Every study of the proportion of content that is being preserved comes up with numbers of 50% or less. The institutions tasked with preserving our digital heritage, the Internet Archive and national libraries and archives, have nowhere close to the budget they would need to get that number even up to 90%.

Note that increasingly people's and society's digital heritage is in the custody of a small number of powerful companies, Google prominent among them. All the examples from the TEDx talk are of this kind. Experience shows that the major cause of lost data in this case is the company shutting the service down, as Google does routinely. Jason Scott's heroic Archive Team has tried to handle many such cases.

These days, responsibility for ensuring that the bits survive and can be interpreted rests primarily on Cerf's own company and its peers.

Access Conference: AccessYYZ Preconference Poll

planet code4lib - Tue, 2015-02-17 15:49

We’re deep into planning AccessYYZ and we want to know what we can do to make this conference perfect for you! On Tuesday, September 8 we’ll be throwing pre-conference events and we want your input on what we should organize. Should pre-con programming be kept more in the vein of other Access cons with just the Hackfest? Or should we switch it up to include a different option for those of you interested in gaining some tangible tech skills? Who better to decide this than you, the future attendees? Exactly what we thought.

The survey below includes the following options:

The Hackfest
Hackfest is an informal event gives interested people the chance to work together in small groups to tackle interesting projects in a low stress environment. It’s open to attendees of all abilities and backgrounds, not just programmers or systems librarians. Hackfest is the place to roll up your sleeves and experiment with that new service idea or pick other people’s brains on a tricky problem. Just to be clear, the Hackfest is definitely occurring, regardless of the results of this poll.

Software Carpentry Workshop
Software Carpentry teaches basic computing skills in a peer-led environment. Originally developed for scientists, the organization has recently developed sessions specifically targeted at library workers. Generally, the workshop covers topics such as:

  • the Unix shell (and how to automate repetitive tasks)
  • Python or R (and how to grow a program in a modular, testable way)
  • Git and GitHub (and how to track and share work efficiently)
  • SQL (and the difference between structured and unstructured data)

You can learn more about Software Carpentry – both the workshops and the organization behind them – here.

Something else
Have a great idea for a 1-day something-or-other that we can organize for relatively little money? Let us know in the survey below!

 

What Should Our Pre-Conference Activities (on Tuesday) Look Like?

I'd love to attend... The Hackfest A Software Carpentry Workshop Nothing! I don't plan to attend preconference events. Got a better idea? Let us know!

DPLA: CLIR Hidden Collections and DPLA

planet code4lib - Tue, 2015-02-17 15:45

Recently, the Council on Library and Information Resources (CLIR) announced a national competition to digitize and provide access to collections of rare or unique content in cultural heritage institutions, generously funded by the Andrew W. Mellon Foundation. This new iteration of the popular Hidden Collections program will enhance the emerging global digital research environment in ways that support expanded access and new forms of research for the long term. Its aim is to ensure that the full wealth of resources held by institutions of cultural heritage becomes integrated with the open web.

DPLA is excited to see among the program’s core values, the inclusion of three that are close to our hearts: Openness, Sustainability, and Collaboration.

We applaud CLIR in supporting DPLA by requiring all metadata created by the program to be explicitly dedicated to the public domain through a Creative Commons Public Domain Declaration License.

It is also admirable that the program states institutions may not claim additional rights or impose additional access fees or restrictions to the digital files created through the project, beyond those already required by law or existing agreements. Materials that are in the public domain in analog form must continue to be in the public domain once they have been digitized.

Here at DPLA we recognize that one key to sustainability is through the use of standards. CLIR’s Digital Library Federation program has developed an insightful wiki that not only is useful to program applicants, but to all those interested in how to manage digitization workflows as well.

Collaboration at DPLA not only happens among our contributing cultural heritage institutions, but we are also actively seeking ways for DPLA to partner with like-minded organizations. Working with CLIR and their Hidden Collections program is just one way we are connecting efforts, and we look forward to an even wider array of materials being made available to the public.

LITA: Diagrams Made Easy with LucidChart

planet code4lib - Tue, 2015-02-17 13:00

Editor’s note: This is a guest post by Marlon Hernandez 

For the past year, across four different classes and countless bars, I have worked on an idea that is quickly becoming my go-to project for any Master of Information Science assignment; the Archivist Beer Vault (ABV) database. At first it was easy to explain the contents: BEER! After incorporating more than one entity the explanation grew a bit murky:

ME: So remember my beer database? Well now it includes information on the brewery, style AND contains fictional store transactions
WIFE: Good for you honey.
ME: Yeah unfortunately that means I need to add a few transitive prop… I lost your attention after beer, didn’t I?

Which is a fair reaction since trying to describe the intricacies of abstract ideas such as entity relationship diagrams require clear-cut visuals. However, drawing these diagrams usually requires either expensive programs like Microsoft Visio (student rate $269) or underwhelming experiences of freeware. Enter Lucidchart, an easy to use and relatively inexpensive diagram solution.

The website starts off users with a few templates to modify from 16 categories, such as Flowchart and Entity Relationship (ERD),  or you can opt for a blank canvas. I prefer selecting the Blank (Name of Diagram) option as it clears the field of any unneeded shapes and preselects useful shapes.

Preselected shapes for a Blank Flowchart document

While these shapes should be more than enough for standard diagrams, you are also free to mix and match shapes, such as using flowchart shapes for your wireframe diagram. This is especially helpful when creating high fidelity wireframes that require end product level of detail.

Once you have selected your template it is easy to begin your drawing by dragging the desired shapes onto the canvas. Manipulating shapes and adding text overlays is straightforward, you merely click the edge of the boxes of the shape you want and adjust the size of it, which can either be done manually or set to a specific pixel size. Using the program is akin to having access to Photoshop’s powerful image manipulation tools but in a streamlined user-friendly UI. Most users can get by with just the basic options but for advanced users there are settings to adjust your page size and orientation, add layers, revision history, theme colors, adjust image size, and advanced text options. The frequently updated UI adds user requested features and contains tutorials within the diagram menu.

Adjust shapes by clicking on corners or select Metrics to adjust to specific size.

It also contains intuitive features such as converting lines that connect entities into cardinality notations with pulldown options to switch to  the desired notation style. This feature is not only practical but can also help with development. Getting back to the ABV, as I drew the entity structures and their cardinalities I realized I needed to add a few more transitive entities and normalize some of the relationships as I had a highly undesirable many-to-many relationship between my purchase table and items. As you can see below, the ABV’s ERD makes the complex relationships much more accessible to new users.

BEHOLD! BEER!

It was easy to move tables around as LucidChart kept the connections on a nice grid pattern, which I could also easily override if need be. This powerful flexibility lead to a clean deliverable for my term project. The positive experience I had creating this ERD lead me to try out the program for a more complex task, creating wireframes for a website redesign project in my Information Architecture class.

Tasked with redesigning a website that uses dated menu and page structures, our project required the creation of low, medium, and high fidelity wireframes. These wireframes present a vision for the website redesign with each type adding another layer of detail. In low fidelity wireframes, image placeholders are used and the only visible text are high level menu items while dummy text fills the rest. Thankfully LucidChart’s wireframe shapes contained the exact shapes we needed. Text options are limited but it did contain one of the fonts from our CSS font family property. Once we reached the high fidelity phase it was easy to import our custom images and seamlessly add them to our diagram.

Low, Medium, and High fidelity wireframes of a redesign project.

Once again LucidChart provided a high quality deliverable that impressed my peers and professor. With these wireframes I was able to design the finished product. With LucidChart’s focus on IT/engineering, product management & design, and business, you can find a vast amount of shapes and templates for most of your diagram needs such as Android mockups, flowcharts,  Venn diagrams and even circuit diagrams. There are a few more perks about LucidChart and a few lows.

PERKS
  • Free… sort of: For single users there are three levels of pricing; Free, Basic $3.33/month (paid annually), and Pro $8.33/month (paid annually). Each level adds just a bit more functionality than the last. The free account will get you up and running with basic shapes but limited to 60 per document. Not too bad if are you creating simple ERDs. Require more than 60 objects or an active line to their support? Consider upgrading to Basic. Need to create wireframes? Well you’ll need a Pro account for that. Thankfully, they are actively seeking to convert Visio users by offering promotional pricing for certain users. For instance, university students and faculty can follow the instructions on this page  to request a free upgrade to a Pro account. Other promotions include 50% off for nonprofits and free upgrades for teachers. Check out this page to see if you qualify for a free or discounted Pro account. I can only speak for the Education account that adds not only the Pro features but also the Google Apps integration normally found under Team accounts.
  • Easy collaboration… for a price: As seen in the figure below, users can reply, resolve or reassign comments on any aspect of the diagram.

    Comments example

    All account levels include these basic functions. However, a revision history that tracks edits made by collaborators requires a Pro account. Moreover, sharing custom templates and shapes are functions reserved for Team account users, which starts at $21/month for 5 users.
    One final note: each collaborator is tied to their own account limitations which means free account users may only use 60 shapes even if they are working on a diagram created by a Pro account.

  • Chrome app: The Chrome app converts the website into a nice desktop application that is available offline. Once you are back online the application instantly syncs to their cloud servers. The app is fully featured and responds quicker than working on the website. Using the app is a much more immersive experience than the website.
LOWS
  • Pricing for non-students: As you can see by now LucidChart has an aggressive pricing plan. The Free account is enough for most users to decide if they want to create diagrams that involve more than 60 shapes. It is a bit disappointing to see that the Basic account only adds unlimited shapes and email support. Furthermore, wireframes and mockups are locked up behind the Pro level. Most of these Pro features should really fall under Basic. Still, the $99 annual price for a LucidChart Pro account is far less than Visio, which starts at $299 for non-students.
  • Chrome app stability: For the most part the website has been a flawless experience, the same cannot be said for their Chrome app. There have been times where the application crashes to desktop, the constant syncing did save all of my work, or some shapes becoming unresponsive. There is also an ongoing bug that keeps showing me deleted documents, which do not appear on the website.

    Icons in grey were deleted months ago but still show up in the Chrome App

    None of these knocks against the app have prevented me from using it but it is worth mentioning that the app is a work in progress and can feel like a lower priority for the company.

Don’t just take my word for it, you can try out a demo on their website that contains most of the Pro features. Are there any projects you can see yourself using LucidChart? Have a Visio alternative to share? I’d love to hear about any experiences other users have had.

Marlon Hernandez is an Information Science Technician at the Jet Propulsion Laboratory Library where he helps run the single-service desk, updates websites and deals with the temperamental 3D printer. He is currently in the final year of the MS-IS program at the University of North Texas.  You can find him posting running reviews, library projects and beer pictures on his website: mr-hernandez.com.

Peter Sefton: A quick open letter to eResearch@UTS

planet code4lib - Tue, 2015-02-17 02:43

2015-02-17, from a tent at the University of Melbourne

Hi team,

Thanks for a great first week last week and thanks for the lunch Peter Gale – I think I counted 12 of us around the table. I thought the week went well, and I actually got to help out with a couple of things, but you’ll all be carrying most of the load for a little while yet while I figure out where the toilets are, read through those delightful directives, policies and procedures that are listed in the induction pack, and try to catch up with all the work that’s already going on and the systems you have in place. All of you, be sure to let me know if there’s something I should be doing to start pulling my weight.

As you know, I have immediately nicked-off to Melbourne for a few days. Thought I might explain what that’s about.

I am at the Research Bazaar conference, #Resbaz.

What’s a resbaz?

The site says:

The Research Bazaar Conference (#ResBaz) aims to kick-start a training programme in Australia assuring the next generation of researchers are equipped with the digital skills and tools to make their research better.

This event builds on the successful Doctoral Training programmes by research councils in the UK [1] and funding agencies in the USA [2]. We are also looking to borrow the chillaxed vibe of events like the O’Reilly Science ‘Foo Events’ [3].

So what exactly is ResBaz?

ResBaz is two kinds of events in one:

  1. ResBaz is an academic training conference (i.e. think of this event as a giant Genius Bar at an Apple store), where research students and early career researchers can come to acquire the digital skills (e.g. computer programming, data analysis, etc.) that underpin modern research. Some of this training will be delivered in the ‘hands-on’ workshop style of Mozilla’s global ‘Software Carpentry’ bootcamps.

You can get hands-on support like at an Apple Store’s Genius Bar!

  1. ResBaz is a social event where researchers can come together to network, make new friends, and form collaborations. We’re even trying to provide a camping site on campus for those researchers who are coming on their own penny or just like camping (dorm rooms at one of the Colleges will be a backup)! We have some really fun activities planned around the event, from food trucks to tent BoFs and watching movies al fresco!

It’s also an ongoing research-training / eResearch rollout program at Melbourne Uni.

But what are you doing there Petie?

On Monday I did three main things apart from the usual conference networking, meeting people stuff.

Soaked up the atmosphere, observed how the thing is run, and talked to people about how to run eResearch training programs

David Flanders wants us to run a similar event in Sydney, I think that’s a good idea, he and I talked about how to get this kind of program funded internally and what resources you need to make it happen.

Arna from Swinburne told me about a Resbaz-like model at Berkeley where they use part-time postdocs to drive eResearch uptake. This is a bit different from the Melbourne uni approach of working with postgrads:

@ptsefton Data-driven discovery (project driven). What we’d like to do at Swinburne: http://t.co/Q6t80txxkm Also check out @NYUDataScience

— Arna Karick (@drarnakarick) February 16, 2015 Attended the NLTK training session

This involves working through a series of text-processing exercises in an online Python shell, iPython. I’m really interested in this one, not just ‘cos of my extremely rusty PhD in something resembling computational linguistics, but because of the number of different researchers from different disciplines who will be able to use this for text-mining, text processing and text characterisation.

Jeff, can you please let the Intersect snap-deploy team know about DIT4C – which lets you create a kind of virtualised computer lab for workshops, and, I guess, for real work, via some Docker voodoo. (Jeff Christiansen, is the UTS eResearch Analyst, supplied by our eResearch partner Intersect).

Met with Shuttleworth Fellow Peter Murray-Rust and the head of Mozilla’s science lab Kaitlin Thaney

We wanted to talk about Scholarly HTML. How can we get scholarship to be of the web, in rich content-minable semantic markup rather than just barely-on the web. Even just simple things like linking authors names to their identifiers would be a huge improvement over the current identity guessing games we play with PDFs and low-quality bibliographic metadata.

Kaitlin asked PMR and me where we should start with this, where would the benefits be most apparent, and the the uptake most enthusiastic? It’s sad but the obvious benefits of HTML (like, say being able to read an article on a mobile phone) are not enough to change the scholarly publishing machine.

We’ve been working on this for a long time, and we know that getting mainstream publisher uptake is almost impossible – but we think it’s worth visiting the Open Educational Resources movement and looking at textbooks and course materials, where the audience want interactive eBooks, and rich materials (even if they’re packaged as apps, HTML is still the way to build them). There’s also a lot opportunity with NGO and university reports where impact and reach are important, and with the reproducible-research crowd who want to do things the right way.

I think there are some great opportunities for UTS in this space, as we have Australia’s biggest stable of Open Access journals, a great basis on which to explore new publishing models and delivery mechanisms.

I put an idea to Kaitlin which might result in a really useful new tool. She’s got the influence at Mozilla and can mobilise and army of coders. I hope there’s more to report on that.

Kaitlin also knows how to do flattery:

Talking about scholarly HTML and the future of authoring with two #openscience greats: @ptsefton + @petermurrayrust pic.twitter.com/8zvaEB9SGH

— Kaitlin Thaney (@kaythaney) February 16, 2015 TODO
  • Need to talk to Deb Verhoeven from Deakin about the new Ozmeka project, an open collaboration to adapt the humanities-focussed Omeka respository software for working-data repositories for a variety of research disciplines. So far we have UWS and UTS contributing to the project, but we’d love other Australian and global collaborators.

  • Find out how to use NLTK to do named-entity recognition / semantic tagging on stuff like species and common-names for animals, specifically fish, for a project we have running at UTS.

    This project takes a thematic approach to building a data collection, selecting data from UTS research relating to water to build a ‘Data Hub of Australian Research into Marine and Aquatic Ecocultures’ (Dharmae). UTS produces a range of research involving water across multiple disciplines: concerning water as a resource, habitat, environment, or cultural and migratory medium. The concept of ‘ecocultures’ will guide collection development which acknowledges the interdependence of nature and culture, and recognises that a multi-disciplinary approach is required to produce transformational research. Rather than privilege a particular discipline or knowledge system (e.g. science, history, traditional indigenous knowledge, etc), Dharmae will be an open knowledge arena for research data from all disciplines, with the aim of supporting multi-disciplinary enquiry and provoking cross-disciplinary research questions.

    Dharmae will be seeded with two significant data collections, a large oral history project concerning the Murray Darling Basin, and social science research examining how NSW coastal residents value the coast. These collections will be linked to related external research data collections such as those on TERN, AODN, and, thanks to the generous participation of indigenous Australians in both studies, to the State Library of NSW indigenous data collections. Dharmae will continue to develop beyond the term of this project.

  • Make sure Steve from Melbourne meets people who can help him solve his RAM problem by showing him how to access the NeCTAR cloud and HPC services.

SearchHub: Introducing Query Autofiltering

planet code4lib - Tue, 2015-02-17 01:45
Query Autofiltering is autotagging of the incoming query where the knowledge source is the search index itself.  What does this mean and why should we care? Content tagging processes are traditionally done at index time either manually or automatically by machine learning or knowledge based (taxonomy/ontology) approaches. To ‘tag’ a piece of content means to attach a piece of metadata that defines some attribute of that content (such as product type, color, price, date and so on). We use this now for faceted search – if I search for ‘shirts’, the search engine will bring back all records that have the token ‘shirts’ or the singular form ‘shirt’ (using a technique called stemming).  At the same time, it will display all of the values of the various tags that we added to the content at index time under the field name or “category”  of these tags.  We call these things facets. When the user clicks on a facet link, say color = red, we then generate a Solr filter query with the name / value pair of <field name> = <facet value> and add that to the original query. What this does is narrow the search result set to all records that have ‘shirt’ or ‘shirts’ and the ‘color’ facet value of ‘red’. Another benefit of faceting is that the user can see all of the colors that shirts come in, so they can also find blue shirts in the same way. But what if they are impatient and type in ‘blue shirts’ into the search box? The way things work now, the search engine will return records that contain the word ‘shirt’ or ‘shirts’ OR the word ‘blue’. This will be partially successful in that blue shirts will be part of the result set but so will red, green, orange and yellow shirts because they all have the term ‘shirt’ in common.  (This will happen if the product description is like – “This is a really nice shirt. It comes in red, blue, green, orange and yellow.”)  Worse, we will also get other blue things like sweaters, pants, socks, hats, etc. because they all have the word ‘blue’ in their description. Ah, you say but you can then use faceting to get what you really want, click on the shirt product type facet and the color facet for blue. But why should we make the user do this? Its annoying to first see a bunch of stuff that they didn’t ask for and then have to click things to get what they want. They wanted to make things easier for us by specifying what color they want up front and we responded by making things worse for them. But we don’t have to – like Dorothy in the Wizard of Oz, we already have the information that we need to “do the right thing” we just don’t use it. Hence query autofiltering. There is another twist here that we should consider.  Traditionally, tagging operations, whether manual or automated are applied at index time for “content enrichment”. What we are effectively doing is adding knowledge to the content – a tag tells the search engine that this content has a particular attribute or property. But what if we do this to the incoming quey?  We can, because like content, queries are just text and there is nothing stopping us from applying the same techniques to them (here it must be autotagging though if we want the results in under a second). However, we generally don’t do this because we don’t want to change the query in such a way as to misdirect the search engine so that it returns the ‘wrong’ results – i.e., we don’t want to screw it up so we leave it alone. We know that if we use OR by default, then the correct results should be in there “somewhere” and the user can then use facets to “drill-in” to find what they are looking for.  We also use OR by default because we are afraid to use AND – which can lead to the dreaded ZERO results. This seems like failure to us, we haven’t returned ANYTHING, ZIP, NADA – whats wrong with us?  But is this correct? What if the thing that the user actually wants to find doesn’t exist in the search collection? Should we return ZERO results or not?  Right now we don’t do that.  We return a bunch of results which when the user tries to drill in they keep hitting a dead end. Suppose that we don’t carry purple socks. If a user searches for this, we will return socks that are not purple and purple things that are not socks. If having to drill in to find what they want after telling us that in the query is frustrating, having to drill in and coming up empty handed again and again is VERY frustrating. Why can’t we just admit up front that we don’t sell that thing? We are not going to get their business on that item anyway because we don’t sell it, but why piss them off in the process of not finding it?  We don’t because we are not confident in the accuracy of our search results, i.e. we don’t know either, so we don’t want to lose business by telling them that we don’t carry something when in fact we do. So we use faceting as a net to catch the stray fish. But sometimes we are confident enough that we know what the user wants, that we can override the search engine and give them the right answer.  One way is best bets, also known as landing pages or spotlighting. This requires that a human determine what to bring back for a given set of query phrases – a very labor intensive way of doing it, but its the only way we know. We trust this because a person has looked at the queries and figured out what is being searched for.  Another way cool example of this is getting the local weather forecast on Google or Bing. I call this technique “inferential search” – the process of inferring what the user is looking for and then returning just that (or putting that at the top). This works by introspecting the query and then redirecting or augmenting the search output based on that up-front analysis. Getting back to autotagging, this is what I was talking about earlier – why don’t we do this at query time? We have already built up this knowledge base that we call our search index by using tagging (manual and automated) processes at index time – why don’t we just use that knowledge at query time? We already know how to use it with faceting to get the user to the right answer but why make the user wait or do more work when they don’t have to? Now do you see what query autofiltering is good for? We can use it to short-circuit this frustrating process so that when the user searches for “blue shirts” all that we show them are blue shirts! We can do this because we told the search engine that ‘blue’ is a ‘color’ by adding this as one of the values of the color facet – i.e. the search index already “knows” this linguistic truth because we told it that when we tagged stuff. So the confidence that we built into our faceting engine can be used at query time to do the same thing – we see ‘blue’ in the query, we pull it out and make it a filter query as if the user had first searched for ‘shirts’ and then clicked on the blue color facet. It turns out that Lucene-Solr makes this really easy to do, the Lucene index contains a data structure called the FieldCache that contains all of the values that were indexed for a particular field. This will be renamed to “UninvertedIndex” or something in Lucene 5.0 – a kind of search-wonkish way of saying that it’s a forward index – not the normal “inverted index” that search engines use. An inverted index allows us to ask – give me all of the documents that contain this term value in this field. An uninverted index allows us to ask – what are all the term values that were indexed in this field?  So once we have tagged things, we can then at query time determine if any of these tag values are in the query and thanks to Lucene, we can do this very quickly (with other search engines, we may not be able to do this at all because faceting is calculated at index time and there may be no equivalent of the FieldCache – but you can still use white lists). But getting back to ZERO results, we can now confidently say that “No, we don’t carry purple socks” or orange rugs because we now know that there is no sock or rug that we ever tagged that way. We should also make the “No Results” message friendlier rather than to suggest that the user did something wrong, as we do now. (Shamefully, we never admitted that we might have been the real culprit– i.e. we take the Fifth.) Query autofiltering is therefore a very powerful technique because it leverages all of the knowledge that we put into our search index and allows us to have much more confidence that we are “doing the right thing” when we mess around with the original query. A simple implementation of this idea is available for download on github. One of the advantages of this approach is that it uses the same index that the query is targeted at via the FieldCache – and its very fast.  A disadvantage is that it requires ‘exact-match’ semantics so it can’t deal with things like synonyms – i.e. what if the user asked to see a “red couches” rather than “red sofas” and we wanted to autofilter on product type – or “navy shirts”? (The solution here would be to have a multi-value facet field that we would be used for autofiltering.)  Because of the exact-match semantics, we also have to ensure that things like case or stemming don’t affect us. Because of this, we may have to do a bit of preprocessing of the query so that it matches what we indexed, while being careful not to distort the pieces of the original query that we will pass through as-is to Solr. Another disadvantage is that the query autofiltering Search Component works with a single field at a time so to do mutiple fields, we need multiple component stages.  A good use case is in classic cars where you want to detect year, make and model as in “1967 Ford Mustang convertible”. Another approach that was suggested by Erik Hatcher, is to have a separate collection that is specialized as a knowledge store and query it to get the categories with which to autofilter on the content collection. This is a less “brute-force” (knee-jerk?) method which makes use of a collection that contains “facet knowledge”.  We can pre-query this collection to learn about synonyms and categories, etc. using all of the cool tricks that we have built into search – fuzzy search to handle misspellings, proximity, multi-term fixes such as autophrasing and more (pivot facets and stats – Oh My!) – the possibilities are literally endless here. The results of this pre-query phase can then drive autofiltering or a less aggressive strategy such as spotlighting or query suggestion (like you do with spell correction). Unlike the FieldCache approach, it can be externalized into a Query Pipeline stage such as featured by our Lucidworks Fusion product. The key is that in both cases, we are using the search index itself as a knowledge source that we can use for intelligent query introspection and thus powerful inferential search!! Thanks again to Erik Hatcher for suggesting and collaborating with me on this idea. The “red sofa” problem was discussed in my post “The Well Tempered Search Application – Fugue”.  I had originally thought to use a white list to solve this problem. Erik suggested that we use the field values that are in the index and this suggestion “closed the loop” for me – i.e. major lightbulb moment. Erik is a genius. He is well known for his books on Ant and Lucene and is also an extremely nice guy. One of my greatest joys in coming to Lucidworks was being able to work with people like Erik, Grant Ingersoll, Tim Potter and Chris Hostetter – better known as Hossman or simply Hoss.  These guys are luminaries in the Lucene-Solr world but the list of my extraordinary colleagues at Lucidworks goes on and on.  

The post Introducing Query Autofiltering appeared first on Lucidworks.

Andromeda Yelton: #c4l15 keynote transcript

planet code4lib - Mon, 2015-02-16 21:09

Following up on my last post, I made a transcript of my keynote at code4lib; here you go! In case you missed it:

  • video (I start around 7 minutes in);
  • my last post, which has links for various sites and ideas I mention throughout the talk.

There are numerous small audio glitches which I’ve filled in with my best guesses of what I was saying, where possible.

Architect for wanderlust: the web and open things

Nine years. Nine years ago we were in a much smaller building. Who was there? [pause for raised hands] Who wasn’t? [pause for many more raised hands] Who saw this coming? I knew there’d be one wiseass in the crowd. Hey, Mark. Nine years ago, except maybe for a token wiseass, no one knew we’d still be here for a tenth conference all this time later. That we’d have an IRC channel, a mailing list with three-thousand-some people, a journal — all these things we do. No one knew that it would evolve to be all this, but you built an open thing. You built a thing where lots of people can get write access and can build it all together.

Twenty-four years ago, someone else built another open thing. This may not be the first web page, but it is the oldest known one, and you can still see it today at info.cern.ch. The top post in my blog right now has all the links I’m going to be referencing during this talk, and I tweeted that out so you can find all this stuff for your clicky clicky pleasure. And this is the oldest web page we’ve got.

And why did Tim Berners-Lee make this? Well, physicists needed to share data. They had a bunch of different research stuff, but they didn’t have shared servers, they didn’t have shared presentation software, so they needed a thing. And he made this thing, and it seems to have gone well. So…why did it work?

Well. One big reason it worked was a determined agnosticism about formats. He didn’t care what format your data was in, what software you’d used to create it. The internet at the time had a collection of protocols. I remember gopher — I sort of have a soft spot for gopher. But it had a whole bunch of different protocols and formats and he built his protocol so you didn’t have to care. So that it was a generalized idea of information connection that could hold all the things, without being prescriptive as to their content or nature. And in fact he considered — I was reading the wikipedia page on the history of the world wide web, which is a great way to lose, like, three hours — he considered what should he name this thing, and went through ideas like “The Information Mine”. Go ahead and think about if he had named that “The Information Mine”, and we still had to call that today. “The Mine of Information”. But he settled on “the World Wide Web”, and what that says to me is that the important thing about what he was building out of all the experiments and hypertext he’d been doing in the past, the important thing wasn’t the information, it was the interconnection. The important thing wasn’t the information you put in it; it was the way it enabled people to connect to information and each other. So he didn’t tell them what to do with this architecture he’d created. But —

— he told them how to do it. If you read his original proposal to CERN for the money to support this thing, building the prototype and so forth, it says in there one of the conditions of the work is that he wants “to provide the software for the above free of charge to anyone”. If he hadn’t said that, we wouldn’t be here today. We would literally not be here today. But he wrote into this proposal — he used the tools of bureaumancy — to make sure this was a thing that anyone — who admittedly met a pretty high barrier for technical connectivity and knowledge — could use. And he wrote a ton of documentation. He told you exactly what you needed to do to download this and set up your own web server. And documentation is a brand of hospitality. And that made it possible for this thing he’d built to spread, and become a thing that everyone, and no one, owned, and everyone could build together, and as a result of that we today, twenty-four years later, have –

lolcats

the Arab spring

our childhoods [delayed-reaction laughter as the audience reads the slide]

art and culture

and each other.

Let me tell you a bit about my origin story, and how the things that evolve don’t necessarily have anything to do with what we predict. This, also, is twenty-four years ago. I was at nerd camp, and a bunch of friends wanted to put on a scene for the talent show. They wanted to put on the scene from Monty Python where one is debating whether someone is a witch, and it becomes important to compare her to — a duck. And my then-boyfriend, now-husband, speaking of things you can’t predict twenty-four years ago, went to the mall and found this remarkably charismatic little duck. Which cost an outrageous amount, but it’s got this really cute houndstooth hat, and he bought it to be a prop in the talent show. And this talent show skit never happened. Because they weren’t just performing the scene from Monty Python; they were also performing the logician’s analysis of the scene which shows up in a BBC radio play, and which is so profane that no one is ever letting a bunch of fourteen-year-old boys do it on a stage. So the talent show act never happened, but we still had this duck. And I will get back to it.

But first, I want to say some stories about the rest of you. This is where we come from. This is where the organizing committees for this code4lib come from. And it’s kind of a lot of places. And this is just the committees — this isn’t counting all the people I’ve met here. If I counted them there’d be a whole lot more Canada on that map; there’d be a bunch more states; I’d have to zoom this out to include Japan and New Zealand. We come from all over.

But we come from all over metaphorically, too, and that’s one of my favorite things about librarianship, is we are people of wanderlust who found a home, here. You talk to people and you ask, “what did you major in in college?”, and you hear English, and history, and math, and religion, and philosophy, and musical theater. You ask people what they did before librarianship, because just about everyone had a “before librarianship”, and there’s teachers and publishers and marketers and designers and computer scientists and there’s all kinds of disciplinary backgrounds and perspectives and toolkits that we bring to librarianship, and that we can use to inform our work. The wanderlust that brought us here is a thing that lets us all enrich one another with our different perspectives.

And that is important for everyone’s stories. This quote, “a disciplined empathy”, is from Sumana Harihareswara’s keynote last year, which I loved, which you should read or watch if you were not here. And one of the things that she said is that user experience needs to be a first-class responsibility. And how do you get there, in software, is you have that empathy that comes from being able to see things from a variety of perpectives, and also the discipline to make yourself actually do it. To observe people, to do the user studies, to talk to people, to do the ethnography or the reading or what-have-you. Everyone has so many stories, and part of what we do with library technology is try to find ways for them to interact with their own and others’ stories, for them to make their stories legible and to find stories that are legible to them.

So let’s get back to mine. This is my friend Allegra, the one transformed by joy, and my friend Sam. Allegra just graduated from the University of Chicago last year, Sam is a senior at MIT, and that piece of paper that Allegra is holding, I wrote in 1991, twenty-four years ago. And it had the Legend of the Duck, in the most bombastic way that teenagers can find to write a thing. I had the best handwriting, so that’s why it was me. And I told the story that I told you earlier, except with fancier language. But at the end I said that every year the Holder of the Duck would pass it on to a new Holder of the Duck, from now into forever. And I wrote a bunch of lines for people to sign their names. Not because I thought that would ever work — just because that’s how much paper I had. So Grant signed the first one, and he gave it to Meggin, and Meggin gave it to me, and I gave it to Frank, and so on and so forth, and Max gave it to Allegra, and Allegra gave it to Sam, and it came to be the most important thing at this nerd camp even today. And there’s a Wednesday in August when a hundred and fifty incredibly excited teengaers will gather in a room to see who gets this next. Because being the Holder of the Duck is the most important thing they can imagine. We had no idea it would do this, but we built an open thing that people could be part of, that people could inscribe their names upon, and make into their story, and facilitate their wanderlust.

Of course open things don’t always go so well. This Icelandic pony is not by Kathy Sierra, who takes these unbelievably ethereal photos of her ponies, and it’s not one of hers because even though they’re the best ones out there I’ve ever seen, she doesn’t have clear license terms on them, and I didn’t want to be just another random person from the internet in her inbox, because she’s had way too many of those. If you’re not familiar with Kathy Sierra’s story, she basically got chased off the internet years ago by someone who’s widely held to be a nerd hero in some circles. She was threatened and her children were threatened and it became altogether not worthwhile. She came back recently, pretty much got chased off again. This is not Kathy Sierra’s work. This is not Randi Harper’s work, or Zoe Quinn’s, or Brianna Wu’s, or Anita Sarkeesian’s. There are many people who are severely threatened by open and ungardened things. Because the things that grow in open places can be vicious indeed.

And even for the people who don’t face that kind of peril, there’s a million quieter ways that open things — the openness of neglect — can be threatening or scary or overwhelming. If you have ever tried to teach yourself to code or take on the mantle of technologist, and you’re not a nineteen-year-old white man in a hoodie, you may have looked into technology and had a lot of trouble seeing yourself there. And that’s one of the things that comes up over and over when I teach people to code. It’s not just about, how do variables work or functions, although that’s challenging, but the more challenging questions are the questions of identity. When people look in the mirror, do they see someone who looks like themselves? When people look in the mirror and try to figure out how to piece together the disparate fragments of their identity and one of them is ‘technologist’, can they fit it together into a coherent whole with all the other things that they also are, and aren’t going to give up? The openness of neglect is a way of not noticing the barriers that don’t affect you personally, but that’s not the same as the barriers not being there.

That’s why this mattered to me so much. This is the commit history, part of it, from the CodeOfConduct4Lib. This is how you operationalize a disciplined empathy. I angsted for like a year about joining code4lib. I spent a solid year wondering if I was, like, cool enough, or smart enough, or technologically skilled enough to be part of code4lib. And I finally got over myself, and I joined the mailing list, and I started hanging out in the IRC channel, and it was fun, actually, and I liked the people I met, and I was having a good time. But I didn’t realize that I had been spending that entire time waiting for the other shoe to drop until Bess Sadler displayed the remarkable political courage to ask us to do this thing. And in the ensuing discussion — which, of course, had a variety of perspectives, which you would expect — but the thing that stood out to me is so many people in this community who have political capital, who have influence, who matter, were willing to put that status and capital on the line to be part of this thing, to draft it and to sign their names to it. And that was when I knew that it’s not just some weird freaky coincidence that nobody has yet been, like, a horrible misogynist to me. That that’s actually who you are. You’re nice to me because you’re nice. And I, I didn’t know that until people took this explicit step.

And it’s not just me, of course, right? There’s a lot of first-timers in this room, and I was a first-timer in this room two years ago, and now I’m on this stage, and none of us knew that would happen. And there are first-timers in this room who a year or two from now will be on this stage, or who will be writing the software that is the new hotness that all of you want to use, and you don’t know who they are. Hospitality matters.

It also matters because we don’t always do it as well as we did right here. This picture that I included on an earlier slide, I wanted something that was a story sculpture. I wanted something that showed people interacting in a really tangible way with a book that was larger than life. And this is what I found, and that was great, and I didn’t realize until much later that everyone in this picture is white. Everyone in this picture is young. Probably everyone in this picture is able-bodied. And I looked at the exif data that Flickr so graciously exposes and realized this is the Grounds for Sculpture in Trenton, New Jersey, which I know because my in-laws live just up the road and they always keep saying we should go there and we never get around to it. But because of that I know that everyone here had at least $10 and the free time to spend on this afternoon. Probably everyone here speaks English, probably natively. There’s a good chance they’re all really highly educated. My in-laws live up the road in Princeton. There’s a lot of really highly educated people who live in this area. Part of architecting for wanderlust is thinking about whose wanderlust you are architecting for. Whose stories are actually tellable in the systems that you create? Whose stories are recognized? Whose stories are writeable in the systems we write? It’s not just about not consciously erecting barriers. It’s about going out of your way to notice what barriers might have been erected and doing something to take them away.

Switch gears for a bit and ask, what is a library? This is a library. They don’t know it. My hometown, Somerville, has Artisan’s Asylum, which is one of the best-known makerspaces in the country, and that’s great, and honestly I pretty much never go there. I go here. This is Parts and Crafts. It is the kiddie, unschooling version of a makerspace about a mile down the street. That’s my kid. And pretty much every Saturday they have an open shop, and people come in and you can just kind of do whatever you want in a really self-directed way. There’s not really rules; the grownups don’t tell you what to do. But every so often, one of the Parts and Crafts staff will kind of wander over and say something like, “[dramatic voice] You know what? That thing you’re doing, you could do it better with a hot glue gun. Do you know how hot glue guns work? Do you want to?” And then they go find the hot glue gun, and suddenly it’s part of your eight-year-old’s repertoire, and next time she shows up she just goes over to the hot glue gun and starts gluing stuff to other stuff.

They’ve got a lot of stuff besides hot glue guns as well. Here’s a close-up of one of the shelves, which have this wonderful mishmash of this, and yarn, and tongue depressors, and motors, and this-that-and-the-other. All kinds of stuff. And what makes this a library to me is that it is about self-directed exploration. It’s about transforming yourself through access to information in ways that matter to you. And it’s supported by this remarkable collection, and a staff who don’t tell you what to do with it, but who are intensely knowledgeable in what they have, and able to recognize when their collection is relevant to your interests. So to me, this is a library, and it’s one of the best libraries I know.

What is library software? This is not library software. [applause] This is a screenshot from my local public library’s OPAC, and this is what happens when you do a keyword search that returns no hits. It dumps you on this gigantic page of search delimiters that probably not even a librarian could love, and let me tell you, if I didn’t get any original hits from my search, limiting it to large print Albanian will not help. [laughter] And this page outright angers me, because there are so many options it could have that would allow you to continue wandering. This could have an Ask-a-Librarian feature. This could have something that told me that ILL existed, which I didn’t know until library school, and as you recall I already had another master’s degree. This could be something that did a WorldCat search of other libraries for this thing. This could be any number of things that let me take another step that had a chance of success, but instead, it gives me a baffling array of ways to get large print Albanian hits for a book that doesn’t exist.

Other things that are not library software include API keys you can’t get, documentation and examples behind paywalls, twelve-billion-step ebook checkout processes. These are not library software. These do not facilitate wanderlust. These do not let people transform themselves through access to information and one another.

This is library software. This lets you — not just wander among hyperlinks — but write your own things. Write your own things in text, write your own things in code. This lets you build and generate.

This is library software. Partly maintained by Misty De Meo, who’s sitting right there. [applause] For those of you who are not familiar with Homebrew, it’s a package manager, and if you’ve ever tried to install things by, like, going to the web site, and trying to find the thing, and then realizing you don’t have the dependency, and finding another thing, and then, like, your whole house is full of yak hair, this does that for you. You type ‘brew install the thing’, and it goes and finds the thing, and all the things that the thing needs, and it just does it for you. And this is library software because it lets use make the things and learn the things we wanted to make and learn with fewer impediments, and totally nonjudgmentally.

This is library software. This is something I wrote during a Harvard Libraries hackathon a little while back, intersectional librarycloud. The Harvard Libraries has an API that returns the kind of collection data you would expect, but it also returns this thing Stackscore, which is a weighted 0-100 average of various popularity measures. So what intersectional librarycloud does — it’s one of the things you can link to, you can try it out — it lets you search for subject terms, and it brings back the most popular things in the Harvard collection that match that subject term. And it also examines their subject metadata to see if they have any terms consistent with women’s studies, or African-American studies, or LGBT studies. And I wrote this because I wanted to see, when students and scholars are forming their mental models, their understanding of how the world works, at one of the most eminent universities on earth, are these perspectives included by default? Because the consequences of them not being there is you get out into the world and you have to have all these stupid arguments about misogyny or what-have-you because people think it’s not a thing because hey, they didn’t study it in school, right? It’s not a part of the mental model they built when learning about history. This is a search for ‘history’, and as you note, looking at those grey and grey and grey blocks in the columns, when Harvard thinks ‘history’ it doesn’t think women’s history, or African-American history, or LGBT history.

And I also wanted to see, you know, if I search for something that really is, like, ‘women’s studies’, do I get hits in any of these other columns, too? Are the perspectives we see intersectional or is it all just kind of separated? Is it like, oh, well, if you’re studying women, you’re clearly not studying gay people, or whatever. That’s bunk. Unfortunately, that’s how it works, as far as the Harvard library usage data are concerned. Chris Bourg talked about this in a great keynote she gave at the Ontario Library Assocation just a couple weeks ago, mentioning — She was talking about how our cataloging systems can reinscribe prejudices and hide things from us. And she mentioned the book ‘Conduct Unbecoming’, which is really the foremost history of gays in the military. And, if you were looking for it at her library, you would find it shelved between gay porn. Not that any of us have a problem with gay porn, but you don’t find it shelved in the military history section. So if you were looking, if you were browsing the shelves, if you’re looking at a subject search, right there, for the history of the military, the history of gay people in the military is not a thing. So I wanted to interrogate whose wanderlust we’re supporting.

This is library software. The New York Public Library does these amazing things with the vast pile of cultural heritage data they’re sitting on top of, and their remarkable software resources. And they use that place as a cultural institution to create software that connects people to the world around them and to their own cultural heritage in a way that is creative and inspiring and moving. And so this, for instance, there’s so many things you could look at. They just got a Knight grant to do this Space/Time thing, which is basically like Google Maps mashed up with a historical slider, so you can look at stuff at different points in history [] address change, and you can look up addresses that no longer exist, and it’s not really built yet, but they got money to do it, so it’ll be awesome. But this is their menus page, which you can look at right now. Adn they digitized a whole bunch of menus from different times in the city’s history, and so you can see what were people eating, what sort of things did people aspire to eat, what counted as high-class or low-class in people’s brains at the time during all of these different decades. This is the 1920s and it’s mostly things you wouldn’t see in restaurants today. But it’s fun! It’s neat. And it also — one of the great things that they do is they have APIs. So Chad Nelson, who may be somewhere in this room unless he had an early flight, he wrote this adorable little Twitter bot called @_badtaste_ that mashes up dishes from this with truly distressing words to create…vomitous menus, actually. It’s pretty funny. Don’t browse it right before lunchtime. But it’s funny. And so the fact that they had an API makes it even more library software, right, because it made it possible for Chad to have fun with this, and to build his own thing that let him explore this cultural heritage data in his own way, and connect to it in his own way.

This is library software. Probably a lot of you have seen Ed Summers’ @congressedits bot. It checks out anonymous wikipedia entries that are edited from Congress and posts them to Twitter, and has gotten actually kind of a lot of media coverage. But the thing about that that really took it to the next level was he put the code on github and he documented it, and that made it possible for people to fork it and made their own bots that checked up on South Africa, or Israel, or Switzerland, or whatever parliament or what-have-you they wanted to look at, so this became a worldwide phenomenon. And this is library software not just because it’s creative and playful that way, but because it’s something that lets us be more engaged with the world around us, more connected civically. It’s code that lets us find and tell stories that matter to people. And it’s a platform. It lets people tell more and more.

This is library software. This is something we created all together. It’s got little bits of many, many, many of us, and it’s changing every day, and it’s our little thing. Zoia was kind of transformative for me, also, in becoming part of code4lib, because the first time I wrote a plugin for zoia, which was a thing that totally intimidated the heck out of me — but then I did! And people helped me deploy it, and then I saw people using it. And within, like, thirty seconds, I saw people mashing it up with the outputs of other zoia plugins, and making it their own thing. And then I wrote documentation of how I did it, and subsequently I saw people use that documentation and make even more of their own things. Again, documentation is a form of hospitality. It’s a way that we bring more people into the community and let them write code that matters to them in their own ways. So this is library software.

Year ten. We’ve come an hour and a half up the road, and nine years. We’ve spent the last nine years inventing code4lib. I totally had a fencepost error in the first version of this talk, by the way, but this is the tenth conference, ten minus nine is one — anyway. We’ve spent the last nine years inventing code4lib. And I want to think about what we spend the next nine years inventing, and how we spend the next nine years inventing code that is deeply informed by library values. That is library code.

I want us to spend the next nine years inventing, building, library software. Building systems that question our own assumptions. That intentionally remove barriers and make space for all kinds of people, from all kinds of backgrounds, to tell their own stories, to build their own technology, to use in their own ways, that transform themselves in ways that matter to them. I want us to decenter ourselves so the systems we build aren’t things we own but things we give, and can then evolve in ways that we can’t predict. I want us to build library software. Architect for wanderlust.

District Dispatch: They get it all . . . e-mails, tweets, posts, pix, files . . . until we make it stop.

planet code4lib - Mon, 2015-02-16 16:56

Illustration by Nick Anderson

Ever have the feeling when it comes to reform of the nation’s privacy and surveillance laws that you might as well cancel your online news subscription and just put this year’s date on that copy of last year’s story you saved in the cloud? You know the file we mean – it’s the one – along with all of your emails, texts, tweets, photos or cloud-stored info – that the government doesn’t need a warrant to get without your permission if it’s more than six months old? (This ACLU infographic lays it out well.)

Yup. You read that right, and you may have read about it last right here in “Warrant? Who Needs a Warrant??!!??,” which reported on the then latest wrinkle in the multi-year fight to update the 1986 Electronic Communications Privacy Act (ECPA) to finally bring it – and all of our Fourth Amendment rights – out of the Bronze and into the Digital Age.

ALA and other national privacy advocates had high hopes last year for House passage of Reps. Kevin Yoder’s (R-KS) and Jared Polis’ (D-CO) “Email Privacy Act” given that it had been co-sponsored by well over half of all Members of the House, including a majority of Republicans. Parallel legislation by Sens. Patrick Leahy (D-VT) and Mike Lee (R-UT) also was advanced in the Senate. Without the backing of House Judiciary Committee Chairman Bob Goodlatte, however, the bill never made it to the House floor and it evaporated with the 113th Congress at the end of 2014.

As of early this month, both bills are back and – not to rest on his laurels – Rep. Yoder at this writing already has racked up an amazing 240 cosponsors (154 of them fellow Republicans) for the 114th Congress’ version of the Email Privacy Act, H.R. 699. On the same day, Sens. Lee and Leahy also “dropped” their Electronic Communications Privacy Act Amendments Act of 2015, S. 356, which now has 11 cosponsors.

The American Library Association (ALA), and we hope you too, will be pushing hard in this Congress to finally reform ECPA. If you haven’t already, now’s the time to sign up for the District Dispatch so that, when the time comes, you too can help pull the plug on that giant sucking sound you may hear every time you text, tweet, email or click “save.”

The post They get it all . . . e-mails, tweets, posts, pix, files . . . until we make it stop. appeared first on District Dispatch.

flickr: what to wear?

planet code4lib - Mon, 2015-02-16 14:16

wind and esmé posted a photo:

flickr: archivesspace house

planet code4lib - Mon, 2015-02-16 14:16

wind and esmé posted a photo:

flickr: misty

planet code4lib - Mon, 2015-02-16 14:16

wind and esmé posted a photo:

Ed Summers: Human Twist

planet code4lib - Mon, 2015-02-16 14:05

Human motives sharpen all our questions, human satisfactions lurk in all our answers, all our formulas have a human twist.

William James in Pragmatism and Humanism.

John Miedema: “Actually, these last two piles, JUNK and TOUGH, were the piles that gave him the most concern.”

planet code4lib - Mon, 2015-02-16 13:03

Phaedrus is the philosopher-protagonist in the well-known book, Zen and the Art of Motorcycle Maintenance by Robert Pirsig. Phaedrus is Robert Pirsig, the author, and his books represent a serious metaphysical inquiry. Lila is the lesser-known sequel in which Phaedrus refines and organizes his thought. It is the organizational elements that inspired my current software project. In the following quote, Phaedrus describes the information architecture of his project. It is elegant and complete, found in better organized folder systems, reflecting the natural development of thought.

In addition to the topic categories, five other categories had emerged. Phaedrus felt these were of great importance:

The first was UNASSIMILATED. This contained new ideas that interrupted what he was doing. They came in on the spur of the moment while he was organizing the other slips or sailing or working on the boat or doing something else that didn’t want to be disturbed. Normally your mind says to these ideas, ‘Go away, I’m busy,’ but that attitude is deadly to Quality. The UNASSIMILATED pile helped solve the problem. He just stuck the slips there on hold until he had the time and desire to get to them.

The next non-topical category was called PROGRAM. PROGRAM slips were instructions for what to do with the rest of the slips. They kept track of the forest while he was busy thinking about individual trees. With more than ten-thousand trees that kept wanting to expand to one-hundred thousand, the PROGRAM slips were absolutely necessary to keep from getting lost.

What made them so powerful was that they too were on slips, one slip for each instruction. This meant the PROGRAM slips were random access too and could be changed and resequenced as the need arose without any difficulty. He remembered reading that John Von Neumann, an inventor of the computer, had said the single thing that makes a computer so powerful is that the program is data and can be treated like any other data. That seemed a little obscure when Phaedrus had read it but now it was making sense.

The next slips were the CRIT slips. These were for days when he woke up in a foul mood and could find nothing but fault everywhere. He knew from experience that if he threw stuff away on these days he would regret it later, so instead he satisfied his anger by just describing all the stuff he wanted to destroy and the reasons for destroying it. The CRIT slips would then wait for days or sometimes months for a calmer period when he could make a more dispassionate judgment.

The next to the last group was the TOUGH category. This contained slips that seemed to say something of importance but didn’t fit into any topic he could think of. It prevented getting stuck on some slip whose place might become obvious later on.

The final category was JUNK. These were slips that seemed of high value when he wrote them down but which now seemed awful. Sometimes it included duplicates of slips he had forgotten he’d written. These duplicates were thrown away but nothing else was discarded. He’d found over and over again that the junk pile is a working category. Most slips died there but some reincarnated, and some of these reincarnated slips were the most important ones he had.

Actually, these last two piles, JUNK and TOUGH, were the piles that gave him the most concern. The whole thrust of the organizing effort was to have as few of these as possible. When they appeared he had to fight the tendency to slight them, shove them under the carpet, throw them out the window, belittle them, and forget them. These were the underdogs, the outsiders, the pariahs, the sinners of his system. But the reason he was so concerned about them was that he felt the quality and strength of his entire system of organization depended on how he treated them. If he treated the pariahs well he would have a good system. If he treated them badly he would have a weak one. They could not be allowed to destroy all efforts at organization but he couldn’t allow himself to forget them either. They just stood there, accusing, and he had to listen.

Pirsig, Robert M. (1991). Lila: An Inquiry into Morals. Pg. 25-26.

Alf Eaton, Alf: Visualising political donations

planet code4lib - Sun, 2015-02-15 17:35

Earlier this week I attended a “Big Data Investigation Workshop” run by British Library Labs as part of the International Digital Curation Conference.

The workshop was an introduction to working with tools for cleaning, analysing and visualising collections of data: OpenRefine (which is great but showing its age), Tableau (which is ridiculously impressive) and Gephi (which has fast graph layout but lacks usability).

As the workshop was co-organised by the Internation Crime Fiction Research Group, the theme of the data was “Crime Fiction”. However, for our project, we decided to look at “Crime Fact”. In particular, we looked at a recent news story in The Independent, which stated that “three senior figures at scandal-hit [HSBC] bank donated £875,000" to the Conservative Party in recent years.

Although the news story didn’t link to any source data, it almost certainly came from the Electoral Commision’s register of donations to political parties.

Running a basic search of the Electoral Commision’s register, with no filters, produced a CSV file containing all registered donations since 2001, which we then loaded into Tableau Public (Tableau’s limited, free desktop application for data visualisation).

Total donations per party

The first visualisation was a simple bar chart of the total donations to each party, including only “political party” recipients, coloured according to the type of donation.

Total donations per individual

The next visualisation was a summary of the donations from the individuals named in the news story. We added a filter on the donor name, searched for their surname and selected those names which matched (there were several variations on each donor’s name in the database), then used Tableau’s grouping to group together the name variations. Pleasingly the totals almost exactly matched those given in the news story, for the three named donors.

Location of the donors

Getting Tableau to recognise UK postcodes is a bit tricky, as it doesn’t recognise the full postcode - we had to write a function to separate out only the first part of the postcode. Once this was done, Tableau easily mapped the location of each donor, to produce the final visualisation: a map of each donation to a political party, coloured according to the recipient party and sized according to the value of the donation.

Alf Eaton, Alf: Force-directed tag clouds

planet code4lib - Sun, 2015-02-15 15:59

I’d been making graphs of Spotify’s “Related Artists” network, but was finding that pieces of the graph often remained disconnected.

To connect these disparate parts of the network, I queried last.fm for the top tags that had been attached to each artist, and added those to the graph.

This brought the network together nicely, so I applied it to a larger data set: all the unique artists that had ever been played on a particular BBC 6 Music radio show.

Dark matter

The full graph of artists and their tags was interesting, but to get a clearer overview of the show’s musical themes, the artist nodes were hidden after the graph had been laid out (using Gephi's "Force Layout 2" algorithm).

This left just the tags, laid out in two dimensions, where the most similar tags are closest together and the most frequently used are largest.

As some of the labels were overlapping, I used Gephi’s "Label Adjust" layout algorithm to shift their positions enough that most of the overlapping was avoided.

Here are some examples - I think they summarise the shows' content rather well:

Stuart Maconie’s Freakier Zone Gilles Peterson Marc Riley Unique identifiers

One problem was that when several artists shared the same name, irrelevant tags would be attached to an artist. To avoid this, only the artists that had been given MusicBrainz IDs in the BBC data were included, and these MBIDs were used to query last.fm for tags.

Discussion

In a sense, the artists are the “dark matter” of the graph: they pull the tags together and organise their macroscopic structure, but remain invisible in the final, visible map.

It may be that a highly-concentrated cluster of artists (as well as one or two very loosely-connected artists) pushes some tags further apart than they deserve to be.

These word clouds were generated with Gephi, as it handles thousands of nodes easily. I'd like to be able to do the same thing in D3, as Gephi is quite awkward to use, and has cropped the node labels when exporting the above images (it seems to only take the nodes into account when cropping the output, and not their labels).

Here's the (working, but unoptimised) code for building the artists + tags graph data.

Open Library: Digital PML uses BookReader to enhance access to local collections

planet code4lib - Sat, 2015-02-14 19:53

“We’re writing to let you know that we are proud and grateful users of the Internet Archive BookReader software on our new repository of digitized materials, Digital PML

So great! For other institutions that would like to use the BookReader can read through this documentation to help you get started.

Manage Metadata (Diane Hillmann and Jon Phipps): The Jane-athon Report

planet code4lib - Sat, 2015-02-14 19:43

I’ve been back from Chicago for just over a week now, but still reflecting on a very successful Jane-athon pre-conference the Friday before Midwinter. And the good news is that our participant survey responses agree with the “successful” part, plus contain a lot of food for thought going forward. More about that later …

There was a lot of buzz in the Jane-athon room that day, primarily from the enthusiastic participants, working together at tables, definitely having the fun we promised. Afterwards, the buzz came from those who wished they’d been there (many on Twitter @Janeathon) and others that wanted us to promise to do it again. Rest assured–we’re planning on another one in San Francisco at ALA Annual, but it will probably be somewhat different because by then we’ll have a better support infrastructure and will be able to be more concrete about the question of ‘what do you do with the data once you have it?’ If you’re particularly interested in that question, keep an eye on the rballs.info site, where new resources and improvements will be announced.

Rballs? What the heck are those? Originally they were meant to be ‘RIMMF-balls’, but then we started talking about ‘resource-balls’, and other such wanderings. The ‘ball’ part was suggested by ‘tar-balls’ and ‘mudballs’ (mudball was a term of derision in the old MARBI days, but Jon and I started using it more generally when we were working on aggregated records in NSDL).

So, how did we come up with such a crazy idea as a Jane-athon anyway? The idea came from Deborah Fritz, who’d been teaching about RDA for some time, plus working with her husband Richard on the RIMMF (RDA In Many Metadata Formats) tool, which is designed to allow creation of RDA data and export to RDF. The tool was upgraded to version 3 for the Jane-athon, and Deborah added some tutorials so that Jane-athon participants could get some practice with RIMMF beforehand (she also did online sessions for team leaders and coaches).

Deborah and I had discussed many times the frustration we shared with the ‘sage on the stage’ model of training, which left attendees to such events unhappy with the limitations of that model. They wanted something concrete–they usually said–something they could get their teeth into. Something that would help them visualize RDA out of the context of MARC. The Jane-athon idea promised to do just that.

I had done a prototype session of the Jane-athon with some librarians from the University of Hawaii (Nancy Sack did a great job organizing everything, even though a dodgy plane made me a day late to the party!) We got some very useful evaluations from that group, and those contributed to the success of the official Chicago debut.

So a crazy idea, bolstered by a lot of work and a whole lot of organizational effort, actually happened, and was even better than we’d dared to hope. There was a certain chaos on the day, which most people accepted with equanimity, and an awful lot of learning of the best kind. The event couldn’t have happened without Deborah and Richard Fritz, Gordon Dunsire, and Jon Phipps, each of whom had a part to play. Jamie Hennelly from ALA Publishing was instrumental in making the event happen, despite his reservations about herding the organizer cats.

And, as the cherry on top: After the five organizers finished their celebratory dinner later in the evening after the Jane-athon, we were all out on the sidewalk looking for cabs. A long black limousine pulled up, and asked us if we wanted a ride. Needless to say, we did, and soon pulled up in style in front of the Hyatt Regency on Wacker. Sadly, there was no one we knew at the front of the hotel, but many looked askance at the somewhat scruffy mob who piled out of the limo, no doubt wondering who the heck we were.

What’s up next? We think we’re on the path of a new data sharing paradigm, and we’ll run with that for the next few months, and maybe riff on that in San Francisco. Stay tuned! And do download a copy of RIMMF and play–there are rballs to look at and use for your purposes.

P.S. A report of the evaluation survey will be on RDA-L sometime next week.

William Denton: Disquiet Junto 0163

planet code4lib - Sat, 2015-02-14 15:28

I follow Marc Weidenbaum’s collaborative musical project the Disquiet Junto to see what the projects are, and sometimes listen to the work people create. I’ve never contributed before, but the current project, Disquiet Junto Project 0163: Layering Minutes After Midnight, was something I could tackle easily with Sonic Pi, so I had a go.

The instructions for this project are:

Step 1: Revisit project #0160 from January 22, 2015, in which field recordings were made of the sound one minute past midnight:

http://disquiet.com/0160/

Step 2: Locate segments that are especially quiet and meditative — and confirm that they are available for creative reuse. Many should have a Creative Commons license stating such, and if you’re not sure just check with the responsible Junto participant.

Step 3: Using segments from three different tracks from the January 22 project, create a new work of sound that layers the pre-existing material into something new, something nocturnal. Keep the length of your final piece to one minute

Step 4: Upload the finished track to the Disquiet Junto group on SoundCloud.

Step 5: Be sure to include link/mentions regarding the source tracks.

Step 6: Then listen to and comment on tracks uploaded by your fellow Disquiet Junto participants.

What I did was this. First, I downloaded all of the downloadable WAV files in the project. (Sonic Pi can only sample them and FLAC, and for some reason the FLAC file I got didn’t work.)

Next I wrote a script that would choose three different WAV files at random and for each one a random starting time within the first 20 seconds of the track. (Assuming all tracks are exactly 60 seconds, this means choosing a random number between 0 and 1/3, because for Sonic Pi the start of a sample is at 0 and the end is at 1.)

use_random_seed Time.now.to_i soundfiles = Dir.glob("*wav") STDERR.puts soundfiles.size tracks = [] start = [] 3.times do |i| index = rrand_i(0, soundfiles.size) tracks[i] = soundfiles[index] soundfiles.delete_at(index) start[i] = rrand(0, 0.3333) end sample tracks[0], start: start[0], finish: start[0] + 0.6666, attack: 5, release: 5, amp: 0.7 sleep 10 sample tracks[1], start: start[1], finish: start[2] + 0.6666, attack: 5, release: 5, amp: 0.7 sleep 10 sample tracks[2], start: start[2], finish: start[2] + 0.6666, attack: 5, release: 5, amp: 0.7 Emacs, split window, editing on the left and Sonic Pi output on the right, dark Solarized theme

It plays the fragment of the first track, then 10 seconds later starts the fragment of the second track, then 10 seconds later starts the fragment of the third track. Since each is 40 seconds long, for 20 seconds all three are on top of each other, then the first ends, then the second, and for the last 10 seconds only the third track is playing. The attack and release settings mean each track takes 5 seconds to fade in and 5 seconds to fade out.

I was doing all this in Emacs if interested) in sonic-pi-mode. After some testing I ran M-x sonic-pi-start-recording, ran the script, then ran M-x sonic-pi-stop-recording and saved the file.

These are the three tracks it chose:

  1. Spin Cycle-disquiet160-oneminutepastmidnight by High Tunnels
  2. archway road midnight (disquiet160-oneminutepastmidnight) by Zedkah
  3. Can you hear the boredom? (Disquiet0160-Oneminutepastmidnight) by moduS ponY.

All have Creative Commons licenses, which I checked before going further.

The result was 72 seconds long (a few seconds were added while I ran the start/stop but I don’t see how that added up to 12) so I used Audacity to change the length to 60s without changing the pitch. I went back later to edit out the start/stop dead time but accidentally overwrote my original file, so I left it as is.

The result is “Waves Upon Waves” (embedded from SoundCloud):

Pages

Subscribe to code4lib aggregator