You are here

Feed aggregator

Information Technology and Libraries: Data Center Consolidation at the University at Albany

planet code4lib - Tue, 2015-12-22 05:14

This paper describes the experience of the University at Albany (UAlbany) Libraries’ migration to a centralized University data center. Following an introduction to the environment at UAlbany, the authors discuss the advantages of data center consolidation. Lessons learned from the project include the need to participate in the planning process, review migration schedules carefully, clarify costs of centralization, agree on a service level agreement, communicate plans to customers, and leverage economies of scale.

Information Technology and Libraries: Digital Collections Are a Sprint, Not a Marathon: Adapting Scrum Project Management Techniques to Library Digital Initiatives

planet code4lib - Tue, 2015-12-22 05:14

This article describes a case study in which a small team from the digital initiatives group and metadata services department at the University of Colorado Boulder (CU-Boulder) Libraries conducted a pilot of the Scrum project management framework. The pilot team organized digital initiatives work into short, fixed intervals called sprints—a key component of Scrum. Over a year of working in the modified framework yielded significant improvements to digital collection work, including increased production of digital objects and surrogate records, accelerated publication of digital collections, and an increase in the number of concurrent projects. Adoption of sprints has improved communication and cooperation among participants, reinforced teamwork, and enhanced their ability to adapt to shifting priorities.

Information Technology and Libraries: Static vs. Dynamic Tutorials: Applying Usability Principles to Evaluate Online Point-of-Need Instruction.

planet code4lib - Tue, 2015-12-22 05:14

This study had a two-fold purpose: to discover through the implementation of usability testing which mode of tutorial was more effective: screencasts containing audio/video directions (dynamic) or text-and-image tutorials (static); and to determine if online point-of-need tutorials were effective in helping undergraduate students use library resources. To this end, the authors conducted two rounds of usability tests consisting of three groups each, in which participants were asked to complete a database-searching task after viewing a text-and-image tutorial, audio/video tutorial, or no tutorial. The authors found that web usability testing was a useful tutorial-testing tool while discovering that participants learned most effectively from text-and-image tutorials, since both rounds of participants completed tasks more accurately and more quickly than those who received audio/video instruction or no instruction.

DuraSpace News: @mire Challenges Google Books Viewer

planet code4lib - Tue, 2015-12-22 00:00

From Bram Luyten, @mire With the new Document Streaming Module, Atmire is taking in-browser document access to the next level. This new DSpace add-on module can run completely on your own server infrastructure. It enables users to start reading documents instantly on a wide variety of devices and screen sizes. We closely examined viewers such as Google Books and decided we wanted to do better for DSpace. Discover the key features of the module:

Start reading documents immediately.

DuraSpace News: DSpace Use Grows at Top Global Universities

planet code4lib - Tue, 2015-12-22 00:00

From Bram Luyten, @mire  25 of the top 50 institutions in the 2015 Academic Ranking of World Universities run the Open Source DSpace software. This uptake is a testimony to the continued success of the platform, originally developed by MIT and HP in 2002. Governed by DSpace members of the not-for-profit organization DuraSpace, and supported by an active open source ecosystem of service providers, the community is bigger than ever.

DuraSpace News: Center for Astronomy & Physics Education Research (CAPER) Chooses Open Repository for Hosted Repository Platform

planet code4lib - Tue, 2015-12-22 00:00

From James Evans, Open Repository  Open Repository is pleased to announce The Center for Astronomy & Physics Education Research (CAPER), as a new client on Open Repository’s hosted repository platform.  CAPER is a non-profit organisation based in the United States and internationally, whose mission is “Improving Teaching and Learning in Astronomy and the Earth Sciences, through research, curriculum resources and professional development.” 

Villanova Library Technology Blog: JOB: 5 year Fellowship (and tenure track) on Ethics and Data Analytics / Big Data – University of Leeds, UK

planet code4lib - Mon, 2015-12-21 19:37

The University of Leeds is seeking to appoint an outstanding research leader to the position of University Academic Fellow, to be based in Inter-Disciplinary Ethics Applied (Faculty of Arts), to work on the Ethics of Data Analytics (“Big Data”).

Applicants are expected to have demonstrated research excellence, and teaching ability, in the broadly-defined field of ethics or practical philosophy, have a commitment to inter-disciplinary work, and a skill-set for inquiry into the specific ethical issues raised in the field of ‘big data’. This might include a research track-record in neighbouring areas of philosophical inquiry such as privacy, justice, consent, risk, the public interest, trust, procedural justice.

In outline, the prestigious position involves a 5 year period heavily weighted towards research and external engagement related to that research. Successful completion of this 5 year ‘probationary’ period will lead to an appointment as Associate Professor at grade 9, and a transition into that role at the end of the initial 5 year period.

Potential applicants are strongly encouraged to make informal enquiries about the role. Please contact Professor Chris Megone, tel: +44 (0)113 343 7888, email:, or Dr Jamie Dow, tel: +44(0)113 343 7887, email:

Closing Date:      Monday 18 January 2016

Jamie Dow

Lecturer, IDEA Centre (Inter-Disciplinary Ethics Applied), University of Leeds

Tel: +44 113 343 7887


Web: (staff profile, Philosophy) (White Rose Ancient Philosophy in Yorkshire) (Leading Minds Research Project

– the Ethics of Persuasion in Ancient Philosophy and Contemporary Leadership).


Open Knowledge Foundation: Africa Open Data Collaboration Fund Winners Announced

planet code4lib - Mon, 2015-12-21 18:52

Open Knowledge International and the Open Data for Development program are pleased to announce the seven projects that have been shortlisted to receive support from the Africa Open Data Collaboration Fund (AODC Fund)*. The AODC Fund is a partnership with the organisers of the First Africa Open Data Conference and it was designed to provide seed support to empower Africa’s emerging open data civic entrepreneurs to build their data communities, to improve the delivery of services to citizens, and to achieve sustainable development global goals.

The Africa Open Data Collaboration Fund intends to provide between USD 10,000 and USD 15,000 to each one of the initiatives. In this final, non-competitive, phase , these initiatives will further develop their work plans and explore how to leverage the OD4D network to advance their initiatives.

The grants will be coordinated by Open Knowledge as part of the Open Data for Development program, a global partnership funded by the International Development Research Centre, the World Bank, UK’s Department for International Development (DFID) and Global Affairs Canada. OD4D brings together a network of leading open data partners working together to harness the potential of open data initiatives to enhance transparency and accountability as well as facilitate public service delivery and citizen participation in developing countries. The overall goal of the OD4D program is to scale innovative approaches that have been proven to work, and strengthen coordination amongst open data initiatives to ensure they benefit citizens in developing countries.

The shortlisted initiatives are:

  • CoST Tanzania, Tanzania: Tanzania is implementing programmes to improve service delivery in a wider variety of sectors such as education, health, water and transportation with a number of programmes involving the construction of facilities and other forms of infrastructure. However, stakeholders complain about the poor quality of infrastructure, often alluding to corruption and lack of accountability. Through the AODC Fund, CoST TZ will work to address this challenge by advocating for transparency in procurement of infrastructure projects through the disclosure of contracts information, by building the capacity of Procuring Entities (PEs) to disclose information and by sensitising key stakeholders, especially CSOs, to use disclosed data to hold authorities accountable.
  • AfroLeadership, Cameroon: This project seeks to fight corruption, improve local accountability and ensure effective service delivery by collecting and publishing approved budgets and accounts for all local authorities in Cameroon on the OpenSpending Cameroon platform. Furthermore, in order to ensure that the data made available is taken up and used to hold government to account, AfroLeadership will work to strengthen the capacity of journalists and civil society actors to understand budget data by providing a number of offline trainings and developing online resources and courses, all in collaboration with School of Data and the international OpenSpending team at Open Knowledge International.
  • Outbox, Uganda: The rapidly expanding capital of Uganda, is experiencing some obvious growing pains and while the current administration is doing their best to increase the living conditions of the inhabitants through beautification efforts, road and transportation projects etc, it remains obvious that Kampala’s environment is suffering. This project aims to make these challenges visible through data and allow the Kampala Capital City Authority (KCCA) and other stakeholders more effectively manage the impact of an ever-growing population through data-driven policies and projects. Outbox, in collaboration with the Things Network in Amsterdam, will develop an environment station to measures different environmental indicators such as temperature, air pressure, humidity, sound, ultrasound and light, install three stations and publish their measurements online in real time. Outbox will then take our data and advocate with it directly at Kampala City Council and other stakeholders and show them how they can use the data we publish in doing their daily work.
  • The Association of Freelance Journalists, Kenya: While sectors such as education, health, buisness and the environment are of critical importance to the people of Kenya and there is a substantial amount of data available that could be used to tell compelling stories about the challenges Kenya is facing in these sectors, at present, journalists still lack the capacity to use the data effectively for storytelling. As such, the AFJ will lead a project to unpack data journalism in Kenya by building the capacity of Kenyan journalists to use data and assist journalists in producing data driven stories.
  • HeHe Labs, Rwanda: HeHe is a mobile technologies research and innovation lab with the vision to transform Africa into a knowledge society by connecting its people to relevant information and services. Their mission is to leverage the growth in ICTs by building meaningful mobile technology solutions for the African continent. Through this project, HeHe Labs will work to improve service delivery in all sectors through a six month training programme designed to train young people how to use technology to solve local challenges.
  • Women Environmental Programme, Nigeria: Many problems have bedevilled the local government budgeting in Nigeria. This project will specifically evaluate the causes of failure in the local government areas and how effective budget control bring about efficient governance in the local government systems.
  • Press Union of Liberia, Liberia: There is a severe limitation of data that the ordinary people can access in Liberia. Through this project, a select team of journalists will be trained in data journalism, and a lesser number of community radio stations – in rural Liberia – will be provided computers, cameras and digital recorders to expand their programs through the use of social media. The journalists will learn to access links to a variety of data and to upload local data sources to a web portal operated by the Press Union of Liberia.

The Open Data for Development is looking forward to collaborating with all the grant recipients in 2016!

*The grants are pending on the expected renewal of the partnership between the International Development Research Centre and the World Bank Development Grant Facility for the third year of the OD4D Program.

DPLA: DPLA at ALA Midwinter 2016

planet code4lib - Mon, 2015-12-21 18:45

The American Library Association’s (ALA) midwinter conference starts in a few weeks, so we’ve compiled a short schedule of talks, panels, and presentations that feature members of our staff, Board, and Hubs. Sessions involving DPLA staff are marked [S], while sessions involving Board or Hub members are marked with [A], for ‘affiliate’.


[S] New Developments at the Digital Public Library of America
1:00 PM – 2:30 PM / Boston Convention and Exhibition Center, Room 253 A

This session will begin with an overview by DPLA’s Executive Director Dan Cohen of these major new developments, including the formulation and dissemination of standardized rights statements, being done in partnership with Europeana and Creative Commons, as well as others internationally; the Open Ebooks Initiative, which aims to provide free access to contemporary ebooks for youth in low-income areas, a partnership with the New York Public Library and First Book, with guidance from the Institute of Museum and Library Services and the leading inspiration of the White House and ConnectED; and Hydra-in-a-box, which is producing a next-generation digital repository that DPLA’s hubs and others can use, being developed along with DuraSpace and the Stanford University Libraries. Cohen will also give an update on the growth of DPLA’s national network, including expansion to new states. Following this brief presentation, there will be ample time for questions from the audience, and following that, more informal time for attendees to mingle with DPLA staff and other members of the DPLA community who are attending ALA Midwinter.


[S][A] Using Digital Libraries to Discover Primary Sources for Teaching and Learning (part of the Massachusetts School Library Association’s Digital Learning Day @ WGBH)
10:00 AM – 10:50 AM [Repeated at 11:00] / WGBH Production Services, 1 Guest Street, Boston, MA

This presentation will demonstrate how teachers and students can use two free digital libraries to discover primary source research materials: Digital Commonwealth, a Massachusetts collaboration of libraries, archives, and museums, and Digital Public Library of America, its national partner and counterpart. These projects offer students rich local, state, and national history and culture resources vetted by cultural heritage professionals. DPLA participant: Franky Abbott, Project Manager.

About this external event: The Massachusetts School Library Association is partnering with WGBH and PBS LearningMedia to bring you an exciting day of learning about the rich, online curricular and instructional content provided by WGBH and other local digital content providers, a keynote from the producer of WGBH’s American Experience, tours of the WGBH Studios, a Digital Playground, and more! To learn more, visit the MSLA’s page for this event.

[S] Consortial Ebooks Interest Group Meeting
2:30 PM – 4:00 PM / Boston Convention and Exhibition Center, Room 152

Consortial eBooks Interest Group Meeting will discuss trends and relevant topics that pertain to Consortial eBooks. ASCLA believes that consortia represents a large segment of libraries and that by acting as consortia, ASCLA can be influential with publishers and vendors to benefit libraries and library users as the e-book landscape evolves. ASCLA welcomes any type of library or library agency as well as consortias. DPLA participant: Rachel Frick, Business Development Director.


[S] Codex Hackathon (external event)
January 8 – 10 / MIT Media Lab

DPLA is participating in a literary/publishing/library/books hackathon at the MIT Media Lab on January 8-10, 2016. CODEX is a community of folks who want to imagine the future of books and reading. Programmers, designers, writers, librarians, publishers, readers: all are welcome. DPLA staffers will be giving an introduction to the DPLA API. DPLA participants: Audrey Altman, Technical Specialist; Gretchen Gueguen, Data Services Coordinator.


[S] An International Framework for Standardized Rights Statements for Cultural Heritage (part of the ALCTS Metadata Interest Group meeting)
8:30 AM – 10:00 AM / Boston Convention and Exhibition Center, Room 107AB

Join the ALCTS Metadata Interest Group in Boston for the Midwinter Meeting for the following program at in the Boston Convention Center, room 107AB, on Sunday, January 10th. There will be a business meeting at 8:30, followed by a number of exciting programs at 9:00, including one from DPLA Director for Content Emily Gore about

About this presentation: An international committee with representatives from the Digital Public Library of America, Europeana and Creative Commons have developed a set of standardized rights statements for digital cultural heritage materials to be available as URIs along with additional technical requirements for implementation.  The work of the committee, the recommended statements and the upcoming launch of will be discussed.  It is anticipated that a first-look preview of the site will be demonstrated for the audience. is expected to launch in early 2016.

Islandora: Islandora CLAW Community Sprint 002 - Complete!

planet code4lib - Mon, 2015-12-21 17:39

The Islandora community just wrapped up its second volunteer sprint on Islandora CLAW. The first sprint, back in November, was a broad introduction with a focus on documentation and knowledge transfer. This time around we switched gears and put the emphasis on writing code, with a smaller team made up of developers and most of the Islandora CLAW Committers, working on tasks relating to porting the existing Java based services to PHP services. The team for this sprint was:

  • Nick Ruest
  • Jared Whiklo
  • Diego Pino
  • Nigel Banks

Their numbers were small but their accomplishments were great. Diego takes the MVP award for this sprint, tackling the resource PHP service and working through API improvements for Chullo, the CLAW equivalent of Tuque. Jared did some updates for our vagrant installer. Nick coordinated the sprint and put up an updated the Islandora PCDM diagrams. Nigel worked on integration testing and came away with a plan to redo the CLAW build scripts in Ansible. The whole crew worked together throughout the sprint, strategizing and reviewing each other's code.

The next sprint will be in the new year, running January 18 - 29. It will be another focussed, developer-driven sprint, so if you are interested in digging into the code behind CLAW and making a contribution, please sign up! If you have any questions about contributing to a sprint in the future, please do not hestitate to contact CLAW Project Director, Nick Ruest.

District Dispatch: Re:Create video available

planet code4lib - Mon, 2015-12-21 17:39

ReCreate Logo

Earlier this year, ALA became a founding member of a new copyright coalition called Re:Create. As Congress contemplates legislative change, the U.S. Copyright Office solicits public comments on software embedded in products and the 1201 rulemaking (soon), and as the federal government negotiates trade deals, Re:Create and its members engage with the message that copyright law should reflect how the public uses information and creates in the digital environment. We want a law that makes sense to people, and that supports fair use, free expression and an open Internet. We recently produced a video. Check it out. I like the man with the guitar.

The post Re:Create video available appeared first on District Dispatch.

Library of Congress: The Signal: Authenticity Amidst Change: The Preservation and Access Framework for Digital Art Objects

planet code4lib - Mon, 2015-12-21 16:28

The following is a guest post by Chelcie Juliet Rowell, Digital Initiatives Librarian, Z. Smith Reynolds Library, Wake Forest University.

In this edition of the Insights Interview series for the NDSA Innovation Working Group, I was excited to talk with collaborators on Cornell University Library’s Preservation & Access Framework for Digital Art Objects project:

  • Madeline (Mickey) Casad, Curator for Digital Scholarship and Associate Curator of the Rose Goldsen Archive
  • Dianne Dietrich, Digital Forensic Analyst and Physics & Astronomy Librarian
  • Jason Kovari, Head of Metadata Services
  • Michelle A. Paolillo, Digital Curation Services Lead

Chelcie: Tell us about the Preservation & Access Framework for Digital Art Objects (PAFDAO) project and the wicked problem it tackles.

Jason: PAFDAO is a recently completed NEH-funded research and development project to preserve CD-ROM based new media art objects from the Rose Goldsen Archive of New Media Art. Many of these items were created in the early 1990s and are at significant risk of data loss. On top of that, the vast majority of the works in question could not be accessed by researchers at all without the use of legacy hardware and software, which can be difficult and costly to sustain for public use. The project developed preservation and access solutions for these highly interactive and complex born-digital artworks stored on fragile optical media. Our primary aim was to create a solution that worked for these artworks while being generalizable for the rest of the Archive and, most importantly, the community as a whole.

Chelcie: The PAFDAO project’s test collection included more than 300 interactive born-digital artworks created for optical disc and web distribution, many of which date back to the early 1990s. Using one particular artwork as an exemplar, describe for us the aesthetic experiences it was intended to elicit, as well as its numerous digital objects and dependencies.

Shock in the Ear by Norie Neumark with visual concepts by Maria Miranda and music by Richard Vella. Used with permission.

Mickey: An excellent example of the kind of work we focused on can be found in Shock in the Ear, a 1997 CD-ROM artwork by the Australian artist Norie Neumark.

Visually, this work presents the user screens that look like painterly collages, with embedded images and handwritten text rendered in deep, saturated colors. It’s also an intricate work of immersive, abstract sound art, with sounds sometimes sharp and unsettling, sometimes deep and enveloping, sometimes alarming. The sounds often fade in and out of aural focus, change in volume, or blend into one another as the user moves a cursor around the screen. By the same token, new images might be revealed, fade in and out of focus, or change color subtly in concert with the changing soundscape of the work, as the user explores rollover areas of the interactive screen.

As the user moves through the artwork, he or she hears personal accounts of experiences of shock or trauma: for example, the story of an accident, an experience of electroshock, or the shock of cultural displacement. These are distinct narratives, and they are related by distinctive voice characters, but there’s not a consistent match of voice to story. You hear only fragments of these stories at a time. Clicking on the screen might take the user deeper into one story, or it might cause the thread to abruptly switch to another, randomly selected fragment of another narrative. The user can also intermittently be trapped in the program’s own timeline, forced to wait while looking at a clock face, before new screens and new sections load.

All in all, this adds up to a sophisticated aesthetic meditation on the experience of shock and its aftermath, where the elements of randomness, multimedia content, and interactivity are all absolutely essential to the work’s impact. It’s an incredibly refined realization of the artistic potential of the CD-ROM medium.

The work consists of hundreds of image files, sound files, and programs to coordinate the interactive collaging of all this media content in the rollover responsive screens as well as the structural progression of the user’s movement from screen to screen.

Shock in the Ear by Norie Neumark with visual concepts by Maria Miranda and music by Richard Vella. Used with permission.

Dianne: Like Neumark’s Shock in the Ear, many artists used — or collaborated with programmers using — software called Macromedia Director to create their artworks. Director allowed them to embed all sorts of multimedia files into their artwork, including audio files, video files, and still images. I remember once seeing a book on Macromedia Director that boasted the fact that its goal was to help users publish professional quality, standalone CD-ROMs to distribute their interactive content. We found that so many of the works still had so many specific dependencies, like, they only ran on a Mac, or they needed a particular version of QuickTime, or a certain Netscape browser plugin, and so forth. We did find patterns in the collection as a whole, which helped us predict what kinds of works would have certain dependencies.

Chelcie: The project team conducted a survey of users of media archives in hopes of better understanding the research questions animating their work. What sorts of questions are researchers asking? How do these questions reveal a more nuanced understanding of the concept of the digital preservation concept of “authenticity”?

Mickey: We developed the survey to be a fairly broad and qualitative investigation. Initially, we had hoped that this would help us identify distinctive profiles of media art researchers with specific kinds of needs — for example, what percentage of researchers are likely to want access to historical hardware. But the responses we received didn’t really coalesce like this. In part this is because the community of researchers for this kind of artwork is still relatively small and disparate, though we expect this to change as the cultural significance of this work becomes more and more apparent, and as the technological and institutional impediments to archiving it become less daunting.

We were surprised to see less emphasis on code and hardware than we might have expected, but, again, as an archiving institution, we expect the field of new media research to evolve rapidly in the years ahead, as archives begin to address the technological challenges of providing access to new media collections, and as researchers become more and more technologically sophisticated and also more accustomed to the idea of digital artifacts as objects of cultural study.

“Authenticity” is a tricky question. Dianne has written elsewhere about distinctions between forensic authenticity, archival authenticity, and cultural authenticity. Forensic authenticity and archival authenticity refer to different ways of ensuring that an object is what it claims to be, and has not been falsified or corrupted. Cultural authenticity is a much more complex issue that needs to be taken into account, especially with art collections. It has a lot to do with the patron’s faith in the archiving institution, and this is one of the things that came out of our researcher survey. Respondents wanted to know that a work would be presented in a way that gave due respect to its original context, and to the artistic vision of its creator, and we needed to adapt our preservation strategies to better address this need.

Dianne: It is important to point out that “authenticity” wasn’t so cut and dry even when these artworks were being created. We have countless examples of works where the system requirements say something like, “Compatible with either Mac or PC” — and so, even back in the day, artists knew that people were going to experience these works on a variety of personal computer setups.

Mickey: There’s a lot of “variability,” built into these artworks, just in their very nature. And anything iterative or random or built to run on multiple platforms offers its own subtle defiance of the idea of singular artistic uniqueness — which might be part of a layperson’s definition of an artwork’s authenticity. So there’s a lot of grey area, and a lot of work about this grey area being done in the arts archiving community. We knew we would need to get this right — or as right as we could, working at scale with a large research collection.

Chelcie: The project team also developed a questionnaire for guiding interviews with artists about what they view as the most significant properties of their artworks. How is the questionnaire intended to inform the preservation and access strategy for a particular artwork?

Mickey: The artist questionnaire is simple and flexible, which can be found in Appendix C of the white paper. It’s based on questionnaires and user interview processes developed by other media arts archiving organizations (for example, the questionnaire or the Variable Media Questionnaire), but very much tailored to our preservation workflow, our need to work at large scale, and the requirements of a research archive. It allows us to open a conversation with artists about the most significant properties of their artworks.

Shock in the Ear by Norie Neumark with visual concepts by Maria Miranda and music by Richard Vella. Used with permission.

In the case of the artwork described above, Shock in the Ear, artist Norie Neumark confirmed for us that sound quality was of vital importance in any rendering of the artwork. She also gave us important information about the work’s compiling environment and production history, and ultimately sent us working files and an updated version that we will archive alongside the 1997 version for future researchers. We were able to disclose some of the rendering shortcomings that might be associated with our use of emulation as an access strategy, and she agreed to this with the important caveats about rendering sound quality.

Dianne: From my perspective, I think the artist interview can also serve as an important tool to prompt for more technical details. For instance, we can ask whether the artist still has working documentation or source code for the artworks, or whether they’ve upgraded the work to a newer platform. It’s also a helpful way to probe what the artist views as important to preserve about their work, knowing that any access strategy is likely to alter the user experience in some way. For instance, we might notice that when a work is viewed on an LCD monitor, the colors are different than they would appear on a CRT. We might spend a lot of time trying to mediate that problem before realizing, say, through an artist interview, that it’s not their priority. Perhaps rendering the speed of the work faithfully is most important to them. I wouldn’t say that the artist interview dramatically changes a preservation or access strategy for a particular artwork, but rather clarifies and prioritizes which metadata is most important for the ongoing preservation of an artwork.

Chelcie: The project’s objective was to provide the “best feasible” access to artworks, and the project team was somewhat surprised to conclude that emulation presented the “best feasible” preservation strategy. What interpretive possibilities are lost in a “feasible” implementation? On the other hand, what interpretive possibilities are preserved?

Dianne: Many of these works consisted of a collection of multimedia files, and it is not impossible to navigate through the files contained in each CD-ROM and still be able to view many (but not all) of them on current computers. Of course, the most important part of the artwork was usually a hardware-specific executable file that ran a program that then responded to interactions from the user. And so what emulation preserves is the experience of interacting with the artworks and seeing how all of the individual assets on a CD-ROM relate to one another.

Adriene Jenik, Mauve Desert, 1997, Shifting Horizons Productions. Screenshot by Dianne Dietrich, Mac OS 9 in SheepShaver.

Emulation can be a challenging concept because it can radically alter the user experience. In our example, we’ve migrated data off of physical media, so users no longer have to load an actual CD-ROM into a drive to start an artwork. There are other more subtle changes, too: the colors are different and the speed of the work is usually faster. However, one of the ways I looked at these artworks — and digital files more generally — is that they’ve always been subject to a variable user experience. Many of the works could run on a number of hardware/software combinations, and that was built in, I think, to the artists’ expectations. Everybody had a different monitor with a unique calibration, or a different mouse, or keyboard. Given the range of personal computing environments that has always existed, it is helpful to see emulation as an extension of the variance in experience that was always the case for these materials. We have taken the time to document some of the changes that emulation introduces, including speed, color rendering, the effect of new input devices (like trackpads) and so forth.

Chelcie: If any component of a new media artwork fails, the entire artwork can become unreadable. Metadata supports the rendering of the work, as well as its interpretation. How does the framework define the appropriate metadata to describe interactive born-digital artworks?

Jason: The framework includes a combination of descriptive (MARCXML), technical (DFXML) and preservation (PREMIS) metadata for each artwork as well as a classifications document to help guide rendering and restoration decisions on a “class”-level. The root of the metadata requirements stemmed from an assessment of the nature of what we’re preserving: disk images; we were careful to not over-describe objects but still gathered metadata about the disk images as well as the files on the disk. Further, we considered data derived from the user survey to determine whether the metadata were rich enough to support user needs. If rendering of an artwork fails, we believe we’ve captured a good depth of description (both artistic and technical) to allow staff to begin identifying methods to make the artwork renderable.

Dianne: One thing we learned during the project was that, because there are so many utilities for characterizing digital objects, there was so much metadata and description that we could have created, and so we had to select what we thought was essential for our future colleagues.

Chelcie: What objects does the framework recommend should constitute the package (SIP or AIP) stewarded by a collecting institution?

Michelle: The works affected by the grant have item-level representation in our catalog, so for each conceptual work, we created an aggregate named after its bibliographic identifier and described by the MARCXML of that record. In this container, we placed other aggregates, named “disk_images” and “coverscans”, and optionally “derivative_disk_images”. “disk_images” contain the bit-faithful copies of the disks we made, along with their technical and PREMIS metadata, and any informational files relevant to that particular work (for example, notes describing issues with technical playback). “coverscans” are raster images that document the physical packaging of the original disks, which are often creative artworks in and of themselves. Optionally, if there were opportunities to make derivatives, most often, to create operating environment changes that led to improved playback, those would be included in “derivative_disk_images”. In addition, all configured emulators, environments, and hardware ROMs were included in the deposit. (The complete deposit structure is described in Appendix A of the white paper.) We have additional narrative documentation that explains how to relate artwork system dependencies with emulation environments that complements the structured metadata we created. The overall deposit reflects the complexity inherent in these objects, so there is a lot of cross referencing to assist our future colleagues in matching any given work with the appropriate emulator and ROM to play it back.

It is worth noting that the deposit of any given work is driven by its affiliation to a specific Archives collection, and PAFDAO involved 12 of these, as the Rose Goldsen Archive of New Media Art is a collecting area encompassing several collections. The intellectual value of these assets drove the deposit arrangement. While their associated technical analysis and description are important, at the end of the day, these assets are valued for their properties as exemplary works of a time, and assets within their collections. Our mission as a research institution guides us to preserve these objects within the context of their collections.

Chelcie: What project outputs do you anticipate will prove the most useful to institutions other than the Goldsen Archive that are preparing to meet the preservation challenges of similar materials?

Dianne: We wrote up a classifications document outlining the various (and often overlapping) characteristics of the works in the testbed collection. For each classification, we included a short summary, implications for access, and projected restoration issues — that is, what we expect it would take to get a work running in a current system.

Since we focused so much on emulation in this project, I think the most valuable part of that document is the explanations of the issues we encountered when trying to provide access (using emulation) for each type of artwork. As a companion, we also wrote up a fairly comprehensive how-to document for locally configuring the various emulation software we tested, which includes how we approached determining a compatible emulation environment for various artworks.

While we used a combination of file type analysis and manual review to classify artworks and determine suitable emulation environments, I’m personally pleased to see work done elsewhere in developing frameworks for classifying digital materials to support automatic detection of compatible emulation environments. Similarly, as Emulation as a Service develops into a viable solution for many archives, I still think there’s value in the detail we provided in our emulation documentation, since we used the same emulators that EaaS is running in the backend of their system.

Chelcie: What opportunities for conversation or collaboration do you foresee, either among cultural institutions or between cultural institutions, games communities, and hobbyists?

Dianne: I really like that we can tap into a kind of collective nostalgia for our older technology in order to help preserve these important artifacts. We were so fortunate that emulators exist for the exact environments we needed for these artworks — and really, that’s the result of enthusiasts who wanted to preserve the computing environments of their youth. As more people get involved in this space, there’s a greater awareness of not only the technical, but social and historical implications for this kind of work. Ultimately, there’s so much potential for synergy here. It’s a really great time to be working in this space.

To learn more about the PAFDAO project, check out the project team’s white paper, Preserving and Emulating Digital Art Objects.

David Rosenthal: DRM in the IoT

planet code4lib - Mon, 2015-12-21 16:00
This week, Phillips pushed out a firmware upgrade to their "smart" lighting system that prevented third-party lights that used to interoperate with it continuing to do so. An insightful Anonymous Coward at Techdirt wrote:
And yet people still wonder why many people are hesitant to allow any sort of software update to install. Philips isn't just turning their product into a wall garden. They're teaching more people that "software update"="things stop working like they did".Below the fold, some commentary.

One major problem with the IoT is that in most cases software updates aren't available. But even when they are, consumers are very slow to install them.  This is an important reason why, and I can sympathise with them. An iOS upgrade effectively bricked my second iPhone. After the interminable download, it took the iPhone 20-30 seconds to echo a keypress! I'm now very slow to upgrade iOS, waiting until there are numerous reports of successful upgrades to my particular hardware before trying it.

Selling a product that is clearly labelled as being intentionally defective because it is DRM-ed is one thing, but subsequently and without notice rendering a product defective that wasn't when purchased is quite another. Phillips got so much flak about this boffo idea that they restored interoperability after a couple of days.

But I'm sure others will not learn from this, and we'll see more attempts to cripple paid-for functionality. TKnarr at Techdirt proposed an interesting way to fight back:
Their controllers say Zigbee Light link protocol 1.0 certified. If the firmware update renders the controllers incompatible with Zigbee Light link protocol 1.0 (ie. will not interoperate with bulbs using that protocol), that's a manufacturing defect. I'd simply return the defective controllers to where you bought them and request a refund (a replacement isn't acceptable since Philips has made it clear all of their controllers are or will be rendered defective). Sorting out the defective merchandise with the manufacturer is the store's problem.

The store will probably balk at refunding your money. Your state Attorney General's office would probably appreciate reports of stores refusing to accept returns of defective merchandise, seeing as various warranty and consumer-protection laws require them to.What would have happened if I'd tried to return my iPhone as defective?

District Dispatch: Spending bill a win for library funding, loss on cybersecurity

planet code4lib - Fri, 2015-12-18 20:57

Library Funding: Libraries Receive Welcome Funding Increases:

Congress moved quickly, and surprisingly easily, to pass the $1.1 trillion “omnibus” spending package Friday, which the President is expected to sign promptly. As reported earlier this week, libraries received welcome funding increases for both Library Services and Technology Act (LSTA) and Innovative Approaches to Literacy (IAL). The House passed the measure in a 316-113 vote while, a few hours later, the Senate voted 65-33 to send the bill to the President.

Library Services and Technology Act:

Funding for LSTA will be increased in FY16 to $183 million, an increase over the FY15 level of $181

NY –

million. Grants to States will receive an FY16 boost to $155.8 million ($154.8 million in FY15). Funding for Native American Library Services has been raised slightly to $4.1 million ($3.9 million in FY15). National Leadership Grants for Libraries grows to $13.1 million ($12.2 million in FY15). Laura Bush 21st Century Librarian funding will stay level at $10 million. Overall funding for the Institute of Museum and Library Services will bump to $230 million, up slightly from $227.8 million in FY15.

Innovative Approaches to Literacy:

Funding for school libraries received an increase of $2 million, raising total IAL program funding in FY16 to $27 million. At least half of such funding is dedicated to school libraries.

Much of the overarching appropriations discussions around the omnibus bill focused less on funding levels and more on policy “riders” addressing controversial issues such as abortion, refugees, energy, and gun control and research. Among the policy issues followed by ALA were:

  • Net Neutrality: A threatened policy rider – opposed strongly by ALA – that would have prohibited the FCC from implementing its Open Internet Order, failed to overcome strong opposition and was not included in the final spending package.
  • E-Rate: Once again, funding for E-rate will not be delayed as Congress extended the Anti-Deficiency Act exemption through 2017. ALA urged Congress to include this exemption.
  • Cybersecurity: In a far less auspicious development, the omnibus also included the Cybersecurity Act of 2015: language secretly negotiated by the leadership of the House and Senate Intelligence Committees (with belated input from the House Committee on Homeland Security) and inserted by Speaker Ryan into the 2000+ page omnibus on the eve of its final approval. Passage of the Act, which is hostile to personal privacy in many fundamental ways, ends (at least for now) a fight conducted by ALA and many coalition partners in industry and the civil liberties community across multiple Congresses for many years. Most recently, ALA President Sari Feldman spoke out publicly against the bill on December 16, and the Washington Office mounted a spirited letter and Twitter grassroots campaign against inserting cybersecurity legislation of any kind into the omnibus. ALA also joined its coalition partners in calling in writing for the Speaker to strip the bill from the omnibus before bringing it to a vote.  More details about the Cybersecurity Act and its impact from our allies at the ACLU are available here.
  • Privacy: More positively, the omnibus also contains language authored by privacy champion Rep. Kevin Yoder (R-KS3) reaffirming that federal financial services agencies, like the Securities and Exchange Commission, must get a judicially approved search warrant based on probable cause in order to compel the release of the content of any individuals’ electronic communications of any kind. While that language doesn’t change current law, it does strengthen the hand of Rep. Yoder and the more than 300 other Members of the House who have cosponsored the Email Privacy Act (H.R. 699) which would change the dangerously antiquated Electronic Communications Privacy Act to make “warrant for content” the law for all electronic communications from the moment that they are created. As recently reported in District Dispatch, progress may well have been made toward that end at a recent House Judiciary Committee hearing, but much more work – helped along by the symbolic omnibus action – remains to be done.

The post Spending bill a win for library funding, loss on cybersecurity appeared first on District Dispatch.

District Dispatch: Sparks! Ignition grants available to libraries, apply now!

planet code4lib - Fri, 2015-12-18 16:25

Application deadline is February 1, 2016

IMLS is accepting applications now for its 2016 Sparks! Ignition Grants for Libraries.

The Institute of Museum and Library Services (IMLS) is accepting applications for Sparks! Ignition Grants for Libraries. This is a great opportunity for libraries to secure small grants that encourage libraries and archives to prototype and evaluate innovations that result in new tools, products, services, or organizational practices. They enable grantees to undertake activities that involve risk and require them to share project results-whether they succeed or fail-to provide valuable information to the library field and help improve the ways that libraries serve their communities.

The funding range is from $10,000 to $25,000, and there are no matching requirements. Projects must begin on October 1, November 1, or December 1, 2016. Click here for program guidelines and more information about the funding opportunity.

Make sure you identify a specific problem or need that is relevant to many libraries and/or archives and propose a testable and measurable solution. And you must explain how the proposed activity differs from current practices or takes advantage of an unexplored opportunity, and the potential benefit to be gained by this innovation, among other requirements.

Getting Your Questions Answered; Two Webinars in January

IMLS staff members are available by phone and e-mail to discuss general issues relating to the Sparks! Ignition Grants for Libraries program. You are also invited to participate in one of two IMLS webinars to learn more about the program, ask questions, and listen to the questions and comments of other participants.

The webinars are scheduled for:

  • Thursday, January 7, 2016, at 3:00 p.m. EST; and
  • Wednesday, January 13, 2016, at 2 p.m. EST.

See grant program guidelines for additional webinar details.

IMLS is using the Blackboard Collaborate system (version 12.5). If you are a first-time user of Blackboard, please click here to check your system compatibility and configure your settings.

The post Sparks! Ignition grants available to libraries, apply now! appeared first on District Dispatch.

Ed Summers: Graph Stories

planet code4lib - Fri, 2015-12-18 05:00

This semester I had the opportunity to help out in a few sessions of Matt Kirschenbaum’s Digital Studies graduate seminar. Matt wanted to include some hands on exercises collecting data from the Twitter API, to serve as a counterpoint to some of the readings on networks such as Galloway and Thacker’s The Exploit. I don’t think the syllabus is online, but if it becomes available I’ll be sure to link it here.

Now ENG 668K wasn’t billed as a programming class, so getting into Python or R or <insert your favorite programming language here> just wouldn’t have been fair or appropriate. But discussing the nature of networks, and social media platforms really requires students to get a feel for the protocols and data formats that underpin and constrain these platforms. So we wanted to give students some hands on experience with what the Twitter API did and, perhaps more importantly, didn’t do. We also wanted to get students thinking about the valuable perspectives that humanists bring to data.

The Graph From 10,000 Feet

I started with a general introduction to Twitter and the graph as a data model. Much of this time was spent doing a close read of a particular tweet as it appears in the Web browser. We looked at the components of a tweet, such as who sent it, when it was sent, hashtags, embedded media, replies, retweets and favorites, and how these properties and behaviors are instantiated in the Twitter user interface.

We then took a look at the same tweet as JSON data, and briefly discussed the how and why of the Twitter API. We wrapped up with a quick look at the history of graph theory, which was kind of impossible and laughable in a way. But I think it was important to at least gesture at how networks and graphs have a long history, have accumulated a lot of theory, but in essence are quite simple.

As an exercise we all created paper graphs of some topic. There’s no need to get lost in Gephi to start thinking about the graph data model. I showed them a paper graph that modeled a small subset of MITH people and projects as an example:

I wish I took pictures of the ones all the students came up with. They were amazing and made my example seem super dull. MITH is in the middle of working on a viewer for all our projects over the past 15 years, so we have a dataset of the people and projects available. I created a hairball graph view of the full dataset just to balm my damaged pride.

Tags and TAGS

But this was all just a prelude to where we spent the majority of our time in the subsequent classes, which was in using the Twitter Archiving Google Sheet or TAGS. The thing that TAGS did well was get students thinking about the various representations of a tweet, and how those representations fit different uses. The representation as HTML is clearly good for the human user on the Web. The JSON representation was designed for API access by clients on mobile devices, and other services. And one of those other services is TAGS, which takes the JSON data and puts it into a familiar tabular layout for analysis. This complemented an earlier set of sessions in the semester where Raff Viglianti and Neil Fraistat talked about reimagining the archive, with a dive into encoding a Shelley manuscript using TEI.

We did run into some problems using TAGS, mostly centered on the Twitter API Keys. To use TAGS you need a Twitter account, and a Google account. This may seem like a no-brainer but I was impressed to see there were some students who assiduously wanted neither. Respect. Once you have a Twitter account you need to register an app, and get the keys for the app. In order to create an application you need to attach a mobile number to your user account. This is understandable given Twitter’s ongoing fight against spam, but it presents another privacy/security conundrum to students. But once the keys are in hand TAGS is mostly straightforward to enter a search query and get some data.

The wrap up exercise for the hands on TAGS experimentation was to do some data collection and then write a very basic data story about that data. I felt like I appropriated the phrase data story a little bit. It’s a new term that I never really contextualized fully because of time constraints. Data storytelling could be a whole class in itself. In fact Nick Diakopoulos this semester at UMD. But we simply wanted the students to try to collect some data and write up what they did, and what (if anything) they found in the data. To get them started a provided an example data story.

My Hastily Composed Data Story

For my data story I assembled a Twitter data dataset for the poet Patricia Lockwood. I collected 1,592 of her tweets for the period from April 20, 2015 to November 22, 2015. To do this I configured TAGS settings to fetch using the API type status/user_timeline for the user TriciaLockwood. Lockwood has tweeted 12,658 times since May 2011, so I wasn’t able to get all of her tweets because of the limitations of Twitter’s API.

I thought it might be interesting to see who Lockwood corresponds with the most for the time period I was able to collect in, so I created a pivot table (the Correspondence sheet in the Google Sheets document) that listed the people she directed tweets at the most. The top 5 users she corresponded with were:

    1. @cat_beltane - Gregory Erskine
    1. @DVSblast - Dimitri Stathas (Rapper from New York)
    1. @accomodatingly - Stephen Burt (Harvard Professor)
    1. @PKhakpour - Porochista Khakpour (Iranian-American writer)
    1. @McKelvie - Jamie McKelvie (British Cartoonist)

I thought it was interesting that the top person she corresponded with didn’t seem to be easily googlable even though he has a somewhat distinctive name. Erskine’s profile says he lives in Louisville, Kentucky which isn’t terribly far from St Louis and Cincinnati, both of which are places Lockwood has lived in the past. According to the New York Times she seems to be currently living in Lawrence, Kansas. Perhaps they knew each other in the past and are keeping in touch via Twitter?

The website Erskine has linked from his Twitter profile was registered in 2014 using an address in St Louis. So perhaps they knew each other there. It’s possible to search and scan all their public correspondence, which goes back to September 16, 2011.

Ok, this is starting to feel creepy now.


The resulting data stories were really quite amazing. As I said I barely gave any context for data storytelling, but the students ran with it and well surpassed my example. I can’t include them here because I didn’t ask for their permission. But if any students decide to put them on the Web I will link to them.

Matt did ask me to write up a general response to these data stories, with particular attention to where TAGS break down. I’ve included it here, but I’ve changed the students names, since they didn’t write them thinking they would be on the Web with their names attached.

Here’s what I wrote…

I really enjoyed reading the data stories that everyone was able to put together. One phrase that will stick with me for a while is Julie’s hashtag salad as a term (or neologism really) for tweets that are mostly made up of a coordination of hashtags. I’m going to use it from now on!

I was surprised to find that the stories that established a strong personal voice were particulary effective in drawing out a narrative in the data. Perhaps I shouldn’t have been, but I was :) I guess it makes sense that a narrative about data would need to have a narrator, and that the narrator needs a voice. So it stands to reason that attention to this voice is important for communicating the perspective that the data story is taking. I think this is one reason (among many) that humanists are needed in the STEM dominated field of data science. Voice and perspective matter.

Another important aspect to the stories that worked well was an attention to the data. For example Courtney’s analysis of the device used (mobile vs desktop) and time of day was quite interesting. The treatment is suggestive of the way the data is intertwined with events in the physical world, and the degree to which social media can and cannot mediate those events. Samuel’s analysis of the use of language (English, Spanish and Portuguese) by a particular individual in different contexts was another example of this type of treatment.

But there were many interesting things that were discussed, so it’s probably not fair for me to start highlighting them. The stories contained several comments about the limitations of collecting data from Twitter using TAGS. For example:


Not only does TAGS give us a very brief snapshot of a giant conversation, but we Twitter users must figure out what we need before too much time passes and the data starts to become very hard to draw out.


This is one limitation of TAGS, as it removes the physical layout of Twitter, requiring users to cut and paste hyperlinks to read the original content of tagged Tweets.

The first was that it still takes a human to connect all of the patterns together. Google spreadsheets do not contain a function to sort text by theme, so the user still needs to look at the text to see if there is a pattern to the Tweets.

The second is in the limitations of TAGS. Vince and I were both initially surprised with the seemingly low volume of results we received. After a little research, we realized that TAGS only collects seven to nine days of Tweets.


Of course, it is impossible to tell just from the data how much @Farmerboy or the other’s meant to, or not, to refer to the Little House on the Prairie franchise.


For me, temporality posed the largest problem in that it was important to time when to run the script to capture certain kinds of data.

Not only is the process capped by a maximum number of tweets archived at a specific moment, it is also difficult to explore a topic outside of the “now” because of its relegation to the last seven days, rather than allowing for a specified date range. Additionally, the more tweets I attempted to archive at one time, the greater likelihood the script would fail and yield zero results.

I thought I remembered mentioning that the search API was limited to the last 7-9 days, but even if I did I clearly didn’t emphasize it enough. The search API does restrict access mostly for business reasons since Twitter have a service called Gnip which allows people to purchase access to historical data in bulk. So, if you are interested in a topic, and don’t want to pay Twitter thousands of dollars for data, it is important to collect continuously over a period of time.

TAGS tries to do this for you by allowing you to schedule your search to be rerun, but there are limits to the size of a Google Sheet: 2,000,000 cells, or 111,111 TAGS rows. It also isn’t clear to me how TAGS deals with duplicate data, or how it ensures that it doesn’t have gaps in time. At any rate these observations about the limits of TAGS and the underlying Twitter API are great examples of getting insight into Twitter as a platform, in Tarleton Gillespie’s use of the term. If this sort of thing is of interest there is an emerging literature that analyzes Twitter’s as a platform, such as González-Bailón, Wang, Rivero, Borge-Holthoefer, & Moreno (2014), Driscoll & Walker (2014) and Proferes (2015).

Just as an aside Twitter’s web search isn’t limited to the last 7-9 days like the API. For example you can do a search for the tweets mentioning the word twttr (Twitter’s original name) before March 22, 2006 which will show you some of the first day of tweets from Twitter’s founders.

The comments also point to another limitation of TAGS as a tool. The spreadsheet has the text of the tweet, but it is extremely data centric. To see embedded media, the users profile information, the responses and the full presentation of the tweet it is necessary to visit the twett on the Web using the URL located in the status_url column. This can prove to be quite a barrier, when you are attempting to decode the intent or intended meaning of a message by simply browsing the spreadsheet. The additional context found in the human readable presentation of the Web page makes it much easier to get at the intent or meaning of a tweet. But how do you do this sort of analysis with thousands of messages? This raises good questions about distant reading, which is also an area where a DH perspective has a lot to offer to the data science profession.


It will be interesting to hear what the students made of the class in the reviews. But all in all I was kind of surprised at how low-tech instruments like paper graphs, spreadsheets and data stories could yield valuable thinking, discussion and analysis of social networks. Of course I expect most of that was due to the readings and lectures that went on outside of these exercises. I’m looking forward to hopefully being able to iterate on some of these techniques and ideas in the future.


Driscoll, K., & Walker, S. (2014). Working within a black box: Transparency in the collection and production of big twitter data. International Journal of Communication, 8, 1745–1764.

González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (2014). Assessing the bias in samples of large online networks. Social Networks, 38, 16–27. Retrieved from

Proferes, N. J. (2015). Informational power on twitter: A mixed-methods exploration of user knowledge and technological discourse about information flows (PhD thesis). University of Wisconsin, Milwaukee. Retrieved from

LITA: Register for “Makerspaces: Inspiration and Action” at ALA Midwinter

planet code4lib - Thu, 2015-12-17 20:52

How do you feel about 40,000 square feet full of laser cutters, acetylene torches, screen presses, and sewing machines? Or community-based STEAM programming for kids? Or lightsabers?

If these sound great, you should register for the LITA “Makerspaces: Inspiration and Action” tour at Midwinter! We’ll whisk you off to Somerville for tours, nuts and bolts information on running makerspace programs for kids and adults, Q&A, and hands-on activities at two great makerspaces.

A workspace at Artisan’s. (“HoaT2012: Boston, July-2012” by Mitch Altman; ; CC BY-SA)

Artisan’s Asylum is one of the country’s premier makerspaces. In addition to the laser cutters, sewing machines, and numerous other tools, they rent workspaces to artists, offer a diverse and extensive set of public classes, and are familiar with the growing importance of makerspaces to librarians.

My kid made her fabulous Halloween costume at Parts & Crafts this year and I am definitely not at all biased. (Photo by the author.)

Parts & Crafts is a neighborhood gem: a makerspace for kids that runs camp, afterschool, weekend, and homeschooling programs. With a knowledgeable staff, a great collection of STEAM supplies, and a philosophy of supporting self-directed creativity and learning, they do work that’s instantly applicable to libraries everywhere. We’ll tour their spaces, learn the nuts and bolts of maker programming for kids and adults, and maybe even build some lightsabers.

What tools can you use? (“Parts and Crafts, kids makerspace” by Nick Normal;; CC BY-NC-ND)

Parts & Crafts is also home to the Somerville Tool Library (as seen on BoingBoing). Want to circulate bike tools or belt sanders, hedge trimmers or hand trucks? They’ll be on hand to tell you how they do it.

I’ll be there; I hope you will be, too! .

<figcaption class="wp-caption-text">Let’s all fly to Boston! (Untitled photograph by Clarence Risher;; CC BY-SA)</figcaption>

SearchHub: Open Source Hadoop Connectors for Solr

planet code4lib - Thu, 2015-12-17 19:45

Lucidworks is happy to announce that several of our connectors for indexing content from Hadoop to Solr are now open source.

We have six of them, with support for Spark, Hive, Pig, HBase, Storm and HDFS, all available in Github. All of them work with Solr 5.x, and include options for Kerberos-secured environments if required.

HDFS for Solr

This is a job jar for Hadoop which uses MapReduce to prepare content for indexing and push documents to Solr. It supports Solr running in standalone mode or SolrCloud mode.

It can connect to standard Hadoop HDFS or MapR’s MapR-FS.

A key feature of this connector is the ingest mapper, which converts content from various original formats to Solr-ready documents. CSV files, ZIP archives, SequenceFiles, and WARC are supported. Grok and regular expressions can be also be used to parse content. If there are others you’d like to see, let us know!

Repo address:

Hive for Solr

This is a Hive SerDe which can index content from a Hive table to Solr or read content from Solr to populate a Hive table. 

Repo address:

Pig for Solr

These are Pig Functions which can output the result of a Pig script to Solr (standalone or SolrCloud). 

Repo address:

HBase Indexer

The hbase-indexer is a service which uses the HBase replication feature to intercept content streaming to HBase and replicate it to a Solr index.

Our work is a fork of an NGDATA project, but updated for Solr 5.x and HBase 1.1. It also supports HBase 0.98 with Solr 5.x. (Note, HBase versions earlier than 0.98 have not been tested to work with our changes.)

We’re going to contribute this back, but while we get that patch together, you can use our code with Solr 5.x.

Repo address:

Storm for Solr

My colleague Tim Potter developed this integration, and discussed it back in May 2015 in the blog post Integrating Storm and Solr. This is an SDK to develop Storm topologies that index content to Solr.

As an SDK, it includes a test framework and tools to help you prepare your topology for use in a production cluster. The README has a nice example using Twitter which can be adapted for your own use case.

Repo address:

Spark for Solr

Another Tim Potter project that we released in August 2015, discussed in the blog post Solr as an Apache Spark SQL DataSource. Again, this is an SDK for developing Spark applications, including a test framework and a detailed example that uses Twitter.

Repo address:


Image from book cover for Jean de Brunhoff’s “Babar and Father Christmas“.

The post Open Source Hadoop Connectors for Solr appeared first on

Casey Bisson: On disfluencies

planet code4lib - Thu, 2015-12-17 15:56

Your Speech Is Packed With Misunderstood, Unconscious Messages, by Julie Sedivy:

Since disfluencies show that a speaker is thinking carefully about what she is about to say, they provide useful information to listeners, cueing them to focus attention on upcoming content that’s likely to be meaty. […]


Experiments with ums or uhs spliced in or out of speech show that when words are preceded by disfluencies, listeners recognize them faster and remember them more accurately. In some cases, disfluencies allow listeners to make useful predictions about what they’re about to hear. In one study, for example, listeners correctly inferred that speakers’ stumbles meant that they were describing complicated conglomerations of shapes rather than to simple single shapes.


Disfluencies can also improve our comprehension of longer pieces of content. Psychologists Scott Fraundorf and Duane Watson tinkered with recordings of a speaker’s retellings of passages from Alice’s Adventures in Wonderland and compared how well listeners remembered versions that were purged of all disfluencies as opposed to ones that contained an average number of ums and uhs (about two instances out of every 100 words). They found that hearers remembered plot points better after listening to the disfluent versions, with enhanced memory apparent even for plot points that weren’t preceded by a disfluency. Stripping a speech of ums and uhs, as Toastmasters are intent on doing, appears to be doing listeners no favors.


Subscribe to code4lib aggregator