You are here

Feed aggregator

Hydra Project: Hydra Connect #2 – reports

planet code4lib - Thu, 2014-10-02 00:24

170 or so people are gathered together in Cleveland, Ohio, for Hydra Connect #2 – the second annual Hydra get-together. If you weren’t able to come (and even if you were) you’ll find increasing numbers of presentations and meeting notes hanging off the program page at https://wiki.duraspace.org/display/hydra/Hydra+Connect+2+Program

Enjoy!

HangingTogether: A Year of Living Dangerously For Archives (and you)

planet code4lib - Wed, 2014-10-01 22:51

[Female acrobats on trapezes at circus | Library of Congress ]

[This post is in honor of American Archives Month, which starts today!]

This year at the Society of American Archivists annual meeting, incoming SAA president Kathleen Roe kicked off her initiative “A Year of Living Dangerously for Archives.” You can read about the initiative on the SAA Website, but I can also boil this down for you. Those of us who work in cultural heritage institutions get it — archives are important. We spend a lot of time telling one another how about our wonderful collections, and about the good work we do. However, despite our passion and conviction, we don’t spend nearly enough time making the case outside the building how important archives are.

I like this formulation: Archives change lives. Tell people about it.

I’m eager to hear all the stories that come out of this Year of Living Dangerously (YOLDA, as I’m dubbing it, which goes nicely with YOLO, don’t you think?). I urge you to participate in YOLDA by sharing your stories on the SAA website but also by pointing us to your work in the comments. Let’s use this year to inspire one another. I think it’s more dangerous to not take action than to find ways to advocate for ourselves, but if it makes you happy to think of yourself as an action hero, than go for it!

About Merrilee Proffitt

Mail | Web | Twitter | Facebook | LinkedIn | More Posts (270)

Karen Coyle: This is what sexism looks like

planet code4lib - Wed, 2014-10-01 22:18
[Note to readers: sick and tired of it all, I am going to report these "incidents" publicly because I just can't hack it anymore.]

I was in a meeting yesterday about RDF and application profiles, in which I made some comments, and was told by the co-chair: "we don't have time for that now", and the meeting went on.

Today, a man who was not in the meeting but who listened to the audio sent an email that said:
"I agree with Karen, if I correctly understood her point, that this is "dangerous territory".  On the call, that discussion was postponed for a later date, but I look forward to having that discussion as soon as possible because I think it is fundamental."And he went on to talk about the issue, how important it is, and at one point referred to it as "The requirement is that a constraint language not replace (or "hijack") the original semantics of properties used in the data."

The co-chair (I am the other co-chair, although reconsidering, as you may imagine) replied:
"The requirement of not hijacking existing formal specification languages for expressing constraints that rely on different semantics has not been raised yet." "Has not been raised?!" The email quoting me stated that I had raised it the very day before. But an important issue is "not raised" until a man brings it up. This in spite of the fact that the email quoting me made it clear that my statement during the meeting had indeed raised this issue.

Later, this co-chair posted a link to a W3C document in an email to me (on list) and stated:
"I'm going on holidays so won't have time to explain you, but I could, in theory (I've been trained to understand that formal stuff, a while ago)"That is so f*cking condescending. This happened after I quoted from W3C documents to support my argument, and I believe I had a good point.

So, in case you haven't experienced it, or haven't recognized it happening around you, this is what sexism looks like. It looks like dismissing what women say, but taking the same argument seriously if a man says it, and it looks like purposely demeaning a woman by suggesting that she can't understand things without the help of a man.

I can't tell you how many times I have been subjected to this kind of behavior, and I'm sure that some of you know how weary I am of not being treated as an equal no matter how equal I really am.

Quiet no more, friends. Quiet no more.

Cynthia Ng: Access 2014 Day 2: Afternoon Notes

planet code4lib - Wed, 2014-10-01 21:49
We continue with the afternoon of day 2 of Access 2014. On the program is linked data and some lightning talks. Linked Data is People: Using Linked Data to Reshape the Library Staff Directory Jason A. Clark, Head, Library Informatics & Computing, Montana State University Scott W.H. Young, Digital Initiatives Librarian, Montana State University Linkded […]

CrossRef: CrossRef Newsletter - October 2014 Edition

planet code4lib - Wed, 2014-10-01 21:33

The latest edition of the CrossRef Newsletter has been posted.

The October 2014 edition contains news and updates on CrossRef Text and Data Mining, CrossMark, FundRef and more. The Tech Corner has updates on new technical developments such as the Notification Callback Service. We will be attending various meetings this Fall including exhibiting at the Frankfurt Book Fair next week. The CrossRef Annual meeting in London is coming up in November and there's a story on that. As well as important updates on Board of Directors Election, billing and more.

Open Knowledge Foundation: Connect and Help Build the Global Open Data Index

planet code4lib - Wed, 2014-10-01 21:11

Earlier this week we announced that October is the Global Open Data Index. Already people have added details about open data in Argentina, Colombia, and Chile! You can see all the collaborative work here in our change tracker. Each of you can make a difference to hold governments accountable for open data commitments plus create an easy way for civic technologies to analyze the state of open data around the world, hopefully with some shiny new data viz. Our goal at Open Knowledge is to help you shape the story of Open Data. We are hosting a number of community activities this month to help you learn and connect with each other. Most of all, it is our hope that you can help spread the word in your local language.

Choose your own adventure for the Global Open Data Index

We’ve added a number of ways that you can get involved to the OKFN Wiki. But, here are some more ways to learn and share:

Community Sessions – Let’s Learn Together

Join the Open Knowledge Team and Open Data Index Mentors for a session all about the Global Open Data Index. It is our goal to show open data around the world. We need your help to add data from your region and reach new people to add details about their country.

We will share some best practices on finding and adding open dataset content to the Open Data Index. And, we’ll answer questions about the use of the Index. There are timeslots to help people connect globally.

These will be recorded. But, we encourage you to join us on G+ /youtube and bring your ideas/questions. Stay tuned as we may add more online sessions.

Community Office Hours

Searching for datasets and using the Global Open Data Index tool is all the better with a little help from mentors and fellow community members. If you are a mentor, it would be great if you could join us on a Community Session or host some local office hours. Simply add your name and schedule here.

Mailing Lists and Twitter

The Open Data Index mailing list is the main communication channel for folks who have questions or want to get in touch: https://lists.okfn.org/mailman/listinfo/open-data-census#sthash.HGagGu39.dpuf For twitter, keep an eye on updates via #openindex2014

Translation Help

What better way to help others get involved than to share in your own language. We could use your help. We have some folks translating content into Spanish. Other priority languages are Yours!, Arabic, Portuguese, French and Swahili. Here are some ways to help translate:

Learn on your own

We know that you have limited time to contribute. We’ve created some FAQs and tips to help you add datasets on your own time. I personally like to think of it as a data expedition to check the quality of open data in many countries. Happy hunting and gathering! Last year I had fun reviewing data from around the world. But, what matters is that you have local context to review the language and data for your country. Here’s a quick screenshot of how to contribute:

Thanks again for making Open Data Matter in your part of the world!

(Photo by Marieke Guy, cc by license (cropped))

Cynthia Ng: Access 2014: Day 2 Morning Notes

planet code4lib - Wed, 2014-10-01 20:08
Good morning Calgary. Day 2 of Access 2014. My presentation was first up, and is fully written up in a separate blog post, We’re All Disabled! Part 2: Building Accessible (Web) Services with Universal Design. When Campus IT Comes Knocking: A New Model for UBC Library IT in the 21 Century Paul Joseph Systems Librarian, […]

OCLC Dev Network: Introducing the WorldCat Discovery API

planet code4lib - Wed, 2014-10-01 19:15

We are very excited to announce the beta release of the WorldCat Discovery API. This API is a full-featured, modern discovery API that allows you to search across WorldCat and OCLC’s central index. The WorldCat Discovery API is currently available as a beta and is not yet in general release or available for use by commercial partners. Libraries using WorldCat Discovery Services can request to participate in the beta.

FOSS4Lib Recent Releases: Service-Proxy - 0.38

planet code4lib - Wed, 2014-10-01 19:03
Package: Service-ProxyRelease Date: Monday, September 29, 2014

Last updated October 1, 2014. Created by Peter Murray on October 1, 2014.
Log in to edit this page.

Service Proxy version 0.38, Mon Sep 29 16:27:12 UTC 2014

- allows empty un/pw on perconfig authentication, MKSP-125
- statistics plug-in can optionally make it's own bytarget request for
per-target stats, MKSP-130
- bug-fixes and optimizations to bootstrapping of search before record
command, MKSP-129
- encodes pazpar2 parameter names, i.e. to support names on the form
pz:param[target-id]

District Dispatch: A quiet woo hoo moment at ALA’s Washington Office

planet code4lib - Wed, 2014-10-01 18:34
#147205418 / gettyimages.com

I did not double check, but I think it’s safe to say that most of the last E-rate posts have mentioned somewhere “over the last year” or “about a year ago” or “beginning last summer.” So… Monday, we saw an inkling of the potential payoff for which we have been holding our collective breath for over a year, since the E-rate modernization proceeding began.

On Monday while we were putting the final touches on our reply comments (pdf) to the E-rate Further Notice of Proposed Rulemaking, Federal Communications Commission (FCC), Chairman Tom Wheeler delivered remarks at the 2014 Education Technology Summit. The Chairman’s remarks clearly articulated what we have been hoping to hear since the adoption of the changes in the July Order and its Wi-Fi focus. For those of you following along closely, you know we have been advocating strongly for increasing the number of libraries that can report scalable, affordable high-capacity broadband to the building. While our strategy evolved in response to the changing dynamics in D.C. as well as through input from the numerous emails and calls and meetings with ALA’s E-rate Task Force and other library organizations, our goal remains unchanged. We know from more than a decade of research that the fundamental barriers libraries face in increasing broadband capacity are availability and affordability.

On Monday, the Chairman clearly articulated that addressing these barriers is the focus of this next phase of the E-rate modernization efforts:

We have updated the program to close the Wi-Fi gap. Next, we must close the Rural Fiber Gap. So, today, I would like to visit about the next steps in the evolution of the E-rate program. In particular I want to talk about two related issues that remain squarely before the Commission as we consider next steps in the E-rate modernization process: 1) closing the Rural Fiber Gap for schools and libraries, and 2) tackling the affordability challenge.

We know that with the majority of libraries still reporting speeds less than 10 Mbps, there is a long way to go before we can report the majority of libraries are closing in on the gigabit goal set by the Commission in July. And, we know that for most libraries the key to getting there is via a fiber connection regardless of locale.

Our comments also stress the “affordability gap” and we call on the Commission to address both simultaneously, knowing that for many libraries fiber (or other technology) may be in the vicinity of the library, but the monthly cost of service is more than the library can afford so the library ends up saying, “no thank you.” Whether it’s a library struggling along at 3 Mbps to provide video conferencing and distance education services in a rural community; an urban library maxing out every day at 3:00 when school lets out and patrons on their own devices or at the library computers are feeling the stress on the library’s network; or a suburban library planning a multi-media lab and holding work skills classes, we know that two thirds of all libraries want to upgrade to higher speeds. The Commission has clearly opened the door to see that these upgrades can be done through the E-rate program—and that the recurring costs are subsequently affordable.

I would say that over the last year (and leading up to this current proceeding from our earlier work related to the National Broadband Plan in 2010) we worked hard to turn the national emphasis on broadband access and adoption in favor of libraries. With regards to E-rate, we repeatedly asked the question, how should the E-rate program look in the 21st century so that it can best meet the needs of 21st century libraries? Ensuring libraries have the broadband capacity they need is one critical way to shape the program.

A long-standing issue for ALA has been to see the program adequately funded. Our comments ask the Commission to take up the funding challenge, knowing that upgrades will both require immediate investment and likely incur greater monthly costs for service. The data gathering by the Commission and by stakeholders (in addition to the careful review of current program spending, fine-tuning of eligible services, and encouraging economies of scale) will work as guide posts for determining future funding needs of the program. Chairman Wheeler clearly opened the funding door as well and we are confident that “right sizing” the fund for the long haul is firmly on the agenda.

All told, I think we are slowly exhaling. In part because we submitted the reply comments well before the midnight deadline, but really because while we have made some subtle gains over the course of the year’s work (and some not so subtle, perhaps), what we heard from the Chairman on Monday can be read as the Commission making good on its commitment to addressing the “to the library” issue.

There is quite a bit of distance between remarks made in a speech and a Commission order, but the Chairman set an agenda for the E-rate review and modernization and to date, has accomplished much of that agenda. Going from rulemaking to order is an example of the art of compromise and we look forward to helping shape the process. In June in Las Vegas, the E-rate stakes were pretty high. This October in D.C. they will be even higher, but before we deal the last hand, we can step back briefly and quietly say “woo hoo.”

The post A quiet woo hoo moment at ALA’s Washington Office appeared first on District Dispatch.

LITA: Jobs in Information Technology: Oct 1

planet code4lib - Wed, 2014-10-01 17:23

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Dean of the Library, California Maritime Academy, Vallejo, CA

Library Systems and  Applications Specialist, Cleveland Public Library, Cleveland, OH

Manager, Digital Services, Florida Virtual Campus, Gainesville, FL

Senior Software Developer, University of Maryland, College Park – Libraries, College Park, MD

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

Cynthia Ng: Access 2014: We’re All Disabled! Part 2: Building Accessible (Web) Services with Universal Design

planet code4lib - Wed, 2014-10-01 15:39
This was presented at Access 2014 in a half hour time slot, so I was pretty tight on time. Recording of the stream should be available on the Access Conference Website at some point. Slides Presentation Slides Introduction Morning everyone. I hope yesterday’s panel on usability and accessibility got you thinking about how you might […]

Open Knowledge Foundation: Support Diego Gomez, Join the Global Open Access Movement

planet code4lib - Wed, 2014-10-01 12:38

This is a post put together based on great contributions on the blogs of the Electronic Frontier Foundation (Adi Kamdar & Maira Sutton), Creative Commons (Timothy Vollmer) and the Open Access Button project (David Carroll).

Join the global Open Access movement!

In July the Electronic Frontier Foundation (EFF) wrote about the predicament that Colombian student Diego Gomez found himself in after he shared a research article online. Gomez is a graduate student in conservation and wildlife management at a small university. He has generally poor access to many of the resources and databases that would help him conduct his research. Paltry access to useful materials combined with a natural culture of sharing amongst researchers prompted Gomez to share a paper on Scribd so that he and others could access it for their work. The practice of learning and sharing under less-than-ideal circumstances could land Diego in prison.

Facing 4-8 years in prison for sharing an article

The EFF reports that upon learning of this unauthorized sharing, the author of the research article filed criminal complaint against Gomez. The charges lodged against Diego could put him in prison for 4-8 years. The trial has started, and the court will need to take into account several factors: including whether there was any malicious intent to the action, and whether there was any actual harm against the economic rights of the author.

Academics and students send and post articles online like this every day—it is simply the norm in scholarly communication. And yet inflexible digital policies, paired with senseless and outdated practices, have led to such extreme cases like Diego’s. People who experience massive access barriers to existing research—most often hefty paywalls—often have no choice but to find and share relevant papers through colleagues in their network. The Internet has certainly enabled this kind of information sharing at an unprecedented speed and scale, but we are still far from reaching its full capacity.

If open access were the default for scholarly communication, cases like Diego’s would become obsolete. Let’s stand together to support Diego Gomez and promote Open Access worldwide.

Help Diego Gomez and join academics and users in fighting outdated laws and practices that keep valuable research locked up for no good reason. If open access were the default for scholarly communication, cases like Diego’s would become obsolete. Academic research would be free to access and available under an open license that would legally enable the kind of sharing that is so crucial for enabling scientific progress.

We at Open Knowledge have joined as signees of the petition in support of Diego alongside prominent organisations such as the Electronic Frontier Foundation, Creative Commons, Open Access Button, Internet Archive, Public Knowledge, and the Right to Research Coalition. Sign your support for Diego to express your support for open access as the default for scientific and scholarly publishing, so researchers like Diego don’t risk severe penalties for helping colleagues access the research they need:

[Click here to sign the petition]

Sign-on statement: “Scientific and scholarly progress relies upon the exchange of ideas and research. We all benefit when research is shared widely, freely, and openly. I support an Open Access system for academic publishing that makes research free for anyone to read and re-use; one that is inclusive of all and doesn’t force researchers like Diego Gomez to risk severe penalties for helping colleagues access the research they need.”

LITA: Cataloging a world of languages

planet code4lib - Wed, 2014-10-01 12:00

My university has a mandate to increase our international reach through research collaborations, courses offered, and support for international students.

From the technical services side, this means our catalogers must provide metadata for resources in unfamiliar languages, including some that don’t use the Roman alphabet. A few of the challenges we face include:

  • Identifying the language of an item (is that Spanish or Catalan?)
  • Cataloging an item in a language you don’t speak or read (what is this book even about?)
  • Transliterating from non-Roman alphabets (e.g. Cyrillic, Chinese, Thai)
  • Diacritic codes in copy cataloging that don’t match your system’s encoding scheme

I’d like to share a few free tools that our catalogers have found helpful. I’ve used some of these in other areas of librarianship as well, including acquisitions and reference.

Language identifiers

Sometimes I open a book or article and have no idea where to start, because the language isn’t anything I’ve seen before.

I turn to the Open Xerox Language Identifier, which covers over 80 different languages. Type or paste in text of the mysterious language, and give it a try. The more text you provide, the more accurate it is.

Language translators

Web translation tools aren’t perfect, but they’re a great way to get the gist of a piece of writing (don’t use them for sending sensitive emails to bilingual coworkers, however).

Google Translate includes over 75 languages, and also a language identification tool. Enter the title, a few chapter names, or back cover blurb, and you’ll get the general idea of the content.

Transliteration tables

If you catalog in Roman script, and you wind up with a resource in Cyrillic or Chinese, how do you translate that so the record is searchable in your ILS? Transliteration tables match up characters between scripts.

The ALA-LC Romanization Tables for non-Roman scripts are approved by the American Library Association and the Library of Congress. They cover over 70 different scripts.

Bibliographic dictionaries

We’re fortunate that librarians love to share: there are quite a few sites produced by libraries that look at common bibliographic terms you’d find on title pages: numbers, dates, editions, statements of responsibility, price, etc.

To share two Canadian examples, Memorial University maintains a Glossary of Bibliographic Information by Language and Queen’s University has a page of Foreign Language Equivalents for Bibliographic Terms.

If you’ve ever seen the phrase “bibliographic knowledge of [language]” in a job posting, this is what it’s referring to—when you’ve cataloged enough material in a language to know these terms, but can’t carry on a conversation about daily life. I have bibliographic knowledge of Spanish, Italian, and Germany, but don’t ask me to go to a restaurant in Hamburg and order a hamburger.

Subject-specific glossaries

Similar to bibliographic dictionaries, these are for terms common to specific subjects.

My university has significant music and map collections, so I often consult the language tools at Music Cataloging at Yale (…and I once  thought music was the universal language) and the European Environment Agency’s Terminology and Discovery Service.

Diacritic charts

In order to ensure that accented characters and special symbols display properly in the catalog, it’s important to have the correct diacritic code.

Our system uses Unicode, and we often rely on the Unicode Character Code Chart or Unicode Character Table.  Which interface you use is personal preference.

It may also be worth coming up with a cheat sheet of the codes you use most frequently – for example, common French accents if you’re cataloging Canadian government documents, which are bilingual.

Many Integrated Library Systems also have diacritic charts built in, where you can select the symbol you need and click it to place it in the record.

Diacritic guessers

Diacritic charts can be long and involved (the Unicode example above is a bit of a nightmare), so if you’re working with a new language, browsing through them searching for a specific code can be time-consuming. You can see the symbol in front of you, but have no idea what it’s called.

This is where Shapecatcher comes in.  This utility allows you to draw a character using your mouse or tablet. It identifies possible matches for the symbol and gives you the symbol’s name and Unicode number.

Have you encountered issues handling different languages when cataloguing? Is there a free language tool you’d like to share? Tell us about it in the comments!

__

Credits: Image of Pieter Bruegel the Elder’s painting The Tower of Babel courtesy of the Google Art Project. Many thanks also to my colleagues Judy Harris and Vivian Zhang for sharing their language challenges and tools.

Ed Summers: A Ferguson Twitter Archive

planet code4lib - Wed, 2014-10-01 02:05

Much has been written about the significance of Twitter as the recent events in Ferguson echoed round the Web, the country, and the world. I happened to be at the Society of American Archivists meeting 5 days after Michael Brown was killed. During our panel discussion someone asked about the role that archivists should play in documenting the event.

There was wide agreement that Ferguson was a painful reminder of the type of event that archivists working to “interrogate the role of power, ethics, and regulation in information systems” should be documenting. But what to do? Unfortunately we didn’t have time to really discuss exactly how this agreement translated into action.

Fortunately the very next day the Archive-It service run by the Internet Archive announced that they were collecting seed URLs for a Web archive related to Ferguson. It was only then, after also having finally read Zeynep Tufekci‘s terrific Medium post, that I slapped myself on the forehead … of course, we should try to archive the tweets. Ideally there would be a “we” but the reality was it was just “me”. Still, it seemed worth seeing how much I could get done.

twarc

I had some previous experience archiving tweets related to Aaron Swartz using Twitter’s search API. (Full disclosure: I also worked on the Twitter archiving project at the Library of Congress, but did not use any of that code or data then, or now.) I wrote a small Python command line program named twarc (a portmanteau for Twitter Archive), to help manage the archiving.

You give twarc a search query term, and it will plod through the search results, in reverse chronological order (the order that they are returned in), while handling quota limits, and writing out line-oriented-json, where each line is a complete tweet. It worked quite well to collect 630,000 tweets mentioning “aaronsw”, but I was starting late out of the gate, 6 days after the events in Ferguson began. One downside to twarc is it is completely dependent on Twitter’s search API, which only returns results for the past week or so. You can search back further in Twitter’s Web app, but that seems to be a privileged client. I can’t seem to convince the API to keep going back in time past a week or so.

So time was of the essence. I started up twarc searching for all tweets that mention ferguson, but quickly realized that the volume of tweets, and the order of the search results meant that I wouldn’t be able to retrieve the earliest tweets. So I tried to guesstimate a Twitter ID far enough back in time to use with twarc’s --max_id parameter to limit the initial query to tweets before that point in time. Doing this I was able to get back to 2014-08-10 22:44:43 — most of August 9th and 10th had slipped out of the window. I used a similar technique of guessing a ID further in the future in combination with the --since_id parameter to start collecting from where that snapshot left off. This resulted in a bit of a fragmented record, which you can see visualized (sort of below):

In the end I collected 13,480,000 tweets (63G of JSON) between August 10th and August 27th. There were some gaps because of mismanagement of twarc, and the data just moving too fast for me to recover from them: most of August 13th is missing, as well as part of August 22nd. I’ll know better next time how to manage this higher volume collection.

Apart from the data, a nice side effect of this work is that I fixed a socket timeout error in twarc that I hadn’t noticed before. I also refactored it a bit so I could use it programmatically like a library instead of only as a command line tool. This allowed me to write a program to archive the tweets, incrementing the max_id and since_id values automatically. The longer continuous crawls near the end are the result of using twarc more as a library from another program.

Bag of Tweets

To try to arrange/package the data a bit I decided to order all the tweets by tweet id, and split them up into gzipped files of 1 million tweets each. Sorting 13 million tweets was pretty easy using leveldb. I first loaded all 16 million tweets into the db, using the tweet id as the key, and the JSON string as the value.

import json import leveldb import fileinput   db = leveldb.LevelDB('./tweets.db')   for line in fileinput.input(): tweet = json.loads(line) db.Put(tweet['id_str'], line)

This took almost 2 hours on a medium ec2 instance. Then I walked the leveldb index, writing out the JSON as I went, which took 35 minutes:

import leveldb   db = leveldb.LevelDB('./tweets.db') for k, v in db.RangeIter(None, include_value=True): print v,

After splitting them up into 1 million line files with cut and gzipping them I put them in a Bag and uploaded it to s3 (8.5G).

I am planning on trying to extract URLs from the tweets to try to come up with a list of seed URLs for the Archive-It crawl. If you have ideas of how to use it definitely get in touch. I haven’t decided yet if/where to host the data publicly. If you have ideas please get in touch about that too!

Library Tech Talk (U of Michigan): Old Wine in New Bottles: Our Efforts Migrating Legacy Materials to HathiTrust

planet code4lib - Wed, 2014-10-01 00:00
(by Kat Hagedorn, Christina Powell, Lance Stuchell and John Weise) The one constant in digital preservation over the past couple of decades has been change. Digitization standards have changed as equipment has improved and become more affordable, formats have come and gone, and tools have been developed to help with automated format creation and validation. The progress made on this front has been great, but how do we reconcile older content with current digitization and preservation standards?

Library of Congress: The Signal: QCTools: Open Source Toolset to Bring Quality Control for Video within Reach

planet code4lib - Tue, 2014-09-30 12:01

In this interview, part of the Insights Interview series, FADGI talks with Dave Rice and Devon Landes about the QCTools project.

In a previous blog post, I interviewed Hannah Frost and Jenny Brice about the AV Artifact Atlas, one of the components of Quality Control Tools for Video Preservation, an NEH-funded project which seeks to design and make available community oriented products to reduce the time and effort it takes to perform high-quality video preservation. The less “eyes on” time it takes to do QC work, the more time can be redirected towards quality control and assessment of video on the digitized content most deserving of attention.

QCTools’ Devon Landes

In this blog post, I interview archivists and software developers Dave Rice and Devon Landes about the latest release version of the QCTools, an open source software toolset to facilitate accurate and efficient assessment of media integrity throughout the archival digitization process.

Kate:  How did the QCTools project come about?

Devon:  There was a recognized need for accessible & affordable tools out there to help archivists, curators, preservationists, etc. in this space. As you mention above, manual quality control work is extremely labor and resource intensive but a necessary part of the preservation process. While there are tools out there, they tend to be geared toward (and priced for) the broadcast television industry, making them out of reach for most non-profit organizations. Additionally, quality control work requires a certain skill set and expertise. Our aim was twofold: to build a tool that was free/open source, but also one that could be used by specialists and non-specialists alike.

QCTools’ Dave Rice

Dave:  Over the last few years a lot of building blocks for this project were coming in place. Bay Area Video Coalition had been researching and gathering samples of digitization issues through the A/V Artifact Atlas project and meanwhile FFmpeg had made substantial developments in their audiovisual filtering library. Additionally, open source technology for archival and preservation applications has been finding more development, application, and funding. Lastly, the urgency related to the obsolescence issues surrounding analog video and lower costs for digital video management meant that more organizations were starting their own preservation projects for analog video and creating a greater need for an open source response to quality control issues. In 2013, the National Endowment for the Humanities awarded BAVC with a Preservation and Access Research and Development grant to develop QCTools.

Kate: Tell us what’s new in this release. Are you pretty much sticking to the plan or have you made adjustments based on user feedback that you didn’t foresee? How has the pilot testing influenced the products?

QCTools provides many playback filters. Here the left window shows a frame with the two fields presented separately (revealing the lack of chroma data in field 2). The right window here shows the V plane of the video per field to show what data the deck is providing.

Devon:  The users’ perspective is really important to us and being responsive to their feedback is something we’ve tried to prioritize. We’ve had several user-focused training sessions and workshops which have helped guide and inform our development process. Certain processing filters were added or removed in response to user feedback; obviously UI and navigability issues were informed by our testers. We’ve also established a GitHub issue tracker to capture user feedback which has been pretty active since the latest release and has been really illuminating in terms of what people are finding useful or problematic, etc.

The newest release has quite a few optimizations to improve speed and responsiveness, some additional playback & viewing options, better documentation and support for the creation of an xml-format report.

Dave:  The most substantial example of going ‘off plan’ was the incorporation of video playback. Initially the grant application focused on QCTools as a purely analytical tool which would assess and present quantifications of video metrics via graphs and data visualization. Initial work delved deeply into identifying methodology to use to pick out the right metrics to find what could be unnatural to digitized analog video (such as pixels too dissimilar from their temporal neighbors, or the near-exact repetition of pixel rows, or discrepancies in the rate of change over time between the two video fields). When presenting the earliest prototypes of QCTools to users a recurring question was “How can I see the video?” We redesigned the project so that QCTools would present the video alongside the metrics along with various scopes, meters and visual tools so that now it has a visual and an analytic side.

Kate:   I love that the Project Scope for QCTools quotes both the Library of Congress’s Sustainability of Digital Formats and the Federal Agencies Digitization Guidelines Initiative as influential resources which encourage best practices and standards in audiovisual digitization of analog material for users. I might be more than a little biased but I agree completely. Tell me about some of the other resources and communities that you and the rest of the project team are looking at.

Here the QCTools vectorscope shows a burst of illegal color values. With the QCTools display of plotted graphs this corresponds to a spike in the maximum saturation (SATMAX).

Devon: Bay Area Video Coalition connected us with a group of testers from various backgrounds and professional environments so we’ve been able to tap into a pretty varied community in that sense. Also, their A/V Artifact Atlas has also been an important resource for us and was really the starting point from which QCTools was born.

Dave:  This project would not at all be feasible without the existing work of FFmpeg. QCTools utilizes FFmpeg for all decoding, playback, metadata expression and visual analytics. The QCTools data format is an expression of FFmpeg’s ffprobe schema, which appeared to be one of the only audiovisual file format standards that could efficiently store masses of frame-based metadata.

Kate:   What are the plans for training and documentation on how to use the product(s)?

Devon:  We want the documentation to speak to a wide range of backgrounds and expertise, but it is a challenge to do that and as such it is an ongoing process. We had a really helpful session during one of our tester retreats where users directly and collaboratively made comments and suggestions to the documentation; because of the breadth of their experience it really helped to illuminate gaps and areas for improvement on our end. We hope to continue that kind of engagement with users and also offer them a place to interact more directly with each other via a discussion page or wiki. We’ve also talked about the possibility of recording some training videos and hope to better incorporate the A/V Artifact Atlas as a source of reference in the next release.

Kate:   What’s next for QCTools?

Dave:   We’re presenting the next release of QCTools at the Association of Moving Image Archivists Annual Meeting on October 9th for which we anticipate supporting better summarization of digitization issues per file in a comparative manner. After AMIA, we’ll focus on audio and the incorporation of audio metrics via FFmpeg’s EBUr128 filter. QCTools has been integrated into workflows at BAVC, Dance Heritage Coalition, MOMA, Anthology Film Archives and Die Osterreichische Mediathek so the QCTools issue tracker has been filling up with suggestions which we’ll be tackling in the upcoming months.

Open Knowledge Foundation: Why the Open Definition Matters for Open Data: Quality, Compatibility and Simplicity

planet code4lib - Tue, 2014-09-30 10:55

The Open Definition performs an essential function as a “standard”, ensuring that when you say “open data” and I say “open data” we both mean the same thing. This standardization, in turn, ensures the quality, compatibility and simplicity essential to realizing one of the main practical benefits of “openness”: the greatly increased ability to combine different datasets together to drive innovation, insight and change.

This post explores in more detail why it’s important to have a clear standard in the form of the Open Definition for what open means for data.

Three Reasons

There are three main reasons why the Open Definition matters for open data:

Quality: open data should mean the freedom for anyone to access, modify and share that data. However, without a well-defined standard detailing what that means we could quickly see “open” being diluted as lots of people claim their data is “open” without actually providing the essential freedoms (for example, claiming data is open but actually requiring payment for commercial use). In this sense the Open Definition is about “quality control”.

Compatibility: without an agreed definition it becomes impossible to know if your “open” is the same as my “open”. This means we cannot know whether it’s OK to connect your open data and my open data together since the terms of use may, in fact, be incompatible (at the very least I’ll have to start consulting lawyers just to find out!). The Open Definition helps guarantee compatibility and thus the free ability to mix and combine different open datasets which is one of the key benefits that open data offers.

Simplicity: a big promise of open data is simplicity and ease of use. This is not just in the sense of not having to pay for the data itself, its about not having to hire a lawyer to read the license or contract, not having to think about what you can and can’t do and what it means for, say, your business or for your research. A clear, agreed definition ensures that you do not have to worry about complex limitations on how you can use and share open data.

Let’s flesh these out in a bit more detail:

Quality Control (avoiding “open-washing” and “dilution” of open)

A key promise of open data is that it can freely accessed and used. Without a clear definition of what exactly that means (e.g. used by whom, for what purpose) there is a risk of dilution especially as open data is attractive for data users. For example, you could quickly find people putting out what they call “open data” but only non-commercial organizations can access the data freely.

Thus, without good quality control we risk devaluing open data as a term and concept, as well as excluding key participants and fracturing the community (as we end up with competing and incompatible sets of “open” data).

Compatibility

A single piece of data on its own is rarely useful. Instead data becomes useful when connected or intermixed with other data. If I want to know about the risk of my home getting flooded I need to have geographic data about where my house is located relative to the river and I need to know how often the river floods (and how much).

That’s why “open data”, as defined by the Open Definition, isn’t just about the freedom to access a piece of data, but also about the freedom connect or intermix that dataset with others.

Unfortunately, we cannot take compatibility for granted. Without a standard like the Open Definition it becomes impossible to know if your “open” is the same as my “open”. This means, in turn, that we cannot know whether it’s OK to connect (or mix) your open data and my open data together (without consulting lawyers!) – and it may turn out that we can’t because your open data license is incompatible with my open data license.

Think of power sockets around the world. Imagine if every electrical device had a different plug and needed a different power socket. When I came over to your house I’d need to bring an adapter! Thanks to standardization at least in a given country power-sockets are almost always the same – so I bring my laptop over to your house without a problem. However, when you travel abroad you may have to take adapter with you. What drives this is standardization (or its lack): within your own country everyone has standardized on the same socket type but different countries may not share a standard and hence you need to get an adapter (or run out of power!).

For open data, the risk of incompatibility is growing as more open data is released and more and more open data publishers such as governments write their own “open data licenses” (with the potential for these different licenses to be mutually incompatible).

The Open Definition helps prevent incompatibility by:

Evergreen ILS: Evergreen to Participate in Outreach Program for Women

planet code4lib - Tue, 2014-09-30 02:59

The Evergreen project will participate in the Outreach Program for Women, a program organized through the GNOME Foundation to improve gender diversity in Free and Open Source Software projects.

The Executive Oversight Board voted last month to fund one internship through the program. The intern will work on a project for the community from December 9, 2014 to March 9, 2015. The Evergreen community has identified five possible projects for the internship: three are software development projects, one is a documentation project, and one is a user experience project.

Candidates for the program have started asking questions in IRC and on the mailing list as they prepare to submit their applications, which are due on October 22, 2014. They will also be looking for feedback on their ideas. Please take the opportunity to share your thoughts with them on these ideas since it will help strengthen their application.

If you are an OPW candidate trying to decide on a project, take some time to stop into the #evergreen IRC channel to learn about our project and to get to know the people responsible for the care and feeding of Evergreen. We are an active and welcoming community that includes not only developers, but the sys admins and librarians who use Evergreen on a daily basis.

To get started, read through the Learning About Evergreen section of our OPW page. Try Evergreen out on one of our community demo servers, read through the documentation, and sign up for our mailing lists to learn more about the community. If you are planning to apply for a coding project, take some time to download and install Evergreen. Each project has an application requirement that you should do before submitting the application. Please take time to review that application requirement and find some way you can contribute to the project.

We look forward to working with you on the project!

District Dispatch: Free webinar: Making the election connection

planet code4lib - Mon, 2014-09-29 21:41

From federal funding to support for school librarians to net neutrality, 2015 will be a critical year for federal policies that impact libraries. We need to be working now to build the political relationships necessary to make sure these decisions benefit our community. Fortunately, the November elections provide a great opportunity to do so.

In a new free webinar hosted by the American Library Association (ALA) and Advocacy Guru Stephanie Vance, leaders will discuss how all types of library supporters can legally engage during an election season, as well as what types of activities will have the most impact. Webinar participants will learn 10 quick and easy tactics, from social media to candidate forums that will help you take action right away. If you want to help protecting our library resources in 2015 and beyond, then this is the session for you. Register now as space is limited.

Webinar: Making the Election Connection
Date: Monday, October 6, 2014
Time: 2:00–2:30 p.m. EDT

The archived webinar will be emailed to District Dispatch subscribers.

The post Free webinar: Making the election connection appeared first on District Dispatch.

Pages

Subscribe to code4lib aggregator