Earlier this week we announced that October is the Global Open Data Index. Already people have added details about open data in Argentina, Colombia, and Chile! You can see all the collaborative work here in our change tracker. Each of you can make a difference to hold governments accountable for open data commitments plus create an easy way for civic technologies to analyze the state of open data around the world, hopefully with some shiny new data viz. Our goal at Open Knowledge is to help you shape the story of Open Data. We are hosting a number of community activities this month to help you learn and connect with each other. Most of all, it is our hope that you can help spread the word in your local language.
We’ve added a number of ways that you can get involved to the OKFN Wiki. But, here are some more ways to learn and share:Community Sessions – Let’s Learn Together
Join the Open Knowledge Team and Open Data Index Mentors for a session all about the Global Open Data Index. It is our goal to show open data around the world. We need your help to add data from your region and reach new people to add details about their country.
We will share some best practices on finding and adding open dataset content to the Open Data Index. And, we’ll answer questions about the use of the Index. There are timeslots to help people connect globally.
- Thursday, October 9, 2014: You can build the Global Open Data Index! (Times: 16:00 BST / 11:00 EST / 17.00 CEST/18:00 EAT)
- Monday, October 13, 2014: Help Create the Open Data Index (Times: 10:00 BST, 11.00 CEST, 16:00 WIB,, 17:00 HKT)
These will be recorded. But, we encourage you to join us on G+ /youtube and bring your ideas/questions. Stay tuned as we may add more online sessions.Community Office Hours
Searching for datasets and using the Global Open Data Index tool is all the better with a little help from mentors and fellow community members. If you are a mentor, it would be great if you could join us on a Community Session or host some local office hours. Simply add your name and schedule here.Mailing Lists and Twitter
The Open Data Index mailing list is the main communication channel for folks who have questions or want to get in touch: https://lists.okfn.org/mailman/listinfo/open-data-census#sthash.HGagGu39.dpuf For twitter, keep an eye on updates via #openindex2014Translation Help
What better way to help others get involved than to share in your own language. We could use your help. We have some folks translating content into Spanish. Other priority languages are Yours!, Arabic, Portuguese, French and Swahili. Here are some ways to help translate:
- Small tasks – Translate and share Tweets/Facebook posts – (Add this to etherpad)
- Small tasks – Translate the Open Data index blog post into your language – send to index at OKFN dot Org
- Medium taks – Make a copy of the Open Data Index Tutorial and translate it into your language
We know that you have limited time to contribute. We’ve created some FAQs and tips to help you add datasets on your own time. I personally like to think of it as a data expedition to check the quality of open data in many countries. Happy hunting and gathering! Last year I had fun reviewing data from around the world. But, what matters is that you have local context to review the language and data for your country. Here’s a quick screenshot of how to contribute:
Thanks again for making Open Data Matter in your part of the world!
We are very excited to announce the beta release of the WorldCat Discovery API. This API is a full-featured, modern discovery API that allows you to search across WorldCat and OCLC’s central index. The WorldCat Discovery API is currently available as a beta and is not yet in general release or available for use by commercial partners. Libraries using WorldCat Discovery Services can request to participate in the beta.
Last updated October 1, 2014. Created by Peter Murray on October 1, 2014.
Log in to edit this page.
Service Proxy version 0.38, Mon Sep 29 16:27:12 UTC 2014
- allows empty un/pw on perconfig authentication, MKSP-125
- statistics plug-in can optionally make it's own bytarget request for
per-target stats, MKSP-130
- bug-fixes and optimizations to bootstrapping of search before record
- encodes pazpar2 parameter names, i.e. to support names on the form
I did not double check, but I think it’s safe to say that most of the last E-rate posts have mentioned somewhere “over the last year” or “about a year ago” or “beginning last summer.” So… Monday, we saw an inkling of the potential payoff for which we have been holding our collective breath for over a year, since the E-rate modernization proceeding began.
On Monday while we were putting the final touches on our reply comments (pdf) to the E-rate Further Notice of Proposed Rulemaking, Federal Communications Commission (FCC), Chairman Tom Wheeler delivered remarks at the 2014 Education Technology Summit. The Chairman’s remarks clearly articulated what we have been hoping to hear since the adoption of the changes in the July Order and its Wi-Fi focus. For those of you following along closely, you know we have been advocating strongly for increasing the number of libraries that can report scalable, affordable high-capacity broadband to the building. While our strategy evolved in response to the changing dynamics in D.C. as well as through input from the numerous emails and calls and meetings with ALA’s E-rate Task Force and other library organizations, our goal remains unchanged. We know from more than a decade of research that the fundamental barriers libraries face in increasing broadband capacity are availability and affordability.
On Monday, the Chairman clearly articulated that addressing these barriers is the focus of this next phase of the E-rate modernization efforts:
We have updated the program to close the Wi-Fi gap. Next, we must close the Rural Fiber Gap. So, today, I would like to visit about the next steps in the evolution of the E-rate program. In particular I want to talk about two related issues that remain squarely before the Commission as we consider next steps in the E-rate modernization process: 1) closing the Rural Fiber Gap for schools and libraries, and 2) tackling the affordability challenge.
We know that with the majority of libraries still reporting speeds less than 10 Mbps, there is a long way to go before we can report the majority of libraries are closing in on the gigabit goal set by the Commission in July. And, we know that for most libraries the key to getting there is via a fiber connection regardless of locale.
Our comments also stress the “affordability gap” and we call on the Commission to address both simultaneously, knowing that for many libraries fiber (or other technology) may be in the vicinity of the library, but the monthly cost of service is more than the library can afford so the library ends up saying, “no thank you.” Whether it’s a library struggling along at 3 Mbps to provide video conferencing and distance education services in a rural community; an urban library maxing out every day at 3:00 when school lets out and patrons on their own devices or at the library computers are feeling the stress on the library’s network; or a suburban library planning a multi-media lab and holding work skills classes, we know that two thirds of all libraries want to upgrade to higher speeds. The Commission has clearly opened the door to see that these upgrades can be done through the E-rate program—and that the recurring costs are subsequently affordable.
I would say that over the last year (and leading up to this current proceeding from our earlier work related to the National Broadband Plan in 2010) we worked hard to turn the national emphasis on broadband access and adoption in favor of libraries. With regards to E-rate, we repeatedly asked the question, how should the E-rate program look in the 21st century so that it can best meet the needs of 21st century libraries? Ensuring libraries have the broadband capacity they need is one critical way to shape the program.
A long-standing issue for ALA has been to see the program adequately funded. Our comments ask the Commission to take up the funding challenge, knowing that upgrades will both require immediate investment and likely incur greater monthly costs for service. The data gathering by the Commission and by stakeholders (in addition to the careful review of current program spending, fine-tuning of eligible services, and encouraging economies of scale) will work as guide posts for determining future funding needs of the program. Chairman Wheeler clearly opened the funding door as well and we are confident that “right sizing” the fund for the long haul is firmly on the agenda.
All told, I think we are slowly exhaling. In part because we submitted the reply comments well before the midnight deadline, but really because while we have made some subtle gains over the course of the year’s work (and some not so subtle, perhaps), what we heard from the Chairman on Monday can be read as the Commission making good on its commitment to addressing the “to the library” issue.
There is quite a bit of distance between remarks made in a speech and a Commission order, but the Chairman set an agenda for the E-rate review and modernization and to date, has accomplished much of that agenda. Going from rulemaking to order is an example of the art of compromise and we look forward to helping shape the process. In June in Las Vegas, the E-rate stakes were pretty high. This October in D.C. they will be even higher, but before we deal the last hand, we can step back briefly and quietly say “woo hoo.”
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.New This Week
Visit the LITA Job Site for more available jobs and for information on submitting a job posting.
Cynthia Ng: Access 2014: We’re All Disabled! Part 2: Building Accessible (Web) Services with Universal Design
This is a post put together based on great contributions on the blogs of the Electronic Frontier Foundation (Adi Kamdar & Maira Sutton), Creative Commons (Timothy Vollmer) and the Open Access Button project (David Carroll).
Join the global Open Access movement!
In July the Electronic Frontier Foundation (EFF) wrote about the predicament that Colombian student Diego Gomez found himself in after he shared a research article online. Gomez is a graduate student in conservation and wildlife management at a small university. He has generally poor access to many of the resources and databases that would help him conduct his research. Paltry access to useful materials combined with a natural culture of sharing amongst researchers prompted Gomez to share a paper on Scribd so that he and others could access it for their work. The practice of learning and sharing under less-than-ideal circumstances could land Diego in prison.Facing 4-8 years in prison for sharing an article
The EFF reports that upon learning of this unauthorized sharing, the author of the research article filed criminal complaint against Gomez. The charges lodged against Diego could put him in prison for 4-8 years. The trial has started, and the court will need to take into account several factors: including whether there was any malicious intent to the action, and whether there was any actual harm against the economic rights of the author.
Academics and students send and post articles online like this every day—it is simply the norm in scholarly communication. And yet inflexible digital policies, paired with senseless and outdated practices, have led to such extreme cases like Diego’s. People who experience massive access barriers to existing research—most often hefty paywalls—often have no choice but to find and share relevant papers through colleagues in their network. The Internet has certainly enabled this kind of information sharing at an unprecedented speed and scale, but we are still far from reaching its full capacity.If open access were the default for scholarly communication, cases like Diego’s would become obsolete. Let’s stand together to support Diego Gomez and promote Open Access worldwide.
Help Diego Gomez and join academics and users in fighting outdated laws and practices that keep valuable research locked up for no good reason. If open access were the default for scholarly communication, cases like Diego’s would become obsolete. Academic research would be free to access and available under an open license that would legally enable the kind of sharing that is so crucial for enabling scientific progress.
We at Open Knowledge have joined as signees of the petition in support of Diego alongside prominent organisations such as the Electronic Frontier Foundation, Creative Commons, Open Access Button, Internet Archive, Public Knowledge, and the Right to Research Coalition. Sign your support for Diego to express your support for open access as the default for scientific and scholarly publishing, so researchers like Diego don’t risk severe penalties for helping colleagues access the research they need:[Click here to sign the petition]
Sign-on statement: “Scientific and scholarly progress relies upon the exchange of ideas and research. We all benefit when research is shared widely, freely, and openly. I support an Open Access system for academic publishing that makes research free for anyone to read and re-use; one that is inclusive of all and doesn’t force researchers like Diego Gomez to risk severe penalties for helping colleagues access the research they need.”
From the technical services side, this means our catalogers must provide metadata for resources in unfamiliar languages, including some that don’t use the Roman alphabet. A few of the challenges we face include:
- Identifying the language of an item (is that Spanish or Catalan?)
- Cataloging an item in a language you don’t speak or read (what is this book even about?)
- Transliterating from non-Roman alphabets (e.g. Cyrillic, Chinese, Thai)
- Diacritic codes in copy cataloging that don’t match your system’s encoding scheme
I’d like to share a few free tools that our catalogers have found helpful. I’ve used some of these in other areas of librarianship as well, including acquisitions and reference.
Sometimes I open a book or article and have no idea where to start, because the language isn’t anything I’ve seen before.
I turn to the Open Xerox Language Identifier, which covers over 80 different languages. Type or paste in text of the mysterious language, and give it a try. The more text you provide, the more accurate it is.
Web translation tools aren’t perfect, but they’re a great way to get the gist of a piece of writing (don’t use them for sending sensitive emails to bilingual coworkers, however).
Google Translate includes over 75 languages, and also a language identification tool. Enter the title, a few chapter names, or back cover blurb, and you’ll get the general idea of the content.
If you catalog in Roman script, and you wind up with a resource in Cyrillic or Chinese, how do you translate that so the record is searchable in your ILS? Transliteration tables match up characters between scripts.
The ALA-LC Romanization Tables for non-Roman scripts are approved by the American Library Association and the Library of Congress. They cover over 70 different scripts.
We’re fortunate that librarians love to share: there are quite a few sites produced by libraries that look at common bibliographic terms you’d find on title pages: numbers, dates, editions, statements of responsibility, price, etc.
To share two Canadian examples, Memorial University maintains a Glossary of Bibliographic Information by Language and Queen’s University has a page of Foreign Language Equivalents for Bibliographic Terms.
If you’ve ever seen the phrase “bibliographic knowledge of [language]” in a job posting, this is what it’s referring to—when you’ve cataloged enough material in a language to know these terms, but can’t carry on a conversation about daily life. I have bibliographic knowledge of Spanish, Italian, and Germany, but don’t ask me to go to a restaurant in Hamburg and order a hamburger.
Similar to bibliographic dictionaries, these are for terms common to specific subjects.
My university has significant music and map collections, so I often consult the language tools at Music Cataloging at Yale (…and I once thought music was the universal language) and the European Environment Agency’s Terminology and Discovery Service.
In order to ensure that accented characters and special symbols display properly in the catalog, it’s important to have the correct diacritic code.
It may also be worth coming up with a cheat sheet of the codes you use most frequently – for example, common French accents if you’re cataloging Canadian government documents, which are bilingual.
Many Integrated Library Systems also have diacritic charts built in, where you can select the symbol you need and click it to place it in the record.
Diacritic charts can be long and involved (the Unicode example above is a bit of a nightmare), so if you’re working with a new language, browsing through them searching for a specific code can be time-consuming. You can see the symbol in front of you, but have no idea what it’s called.
This is where Shapecatcher comes in. This utility allows you to draw a character using your mouse or tablet. It identifies possible matches for the symbol and gives you the symbol’s name and Unicode number.
Have you encountered issues handling different languages when cataloguing? Is there a free language tool you’d like to share? Tell us about it in the comments!
Credits: Image of Pieter Bruegel the Elder’s painting The Tower of Babel courtesy of the Google Art Project. Many thanks also to my colleagues Judy Harris and Vivian Zhang for sharing their language challenges and tools.
Much has been written about the significance of Twitter as the recent events in Ferguson echoed round the Web, the country, and the world. I happened to be at the Society of American Archivists meeting 5 days after Michael Brown was killed. During our panel discussion someone asked about the role that archivists should play in documenting the event.
There was wide agreement that Ferguson was a painful reminder of the type of event that archivists working to “interrogate the role of power, ethics, and regulation in information systems” should be documenting. But what to do? Unfortunately we didn’t have time to really discuss exactly how this agreement translated into action.
Fortunately the very next day the Archive-It service run by the Internet Archive announced that they were collecting seed URLs for a Web archive related to Ferguson. It was only then, after also having finally read Zeynep Tufekci‘s terrific Medium post, that I slapped myself on the forehead … of course, we should try to archive the tweets. Ideally there would be a “we” but the reality was it was just “me”. Still, it seemed worth seeing how much I could get done.twarc
I had some previous experience archiving tweets related to Aaron Swartz using Twitter’s search API. (Full disclosure: I also worked on the Twitter archiving project at the Library of Congress, but did not use any of that code or data then, or now.) I wrote a small Python command line program named twarc (a portmanteau for Twitter Archive), to help manage the archiving.
You give twarc a search query term, and it will plod through the search results, in reverse chronological order (the order that they are returned in), while handling quota limits, and writing out line-oriented-json, where each line is a complete tweet. It worked quite well to collect 630,000 tweets mentioning “aaronsw”, but I was starting late out of the gate, 6 days after the events in Ferguson began. One downside to twarc is it is completely dependent on Twitter’s search API, which only returns results for the past week or so. You can search back further in Twitter’s Web app, but that seems to be a privileged client. I can’t seem to convince the API to keep going back in time past a week or so.
So time was of the essence. I started up twarc searching for all tweets that mention ferguson, but quickly realized that the volume of tweets, and the order of the search results meant that I wouldn’t be able to retrieve the earliest tweets. So I tried to guesstimate a Twitter ID far enough back in time to use with twarc’s --max_id parameter to limit the initial query to tweets before that point in time. Doing this I was able to get back to 2014-08-10 22:44:43 — most of August 9th and 10th had slipped out of the window. I used a similar technique of guessing a ID further in the future in combination with the --since_id parameter to start collecting from where that snapshot left off. This resulted in a bit of a fragmented record, which you can see visualized (sort of below):
In the end I collected 13,480,000 tweets (63G of JSON) between August 10th and August 27th. There were some gaps because of mismanagement of twarc, and the data just moving too fast for me to recover from them: most of August 13th is missing, as well as part of August 22nd. I’ll know better next time how to manage this higher volume collection.
Apart from the data, a nice side effect of this work is that I fixed a socket timeout error in twarc that I hadn’t noticed before. I also refactored it a bit so I could use it programmatically like a library instead of only as a command line tool. This allowed me to write a program to archive the tweets, incrementing the max_id and since_id values automatically. The longer continuous crawls near the end are the result of using twarc more as a library from another program.Bag of Tweets
To try to arrange/package the data a bit I decided to order all the tweets by tweet id, and split them up into gzipped files of 1 million tweets each. Sorting 13 million tweets was pretty easy using leveldb. I first loaded all 16 million tweets into the db, using the tweet id as the key, and the JSON string as the value.import json import leveldb import fileinput db = leveldb.LevelDB('./tweets.db') for line in fileinput.input(): tweet = json.loads(line) db.Put(tweet['id_str'], line)
This took almost 2 hours on a medium ec2 instance. Then I walked the leveldb index, writing out the JSON as I went, which took 35 minutes:import leveldb db = leveldb.LevelDB('./tweets.db') for k, v in db.RangeIter(None, include_value=True): print v,
I am planning on trying to extract URLs from the tweets to try to come up with a list of seed URLs for the Archive-It crawl. If you have ideas of how to use it definitely get in touch. I haven’t decided yet if/where to host the data publicly. If you have ideas please get in touch about that too!
Library Tech Talk (U of Michigan): Old Wine in New Bottles: Our Efforts Migrating Legacy Materials to HathiTrust
Library of Congress: The Signal: QCTools: Open Source Toolset to Bring Quality Control for Video within Reach
In this interview, part of the Insights Interview series, FADGI talks with Dave Rice and Devon Landes about the QCTools project.
In a previous blog post, I interviewed Hannah Frost and Jenny Brice about the AV Artifact Atlas, one of the components of Quality Control Tools for Video Preservation, an NEH-funded project which seeks to design and make available community oriented products to reduce the time and effort it takes to perform high-quality video preservation. The less “eyes on” time it takes to do QC work, the more time can be redirected towards quality control and assessment of video on the digitized content most deserving of attention.
In this blog post, I interview archivists and software developers Dave Rice and Devon Landes about the latest release version of the QCTools, an open source software toolset to facilitate accurate and efficient assessment of media integrity throughout the archival digitization process.
Kate: How did the QCTools project come about?
Devon: There was a recognized need for accessible & affordable tools out there to help archivists, curators, preservationists, etc. in this space. As you mention above, manual quality control work is extremely labor and resource intensive but a necessary part of the preservation process. While there are tools out there, they tend to be geared toward (and priced for) the broadcast television industry, making them out of reach for most non-profit organizations. Additionally, quality control work requires a certain skill set and expertise. Our aim was twofold: to build a tool that was free/open source, but also one that could be used by specialists and non-specialists alike.
Dave: Over the last few years a lot of building blocks for this project were coming in place. Bay Area Video Coalition had been researching and gathering samples of digitization issues through the A/V Artifact Atlas project and meanwhile FFmpeg had made substantial developments in their audiovisual filtering library. Additionally, open source technology for archival and preservation applications has been finding more development, application, and funding. Lastly, the urgency related to the obsolescence issues surrounding analog video and lower costs for digital video management meant that more organizations were starting their own preservation projects for analog video and creating a greater need for an open source response to quality control issues. In 2013, the National Endowment for the Humanities awarded BAVC with a Preservation and Access Research and Development grant to develop QCTools.
Kate: Tell us what’s new in this release. Are you pretty much sticking to the plan or have you made adjustments based on user feedback that you didn’t foresee? How has the pilot testing influenced the products?
Devon: The users’ perspective is really important to us and being responsive to their feedback is something we’ve tried to prioritize. We’ve had several user-focused training sessions and workshops which have helped guide and inform our development process. Certain processing filters were added or removed in response to user feedback; obviously UI and navigability issues were informed by our testers. We’ve also established a GitHub issue tracker to capture user feedback which has been pretty active since the latest release and has been really illuminating in terms of what people are finding useful or problematic, etc.
The newest release has quite a few optimizations to improve speed and responsiveness, some additional playback & viewing options, better documentation and support for the creation of an xml-format report.
Dave: The most substantial example of going ‘off plan’ was the incorporation of video playback. Initially the grant application focused on QCTools as a purely analytical tool which would assess and present quantifications of video metrics via graphs and data visualization. Initial work delved deeply into identifying methodology to use to pick out the right metrics to find what could be unnatural to digitized analog video (such as pixels too dissimilar from their temporal neighbors, or the near-exact repetition of pixel rows, or discrepancies in the rate of change over time between the two video fields). When presenting the earliest prototypes of QCTools to users a recurring question was “How can I see the video?” We redesigned the project so that QCTools would present the video alongside the metrics along with various scopes, meters and visual tools so that now it has a visual and an analytic side.
Kate: I love that the Project Scope for QCTools quotes both the Library of Congress’s Sustainability of Digital Formats and the Federal Agencies Digitization Guidelines Initiative as influential resources which encourage best practices and standards in audiovisual digitization of analog material for users. I might be more than a little biased but I agree completely. Tell me about some of the other resources and communities that you and the rest of the project team are looking at.
Devon: Bay Area Video Coalition connected us with a group of testers from various backgrounds and professional environments so we’ve been able to tap into a pretty varied community in that sense. Also, their A/V Artifact Atlas has also been an important resource for us and was really the starting point from which QCTools was born.
Dave: This project would not at all be feasible without the existing work of FFmpeg. QCTools utilizes FFmpeg for all decoding, playback, metadata expression and visual analytics. The QCTools data format is an expression of FFmpeg’s ffprobe schema, which appeared to be one of the only audiovisual file format standards that could efficiently store masses of frame-based metadata.
Kate: What are the plans for training and documentation on how to use the product(s)?
Devon: We want the documentation to speak to a wide range of backgrounds and expertise, but it is a challenge to do that and as such it is an ongoing process. We had a really helpful session during one of our tester retreats where users directly and collaboratively made comments and suggestions to the documentation; because of the breadth of their experience it really helped to illuminate gaps and areas for improvement on our end. We hope to continue that kind of engagement with users and also offer them a place to interact more directly with each other via a discussion page or wiki. We’ve also talked about the possibility of recording some training videos and hope to better incorporate the A/V Artifact Atlas as a source of reference in the next release.
Kate: What’s next for QCTools?
Dave: We’re presenting the next release of QCTools at the Association of Moving Image Archivists Annual Meeting on October 9th for which we anticipate supporting better summarization of digitization issues per file in a comparative manner. After AMIA, we’ll focus on audio and the incorporation of audio metrics via FFmpeg’s EBUr128 filter. QCTools has been integrated into workflows at BAVC, Dance Heritage Coalition, MOMA, Anthology Film Archives and Die Osterreichische Mediathek so the QCTools issue tracker has been filling up with suggestions which we’ll be tackling in the upcoming months.
Open Knowledge Foundation: Why the Open Definition Matters for Open Data: Quality, Compatibility and Simplicity
The Open Definition performs an essential function as a “standard”, ensuring that when you say “open data” and I say “open data” we both mean the same thing. This standardization, in turn, ensures the quality, compatibility and simplicity essential to realizing one of the main practical benefits of “openness”: the greatly increased ability to combine different datasets together to drive innovation, insight and change.
This post explores in more detail why it’s important to have a clear standard in the form of the Open Definition for what open means for data.Three Reasons
Quality: open data should mean the freedom for anyone to access, modify and share that data. However, without a well-defined standard detailing what that means we could quickly see “open” being diluted as lots of people claim their data is “open” without actually providing the essential freedoms (for example, claiming data is open but actually requiring payment for commercial use). In this sense the Open Definition is about “quality control”.
Simplicity: a big promise of open data is simplicity and ease of use. This is not just in the sense of not having to pay for the data itself, its about not having to hire a lawyer to read the license or contract, not having to think about what you can and can’t do and what it means for, say, your business or for your research. A clear, agreed definition ensures that you do not have to worry about complex limitations on how you can use and share open data.
Let’s flesh these out in a bit more detail:Quality Control (avoiding “open-washing” and “dilution” of open)
A key promise of open data is that it can freely accessed and used. Without a clear definition of what exactly that means (e.g. used by whom, for what purpose) there is a risk of dilution especially as open data is attractive for data users. For example, you could quickly find people putting out what they call “open data” but only non-commercial organizations can access the data freely.
Thus, without good quality control we risk devaluing open data as a term and concept, as well as excluding key participants and fracturing the community (as we end up with competing and incompatible sets of “open” data).Compatibility
A single piece of data on its own is rarely useful. Instead data becomes useful when connected or intermixed with other data. If I want to know about the risk of my home getting flooded I need to have geographic data about where my house is located relative to the river and I need to know how often the river floods (and how much).
That’s why “open data”, as defined by the Open Definition, isn’t just about the freedom to access a piece of data, but also about the freedom connect or intermix that dataset with others.
Unfortunately, we cannot take compatibility for granted. Without a standard like the Open Definition it becomes impossible to know if your “open” is the same as my “open”. This means, in turn, that we cannot know whether it’s OK to connect (or mix) your open data and my open data together (without consulting lawyers!) – and it may turn out that we can’t because your open data license is incompatible with my open data license.
Think of power sockets around the world. Imagine if every electrical device had a different plug and needed a different power socket. When I came over to your house I’d need to bring an adapter! Thanks to standardization at least in a given country power-sockets are almost always the same – so I bring my laptop over to your house without a problem. However, when you travel abroad you may have to take adapter with you. What drives this is standardization (or its lack): within your own country everyone has standardized on the same socket type but different countries may not share a standard and hence you need to get an adapter (or run out of power!).
For open data, the risk of incompatibility is growing as more open data is released and more and more open data publishers such as governments write their own “open data licenses” (with the potential for these different licenses to be mutually incompatible).
The Open Definition helps prevent incompatibility by:
- Setting out a set of clear principles that every open data license should conform to (not by mandating one single license – or even specific license terms)
- Running a dedicated process for reviewing and determining whether a license is conformant with the Open Definition’s principles
The Evergreen project will participate in the Outreach Program for Women, a program organized through the GNOME Foundation to improve gender diversity in Free and Open Source Software projects.
The Executive Oversight Board voted last month to fund one internship through the program. The intern will work on a project for the community from December 9, 2014 to March 9, 2015. The Evergreen community has identified five possible projects for the internship: three are software development projects, one is a documentation project, and one is a user experience project.
Candidates for the program have started asking questions in IRC and on the mailing list as they prepare to submit their applications, which are due on October 22, 2014. They will also be looking for feedback on their ideas. Please take the opportunity to share your thoughts with them on these ideas since it will help strengthen their application.
If you are an OPW candidate trying to decide on a project, take some time to stop into the #evergreen IRC channel to learn about our project and to get to know the people responsible for the care and feeding of Evergreen. We are an active and welcoming community that includes not only developers, but the sys admins and librarians who use Evergreen on a daily basis.
To get started, read through the Learning About Evergreen section of our OPW page. Try Evergreen out on one of our community demo servers, read through the documentation, and sign up for our mailing lists to learn more about the community. If you are planning to apply for a coding project, take some time to download and install Evergreen. Each project has an application requirement that you should do before submitting the application. Please take time to review that application requirement and find some way you can contribute to the project.
We look forward to working with you on the project!
From federal funding to support for school librarians to net neutrality, 2015 will be a critical year for federal policies that impact libraries. We need to be working now to build the political relationships necessary to make sure these decisions benefit our community. Fortunately, the November elections provide a great opportunity to do so.
In a new free webinar hosted by the American Library Association (ALA) and Advocacy Guru Stephanie Vance, leaders will discuss how all types of library supporters can legally engage during an election season, as well as what types of activities will have the most impact. Webinar participants will learn 10 quick and easy tactics, from social media to candidate forums that will help you take action right away. If you want to help protecting our library resources in 2015 and beyond, then this is the session for you. Register now as space is limited.
Webinar: Making the Election Connection
Date: Monday, October 6, 2014
Time: 2:00–2:30 p.m. EDT
The special student registration rate to the 2014 LITA National Forum has been extended through Monday October 6th, 2014. The Forum will be held November 5-8, 2014 at the Hotel Albuquerque in Albuquerque, NM. Learn more about the Forum here.
This special rate is intended for a limited number of graduate students enrolled in ALA accredited programs. In exchange for a discounted registration, students will assist the LITA organizers and the Forum presenters with on-site operations. This year’s theme is “Transformation: From Node to Network.” We are anticipating an attendance of 300 decision makers and implementers of new information technologies in libraries.
The selected students will be expected to attend the full LITA National Forum, Thursday noon through Saturday noon. This does not include the pre-conferences on Thursday and Friday. You will be assigned a variety of duties, but you will be able to attend the Forum programs, which include 3 keynote sessions, 30 concurrent sessions, and a dozen poster presentations.
The special student rate is $180 – half the regular registration rate for LITA members. This rate includes a Friday night reception at the hotel, continental breakfasts, and Saturday lunch. To get this rate you must apply and be accepted per below.
To apply for the student registration rate, please provide the following information:
- Complete contact information including email address,
- The name of the school you are attending, and
- 150 word (or less) statement on why you want to attend the 2014 LITA Forum
Please send this information no later than October 6, 2014 to email@example.com, with “2014 LITA Forum Student Registration Request” in the subject line.
Those selected for the student rate will be notified no later than October 10, 2014.
Library of Congress: The Signal: Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014
The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress.
The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT professionals, digital collections and strategic planning staff and digital preservation practitioners to discuss the challenges of digital storage and to help inform decision-making in the future. Participants come from a variety of government agencies, cultural heritage institutions and academic and research organizations.
Throughout the two days of the meeting the speakers took the participants back in time and then forward again. The meeting kicked-off with a review of the origins of the DSA meeting. It started ten years ago with a gathering of Library of Congress and external experts who discussed requirements for digital storage architectures for the Library’s Packard Campus of the National Audio-Visual Conservation Center. Now, ten years later, the speakers included representatives from Facebook and Amazon Web Services, both of which manage significant amounts of content and neither of which existed in 2004 when the DSA meeting started.
The theme of time passing continued with presentations by strategic technical experts from the storage industry who began with an overview of the capacity and cost trends in storage media over the past years. Two of the storage media being tracked weren’t on anyone’s radar in 2004, but loom large for the future – flash memory and Blu-ray disks. Moving from the past quickly to the future, the experts then offered predictions, with the caveat that predictions beyond a few years are predictably unpredictable in the storage world.
Another facet of time – “back to the future” – came up in a series of discussions on the emergence of object storage in up-and-coming hardware and software products. With object storage, hardware and software can deal with data objects (like files), rather than physical blocks of data. This is a concept familiar to those in the digital curation world, and it turns out that it was also familiar to long-time experts in the computer architecture world, because the original design for this was done ten years ago. Here are some of the key meeting presentations on object storage:
- Henry Newman – Instrumental, Inc., “Object Storage Developments” (PDF)
- David Anderson – Seagate, “Introduction to Kinetic Key Value Storage” (PDF)
- Sage Weil – Ceph, “Digital Preservation with Open Source” (PDF)
- Chris MacGown – Piston Cloud, “Object Storage: An update in brief” (PDF)
Several speakers talked about the impact of the passage of time on existing digital storage collections in their institutions and the need to perform migrations of content from one set of hardware or software to another as time passes. The lessons of this were made particularly vivid by one speaker’s analogy, which compared the process to the travails of someone trying to manage the physical contents of a car over one’s lifetime.
Even more vivid was the “Cost of Inaction” calculator, which provides black-and-white evidence of the costs of not preserving analog media over time, starting with the undeniable fact that you have to start with an actual date in the future for the “doomsday” when all your analog media will be unreadable.
Several persistent time-related themes engaged the participants in lively interactive discussions during the meeting. One topic was the practical methods for checking the data integrity of content in digital collections. This concept, called fixity, has been a common topic of interest in the digital preservation community. Similarly, a thread of discussion on predicting and dealing with failure and data loss over time touched on a number of interesting concepts, including “anti-entropy,” a type of computer “gossip” protocol designed to query, detect and correct damaged distributed digital files. Participants agreed it would be useful to find a practical approach to identifying and quantifying types of failures. Are the failures relatively regular but small enough that the content can be reconstructed? Or are the data failures highly irregular but catastrophic in nature?
Another common theme that arose is how to test and predict the lifetime of storage media. For example, how would one test the lifetime of media projected to last 1000 years without having a time-travel machine available? Participants agreed to continue the discussions of these themes over the next year with the goal of developing practical requirements for communication with storage and service providers.
The meeting closed with presentations from vendors working on the cutting edge of new archival media technologies. One speaker dealt with questions about the lifetime of media by serenading the group with accompaniment from a 32-year-old audio CD copy of Pink Floyd’s “Dark Side of the Moon.” The song “Us and Them” underscored how the DSA meeting strives to bridge the boundaries placed between IT conceptions of storage systems and architectures and the practices, perspectives and values of storage and preservation in the cultural heritage sector. The song playing back from three decade old media on a contemporary device was a fitting symbol of the objectives of the meeting.
Background reading (PDF) was circulated prior to the meeting and the meeting agenda and copies of the presentations are available at http://www.digitalpreservation.gov/meetings/storage14.html.
In 2012 the Open Knowledge launched the Global Open Data Index to help track the state of open data around the world. We’re now in the process of collecting submissions for the 2014 Open Data Index and we want your help!
The main thing you can do is become a Contributor and add information about the state of open data in your country to the Open Data Index Survey. More details and quickstart guide to contributing here »
We also have other ways you can help:
Become a Mentor: Mentors support the Index in a variety of ways from engaging new contributors, mentoring them and generally promoting the Index in their community. Activities can include running short virtual “office hours” to support and advise other contributors, promoting the Index with civil society organizations – blogging, tweeting etc. To apply to be a Mentor, please fill in this form.
Become a Reviewer: Reviewers are specially selected experts who review submissions and check them to ensure information is accurate and up-to-date and that the Index is generally of high-quality. To apply to be a Reviewer, fill in this form.Mailing Lists and Twitter
The Open Data Index mailing list is the main communication channel for folks who have questions or want to get in touch: https://lists.okfn.org/mailman/listinfo/open-data-census
For twitter, keep an eye on updates via #openindex2014Key dates for your calendar
We will kick off on September 30th, in Mexico City with a virtual and in-situ event at Abre LATAM and ConDatos (including LATAM regional skillshare meeting!). Keep an eye on Twitter to find out more details at #openindex14 and tune into these regional sprints:
- Europe / MENA / Africa (October 8-10) – with a regional Google Hangout on 9/10.
- Asia / Pacific (October 13-15) – with a regional Google Hangout on 13/10.
- All day virtual event to wrap-up (October 17)
More on this to follow shortly, keep an eye on this space.Why the Open Data Index?
The last few years has seen an explosion of activity around open data and especially open government data. Following initiatives like data.gov and data.gov.uk, numerous local, regional and national bodies have started open government data initiatives and created open data portals (from a handful three years ago there are now nearly 400 open data portals worldwide).
But simply putting a few spreadsheets online under an open license is obviously not enough. Doing open government data well depends on releasing key datasets in the right way.
Moreover, with the proliferation of sites it has become increasingly hard to track what is happening: which countries, or municipalities, are actually releasing open data and which aren’t? Which countries are releasing data that matters? Which countries are releasing data in the right way and in a timely way?
The Global Open Data Index was created to answer these sorts of questions, providing an up-to-date and reliable guide to the state of global open data for policy-makers, researchers, journalists, activists and citizens.
The first initiative of its kind, the Global Open Data Index is regularly updated and provides the most comprehensive snapshot available of the global state of open data. The Index is underpinned by a detailed annual survey of the state of open data run by Open Knowledge in collaboration with open data experts and communities around the world.
The American Library Association (ALA) today announced the launch of “Progress in the Making,” (pdf) a new educational campaign that will explore the public policy opportunities and challenges of 3D printer adoption by libraries. Today, the association released “Progress in the Making: An Introduction to 3D Printing and Public Policy,” a tip sheet that provides an overview of 3D printing, describes a number of ways libraries are currently using 3D printers, outlines the legal implications of providing the technology, and details ways that libraries can implement simple yet protective 3D printing policies in their own libraries.
“As the percentage of the nation’s libraries helping their patrons create new objects and structures with 3D printers continues to increase, the legal implications for offering the high-tech service in the copyright, patent, design and trade realms continues to grow as well,” said Alan S. Inouye, director of the ALA Office for Information Technology Policy. “We have reached a point in the evolution of 3D printing services where libraries need to consider developing user policies that support the library mission to make information available to the public. If the library community promotes practices that are smart and encourage creativity, it has a real chance to guide the direction of the public policy that takes shape around 3D printing in the coming years.”
Over the next coming months, ALA will release a white paper and a series of tip sheets that will help the library community better understand and adapt to the growth of 3D printers, specifically as the new technology relates to intellectual property law and individual liberties.
This tip sheet is the product of collaboration between the Public Library Association (PLA), the ALA Office for Information Technology Policy (OITP) and United for Libraries, and coordinated by OITP Information Policy Analyst Charlie Wapner. View the tip sheet (pdf).
The post ALA launches educational 3D printing policy campaign appeared first on District Dispatch.
If you are writing configuration to take a pattern to match against files in a file system…
You probably want Dir.globs, not regexes. Dir.glob is in the stdlib. Dir.glob’s unix-shell-style patterns are less expressive than regexes, but probably expressive enough for anything you need in this use case, and much simpler to deal with for common patterns in this use case.
…I don’t even feel like thinking about how to express as a regexp that you don’t want child directories, but only directly there.
Dir.glob will find matches from within a directory on local file system — but if you have a certain filepath in a string you want to test for a match against a dirglob, you can easily do that too with Pathname.fnmatch, which does not even require the string to exist in the local file system but can still check it for a match against a dirglob.
Some more info and examples from Shane da Silva, who points out some annoying inconsistent gotchas to be aware of.
Filed under: General