Open Knowledge Foundation: An unprecedented Public-Commons partnership for the French National Address Database
This is a guest post, originally published in French on the Open Knowledge Foundation France blog
Nowadays, being able to place an address on a map is an essential information. In France, where addresses were still unavailable for reuse, the OpenStreetMap community decided to create its own National Address Database available as open data. The project rapidly gained attention from the government. This led to the signing last week of an unprecedented Public-Commons partnership between the National Institute of Geographic and Forestry Information (IGN), Group La Poste, the new Chief Data Officer and the OpenStreetMap France community.
In August, before the partnership was signed, we met with Christian Quest, coordinator of the project for OpenStreetMap France. He explained the project and its implications to us.
Here is a summary of the interview, previously published in French on the Open Knowledge Foundation France blog.Why Did OpenStreetMap (OSM) France decided to create an Open National Address Database?
The idea to create an Open National Address Database came about one year ago after discussions with the Association for Geographic Information in France (AFIGEO). An Address Register was the topic of many reports however these reports can and went without any follow-up and there were more and more people asking for address data on OSM.
Address data are indeed extremely useful. They can be used for itinerary calculations or more generally to localise any point with an address on a map. They are also essentials for emergency rescues – ambulances, fire-fighters and police forces are very interested in the initiative.
These data are also helpful for the OSM project itself as they enrich the map and are used to improved the quality of the data. The creation of such a register, with so many entries, required a collaborative effort both to scale up and to be maintained. As such, the OSM-France community naturally took it over. However, there was also a technical opportunity; OSM-France had previously developed a tool to collect information from the french cadastre website, which enabled them to start the register with significant amount of information.Was there no National Address Registry project in France already?
It existed on papers and in slides but nobody ever saw the beginning of it. It is, nevertheless, a relatively old project, launched in 2002 following the publication of a report on addresses from the CNIG. This report is quite interesting and most of its points are still valid today, but not much has been done since then.
IGN and La Poste were tasked to create this National Address Register but their commercial interests (selling data) has so far blocked this 12-year old project. As a result, a French address datasets did exist but these datasets were created for specific purposes as opposed to the idea of creating a reference dataset for French addresses. For instance, La Poste uses three different addresses databases: for mail, for parcels, and for advertisements.Technically, how do you collect the data? Do you reuse existing datasets?
We currently use three main data sources: OSM which gathers a bit more than two million addresses, the address datasets already available as open data (see list here) and, when necessary, the address data collected from the website of the cadastre. We also use FANTOIR data from the DGFIP which contains a list of all streets names and lieux-dits known from the Tax Office. This dataset is also available as open data.
These different sources are gathered in a common database. Then, we process the data to complete entries and remove duplications, and finally we package the whole thing for export. The aim is to provide harmonised content that brings together information from various sources, without redundancy. The process is run automatically every night with the exception of manual corrections that are done from OSM contributors. Data are then made available as csv files, shapefiles and in RDF format for semantic reuse. A csv version is published on github to enable everyone to follow the updates. We also produce an overlay map which allows contributors to improve the data more easily. OSM is used in priority because it is the only source from which we can collaboratively edit the data. If we need to add missing addresses, or correct them, we use OSM tools.Is your aim to build the reference address dataset for the country?
This is a tricky question. What is a reference dataset? When you have more and more public services using OSM data, does that mean you are in front of a reference dataset?
According to the definition of the French National Mapping Council (CNIG), a geographic reference must enable every reuser to georeference its own data. This definition does not consider any particular reuse. On the other hand, its aim is to enable as much information as possible to be linked to the geographic reference. For the National Address Database to become a reference dataset, it is imperative that data is more exhaustive. Currently, there is data for 15 million reusable addresses (August 2014) of an estimated total of about 20 million. We have more in our cumulative database, but our export scripts ensure there is a minimum quality and coherency and release only after the necessary checks have been made. We are also working on the lieux-dits which are not address data point, but which are still used in many rural areas in France.
Beyond the question of the reference dataset, you can also see the work of OSM as complementary to the one of public entities. IGN has a goal of homogeneity in the exhaustivity of its information. This is due to its mission of ensuring an equal treatment of territories. We do not have such a constraint. For OSM, the density of data on a territory depends largely on the density of contributors. This is why we can offer a level of details sometimes superior, in particular in the main cities, but this is also the reason why we are still missing data for some départements.
Finally, we think to be well prepared for the semantic web and we already publish our data in RDF format by using a W3C ontology closed to the European INSPIRE model for address description.The reached agreement includes a dual license framework. You can reuse the data for free under an ODbL license, or you can opt for a non-share-alike license but you have to pay a fee. Is share-alike clause an obstacle for the private sector?
I don't think so because the ODbL license does not prevent commercial reuse. It only requires to mention the source and to share any improvement of the data under the same license. For geographical data aiming at describing our land, this share-alike clause is essential to ensure that the common dataset is up to date. Lands change constantly, data improvements and updates must, therefore, be continuous, and the more people are contributing, the more efficient this process is.
I see it as a win-win situation compared to the previous one where you had multiple address datasets, maintained in closed silos with none of which were of acceptable quality for a key register as it is difficult to stay up to date on your own.
However, for some companies, share-alike is incompatible with their business model, and a double licensing scheme is a very good solution. Instead of taking part in improving and updating the data, they pay a fee which will be used to improve and update the data.And now, what is next for the National Address Database?
We now need to put in place tools to facilitate contribution and data reuse. Concerning the contribution, we want to set-up a one-stop-shop application/API, separated from OSM contribution tool, to enable everyone to report errors, add corrections or upload data. This kind of tool would enable us to easily integrate partners into the project. On the reuse side, we should develop an API for geocoding and address autocompletion because not everybody will necessarily want to manipulate millions of addresses!As a last word, OSM is celebrating its ten years anniversary. What does that inspire you?
First, the success and the power of OpenStreetMap lies in its community, much more than in its data. Our challenge is therefore to maintain and develop this community. This is what enables us to do projects such as the National Addresses Database, but also to be more reactive than traditional actors when it is needed, for instance with the current Ebola situation. Centralised and systematic approaches for cartography reached their limits. If we want better and more up to date map data, we will need to adopt a more decentralised way of doing things, with more contributors on the ground. Here’s to Ten More Years of the OpenStreetMap community!
Today, Federal Communications Commission (FCC) Chairman Tom Wheeler held a press call to preview the draft E-rate order that will be circulated at the Commission later this week. The FCC invited Marijke Visser, assistant director of the American Library Association’s (ALA) Program on Networks, to participate in the call. ALA President Courtney Young released a statement in response to the FCC activity, applauding the momentum:
ALA has worked extremely hard on this proceeding to move the broadband bar for libraries so that communities across the nation can more fully benefit from the E’s of Libraries™. That is, as Chairman Wheeler recognizes, libraries provide critical services to our communities across the nation relating to Education, Employment, Entrepreneurship, Engagement and Empowerment.
Of course, the extent to which communities benefit from these services depends on the broadband capacity our libraries have. Unfortunately, for all too many libraries, the bandwidth needed is either not available at all or it is prohibitively expensive.
But what Chairman Wheeler described today will go a long way towards changing the broadband dynamic. With support and guidance from our Senior Counsel, Alan Fishel, ALA stood fast behind our recommendations through many difficult rounds of discussions. After today we have every indication that ALA’s unwavering advocacy and determination over the past year and a half will add up to a series of changes for the E-rate program that will provide desperately needed increased broadband capacity for urban, suburban, and rural libraries across the country.
ALA applauds Chairman Wheeler for his strong leadership throughout the modernization proceeding in identifying a clear path to closing the broadband gap for libraries and schools and ensuring a sustainable E-rate program. The critical increase in permanent funding that the Chairman described during today’s press call will help ensure that libraries can maintain the broadband upgrades we know the vast majority of our libraries are anxious to make. Moreover, the program changes that were referenced today—on top of those the Commission adopted in July—coupled with more funding is without a doubt a win-win for libraries and most importantly for the people in the communities they serve.
Larry Neal, president of the Public Library Association, a division of ALA, and director of the Clinton-Macomb Public Library (MI), also commented on the FCC draft E-rate order.
“The well-connected library opens up literally thousands of opportunities for the people who walk through the doors of their local library,” said Neal. “Libraries are with you from the earliest years with family apps for literacy, through the school years with STEM learning labs, to collaborative workspaces and information resources for small businesses, entrepreneurs, and the next generation of innovators. This should be the story for every library and could be if they had the capacity they needed.”
The post ALA applauds strong finish to the E-rate proceeding appeared first on District Dispatch.
Among his observations are:
- "by some measures the US spends almost 50% more in telecom services than it does for electricity."
- Content is not king; "net of what they pay to content providers, US cable networks appear to be getting more revenue out of Internet access and voice services than out of carrying subscription video, and all on a far smaller slice of their transport capacity".
- True streaming video, with its tight timing constraints, is not a significant part of the traffic. Video is a large part, "but it is almost exclusively transmitted as faster-than-real-time progressive downloads". Doing so allows for buffering to lift the timing constraints.
- "The main function of data networks is to cater to human impatience. Thus "Overprovisioning is not a bug but a feature, as it is indispensable to provide low transmission latency". "Once you have overengineered your network, it becomes clearer that pricing by volume is not particularly appropriate, as it is the size and availability of the connection that creates most of the value."
- "it seems safe to estimate worldwide telecom revenues for 2011 as being close to $2 trillion. About half the revenue ... comes from wireless."
- "with practically all [wireline] costs coming from ... installing the wire to the end user, the marginal costs of carrying extra traffic are negligible. Hence charging according to the volume of traffic cannot easily be justified on the basis of costs.
- "a modern telecom infrastructure for the US, with fiber to almost every premise, would not cost more than $450 billion, well under one year's annual revenue. But there is no sign of willingness to spend that kind of money ... Hence we can indeed conclude that modern telecom is less about high capital investment and far more a game of territorial control, strategic alliances, services and marketing, than of building a fixed infrastucture."
- "Yet another puzzle is the claim that building out fiber networks to the home is impossibly expensive. Yet at the cost of $1,500 per household (in excess of the $1,200 estimate ... for the Google project in Kansas City, were it to reach every household), and at a cost of capital of 8% ..., this would cost only $10 per house per month. The problem is that managers and their shareholders expect much higher rates of return than 8% per year. One of the paradoxes is that the same observers who claim that pension funds cannot hope to earn 8% annually are also predicting continuation of much higher corporate profit rates."
Of the articles that were most frequently downloaded [from First Monday] in 1999, 6 of the top 10 were published in previous years! This supports the thesis that easy online access leads to much wider usage of older materials. [Section 9]After an initial period, frequency of access does not vary with age of article, and stays pretty constant with time (after discounting for general growth in usage). [Section 10] Now The Google Scholar team have followed their Rise of the Rest paper, which I blogged about here, with a validation of Odlyzko's prediction. Their new paper On the Shoulders of Giants: The Growing Impact of Older Articles takes another look at the effect that the dramatic changes as scholarly communications migrated to the Web have had on the behavior of authors. The two major changes have been:
- The greater accessibility of the literature, caused by digitization of back content, born-digital journals and pre-print archives, and relevance ranking by search engines.
- The great increase in the volume of publication, caused by the greatly reduced cost of on-line publication and the reduction of competition for space.
It's been a while since we last Met a Developer, but we're getting back into it with recent Islandora Camp CO instructor and discoverygarden, Inc Team Lead Daniel Lamb. Most of Danny's contributions to Islandora's code have come to us by way of dgi's commitment to open source, but he did recently take on the Herculean task of coming up with the perfect one-line documentation to sum up the behvaior of a tetchy delete button. Here's Danny in his own words:
Please tell us a little about yourself. What do you do when you’re not at work?When I'm not at work, I'm spending time with my wonderful family. I have a beautiful wife and an amazing two year old son, and they're what keeps me going when times are tough. I love cooking, and am very passionate about what I eat and how I prepare it. I also reguarly exercise, and really enjoy lifting weights. I've got a great life going and I want to keep it for as long as possible! Academically, my background is in Mathematics and Physics, not Computer Science. But close enough, right? I've held jobs processing data for astronomers, crunching numbers as an actuary, and even making crappy facebook games before landing at discoverygarden. How long have you been working with Islandora? How did you get started? I've been working with Islandora for about two years. I started because of my job with discoverygarden, which was kind enough to take me in after being abused by the video game industry. The first thing I developed for Islandora was the testing code, which is how I got to learn the stack. Sum up your area of expertise in three words: Asynchronous distributed processing What are you working on right now? I've got my finger in a lot of pies right now. I'm managing my first project for discoverygarden, as well as finishing up the code for one of the longest running projects in the company's history. It's for an enterprise client, and I've had to make a lot of innovations that I hope can eventually find their way back into the core software. I'm also working on a statistical model to help management with scoping and allocation. On top of all that, I'm researching frameworks and technologies for integrating with Fedora 4, which I hope to play a role in when the time finally comes. What contribution to Islandora are you most proud of? Most of the awesome stuff I've done has been for our enterprise client, so I can't talk about it. Well, I could, but then I'd have to kill you :P I guess as far as impact on the software in general, I'm most proud of the lowly IslandoraWebTestCase, which is working in every module out there to help keep our development head as stable as possible. What new feature or improvement would you most like to see? Asynchronous distributed processing :D When we make the move to Fedora 4 and Drupal 8, this concept should be at the core of the software. It’s what will allow us to split the stack apart on multiple machines to keep things running smoothly when we have to scale up and out. What’s the one tool/software/resource you cannot live without? ZOMG I could never live without Vim! It's the greatest text editor ever! Put me in Eclipse or Netbeans and I'll litter :w's all over the place and hit escape a bunch of times unnecessarily. Vim commands have been burned into my lizard brain. If you could leave the community with one message from reading this interview, what would it be? You CAN contribute. I know the learning curve is steep, but you don't need a background in Computer Science to contribute. Pick up something small, and work with it until you feel comfortable. And if you're afraid to try your hand as a developer, there's always something to do *cough documentation cough*.