Some stuff we liked on the web this week…
Our fearless leader @zittrain talks OPM data breach, right to be forgotten and the future of libraries
Pens: Like Awesome Box, but for museum collections
Always wanted a drum machine? Now you can have one, in your browser (h/t @maxogden)
Cool story about the inspiring work by USDS – “DARPA meets the Peace Corps meets SEAL Team Six!”
Thoughts on innovating inside an institution, featuring the Chicago Public Library
If you have heard about linked data, but you're not quite sure what it means, look no further. Find out what linked data is, why it is important and how it will transform the web.
From Tim Donohue, DSpace Tech Lead, DuraSpace
Winchester, MA If you were unable to attend the DSpace Interest Group sessions at either OR15 (in Indianapolis last week) or OAI9 (in Geneva yesterday), you may have heard some talk on Twitter and mailing lists about the DSpace Technology Roadmap (for 2015-2016) which was presented at those conferences.
• The video of this Roadmap talk (25 minutes) is available on YouTube:
DuraSpace News: NOW AVAILABLE: Fedora 3.8.1 for Institutions Running Fedora 3 as a Production Service
Winchester, MA On June 11, 2015 Fedora 3.8.1 was released with bug fixes and improvements that include performance improvements, code cleanup and reorganization, and upgrades of library dependencies to the most modern versions possible for stability. It should be noted that Fedora 3.8.1 is the FINAL release of the Fedora 3.x branch. Fedora 3.8.1 is primarily intended for institutions running Fedora 3 as a production service.
Last week was full of broadband action…and it’s worth keeping your eyes peeled for more to come as the FCC winds up for its next Universal Service Fund program modernization proceeding and Congress considers a new option for stymying network neutrality rules.
Here’s a quick rundown of recent activity and coming attractions:
1) Network neutrality rules take effect. You may not have noticed the government taking over your Internet, as asserted by many net neutrality foes, but the FCC’s Open Internet rules officially took hold Friday, June 12. While the first court challenge calling for an immediate stay of the rules was decided in the FCC’s favor, it will likely be years before we know the final fate of the rules. Unless, of course, Congress finds a way to defund FCC implementation of the rules. The ALA and Association for Research Libraries signed on to a joint letter opposing this move in advance of a House Appropriation’s hearing on June 17.
In the meantime, the FCC is moving ahead, yesterday naming Parul P. Desai to serve as the Open Internet ombudsperson, the public’s primary point of contact within the agency for formal and informal questions and complaints related to the Open Internet rules. Desai comes to the FCC Consumer and Government Affairs Bureau from the Consumers Union, where she served as policy counsel for media, telecommunications and technology policy.
2) FCC seeks to expand low-income telephone subsidy to broadband. As with the E-rate, the FCC has signaled (here and here, for instance) it will modernize its Lifeline program for 21st Century communications. The item is on the agenda for the FCC’s meeting Thursday, June 18. ALA joined dozens of other public interest and consumer advocates in laying out principles (pdf) for the Lifeline proceeding, including universality, excellence, consumer choice, innovation and transparency. ALA also previously met with FCC staff to discuss the role of libraries in supporting home broadband adoption and closing the “homework gap” for families with school-age children that lack home broadband access. Look for a Second Further Notice of Proposed Rulemaking to be approved on Thursday that will ask a series of questions related to how the Commission can best address the affordability barrier for low-income Americans. As new analysis from the Pew Research Center shows, only about 60% of families with incomes at or below $25,000 have high-speed connections at home.
3) Broadband Opportunity Council shares responses to its Request for Comments. The ALA also submitted comments related to how the federal government can promote broadband deployment, adoption and competition. In alignment with our Policy Revolution! initiative, we focus first on how federal agencies may leverage our nation’s libraries in support of national purposes. America’s libraries—well over 100,000 strong—are a critical national infrastructure with a long history of connecting people with each other and with diverse physical and increasingly digital resources. Why not leverage a nationwide trusted infrastructure already in place for which new services often may be implemented for only modest incremental costs? In addition, we recommend the council:
- Develop comprehensive solutions to the three “A’s” of broadband adoption challenges—access, affordability, and ability;
- Reduce or eliminate any barriers to competitive broadband providers. Competition is vital to creating affordable, future-proof broadband opportunity;
- Develop specific strategies to address the needs of rural and tribal communities;
- Enable smart transitions for e-government services; and
- Improve relevant data collection and sharing.
President Obama established the Broadband Opportunity Council in March to address regulatory barriers and encourage investment and training. IMLS is one of the federal agencies represented on the Council, which is charged with providing recommendations of actions that each of their agencies could take to identify and address regulatory barriers, incentivize investment, promote best practices, align funding decisions, and otherwise support wired broadband deployment and adoption.
For example, let's hypothesise that Tony Soprano was to start a bitcoin loan-sharking operation. The bitcoin network would have no way of differentiating bitcoins being transferred from his account with conditions attached - such as repayment in x amount of days, with x amount of points of interest or else you and your family get yourself some concrete boots â€” and those being transferred as legitimate and final settlement for the procurement of baked cannoli goods.
Now say you've lost all the bitcoin you owe to Tony Soprano on the gambling website Satoshi Dice. What are the chances that Tony forgets all about it and offers you a clean slate? Not high. Tony, in all likelihood, will pursue his claim with you.She reports work by George K. Fogg at Perkins Coie on the legal status of Tony's claim:
Indeed, given the high volume of fraud and default in the bitcoin network, chances are most bitcoins have competing claims over them by now. Put another way, there are probably more people with legitimate claims over bitcoins than there are bitcoins. And if they can prove the trail, they can make a legal case for reclamation.
This contrasts considerably with government cash. In the eyes of the UCC code, cash doesn't take its claim history with it upon transfer. To the contrary, anyone who acquires cash starts off with a clean slate as far as previous claims are concerned. ... According to Fogg there is currently only one way to mitigate this sort of outstanding bitcoin claim risk in the eyes of US law. ... investors could transform bitcoins into financial assets in line with Article 8 of the UCC. By doing this bitcoins would be absolved from their cumbersome claim history.
The catch: the only way to do that is to deposit the bitcoin in a formal (a.k.a licensed) custodial or broker-dealer agent account.In other words, to avoid the lien problem you have to submit to government regulation, which is what Bitcoin was supposed to escape from. Government-regulated money comes with a government-regulated dispute resolution system. Bitcoin's lack of a dispute resolution system is seen in the problems Ross Ulbricht ran in to.
Below the fold, I start from some of Kaminska's more recent work and look at another attempt to use the blockchain as a Solution to Everything.
Last month Kaminska, the author of the excellent From the annals of disruptive digital currencies past, reported on the emergence from stealth mode of a startup called 21 Inc. whose plan is to put Bitcoin mining hardware into every device in the Internet of Things. As she explained in a second post entitled 21 Inc. and the plan to kill the free Internet the underlying goal is to replace the current "free" advertising-supported model with one based on making every Internet-connected device mine Bitcoin to provide the funds:
This isn't about disrupting fiat money, central banks or the existing financial rentier system. It's about making the internet much more like the financialised real world. Namely, by adding an energy and scarcity cost to digital transfers on the web so that information can't be as easily exploited as it is today.
Up for grabs, notably, is the marketshare of Google, Facebook and Twitter and their ilk, due to their dependency on free consumer data to drive their advertising-based revenue.Now, I agree with the quote from this interview with Monica Chew that I used in Preserving the Ads?:
[Chew] further argues that advertising “does not make content free” but “merely externalizes the costs in a way that incentivizes malicious or incompetent players.” She cites unsafe plugins and malware examples that result in sites requiring more resources to load, which in turn translates to costs in bandwidth, power, and stability. “It will take a major force to disrupt this ecosystem and motivate alternative revenue models,” she added. “I hope that Mozilla can be that force.”A replacement business model that lacked incentives for bad behavior would be a good thing. Perhaps Bitcoin mining in every device could be the major force, and would lack incentives for bad behavior, but I am doubtful.
Lets start by noting that the "free Internet" can't be killed because it doesn't exist. Advertising supports much of the content of the Web, but attempts to provide advertising-supported "free Internet" connectivity have never succeeded at scale. We pay for our connectivity at home, Stanford pays for our connectivity at work, Starbucks pays for the connectivity underlying their "free WiFi".
Perhaps 21 Inc. has the idea that an ISP could provide a home router that paid for its connectivity by mining Bitcoin. To pay for my home connectivity it would need, assuming current Bitcoin prices are stable, to mine one Bitcoin every 4.5 months for as long as I owned it.
Lets assume that 21 Inc. could initially deliver hardware with insignificant cost and power consumption that could mine at that rate. One problem has been that, driven by Moore's Law, the hash rate of the Bitcoin network was increasing rapidly, which meant that the useful life of Bitcoin mining hardware was very short, perhaps only 6 months. Equipment in the IoT, such as routers, has a long service life. No ISP is going to replace all its customers' routers every 6 months, so on 21 Inc's model they would indeed be providing most of their customers with "free Internet" and thus losing money.
To be viable, the router would have to mine enough Bitcoin in its first six months to pay for connectivity for say 6 years, or about 0.6BTC/week. A current mining ASIC, the AntMiner S5, mines about 0.1 Bitcoin/week. So we would need 6 times the performance of state-of-the-art chips. One S5 consumes 590W and costs $370. Over 3.5KW and $2,200 make for a rather expensive home router. It would take some serious chip magic to get this level of performance out of something that an ISP could afford to put in a router, and 21 Inc. would make more money using their magic chips to mine on their own behalf than they could extract from the ISP market.
For the last six months the hash rate has not been growing; it has been between 300 and 400 PHash/s. Doesn't this invalidate the argument that mining hardware has a short life? No, because as the introduction of the S5 shows, the development of new, better hardware has continued. Miners who can afford to buy state-of-the-art hardware mine more Bitcoin for a given energy input. The stable hash rate is not the result of big miners stopping getting bigger, it is the result of smaller miners with less efficient hardware being forced out of the market. The value of mining hardware (including 21 Inc's) is still dropping rapidly.
21 Inc. believes, as they say in this image, that they have finally made micropayments feasible. A Satoshi is 10-8BTC, so a chip that can mine a million Satoshi a year at current prices generates an income of $2.25/yr. We are truly talking about micropayments. Not even close to enough to pay an ISP for connectivity. Not enough for three $0.99 songs. At my rate of $0.096/KWH if the mining hardware drew 2.7W it would generate no net income.
The claim that micropayments are now feasible has been made many times in the last quarter-century, but they have never succeeded. Emin Gun Sirer points to a remarkable, decade-old paper by Nick Szabo, an early participant in Bitcoin development, called Micropayments and Mental Transaction Costs. Szabo writes:
We have seen how customer mental transaction costs can derive from at least three sources: uncertain cash flows, incomplete and costly observation of product attributes, and incomplete and costly decision making. These costs will increasingly dominate the technological costs of payment systems, setting a limit on the granularity of bundling and pricing. Prices don't come for free.Notice that 21 Inc's chips haven't addressed any of the issues Szabo raises. The owner of a device will have only a rough idea of the value it will generate. Because the chips are so feeble the chance that the owner's chip will ever actually mine a Bitcoin is effectively zero. They make sense only as components of a huge mining pool, which the device owner has to trust. The value of the chip's mining will depend not merely on the market for Bitcoin but also on how large a proportion of the total mining power the pool represents. Vendors offering services to chip owners, such as the ISP in the example above, will not know what their revenue will be worth. The chips add huge amounts of risk to everyone.
If we're going to replace the business model behind both the current Internet and the current Web the replacement needs to provide not just lower risk but also better security. The business model underlying connectivity is quite secure; ISPs don't seem to suffer significant fraud losses. That isn't true of the Web advertising business. Although there have been recent advances in some areas of blockchain-related security, such as in providing two-factor authentication for the wallets in which the bitcoins must be stored, fundamental problems remain.
In a longish but must-read piece entitled A Machine For Keeping Secrets? Vinay Gupta, who is working on a blockchain startup called Ethereum, points out the most fundamental problem with basing commerce, the Internet, or anything else important on the blockchain. It is the bane of all "reliable" distributed systems - inadequate diversity of the underlying technology:
There is more at risk than individual users being compromised and having their contracts spoofed. In a distributed system, there is a monoculture risk. If we have individual users being hacked because their laptops slip a version behind bleeding-edge security patches, that's bad enough. We have all heard tales of enormous numbers of bitcoins evaporating into some thief's pockets. But if we have only three major operating systems, run by >99% of our users, consider the risk that a zero-day exploit could be used to compromise the entire network's integrity by attacking the underlying consensus algorithms. If enough computers on the network say 2+2 = 5, the nature of blockchains is that 2+2 not only equals 5, but it always will equal five.The blockchain is a brittle system. How realistic a problem is this? An attacker with zero-day exploits for each of the three major operating systems on which blockchain software runs could use them to take over the blockchain. There is a market for zero-day exploits, so we know how much it would cost to take over the blockchain. Good operating system zero-days are reputed to sell for $250-500K each, so it would cost about $1.5M to control the Bitcoin blockchain, currently representing nearly $3.3B in capital. That's 220,000% leverage! Goldman Sachs, eat your heart out.
Brian Arthur's work two decades ago showed that technology markets with increasing returns to scale would be dominated by one or a small number of players. Events since then (CPUs, GPUs, search, social networks, Bitcoin mining pools, ...) have amply confirmed this. I agree with Gupta that work on capability systems such as Capsicum has the potential to greatly improve the security of the systems we depend on. But the underlying economics of the markets that provide us with these systems make the monoculture risk he describes unavoidable.
The forces squeezing diversity out are the same ones I discussed in Economies of Scale in Peer-to-Peer Networks, using Ittay Eyal and Emin Gun Sirer's analysis that Bitcoin and similar protocols need be incentive-compatible:
the best strategy of a rational minority pool is to be honest, and a minority of colluding miners cannot earn disproportionate benefits by deviating from the protocolThis is a game theoretic concept. In a recent series of posts Arvind Narayanan has been pursuing a game-theoretic analysis of Bitcoin:
- Consensus in Bitcoin: One system, many models "The branch of math that studies the behavior of interacting participants who follow their incentives is called game theory. This is the other main set of models that's been applied to Bitcoin. In this view, we don't classify nodes as honest and malicious. Instead, we assume that each node picks a (randomized) strategy to maximize its payoff, taking into account other nodes' potential strategies. If the protocol and incentives are designed well, then most nodes will follow the rules most of the time."
- Bitcoin and game theory: we're still scratching the surface shows using very simple arguments that the "block withholding" attack is profitable for the attacker, and thus that the Bitcoin protocol is not incentive-compatible.
- Bitcoin is a game within a game analyzes Bitcoin as a two-level game, the lower-level game played among the miner software implementations, and the upper-level game played at much lower speed among the humans running the miners.
- Does incentive-compatibility imply income linear in contribution? The answer is that anything faster than linear is not incentive-compatible.
- If not, are there incentive-compatible ways to deter large contributions? In principle, yes, but in practice networks that used them would not succeed.
- The simplistic version is, in effect, a static view of the network. Are there dynamic effects also in play? Yes, and they favor large miners.
It isn't clear to me that 21 Inc. has abolished economies of scale. What they are trying to do is to externalize some of the costs of mining (capital, power, communication) to device owners by burying them in the costs of running the device. Externalizing costs in this way has already been tried. Malware that infects home routers and PCs with mining software is fairly common. The malware suppliers costs are low, so even the small amount of Bitcoin they can mine is profitable. The cost base for 21 Inc's "hardware malware" is higher.
The end of Kaminska's post loses me when she writes:
As it stands today, nobody really has an idea of what their data is worth because there’s no mechanism by which processed or unprocessed data bundles can be valued.
Yet we know for a fact that personal data, especially when processed in aggregate form, does have a value in the open market. If it didn’t, Facebook and Google wouldn’t be the multi-billion dollar organisations that they are today.
But the data market trades much more like highly illiquid OTC commodities than anything akin to an open market exchange. Deals seem to be done bilaterally, on a bespoke and opaque basis. There is no “processed data”-value index.She goes on to suggest that:
While it’s true that the processed data in question is light on both information (due to so much of it being pseudonymous in nature), there’s no denying the energy it took to create it.
It’s this base energy cost that can now be used as a public benchmark to price more information-intensive data against.
This makes us think the key objective of the high-order bitcoin enthusiasts (as opposed to the financial speculators) is mostly about giving consumers a choice. On the one hand to pay for specially designed web services with spent processing time (and energy) that helps support the public digital commons (which acts as a glue that links up all sorts of different datasets). Or, on the other hand, to pay for open web services directly with personal data on a trust basis.
Either way, once a transparent price is established for the former, it stands to reason a price for the latter can also be equally derived, opening the door to a meritocratic paid-for internet and a market where all personal data has a clear market price.Unless I'm totally misunderstanding her argument, it seems to be full of misconceptions. The value of "processed" data isn't linear in its size, or in the energy expended in "processing" it. It is true that the Bitcoin market sets a value on a very specific, but otherwise completely useless, computation that is related by the current state of chip technology to the lowest cost of electricity for data centers across the world. But apart from being highly variable, the energy costs of the Bitcoin computation have nothing to do with the energy costs of the computations that, for example, Google does to add value to the raw data they collect.
In 21 Inc's world, the idea is that end users generate Satoshi to pay sites not to collect their data (and how do the users know the site isn't?), not that sites pay end-users to be able to acquire data. The amount is set by 21 Inc, presumably uniformly, whereas in the real world the value of personal data depends strongly on the person. Personal information about, say, Warren Buffet or Barack Obama is worth a lot more than personal information about me.
Cathy O'Neill points out another problem with this concept. Current end user licenses allow the vendor to change the terms of the license at will:
As everyone knows, nobody reads their user agreements when they sign up for apps or services. Even if they did, it wouldn’t matter, because most of them stipulate that they can change at any moment.The customer's only recourse is to stop using the service, and switch to a competitor, like that other eBay just down the street. But increasing returns to scale means that the competitors will be much smaller, less useful, or even non-existent. So even if your hardware is generating Satoshi to pay for the services you are using the power imbalance between eBay and you, or in this case your home router, hasn't changed.
The 2015 NDSR residents have arrived! The launching of this next class of the NDSR Washington, D.C. residency program (the inaugural class was in 2013-14) began with a week-long orientation for the residents. The centerpiece event was the Opening Conference on June 10th, which took place in one of the historic rooms of the Jefferson Building, providing an auspicious backdrop for the official start of the residency.
The Opening Conference featured remarks from prior program participants as well as speeches by invited experts from the field. George Coulbourne, Program Officer at the Library of Congress, began the event by introducing the 2015 residents and their host organization:
- John Caldwell, U.S. Senate, Historical Office
- Valerie Collins, American Institute of Architects
- Nicole Contaxis, National Library of Medicine
- Jaime Mears, D.C. Public Library
- Jessica Tieman, Government Publishing Office
Library of Congress Chief of Staff Robert Newlen gave the official Library welcome, noting the impressive qualifications of all the residents. Trevor Owens, formerly a Digital Archivist at the Library, is now a Senior Program Officer at the Institute of Museum and Library Services (IMLS) and gave some remarks on that agency’s behalf. Owens pointed out one of the big advantages of the NDSR – all the former residents now have jobs. His advice for the current residents was to connect with these former residents and continue building a community of practice. On a more practical note, he also advised them to figure out how to best navigate through their organization in order to get things done.
The first keynote speaker was Dan Russell, Google’s Uber Tech Lead for Search Quality and User Happiness, or as he described it, “I am a cyber-tribal-techno-cognitive-anthropologist.” In his talk, “The Future of Asking and Answering Questions” he stated that he is most interested in how people use tools and information to best understand their world, and that “we can build the future by understanding the past.” Russell stressed the importance of being able to find and recognize credible information, as well as cultivating an understanding of the “new language” (such as acronyms, emoticons) and multi-national information. He noted all these things have “changed the informational landscape.”
His parting thoughts were, “learn to ask the right questions, and learn what tools exist to answer those questions.”
Allison Druin, Chief Futurist at the University of Maryland, was the next keynote speaker with her talk, “Information 2020: The Future of Information.” She began with a demonstration – she took a selfie, and posed the question, “What is this all about? It’s about the now”. She wanted to emphasize the “near” future, and noted that her role as Futurist is “not to predict the future, but to help you prepare for the future.” She emphasized that “our information future HAS to include the needs of people”. Noting some key informational challenges, she then asked “how can we turn these into opportunities?” Her answers included mentorship and being resourceful with limited funding, which leads to creative partnership programs – such as the NDSR.
The next speaker was Jaime McCurry, former NDSR resident from the inaugural class, and now employed as Digital Assets Librarian at Hillwood Estate and Museum in Washington, D.C. She said “it’s surreal to stand here before the next group of residents.” In reminiscing about her time as a resident for that inaugural class, she said there is lots of “newness” to the experience; working with new people, organizations, projects and even perhaps a new city. Addressing the residents directly, she offered some advice from her perspective, including the need to be integrated into their organization as much as possible. “You may be surprised at how much the digital world touches every part of the organization’s activities,” she said. Noting her favorite parts of the residency, she said the best one was networking and sharing experiences with her fellow cohorts.
Prue Adler, Associate Executive Director for the Association of Research Libraries (ARL) then provided remarks from a former host institution perspective, from the inaugural year of the program. At ARL they encouraged complete participation in all aspects of the project – attendance at a wide range of meetings, and networking with as many people as possible. In describing the program overall she noted that a strong cohort is key to the program being a huge success. Many talented applicants resulted in very highly qualified residents. Adler emphasized that the success of the NDSR has a direct bearing on the growth of the profession as a whole.
In the closing remarks, Kris Nelson, Program Management Specialist at The Library of Congress, noted some additional advantages of the program. He discussed some of the positive outcomes of the pilot NDSR program, as highlighted in the professional assessment. Specifically, he noted how previous residents had accomplished what he referred to as “silo-busting” within each of their organizations. That is, residents had the ability to work across organizational programs, and break down perceived organizational barriers.
This opening event was followed by a brainstorming session for the residents led by Dan and Allison later that same day, followed by two days of intensive digital preservation workshops. The workshops were led by Library of Congress staff Abbey Potter, Phil Michel, Kathleen O’Neill, Andrew Cassidy-Amstutz, Erin Engle and Barrie Howard, as well as Lynne Thomas from Northern Illinois University. Five topic areas were chosen to provide some initial training and preparation for the organizational challenges ahead:
- The Digital Lifecycle
- Levels of Digital Preservation
- The POWRR project and tools discussion
- Organizational and Sustainability Issues
- Project Management
Each of these subject areas included a presentation, suggested advanced reading, and a specific activity designed to encourage a deeper experience with the subject matter. At the end of each of these sections, there was further discussion among the group at large. All of this was designed to create an interactive experience, encourage further thought into these areas, and in the end, help John, Valerie, Nicole, Jaime and Jessica to be prepared for their residency journey.
Starting this week, residents are beginning the real work on their projects at their designated host institutions. Stay tuned to The Signal over the coming months, as project updates will be posted from each of the residents.
For another perspective on the Opening Conference, see the recent post on the subject by the University of Maryland, Future of Information Alliance.
We are very excited to welcome nine new regular writers to the LITA blog!
Or: Anything goes. What are we thinking? An impression of ELAG 2015
This year’s ELAG conference in Stockholm was one of many questions. Not only the usual questions following each presentation (always elicited in the form of yet another question: “Any questions?”). But also philosophical ones (Why? What?). And practical ones (What time? Where? How? How much?). And there were some answers too, fortunately. This is my rather personal impression of the event. For a detailed report on all sessions see Christina Harlow’s conference notes.
The theme of the ELAG 2015 conference was: “DATA”. This immediately leads to the first question: “What is data?”. Or rather: “What do we mean with data?”. And of course: “Who is ‘we’?”.
In the current information professional and library perception ‘we’ typically distinguish data created and used for describing stuff (usually referred to as ‘metadata’), data originating from institutions, processes and events (known as ‘usage data’, ‘big data’), and a special case of the latter: data resulting from scholarly research (indeed: ‘research data’). All of these three types were discussed at ELAG.
It is safe to say however, that the majority of the presentations, bootcamps and workshops focused on the ‘descriptive data’ type. I try to avoid the use of the term ‘metadata’, because it is confusing, and superfluous. Just use ‘data’, meaning ‘artificial elements of information about stuff’. To be perfectly clear, ‘metadata’ is NOT ‘data about data’ as many people argue. It’s information about virtual entities, physical objects, information contained in these objects (or ‘content’), events, concepts, people, etc. We could only rightfully speak of ‘data about data’ in the special case of data describing (research) datasets. For this case ‘we’ have invented the job of ‘data librarian’, which is a completely nonsensical term, because this job is concerned with the storage, discoverability and obtainability of only one single object or entity type: research datasets. Maybe we should start using the job title ‘dataset librarian’ for this activity. But this seems a bit odd, right? On the other hand, should we replace the term ‘metadata librarian’ with ‘data librarian’? Also a bit odd. Data is at this moment in time what libraries and other information and knowledge institutions use to make their content findable and usable to the public. Let’s leave it at that.
This brings us to the two fundamental questions of our library ecosystem: “What are we describing?” and the mother of all data questions: “Why are we describing?”, which were at the core of what in my eyes was this year’s key presentation (not keynote!) by Niklas Lindström of the Swedish Royal Library/LIBRIS. I needed some time to digest the core assertions of Niklas’ philosophical talk, but I am convinced that ‘we’ should all be aware of the essential ‘truths’ of his exposition.
First of all: “Why are we describing?“. The objective of having a library in the first place is to provide access in any way possible to the objects in our collections, which in turn may provide access to information and knowledge. So in general we should be describing in order to enable our intended audience to obtain what they need in terms of the collection. Or should that be in terms of knowledge? In real life ‘we’ are describing for a number of reasons: because we follow our profession, because we have always done this, because we are instructed to do so, because we need guidance in our workflows, because the library is indispensable, because of financial and political reasons. In any case we should be clear about what our purposes are, because the purpose influences what we’re describing and how we do that.
Secondly: “What are we describing?”. Physical objects? Semi-tangible objects, like digital publications? Only outputs of processes, or also the processes themselves? Entities? Concepts? Representations? Relationships? Abstractions? Events? Again, we should be clear about this.
Thirdly (a Monty Python Spanish Inquisition moment ;-): “How are we describing?”. We use models, standards, formats, syntax, vocabularies in order to make maps (simplified representations of real world things) for reconciling differences between perceptions, bridging gaps between abstractions and guiding people to obtain the stuff they need. In doing so, Niklas says, we must adhere to Postel’s law, or the Robustness Principle, which states: “Be liberal in what you accept; be conservative in what you send”.
Back to the technology, and the day to day implementation of all this. ‘We’ use data to describe entities and relationships of whatever nature. We use systems to collect and store the data in domain, service and time dependent record formats in system dependent datastores. And we create flows and transformations of data between these systems in order to fulfill our goals. For all this there are multiple standards.
Basically, my own presentation “Datamazed – Analysing library dataflows, data manipulations and data redundancies” targeted this fragmented data environment, describing the library of the University of Amsterdam Dataflow Inventory project leading to a Dataflow Repository, effectively functioning as a map of mappings. “System of Systems (SoS)” was also the topic of the workshop I participated in, “What is metadata management in net-centric systems?” by Jing Wang.
Making sense of entities and relationships was the focus of a number of talks, especially the one by Thom Hickey about extending work, expression and author entities by way of data mining in Worldcat and VIAF, and the presentation by Jane Stevenson on the Jisc/Archives Hub project “exploring British Design”, which entailed shifting focus from documents to people and organizations as connected entities. Some interesting observations about the latter project: the project team started with identifying what the target audience actually wanted and how they would go about getting the desired information (“Why are we describing?”) in order to get to the entity based model (“What are we describing?”). This means that any entity identified in the system can become a focus, or starting point for pathways. A problem that became apparent is that the usual format standards for collection descriptions didn’t allow for events to be described.
Here we arrive at the critique of standards that was formulated by Rurik Greenall in his talk about the Oslo Public Library ILS migration project, where they are migrating from a traditional ILS to RDF based cataloguing. Starting point here is: know what you need in order to support your actual users, not some idealised standard user, and work with a number of simple use cases (“Why are we describing?”). Use standards appropriate for your users and use cases. Don’t be rigid, and adapt. Use enough RDF to support use cases. Use just a part of the open source ILS Koha to support specific use cases that it can do well (users and holdings). Users and holdings are a closed world, which can be dealt with using a part of an existing system. Bibliographic information is an open world which can be taken care of with RDF. The data model again corresponds to the use cases that are identified. It grows organically as needed. Standards are only needed for communicating with the outside world, but we must not let the standards infect our data model (here we see Postel’s Law again).
A striking parallel can be distinguished with the Stockholm University Library project for integration of the Open Source ILS Koha, the Swedish LIBRIS Union Catalogue and a locally developed logistics and ILL system. Again, only one part of Koha is used for specific functions, mainly because with commercial ILSes it is not possible to purchase only individual modules. Integrated library systems, which seemed a good idea in the 1980’s, just cannot cope with fragmented open world data environments.
Dedicated systems, like ILSes, either commercial or open source, tend to force certain standards upon you. These standards not only apply to data storage (record formats etc.) but also to system structure (integrated systems, data silos), etc. This became quite clear in the presentation about the CERN Open Data Portal, where the standard digital library system Invenio imposed the MARC bibliographic format for describing research datasets for high energy physics, which turned out to be difficult if not impossible. Currently they are moving towards using JSON (yet another data standard) because the system apparently supports that too.
With Open Source systems it is easier to adapt the standards to your needs than with proprietary commercial vendor systems. An example of this was given by the University of Groningen Library project where the Open Source Publication Repository software EPrints was tweaked to enable storage and description of research datasets focused on archeological findings, which require very specific information.
As was already demonstrated in the two ILS migration projects deviation of standards of any kind can very easily be implemented. This is obviously not always the case. The locally developed Swedish Royal Library system for the legal deposit of electronic documents supports available suitable metadata standards like OAI, METS, MODS, PREMIS.
For the OER World Map project, presented by Felix Ostrowski we can safely say that no standards were followed whatsoever, except using the discovery data format schema.org for storing data, which is basically also an adaptation of a standard. Furthermore the original objective of the project was organically modified by using the data hub for all kinds of other end user services and visualisations than just the original world map of the location of Open Educational Resources.
It should be clear that every adaptation of standards generates the need for additional mappings and transformations besides the ones already needed in a fragmented systems and data infrastructure for moving around data to various places for different services. Mapping and transformation of data can be done in two ways: manually, in the case of explicit, known items, and by mining, in the case of implicit, unknown items.
Manual mapping and transformation is of course done by dedicated software. The manual part consists of people selecting specific source data elements to be transformed into target data elements. This procedure is known as ETL (Extract Transform Load), and implies the copying of data between systems and datastores, which always entails some form of data redundancy. Tuesday afternoon was dedicated to this practice with three presentations: Catmandu and Linked Data Fragments by Ruben Verborgh and Patrick Hochstenbach; COMSODE by Jindrich Mynarz; d:swarm by Thomas Gängler. Of these three the first one focused more on efficiently exposing data as Linked Open Data by using the Linked Data Fragments protocol. An important aspect of tools like this is that we can move our accumulated knowledge and investment in data transformation away from proprietary formats and systems to system and vendor independent platforms and formats.
Data mining and text mining are used in the case of non explicit data about entities and relationships, where bodies of data and text are analyzed using dedicated algorithms in order to find implicit entities and relationships and make them explicit. This was already mentioned in Thom Hickey’s Worldcat Entity Extension project. It is also used in the InFoLiS2 project, where data and text mining is used to find relationships between research projects and scholarly publications.
Another example was provided by Rob Koopman and Shenghui Wang of OCLC Research, who analyzed keywords, authors and journal titles in the ArticleFirst database to generate proximity visualizations for greater serendipity.
As long as ‘we’ don’t or can’t describe these types of relationships explicitly, we have to use techniques like mining to extract meaningful entities and relationships and generate data. Even if we do create and maintain explicit descriptions, we will remain a closed world if we don’t connect our data to the rest of the world. Connections can be made by targeted mapping, as in the case of the Finnish FINTO Library Ontology service for the public sector, or by adopting an open world strategy making use of semantic web and linked open data instruments.
Furthermore, ‘we’ as libraries should continuously ask ourselves the questions “Why and what are we describing?”, but also “Why are we here?”. Should we stick to managing descriptive data, or should we venture out into making sense of big data and research data, and provide new information services to our audience?
Finally, I thank the local organizers, the presenters and all other participants for making ELAG2015 a smooth, sociable and stimulating experience.
Tonight at the supermarket the point-of-sale system wasn’t working—it gave two sad bongs when anyone tried to scan a debit or credit card. The cashier rebooted it (with the classic method of yanking out a cable and sticking it back in) and as it started up I was surprised to see it ran Linux.
Nice penguin there. Next it showed some COM settings like I haven’t seen in yonks. Terrible photos, I know, but I was in a Loblaws. Don’t judge.
Power cycling it didn’t help, but they rang me in at the next cash.
Wikipedia informs me that VeriFone “is an American multinational corporation headquartered in San Jose, California that provides technology for electronic payment transactions and value-added services at the point-of-sale” and that what I think is this line of point-of-sale systems “run Embedded Linux and use FST FancyPants and the Opera (browser) for their GUI platform.”
Earlier in the day I bought a bottle of pop from a machine and spent a while musing on the fact that I’d just spent $2.25 for a robot arm to deliver me a bottle of flavoured chemical-water. That’s two bucks too much for the chemicals, but robot arms being deployed widely, that’s something. I wonder what OS the arm runs.
The Digital Public Library of America is looking for excellent educators for its new Education Advisory Committee. We recently announced a new grant from the Whiting Foundation that funds the creation of new primary source-based education resources for student use with teacher guidance.
We are currently recruiting a small group of enthusiastic humanities educators in grades 6-14* to collaborate with us on this project. Members of this group will:
- build and review primary source sets (curated collections of primary sources about people, places, events, or ideas) and related teacher guides
- give feedback on the tools students and teachers will use to generate their own sets on DPLA’s website
- help DPLA develop and revise its strategy for education resource development and promotion in 2015-2016
If selected, participants are committing to:
- attend a 2-day in-person meeting on July 29-July, 30 2015 (arriving the night of July 28) in Boston, Massachusetts
- attend three virtual meetings (September 2015, November 2015, and January 2016)
- attend a 2-day in-person meeting in March 2016 in Boston, Massachusetts (dates to be selected in consultation with participants)
Participants will receive a $1,500 stipend for participation as well as full reimbursement for travel costs.
To read more about the DPLA’s recent work with education, please read our research findings.
For questions, please contact email@example.com.
*Grades 13 and 14 are the first two years of college.
About the Digital Public Library of America
The Digital Public Library of America is a free, online library that provides access to millions of primary and secondary sources (books, photographs, audiovisual materials, maps, etc) from libraries, archives, and museums across the United States. Since launching in April 2013, it has aggregated over 10 million items from over 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.
The Unix philosophy as a design heuristic has taken on a life of its own outside of Unix and its descendents, and it seems to be taking root in the community of library-focused open source software. The repository software community in particular seem to be moving away from monolithic all-in-one solutions to modular systems where many different pieces of software all talk to each other and work together to achieve a common goal. Developers are taking successful non-library-focused open source software and co-opting it into their stack. This “don’t reinvent the wheel” approach increases the overall quality of the system AND reduces developer time spent since the chosen software already exists, works well and has its own development community. Consider the use of the Solr search server and the Fuseki triplestore used by the Fedora Commons community. If they had tried to implement their own built-in search and triplestore capabilities it would have required far more developer time to reimplement something that already exists. By using externally developed software like Solr, the Fedora community can rely on a mature project that has its own development trajectory and community of bug squashers.
The Unix philosophy need not be tied just to software, either. I’ve found myself applying it in a managerial sense lately while working on certain projects by creating small, focused teams that do one thing well and pass the work amongst themselves (as opposed to everyone just doing whatever). Another way of looking at it is understanding and capitalizing on the strengths of your coworkers: I’m okay at managing projects but our project manager is much better, and our project manager is okay at coding but I’m much better. It doesn’t make much sense for our project manager to spend time learning to code when he has me, just like it doesn’t make sense for me to focus on project management when I have him (of course there are benefits to this, but in practice its a much smaller return on investment). We both serve the organization better by focusing on and developing our strengths, and leaning on others’ expertise for our weaknesses. This lets us work faster and increases the quality of our output, and in this day and age time and effort are as valuable and limited as the system memory and disk space of a 1970’s mainframe.
This is the second episode Amanda and I – Michael – recorded, but the haphazard way we harvested the audio from our conversation left it wrecked. So it’s been lost, unheard. Way back in April 2014, we read Jim Shamlin’s “Satisfaction, Delight, Disappointment, and Shock.” We talk about inspiring delight through customer service and novelty. Other gems: we like Treehouse, jingles, book lists, WordPress, and free dessert.
Information Technology and Libraries: President's Column: Making an Impact in the Time That is Given to Us
Information Technology and Libraries: Exploratory Subject Searching in Library Catalogs: Reclaiming the Vision
Information Technology and Libraries: User Authentication in the Public Areas of Academic Libraries in North Carolina
The clash of principles between protecting privacy and protecting security can create an impasse between libraries, campus IT departments, and academic administration over authentication issues with the public area PCs in the library. This research takes an in-depth look at the state of authentication practices within a specific region (i.e. all the academic libraries in North Carolina) in an attempt to create a profile of those libraries that choose to authenticate or not. The researchers reviewed an extensive amount of data to identify the factors involved with this decision.