Part bf of Amazon crawl..
This item belongs to: data/ol_data.
This item has files of the following types: Data, Data, Metadata, Text
For most searches in our Net Archive, we have acceptable response time, due to the use of sparse faceting with Solr. Unfortunately as well as expectedly, some of the searches are slow. Response times in minutes slow, if we’re talking worst case. It is tied to the number of hits: Getting top-25 most popular links from pages about hedgehogs will take a few hundred milliseconds. Getting the top-25 links from all pages from 2010 takes minutes. Visualised, the response times looks like this:
Everything beyond 1M hits is slow, everything beyond 10M hits is coffee time. Okay for batch analysis, but we’re aiming for interactive use.Get the probably correct top-X terms by sampling
Getting the top-X terms for a given facet can be achieved by sampling: Instead of processing all hits in the result set, some of them are skipped. The result set iterator conveniently provides an efficient advance-method, making this very easy. As we will only use sampling with larger result sets, there should be enough data to be quite sure that the top-25 terms are the correct ones, although their counts are somewhat off.
This of course all depends on how high X is in top-X, concrete corpus etc. The biggest danger is clusters of content in the corpus, which might be skipped. Maybe the skipping could be made in small steps? Process 100 documents, skip 500, process the next 100…? Tests will have to be made.Fine count the top-X terms
With the correct terms being isolated, precisely those term can be fine counted. This is nearly the same as vanilla distributed faceting, with the exception that all shards must fine count all the top-X terms, instead of only the terms they had not already processed earlier.
Of course the fine counting could be skipped altogether, which would be faster and potentially very usable for interactive exploratory use, where the exact counts does not really matter.But there’s no guarantee?
No. Do remember that vanilla Solr distributed faceting is also a best-effort, with the same guarantee as above: The terms are not guaranteed to be the correct ones, but their counts are.Seems simple enough
Ticket #38 for sparse faceting has been opened and we could really use this in the Danish Net Archive Search. No promises though.Note 2015-05-30
Knut Anton Bøckman mentioned on Twitter that Primo has a faceting mechanism that looks similar to my proposal. It seems that Primo uses the top-200 hits to select the facets (or rather terms?), then do a fine-count on those.
It might work well to base the term selection on the top hits, rather than sampling randomly through all the hits, but I am afraid that 200 is so small a sample that some of the terms will differ from the right ones. I understand the need for a small number though: Getting the top-million hits or just top-hundred-thousand is costly.
All the following quotes are from the article:
In his keynote address, Lloyd Minor, MD, dean of the School of Medicine, defined a term, “precision health,” as “the next generation of precision medicine.” Precision health, he said, is the application of precision medicine to prevent or forestall disease before it occurs. “Whereas precision medicine is inherently reactive, precision health is prospective,” he said. “Precision medicine focuses on diagnosing and treating people who are sick, while precision health focuses on keeping people healthy.”
The fuel that powers precision health, Minor said, is big data: the merging of genomics and other ways of measuring what’s going on inside people at the molecular level, as well as the environmental, nutritional and lifestyle factors they’re exposed to, as captured by both electronic medical records and mobile-health devices.This isn't just what would normally be thought of as medical data:
Precision health requires looking beyond medical data to behavioral data, several speakers said. This is especially true in a modern society where it is behavior, not infectious disease, that’s increasingly the cause of disability and mortality, noted Laura Carstensen, PhD, professor of psychology and founding director of the Stanford Center on Longevity.But not to worry, we can now collect all sorts of useful data from people's smartphones:
That’s where mobile devices for monitoring everyday behavior can be useful in ways electronic health records can’t. Several speakers touched on the potential for using mobile-health devices to survey behavior and chronic disease and, perhaps, provide insights that could be used to support better behavior.
By monitoring 24/7 which room of one’s home one is in at any given minute over a 100-day period, you can detect key changes in behavior — changes in sleep-wake rhythms, for instance — that can indicate or even predict the onset of a health problem.
An expert in analyzing conversations, [Intel fellow Eric] Dishman recounted how he’d learned, for example, that “understanding the opening patterns of a phone conversation can tell you a lot,” including giving clues that a person is entering the initial stages of Alzheimer’s disease. Alternatively, “the structure of laughter in a couple’s conversation can predict marital trouble months before it emerges.”If only we could get rid of these pesky privacy requirements:
“Medical facilities won’t share DNA information, because they feel compelled to protect patients’ privacy. There are legitimate security and privacy issues. But sharing this information is vital. We’ll never cure rare DNA diseases until we can compare data on large numbers of people. And at the level of DNA, every disease is a rare disease: Every disease from A to Z potentially has a genomic component that can be addressed if we share our genomes.”The potential benefits of having this data widely shared across the medical profession are speculative, but plausible. But its not speculative at all to state that the data will also be shared with governments, police, insurance companies, lawyers, advertisers and most of all with criminals. Anyone who has been paying the slightest attention to the news over the last few years cannot possibly believe that these vast amounts of extremely valuable data being widely shared among researchers will never leak, or be subpoenaed. Only if you believe "its only metadata, there's nothing to worry about" can you believe that the data, the whole point of which is that it is highly specific to an individual, can be effectively anonymized. Saying "There are legitimate security and privacy issues. But ..." is simply a way of ignoring those issues, because actually addressing them would reveal that the downsides vastly outweigh the upsides.
Once again, we have an entire conference of techno-optimists, none of whom can be bothered to ask themselves "what could possibly go wrong?". In fact, in this case what they ought to be asking themselves is "what's the worst that could happen?", because the way they're going the worst is what is going to happen.
These ideas are potentially beneficial and in a world where data could be perfectly anonymized and kept perfectly secure for long periods of time despite being widely shared they should certainly be pursued. But this is not that world, and to behave as if it is violates the precept "First, do no harm" which, while strictly not part of the Hippocratic Oath, I believe is part of the canon of medical ethics.
MarcEdit provides lots of different ways for users to edit their data. However, one use case that comes up often is the ability to perform an action on a field or fields based on the presence of data within another field. While you can currently do this in MarcEdit by using tools to isolate the specific records to edit, and then working on just those items — more could be done to make this process easier. So, to that end, I’ve updated the Replace Function to include a new conditional element that will allow MarcEdit to presort using an in-string or regular expression query, prior to evaluating data for replacement. Here’s how it will work…
When you first open the Replace Window:
Notice that the conditional string text has been replaced. This was confusing to folks – because maybe that didn’t reflect exactly what was being done. Rather, this is an option that allows a user to run an instring or Regular Expression search across your entire record before the Find/Replace is run. The search options grouped below – these *only* affect the Find/Replace textboxes. They do not affect the options that are enabled when the Perform Find/Replace If…is checked. Those data fields have their own toggles for instring (has) or regular expression (regex) matching.
If you check the box, the following information will be displayed:
Again – the If [Textbox] [REGEX] is a search that is performed and must evaluate as true in order for the paired find and replace runs. The use case for this function are things like:
- I want to modify the field x but only if foobar is found in field y.
There are other ways to do this by extracting data from files and creating lots of different files for processing or writing a script – but this will give users a great deal more flexibility when wanting to perform options, but only if specific data is found within a field.
A simple example would be below:
This is a non-real world example of how this function works. A user wants to change the 050 field to an 090 field, but only if the data in the 945$a is equal to an m-z. That’s what the new option allows. By checking the Perform Find/Replace If option, I’m allowed to provide a pre-search that will then filter the data sets that I’m going to actually perform the primary Find/Replace pair on. Make sense? I hope so.
Finally – I’ve updated the code around the task wizard so that this information can be utilized within tasks. This enhancement will be in the next available update.
You know you always wanted to be part of the cool gang, well now is your big chance. Be a part of creating the LITA AV club. Help make videos of important LITA conference presentations like the Top Tech Trends panel and LITA Forum Keynotes. Create the recordings to share these exciting and informative presentations with your LITA colleagues who weren’t able to attend. Earn the undying gratitude of all LITA members.
Sound right up your alley? We’ll need a couple of chief wranglers plus a bunch of hands on folks. The group can organize via email now, and meet up in San Francisco say sometime Friday or early Saturday, June 26th and 27th. Details galore to be worked on by all. If you have enough fun you can always turn the Club into a LITA Interest Group and achieve immortality, fame and fortune, or more likely the admiration of your fellow LITAns.
To get started email Mark Beatty at: firstname.lastname@example.org
I’ll get gather names and contacts and create a “space” for you all to play.
Thanks. We can tell you are even cooler now than you were before you read this post.
The original ICP dates from 1961 and read like a very condensed set of cataloging rules. [Note: As T Berger points out, this document was entitled "Paris Principles", not ICP.] It was limited to choice and form of entries (personal and corporate authors, titles). It also stated clearly that it applied to alphabetically sequenced catalogs:
The principles here stated apply only to the choice and form of headings and entry words -- i.e. to the principal elements determining the order of entries -- in catalogues of printed books in which entries under authors' names and, where these are inappropriate or insufficient, under the titles of works are combined in one alphabetical sequence. The basic statement of principles was not particularly different from those stated by Charles Ammi Cutter in 1875.
Note that the ICP does not include subject access, which was included in Cutter's objectives for the catalog. Somewhere between 1875 and 1961, cataloging became descriptive cataloging only. Cutter's rules did include a fair amount detail about subject cataloging (in 13 pages, as compared to 23 pages on authors).
The next version of the principles was issued in 2009. This version is intended to be "applicable to online catalogs and beyond." This is a post-FRBR set of principles, and the objectives of the catalog are given in points with headings find, identify, select, obtain and navigate. Of course, the first four are the FRBR user tasks. The fifth one, navigate, as I recall was suggested by Elaine Svenonius and obviously was looked on favorably even though it hasn't been added to the FRBR document, as far as I know.
The statement of functions of the catalog in this 2009 draft is rather long, but the "find" function gives an idea of how the goals of the catalog have changed:
It's worth pointing out a couple of key changes. The first is the statement "as the result of a search..." The 1961 principles were designed for an alphabetically arranged catalog; this set of principles recognizes that there are searches and search results in online catalogs, and it never mentions alphabetical arrangement. The second is that there is specific reference to relationships, and that these are expected to be searchable along with attributes of the resource. The third is that there is something called "secondary limiting of a search result." This latter appears to reflect the use of facets in search interfaces.
The differences between the 2015 draft of the ICP and this 2009 version are relatively minor. The big jump in thinking takes place between the 1961 version and the 2009 version. My comments (pdf) to the committee are as much about the 2009 version as the 2015 one. I make three points:
Although the ICP talks about "find," etc., it doesn't relate those actions to the form of the "authorized access points." There is no recognition that searching today is primarily on keyword, not on left-anchored strings.
2. Some catalog functions are provided by the catalog but not by cataloging
The 2015 ICP includes among its principles that of accessibility of the catalog for all users. Accessibility, however, is primarily a function of the catalog technology, not the content of the catalog data. It also recommends (to my great pleasure) that the catalog data be made available for open access. This is another principle that is not content-based. Equally important is the idea, which is expressed in the 2015 principles under "navigate" as: "... beyond the catalogue, to other catalogues and in non-library contexts." This is clearly a function of the catalog, with the support of the catalog data, but what data serves this function is not mentioned.
3. Authority control must be extended to all elements that have recognized value for retrieval
This mainly refers to the inclusion of the elements that serve as limiting facets on retrieved sets. None of the elements listed here are included in the ICP's instructions on "authorized access points," yet these are, indeed, access points. Uncontrolled forms of dates, places, content, carrier, etc., are simply not usable as limits. Yet nowhere in the document is the form of these access points addressed.
There is undoubtedly much more that could be said about the principles, but this is what seemed to me to be appropriate to the request for comment on this draft.
Peter Murray: Setting the Right Environment: Remote Staff, Service Provider Participants, and Big-Tent Open Source Communities
I was asked recently to prepare a 15 minute presentation on lessons learned working with a remote team hosting open source applications. The text of that presentation is below with links added to more information. Photographs are from DPLA and Flickr, and are used under Public Domain or Creative Commons derivatives-okay licenses. Photographs link to their sources.
Thank you for the opportunity to talk with you today. This is a description of a long-running project at LYRASIS to host open source software on behalf of our members and others in the cultural heritage community. The genesis of this project is member research done at the formation of LYRASIS from SOLINET, PALINET and NELINET. Our membership told us that they wanted the advantages of open source software but did not have the resources within their organization to host it themselves. Our goals were — and still are — to create sustainable technical infrastructure for open source hosting, to provide top-notch support for clients adopting that hosted open source, and to be a conduit through which clients engage in the open source community.
In the past couple of years, this work has focused on three software packages: the Islandora digital asset system, the ArchivesSpace system for archival finding aids, and most recently the CollectionSpace system for museum objects. Each of these, in sequence, involved new learning and new skills. First was Islandora. For those who are not familiar with Islandora, it is a digital asset system built atop Drupal and Fedora Commons repository system. It is a powerful stack with a lot of moving parts, and that makes it difficult for organizations to set up. One needs experience in PHP and Drupal, Java servlet engines, SOLR, and Fedora Commons among other components. In our internal team those skills were distributed among several staff members, and we are spread out all over the country: I’m in central Ohio, there is a developer in California, a data specialist in Baltimore, a sysadmin in Buffalo, two support and training staff in Atlanta, and servers in the cloud all over North America.Importance of Internal Communication
That first goal I mentioned earlier — creating a sustainable technical architecture — took a lot of work and experimentation for us. All of us had worked in library IT in earlier jobs. Except for our sysadmin, though, none of us had built a hosted service to scale. It was a fast-moving time, with lots of small successes and failures, swapping of infrastructure components, and on-the-fly procedures. It was hard to keep up. We took a page from the Scrum practice and instituted a daily standup meeting. The meeting started at 10:30am eastern time, which got our west coast person up just a little early, and — since we were spread out all over the country — used a group Skype video conference.
The morning standups usually took longer than the typical 15 minutes. In addition to everyone’s reports, we shared information about activities with the broader LYRASIS organization as well as informal things about our personal lives — what our kids were doing, our vacation plans, or laughing about the latest internet meme. It was the sort of sharing that would happen naturally when meeting someone at the building entrance or popping a head over a cubicle wall, and that helped cement our social bonds. We kept the Skype window open throughout the day and used the text chat function to post status updates, ask questions of each other, and share links to funny cat pictures. Our use of this internal communication channel has evolved over the years. We no longer have the synchronous video call every morning for our standup; we post our morning reports as chat messages. If we were to hire a new team member, I would make a suggestion to the team that we restart the video calls at least for a brief period to acclimate the new person to the group. We’ve also moved from Skype to Slack — a better tool for capturing, organizing, searching, and integrating our activity chatter. What started out as a suggestion by one of our team members to switch to Slack for the seven of us has grown organically to include about a third of the LYRASIS staff.
In their book “Remote: Office Not Required” the founders of 37Signals describe the “virtual water cooler”. They say that the idea is to have a single, permanent chat room where everyone hangs out all day to shoot the breeze, post funny pictures, and generally goof around. They acknowledge that it can also be used to answer questions about work, but its primary function is to provide social cohesion. With a distributed team, initiating communication with someone is an intentional act. It doesn’t happen serendipitously by meeting at a physical water cooler. The solution is to lower the barrier of initiating that communication while still respecting the boundaries people need to get work done.
How does your core team communicates among itself. How aware are they of what each other are doing? Do they know each other’s strengths and feel comfortable enough to call on each other for help? Do the members share a sense of forward accomplishment with the project as a whole?Clear Demarcation of Responsibilities between Hosting Company and Organizational Home
One of the unique thing about the open source hosting activity at LYRASIS is that for two of the projects we are also closely paired with organizational homes. Both ArchivesSpace and CollectionSpace have separate staff within LYRASIS that report to their own community boards and have their own financial structure. LYRASIS provides human resource and fiscal administrative services to the organizational homes, and we share resources and expertise among the groups. From the perspective of a client to our services, though, it can seem like the hosting group and the organizational home are one entity. We run into confusion about why we in the hosting group cannot add new features or address bugs in the software. We gently remind our clients that the open source software is bigger than our hosting of it — that there is an organizational home that is advancing the software for all users and not just our hosted clients.
Roles between open source organizations and hosting companies should be clearly defined as well, and the open source organization must help hosting providers make this distinction clear to the provider’s clients as well as self-hosting institutions. For instance, registered service provider agreements could include details for how questions about software functionality are handed off between the hosting provider and the organizational home. I would also include a statement from the registered service provider about the default expectations for when code and documentation will be contributed back to the community’s effort. This would be done in such a way as to give a service provider an avenue to distinguish itself from others while also strengthening the core community values of the project. While there is significant overlap, there are members of ArchivesSpace that are not hosted by LYRASIS and there are clients hosted by LYRASIS that are not members of ArchivesSpace.
How does your project divide responsibilities between the community and the commercial affiliates? What are the expectations that hosted adopters should have about the roles of support, functionality enhancement, and project governance?Empowering the Community
Lastly, one of the clear benefits of developing software as open source is the shared goals of the community participants. Whether someone is a project developer, a self-hosted user of the software, a service provider, or a client hosted by a service provider, everyone wants to see the software thrive and grow. While the LYRASIS hosting service does provide a way for clients to use the functionality of open source software, what we are really aiming to offer is a path for clients to get engaged in the project’s community by removing the technology barriers to hosting. We are selling a service, but sometimes I think the service that we are selling is not necessarily the one that the client is initially looking for. What clients come to us seeking is a way to make use of the functions that they see in the open source software. What we want them to know is how adopting open source software is different. As early as the first inquiry about hosting, we let clients know that the organizational home exists and offer to make an introduction to the project’s community organizer. When a project announces important information, we reflect that information on to our client mailing list. When a client files an issue in the LYRASIS hosting ticket system for an enhancement request, we forward that request to the project but we also urge the client to send the description of their use case through the community channels.
Maintaining good client support while also gently directing the client into the community’s established channels is a tough balancing act. Some clients get it right away, and become active participants in the community. Others are unable or unwilling to take that leap to participation in the project’s greater community. As a hosting provider we’ve learned to be flexible and supportive where ever the client is on its journey in adopting open source. Open source communities need to be looking for ways a hosted client’s staff — no matter what the level of technical expertise — can participate in the work of the community.
Do you have low barriers of entry for comments, corrections, and enhancements to the documentation? Is there a pipeline in place for triaging issue reports and comments that both help the initiator and funnel good information into the project’s teams? And is that triaging work valued on par with code contributions? Can you develop a mentoring program that aids new adopters into the project’s mainstream activities?Conclusion
As you can probably tell, I’m a big believer in the open source method of developing the systems and services that our patrons and staff need. I have worked on the adopter side of open source — using DSpace and Fedora Commons and other library-oriented software…to say nothing of more mainstream open source projects. I have worked on the service provider side of open source — making Islandora, ArchivesSpace and CollectionSpace available to organizations that cannot host the software themselves and empowering them to join the community. Through this experience I’ve learned a great deal about how many software projects think, what adopters look for in projects, what other service providers need to be successful community participants. Balancing the needs of the project, the needs of self-hosted adopters, and the needs of service providers is delicate — but the results are worth it.
I also believe that by using new technologies and strategies, distributed professionals can build a hosting service that is attractive to clients. We may not be able to stand around a water cooler or conference table, but we can replicate the essence of those environments with tools, policies, and a collaborative attitude. In doing so we have more freedom to hire the staff the make the right fit for our organization, no matter where they are located.Link to this post!
While writing my last post about what I wish I had known upon graduating (from library school), I decided that I wanted to write a companion piece about what I was glad I didn’t know. Perhaps the reason for this will soon become clear. So here we go:
- You know nothing. No, seriously, you don’t. You know all that time you just spent learning how to connect to Dialog database via a 300-baud acoustic coupler modem? That’s worth 3-5 years, tops. Then it’s toast. What comes next YOU HAVE NO IDEA. So just stop with the anguish, and meet it like a woman.
- I mean, seriously, YOU KNOW NOTHING. All that stuff you wrote papers about? Gone, too late. Something else is on the horizon, about to mow you down like so much new grass.
- If you can’t learn constantly, like ALL THE TIME, then you are toast. BITNET, UUNET, bulletin boards, WAIS, Gopher, Hytelnet, Veronica, Gopher, ALL of these would come and go within a small number of years. You will learn them, forget them, and bury them all within a decade. Have a nice life.
- You should have taken more cataloging courses. I was more of a public services kind of person, but frankly, public services was completely transformed during the period when we still had the MARC record. So more cataloging courses regarding MARC would have had more staying power than how to give bibliographic instruction talks. I’m sorry to say that I actually would wheel in a book truck with the Reader’s Guide, Biography Index, and other such print volumes to pitch to students who within a year would be lining up to search INFOTRAC instead. Talk about futility.
- You are being disintermediated. None of us at the time could possibly have predicted Google. I mean not even close. We’re talking about a period before AltaVista, which I would argue was the first web search engine that actually worked well. I lived through the era where we switched from people having to come to us to us to meeting people where they are (via mobile, out in the community, etc.). This isn’t a bad thing, but I can assure you it wasn’t what we expected when we graduated in 1986.
- There are fewer jobs than you think there are. I graduated into a job situation where there were plenty of professionals who graduated ahead of me who already had jobs and weren’t going to give them up for decades. We still have this issue. Only in the last few years have we begun to see the beginning of the tidal wave of Baby Boomer retirements that will open up positions for new librarians.
- The future jobs will be more technical or more community oriented than you may expect. If you look at today’s jobs, which really are only a trend that started back close to when I graduated, you will see a distinct shift toward two directions: technical positions, whether it be a software developer or a digital archivist, or toward engaging a community like a university-based “embedded librarian” or someone who serves teens at a public library. The point is that either you are public-facing or you are highly skilled in new technical requirements.
- Personal connections are much more important than a degree. When you obtain your degree you may think that you are done, that you have punched your card and you are good. In reality, it is only just the beginning. What you really need are connections to others. You need to know people who also know you. That is why I am so focused on mentoring young librarians. Young librarians need to focus on building networks of both peers and potential mentors. Who can help you be successful? Who can give you a needed recommendation? Seek these people out and make a connection.
Finally, on a personal note, there is one last thing that I am glad I didn’t know upon graduation. But first I must explain. To get through library school I did the following:
- Worked 30 hours a week as an Evening/Weekend Circulation Assistant as UC Berkeley staff.
- Drove 5 1/2 hours each way nearly every weekend to visit my wife working in Northern California (Arcata).
- Took a one-calendar year full course load MLIS program at UC Berkeley.
Doing this meant that I would leave on Friday for Arcata with an aching back and barely able to stay awake, having gotten perhaps 5 hours of sleep for five nights straight.
As it turns out, this was all simply good training for having twins. It’s really remarkable how life turns out sometimes.
The Access 2015 Organizing Committee is thrilled to announce that our keynote speaker is Amy Buckland!
Amy recently moved to the University of Chicago where she is the Institutional Repository Manager. Previously she was Coordinator of Scholarly Communications at McGill University Library, where she was responsible for open access initiatives, publishing support, copyright, digital humanities, and research data services. She has a bachelor of arts degree from Concordia University (Montreal) where she studied political science and women’s studies, and an MLIS from McGill University. Prior to joining the library world, she worked in publishing for 14 years, and thinks academic libraryland is ripe for a revolution.
Given Amy’s longstanding participation at and with Access over the years and her long-standing involvement in Canadian – and now American – library tech communities, the Organizing Committee was unanimous in its opinion that Amy would be a perfect choice to anchor the conference. For more information, check out her website or connect with her on twitter at @jambina.
Keep an eye out for the announcement of our 2015 David Binkley Memorial Lecture speaker, coming soon!
Is it time to organize a digital constitutional convention on the future of the internet? In a thought-provoking op-ed published in The Hill, Alan S. Inouye, director of the American Library Association’s (ALA) Office for Information Technology Policy (OITP) calls on the nation’s leaders in government, philanthropy, and the not-for-profit sector to convene a digital constitutional convention for the future of the internet.
Today we stand at the crossroads of establishing digital society for generations to come. By now, it is clear to everyone—not just network engineers and policy wonks—that the Internet is at the same time a huge mechanism for opportunity and for control. Though the advent of the Internet is propelling a true revolution in society, we’re not ready for it. Not even close.
For one thing, we are so politically polarized at the national level. The latest evidence: the net neutrality debate. Except it wasn’t. For the most part, it was characterized by those who favor assertive regulatory change for net neutrality stating their position, restating their position, then yelling out their position. Those arguing for the status quo policy did likewise. As the battle lines were drawn, there was little room to pragmatically consider a compromise advanced by some stakeholders.
The current state of digital privacy seems along these lines, as well. With copyright, it is even worse, as a decades-long “debate” has those favoring the strongest copyright protection possible dominating the discourse.
Another problem, as the preceding discussion suggests, is that issues clearly related to each other—such as telecommunications, privacy, and copyright—are debated mostly in their own silos. We need a radically different approach to address these foundational concerns that will have ramifications for decades to come. We need something akin to the Constitutional Convention that took place 228 years ago. We need today’s equivalents of George Washington, James Madison, and Benjamin Franklin to come together and apply their intellectual and political smarts—and work together for the good of all, to lay out the framework for many years to follow.
Most people in the country are in the middle of the political spectrum. They (We) are reasonable people. We want to do as we please, but realize that we don’t have the right to impinge on others’ freedom unduly. We’re Main Street USA.
This sounds so simple and matter-of-fact, but we inside-the-beltway people understand how hard this is to achieve in national policy. We just look outside our windows and see the U.S. Capitol and White House to remind ourselves of the challenge of achieving common-sense compromise in a harsh political climate.
Of course this is not easy. Some of us think of the challenge we need to address as access to information in the digital society, but really we’re talking about the allocation of power—so the stakes are even higher than some may think.
In a number of respects, power is more distributed in digital society. Obviously, laws, regulation, and related public policy remain important. Large traditional telecommunications and media companies remain influential. But now the national information industry includes Google, Apple, Facebook, Microsoft, and other major corporate players who also effectively make “public policy” through product decisions. Similarly, the continuing de facto devolvement from copyright law to a licensing regime (with the rapid growth of ebooks as the latest major casualty) also is shifting power from government to corporations. In some respects, individuals also have more power thanks to the proliferation of digital information and the internet that enable capabilities that previously only organizations could muster (e.g., publishing, national advocacy).
The post It’s time that we had a digital Constitutional Convention appeared first on District Dispatch.
The following guest post is a collaboration from Joanna DiPasquale (Vassar College), Amy Bocko (Wheaton College), Rachel Appel (Bryn Mawr College) and Sarah Walden (Amherst College) based on their panel presentation at the recent Personal Digital Archiving 2015 conference. I will write a detailed post about the conference — which the Library of Congress helped organize — in a few weeks..
When is the personal the professional? For faculty and students, spending countless hours researching, writing, and developing new ideas, the answer (only sometimes tongue-in-cheek) is “always:” digital archiving of their personal materials quickly turns into the creation of collections that can span multiple years, formats, subjects, and versions. In the library, we know well that “save everything” and “curate everything” are very different. What role, then, could the liberal arts college library play in helping our faculty and students curate their digital research materials and the scholarly communication objects that they create with an eye towards sustainability?
At Vassar, Wheaton, Bryn Mawr, and Amherst Colleges, we designed Personal Digital Archiving Days (PDAD) events to push the boundaries of outreach and archiving, learn more about our communities’ needs, and connect users to the right services needed to achieve their archiving goals. In Fall 2014, we held sessions across each of our campuses (some for the first time, some as part of an ongoing PDAD series), using the Library of Congress personal digital archiving resources as a model for our programming. Though our audiences and outcomes varied, we shared common goals: to provide outreach for the work we do, make the campus community aware of the services available to them, and impart best practices on attendees that will have lasting effects for their digital information management.
Joanna DiPasquale, digital initiatives librarian at Vassar, learned about personal digital archiving days from the Library of Congress’ resources and how they worked for public or community libraries. She saw these resources as an opportunity to communicate to campus about the library’s new Digital Initiatives Group and how each part of the group complemented other services on campus (such as media services, computing and preservation). Her workshop was geared toward faculty and faculty-developed digital projects and scholarship. Vassar began the workshops in 2012, and faculty continued to request them each year. By 2014, the event featured a case study from a faculty member (and past attendee) about the new strategies he employed for his own work.Amy Bocko, Digital Asset Curator at Wheaton, saw PDAD’s success during her time as a Vassar employee. Now at Wheaton, Amy wanted to publicize her brand-new position on campus and ability to offer new digitally-focused services in Library and Information Services, and her Personal Digital Archiving Day brought together a diverse group of faculty members to work on common issues. The reactions were favorable and the attendees were grateful for the help they needed to manage their digital scholarship.
Approaching everything as a whole could have been overwhelming, so Amy boiled it down to “what step could you take today that would improve your digital collection? which led to iterative, more effective results. Common responses included “investing in an external hard drive”, “adhering to a naming structure for digital files” and “taking inventory of what I have”. Amy made herself available after her workshop to address the specific concerns of faculty members in relation to their materials. She spoke at length with a printmaking professor that had an extensive collection of both analog slides and digital images with little metadata. They discussed starting small, creating a naming schema that would help her take steps towards becoming organized. The faculty member remarked how just a brief conversation, and knowing that the library was taking steps to help their faculty in managing their digital scholarship, put her mind at ease.Rachel Appel, digital collections librarian at Bryn Mawr, wanted to focus on student life. Rachel worked directly with Bryn Mawr’s Self-Government Association to work specifically with student clubs to bring awareness about their records, help them get organized and think ahead to filling archival silences in the College Archives. Like the other institutions, PDAD provided a great avenue to introduce her work to campus. The students were also very interested in the concept of institutional memory and creating documented legacies between each generation of students. Rachel was able to hold the workshop again for different groups of attendees and focus on basic personal digital file management.
Sarah Walden, digital projects librarian at Amherst, focused on student thesis writers for PDAD. Sarah worked with Criss Guy, a post-bac at Amherst, and they developed the workshop together. Their goal was to expose students to immediate preservation concerns surrounding a large research project like a thesis (backups, organization, versioning), as well as to give them some exposure to the idea of longer-term preservation. They offered two versions of their workshop. In the fall, they gave an overview of file identification, prioritization, organization, and backup. The second version of the workshop in January added a hands-on activity in which the students organized a set of sample files using the organizing-software program, Hazel.
Although our workshops had varying audiences and goals, they empowered attendees to become more aware of their digital data management and the records continuum. They also provided an outreach opportunity for the digital library to address issues of sustainability in digital scholarship.
This benefits both the scholar and the library. The potential for sustainable digital scholarship (whether sustained by the library, the scholar or both) increases when we can bring our own best practices to our constituents. We believe that PDAD events like ours provide an opportunity for college libraries to meet our scholars in multiple project phases:
- While they are potentially worried about their past digital materials
- While they are actively creating (and curating) their current materials
- When they move beyond our campus services (particularly for students).
While we dispense good advice, we also raise awareness of our digital-preservation skills, our services and our best practices, and we only see that need growing as digital scholarship flourishes. On the college campus, the personal heavily overlaps with the professional. We anticipate that we will be holding more targeted workshops for specific groups of attendees and would like to hear experiences from other institutions on how their PDADs evolved.
Back in February, Stephen Balkam's Guardian article What will happen when the internet of things becomes artificially intelligent? sparked some discussion on Dave Farber's IP list, including this wonderfully apposite Philip K. Dick citation from Ian Stedman via David Pollak. It roused Mike O'Dell to respond with Internet of Obnoxious Things, a really important insight into the fundamental problems underlying the Internet of Things. Just go read it. Mike starts:
The PKDick excerpt cited about a shakedown by a door lock is, I fear, more prescient than it first appears.
I very much doubt that any "Internet of Things" will become Artificially Impudent because long before that happens, all the devices will be co-opted by The Bad Guys who will proceed to pursue shakedowns, extortion, and "protection" rackets on a coherent global scale.
Whether it is even possible to "secure" such a collection of devices empowered with such direct control over physical reality is a profound and, I believe, completely open theoretical question. (We don't even have a strong definition of what that would mean.)
Even if it is theoretically possible, it has been demonstrated in the most compelling possible terms that it will not be done for a host of reasons. The most benign fall under the rubric of "Never ascribe to malice what is adequately explained by stupidity" while others will be aggressively malicious. ...
A close second, however, is a definition of "security" that reads, approximately, "Do what I should have meant." Eg, the rate of technology churn cannot be reduced just because we haven't figured out what we need it to do (or not do) - we'll just "iterate" every time Something Bad(tm) happens.Charlie goes further, and follows Philip K. Dick more closely, by pointing out that the causes of Something Bad(tm) are not just stupidity and malice, but also greed:
The evil business plan of evil (and misery) posits the existence of smart municipality-provided household recycling bins. ... The bin has a PV powered microcontroller that can talk to a base station in the nearest wifi-enabled street lamp, and thence to the city government's waste department. The householder sorts their waste into the various recycling bins, and when the bins are full they're added to a pickup list for the waste truck on the nearest routing—so that rather than being collected at a set interval, they're only collected when they're full.
But that's not all.
Householders are lazy or otherwise noncompliant and sometimes dump stuff in the wrong bin, just as drivers sometimes disobey the speed limit.
The overt value proposition for the municipality (who we are selling these bins and their support infrastructure to) is that the bins can sense the presence of the wrong kind of waste. This increases management costs by requiring hand-sorting, so the individual homeowner can be surcharged (or fined). More reasonably, households can be charged a high annual waste recycling and sorting fee, and given a discount for pre-sorting everything properly, before collection—which they forefeit if they screw up too often.
The covert value proposition ... local town governments are under increasing pressure to cut their operating budgets. But by implementing increasingly elaborate waste-sorting requirements and imposing direct fines on households for non-compliance, they can turn the smart recycling bins into a new revenue enhancement channel, ... Churn the recycling criteria just a little bit and rely on tired and over-engaged citizens to accidentally toss a piece of plastic in the metal bin, or some food waste in the packaging bin: it'll make a fine contribution to your city's revenue!Charlie sets out the basic requirements for business models like this:
Some aspects of modern life look like necessary evils at first, until you realize that some asshole has managed to (a) make it compulsory, and (b) use it for rent-seeking. The goal of this business is to identify a niche that is already mandatory, and where a supply chain exists (that is: someone provides goods or service, and as many people as possible have to use them), then figure out a way to colonize it as a monopolistic intermediary with rent-raising power and the force of law behind it.and goes on to use speed cameras as an example. What he doesn't go into is what the IoT brings to this class of business models; reduced cost of detection, reduced possibility of contest, reduced cost of punishment. A trifecta that means profit! But Charlie brilliantly goes on to incorporate:
the innovative business model that Yves Smith has dubbed "crapification". A business that can reduce customer choice sufficiently then has a profit opportunity; it can make its product so awful that customers will pay for a slightly less awful version. He suggests:
Sell householders a deluxe bin with multiple compartments and a sorter in the top: they can put their rubbish in, and the bin itself will sort which section it belongs in. Over a year or three the householder will save themselves the price of the deluxe bin in avoided fines—but we don't care, we're not the municipal waste authority, we're the speed camera/radar detector vendor!Cory Doctorow just weighed in, again, on the looming IoT disaster. This time he points out that although it is a problem that Roomba's limited on-board intelligence means poor obstacle avoidance, solving the problem by equipping them with cameras and an Internet connection to an obstacle-recognition service is an awesomely bad idea:
Roombas are pretty useful devices. I own two of them. They do have real trouble with obstacles, though. Putting a camera on them so that they can use the smarts of the network to navigate our homes and offices is a plausible solution to this problem.
But a camera-equipped networked robot that free-ranges around your home is a fucking disaster if it isn't secure. It's a gift to everyone who wants to use cameras to attack you, from voyeur sextortionist creeps to burglars to foreign spies and dirty cops. Looking back through the notes on my October post, we see that Google is no longer patching known vulnerabilities in Android before 4.4. There are only about 930 million devices running such software. More details on why nearly a billion users are being left to the mercy of the bad guys are here.
The Internet of Things With Wheels That Kill People has featured extensively. First, Progressive Insurance's gizmo that tracks their customer's driving habits has a few security issues:
"The firmware running on the dongle is minimal and insecure," Thuen told Forbes.
"It does no validation or signing of firmware updates, no secure boot, no cellular authentication, no secure communications or encryption, no data execution prevention or attack mitigation technologies ... basically it uses no security technologies whatsoever."
What's the worst that can happen? The device gives access to the CAN bus.
"The CAN bus had been the target of much previous hacking research. The latest dongle similar to the SnapShot device to be hacked was the Zubie device which examined for mechanical problems and allowed drivers to observe and share their habits."
"Argus Cyber Security researchers Ron Ofir and Ofer Kapota went further and gained control of acceleration, braking and steering through an exploit." Second, a vulnerability in BMWs, Minis and Rolls-Royces:
"BMW has plugged a hole that could allow remote attackers to open windows and doors for 2.2 million cars."
..."Attackers could set up fake wireless networks to intercept and transmit the clear-text data to the cars but could not have impacted vehicle acceleration or braking systems."
BMW's patch also updated its patch distribution system to use HTTPS."What were they thinking?
Third, Senator Ed Markey has been asking auto makers questions and the answers are not reassuring. No wonder he was asking questions. At an industry-sponsored hackathon last July a 14-year old with $15 in parts from Radio Shack showed how easy it was:
"Windshield wipers turned on and off. Doors locked and unlocked. The remote start feature engaged. The student even got the car's lights to flash on and off, set to the beat from songs on his iPhone."Key to an Internet of Things that we could live with is, as Vint Cerf pointed out, a secure firmware update mechanism. The consequences of not having one can be seen in Kaspersky's revelations of the "Equation group" compromising hard drive firmware. Here's an example of how easy it can be. To be fair, Seagate at least has deployed a secure firmware update mechanism, initially to self-encrypting drives but now I'm told to all their current drives.
Cooper Quintin at the EFF's DeepLinks blog weighed in with a typically clear overview of the issue entitled Are Your Devices Hardwired For Betrayal?. The three principles:
- Firmware must be properly audited.
- Firmware updates must be signed.
- We need a mechanism for verifying installed firmware.
"None of these things are inherently difficult from a technological standpoint. The hard problems to overcome will be inertia, complacency, politics, incentives, and costs on the part of the hardware companies."Among the Things in the Internet are computers with vulnerable BIOSes:
"Though there's been long suspicion that spy agencies have exotic means of remotely compromising computer BIOS, these remote exploits were considered rare and difficult to attain.
Legbacore founders Corey Kallenberg and Xeno Kovah's Cansecwest presentation ... automates the process of discovering these vulnerabilities. Kallenberg and Kovah are confident that they can find many more BIOS vulnerabilities; they will also demonstrate many new BIOS attacks that require physical access."GCHQ has the legal authority to exploit these BIOS vulnerabilities, and any others it can find, against computers, phones and any other Things on the Internet wherever they are. Its likely that most security services have similar authority.
Useful reports appeared, including this two part report from Xipiter, and this from Veracode on insecurities, this from DDOS-protection company Incasula, on the now multiple botnets running on home routers, and this from the SEC Consult Vulnerability Lab about a yet another catastrophic vulnerability in home routers. This last report, unlike the industry happy-talk, understands the economics of IoT devices:
"the (consumer) embedded systems industry is always keen on keeping development costs as low as possible and is therefore using vulnerability-ridden code provided by chipset manufacturers (e.g. Realtek CVE-2014-8361 - detailed summary by HP, Broadcom) or outdated versions of included open-source software (e.g. libupnp, MiniUPnPd) in their products."And just as I was finishing this rant, Ars Technica posted details of yet another botnet running on home routers, this one called Linux/Moose. It collects social network credentials.
That's all until the next rant. Have fun with your Internet-enabled gizmos!
Part eu of Amazon crawl..
This item belongs to: data/ol_data.
This item has files of the following types: Data, Data, Metadata, Text
In the previous post I walked through some of the different ways that we could normalize a subject string and took a look at what effects these normalizations had on the subjects in the entire DPLA metadata dataset that I have been using.
This post I wanted to continue along those lines and take a look at what happens when you apply these normalizations to the subjects in the dataset, but this time focus on the Hub level instead of working with the whole dataset.
I applied the normalizations mentioned in the previous post to the subjects from each of the Hubs in the DPLA dataset. This included total values, unique but un-normalized values, case folded, lower cased, NACO, Porter stemmed, and fingerprint. I applied the normalizations on the output of the previous normalization as a series, here is an example of what the normalization chain looked like for each.total total > unique total > unique > case folded total > unique > case folded > lowercased total > unique > case folded > lowercased > NACO total > unique > case folded > lowercased > NACO > Porter total > unique > case folded > lowercased > NACO > Porter > fingerprint
The number of subjects after each normalization is presented in the first table below.Hub Name Total Subjects Unique Subjects Folded Lowercase NACO Porter Fingerprint ARTstor 194,883 9,560 9,559 9,514 9,483 8,319 8,278 Biodiversity_Heritage_Library 451,999 22,004 22,003 22,002 21,865 21,482 21,384 David_Rumsey 22,976 123 123 122 121 121 121 Digital_Commonwealth 295,778 41,704 41,694 41,419 40,998 40,095 39,950 Digital_Library_of_Georgia 1,151,351 132,160 132,157 131,656 131,171 130,289 129,724 Harvard_Library 26,641 9,257 9,251 9,248 9,236 9,229 9,059 HathiTrust 2,608,567 685,733 682,188 676,739 671,203 667,025 653,973 Internet_Archive 363,634 56,910 56,815 56,291 55,954 55,401 54,700 J_Paul_Getty_Trust 32,949 2,777 2,774 2,760 2,741 2,710 2,640 Kentucky_Digital_Library 26,008 1,972 1,972 1,959 1,900 1,898 1,892 Minnesota_Digital_Library 202,456 24,472 24,470 23,834 23,680 22,453 22,282 Missouri_Hub 97,111 6,893 6,893 6,850 6,792 6,724 6,696 Mountain_West_Digital_Library 2,636,219 227,755 227,705 223,500 220,784 214,197 210,771 National_Archives_and_Records_Administration 231,513 7,086 7,086 7,085 7,085 7,050 7,045 North_Carolina_Digital_Heritage_Center 866,697 99,258 99,254 99,020 98,486 97,993 97,297 Smithsonian_Institution 5,689,135 348,302 348,043 347,595 346,499 344,018 337,209 South_Carolina_Digital_Library 231,267 23,842 23,838 23,656 23,291 23,101 22,993 The_New_York_Public_Library 1,995,817 69,210 69,185 69,165 69,091 68,767 68,566 The_Portal_to_Texas_History 5,255,588 104,566 104,526 103,208 102,195 98,591 97,589 United_States_Government_Printing_Office_(GPO) 456,363 174,067 174,063 173,554 173,353 172,761 170,103 University_of_Illinois_at_Urbana-Champaign 67,954 6,183 6,182 6,150 6,134 6,026 6,010 University_of_Southern_California_Libraries 859,868 65,958 65,882 65,470 64,714 62,092 61,553 University_of_Virginia_Library 93,378 3,736 3,736 3,672 3,660 3,625 3,618
Here is a table that shows the percentage reduction after each field is normalized with a specific algorithm. The percent reduction makes it a little easier to interpret.Hub Name Folded Normalization Lowercase Normalization Naco Normalization Porter Normalization Fingerprint Normalization ARTstor 0.0% 0.5% 0.8% 13.0% 13.4% Biodiversity_Heritage_Library 0.0% 0.0% 0.6% 2.4% 2.8% David_Rumsey 0.0% 0.8% 1.6% 1.6% 1.6% Digital_Commonwealth 0.0% 0.7% 1.7% 3.9% 4.2% Digital_Library_of_Georgia 0.0% 0.4% 0.7% 1.4% 1.8% Harvard_Library 0.1% 0.1% 0.2% 0.3% 2.1% HathiTrust 0.5% 1.3% 2.1% 2.7% 4.6% Internet_Archive 0.2% 1.1% 1.7% 2.7% 3.9% J_Paul_Getty_Trust 0.1% 0.6% 1.3% 2.4% 4.9% Kentucky_Digital_Library 0.0% 0.7% 3.7% 3.8% 4.1% Minnesota_Digital_Library 0.0% 2.6% 3.2% 8.3% 8.9% Missouri_Hub 0.0% 0.6% 1.5% 2.5% 2.9% Mountain_West_Digital_Library 0.0% 1.9% 3.1% 6.0% 7.5% National_Archives_and_Records_Administration 0.0% 0.0% 0.0% 0.5% 0.6% North_Carolina_Digital_Heritage_Center 0.0% 0.2% 0.8% 1.3% 2.0% Smithsonian_Institution 0.1% 0.2% 0.5% 1.2% 3.2% South_Carolina_Digital_Library 0.0% 0.8% 2.3% 3.1% 3.6% The_New_York_Public_Library 0.0% 0.1% 0.2% 0.6% 0.9% The_Portal_to_Texas_History 0.0% 1.3% 2.3% 5.7% 6.7% United_States_Government_Printing_Office_(GPO) 0.0% 0.3% 0.4% 0.8% 2.3% University_of_Illinois_at_Urbana-Champaign 0.0% 0.5% 0.8% 2.5% 2.8% University_of_Southern_California_Libraries 0.1% 0.7% 1.9% 5.9% 6.7% University_of_Virginia_Library 0.0% 1.7% 2.0% 3.0% 3.2%
Here is that data presented as a graph that I think shows the data a even better.
You can see that for many of the Hubs you see the biggest reduction happening when applying the Porter Normalization and the Fingerprint Normalization. Hubs of note are ArtStore which had the highest percentage of reduction of the hubs. This was primarily caused by the Porter normalization which means that there were a large percentage of subjects that stemmed to the same stem, often this is plural vs singular versions of the same subject. This may be completely valid with out ArtStore chose to create metadata but is still interesting.
Another hub I found interesting with this data was that from Harvard where the biggest reduction happened with the Fingerprint Normalization. This might suggest that there are a number of values that are the same just with different order. For example names that occur in both inverted and non-inverted form.
In the end I’m not sure how helpful this is as an indicator of quality within a field. There are fields that would benefit from this sort of normalization more than others. For example subjects, creator, contributor, publisher will normalize very differently than a field like title or description.
Let me know what you think via Twitter if you have questions or comments.
We’re delighted to be able to tell you that detailed planning is now underway for an exciting program at Hydra Connect 2015 in Minneapolis this Fall. The program committee would love to hear from those of you who have suggestions for items that should be included. These might be workshops or demonstrations for the Monday, or they might be for 5, 10 or 20 minute presentations, discussion groups or another format you’d like to suggest during the conference proper. It may be that you will offer to facilitate or present the item yourself or it may be that you’d like the committee to commission the slot from someone else – you could maybe suggest a name. As in the past, we shall be attempting to serve the needs of attendees from a wide range of experience and background (potential adopters, new adopters, “old hands”; developers, managers, sysops etc) and, if it isn’t obvious, it may be helpful if you tell us who would be the target audience. Those of you going to Open Repositories 2015 might take the opportunity to talk to others about the possibility of joint workshops, presentations, etc.?
Advance warning that, as in past years, we shall ask all attendees who are working with Hydra to bring a poster for the popular “poster show and tell” session. This is a great opportunity to share with colleague Hydranauts what your institution is doing and to forge connections around the work. Details later…
FYI: we plan on opening booking in the next ten days or so and we hope to see you in Minneapolis for what promises to be another great Hydra Connect meeting!
Peter Binkley, Matt Critchlow, Karen Estlund, Erin Fahy and Anna Headley (the Hydra Connect 2015 Program Committee)
I wrote about Amazon Echo a few months back. At the time, I did not have it, but was looking forward to using it. Now, that I have had Echo for a while I have a better idea of its strengths and weaknesses.
It doesn’t pick up every word I say, but its voice recognition is much better than I anticipated. The app works nicely on my phone and iPad and I found it easy to link Pandora, my music, and to indicate what news channels I want to hear from. I enjoy getting the weather report, listening to a flash news briefing, adding items to my shopping list, listening to music, and being informed of the best route to work to avoid traffic.
My favorite feature is that it is hands-free. I’m constantly running around my house juggling a lot of things. Often I need an answer to a question, I need to add something to a shopping list as I’m cooking, or I want to hear a certain song as I’m elbow-deep in a project. Having the ability to just “say the words” is wonderful. Now if it just worked everything…
I hope updates will come soon though as I’d like to see increased functionality in its ability to answer questions and provide traffic information for different locations other than the one location I can program into the app. I also want to be able to make calls and send text messages using Echo.
In my first post about Amazon Echo, I stated I was really interested in the device as an information retrieval tool. Currently, Echo doesn’t work as well as I was expecting for retrieving information, but with software updates I still see it (and similar tools) having an impact on our research.
Overall, I see it as a device that has amazing potential, but it is still in its infancy.
Has anyone else used Echo? I’d love to hear your thoughts on the device.
Last updated May 28, 2015. Created by David Nind on May 28, 2015.
Log in to edit this page.
Monthly maintenance release for Koha v 3.18.7. See the release announcements for the details:
- Koha 3.18.7 - http://koha-community.org/koha-3-18-7-released/ (26 May 2015 - maintenance release)