You are here

Feed aggregator

Karen Coyle: Catalog and Context, Part I

planet code4lib - Tue, 2016-06-21 18:51
This multi-part post is based on a talk I gave in June, 2016 at ELAG in Copenhagen.

Imagine that you do a search in your GPS system and are given the exact point of the address, but nothing more.

Without some context showing where on the planet the point exists, having the exact location, while accurate, is not useful.



In essence, this is what we provide to users of our catalogs. They do a search and we reply with bibliographic items that meet the letter of that search, but with no context about where those items fit into any knowledge map.

Because we present the catalog as a retrieval tool for unrelated items, users have come to see the library catalog as nothing more than a tool for known item searching. They do not see it as a place to explore topics or to find related works. The catalog wasn't always just a known item finding tool, however. To understand how it came to be one, we need a short visit to Catalogs Past.

Catalogs Past
We can't really compare the library catalog of today to the early book catalogs, since the problem that they had to solve was quite different to what we have today. However, those catalogs can show us what a library catalog was originally meant to be.

A book catalog was a compendium of entry points, mainly authors but in some cases also titles and subjects. The bibliographic data was kept quite brief as every character in the catalog was a cost in terms of type-setting and page real estate. The headings dominated the catalog, and it was only through headings that a user could approach the bibliographic holdings of the library. An alphabetical author list is not much "knowledge organization", but the headings provided an ordered layer over the library's holdings, and were also the only access mechanism to them.

Some of the early card catalogs had separate cards for headings and for bibliographic data. If entries in the catalog had to be hand-written (or later typed) onto cards, the easiest thing was to slot the cards into the catalog behind the appropriate heading without adding heading data to the card itself.

Often there was only one card with a full bibliographic description, and that was the "main entry" card. All other cards were references to a point in the catalog, for example the author's name, where more information could be found.
Again, all bibliographic data was subordinate to a layer of headings that made up the catalog. We can debate how intellectually accurate or useful that heading layer was, but there is no doubt that it was the only entry to the content of the library.

The Printed Card
In 1902 the Library of Congress began printing cards that could be purchased by libraries. The idea was genius. For each item cataloged by LC a card was printed in as many copies as needed. Libraries could buy the number of catalog card "blanks" they required to create all of the entries in their catalogs. The libraries would use as many as needed of the printed cards and type (or write) the desired headings onto the top of the card. Each of these would have the full bibliographic information - an advantage for users who then would not longer need to follow "see" references from headings to the one full entry card in the catalog.


These cards introduced something else that was new: the card would have at the bottom a tracing of the headings that LC was using in its own catalog. This was a savings for the libraries as they could copy LC's practice without incurring their own catalogers' time. This card, for the first time, combined both bibliographic information and heading tracings in a single "record", with the bibliographic information on the card being an entry point to the headings.

Machine-Readable Card Printing
The MAchine Readable Cataloging (MARC) project of the Library of Congress was a major upgrade to card printing technology. By including all of the information needed for card printing in a computer-processable record, LC could take advantage of new technology to stream-line its card production process, and even move into a kind of "print on demand" model. The MARC record was designed to have all of the information needed to print the set of cards for a book; author, title, subjects, and added entries were all included in the record, as well as some additional information that could be used to generate reports such as "new acquisitions" lists.

Here again the bibliographic information and the heading information were together in a single unit, and it even followed the card printing convention of the order of the entries, with the bibliographic description at top, followed by headings. With the MARC record, it was possible to not only print sets of cards, but to actually print the headers on the cards, so that when libraries received a set they were ready to do into the catalog at their respective places.

Next, we'll look at the conversion from printed cards to catalogs using database technology.

-> Part II

Cynthia Ng: A Letter of Thanks

planet code4lib - Tue, 2016-06-21 17:09
I have often thought that I have been fortunate to meet a lot of great people during my time in library school and since then in the working world. While I have thanked many of them in writing and in person, I wanted to reflect on how the combination of people and their support has … Continue reading A Letter of Thanks

DPLA: Catch up with DPLA at ALA Annual in Orlando

planet code4lib - Tue, 2016-06-21 14:00

The American Library Association’s Annual Conference kicks off later this week in Orlando, Florida and DPLA staffers are excited to hit the road, connect with a fantastic community of librarians and show our support for the city of Orlando.  Here’s your guide to when and where to catch up with DPLA’s staff and community members at ALA Annual.  If you’ll be following the conference from afar, connect with us on Twitter and following the conference at #alaac16.

[S] = DPLA Staff Participating, [K] = Knight Foundation Sponsored Panel, [H] = DPLA Hub and/or Contributing Institution represented

FRIDAY, June 24, 2016 12:00pm – 2:00pm: Ebook Working Group Project Update [S]
Location: Networking Uncommons, Orange County Convention Center This meeting is open to all librarians curious about current issues, ongoing projects, and ways to get involved. Attendees will learn how the Ebook Working Group fits in with other library ebook groups, and explore the projects we currently work on, including the Library E-content Access Project (LEAP), SimplyE/Open eBooks, SimplyE for Consortia, Readers First and other library-created ebook projects. Current members of the working groups will have the opportunity to meet and share updates, and connect with potential new members. DPLA Staff Presenting: Michelle Bickert, Ebook Program Manager, and Rachel Frick, Business Development Director SATURDAY, June 25, 2016 8:30am – 10:00am: Linked Data – Globally Connecting Libraries, Archives, and Museums

In the past years, libraries have embraced their role as global participants in the Semantic Web. Developments in library metadata frameworks such as BibFrame and RDA built on standard data models and ontologies including RDF, SKOS and OWL highlight the importance of linking data in an increasingly global environment. What is the status of linked data projects in libraries and other memory institutions internationally? Come hear our speakers address current projects, including RightsStatements.org, opportunities and challenges.

Panelists: Gordon Dunsire, Chair, RDA Steering Committee, Edinburgh, United Kingdom; Reinhold Heuvelmann, Senior Information Standards Specialist, German National Library; Richard Urban, Asst. Professor, School of Information, Florida State University

1:00pm – 2:30pm: Library Consortia, E-books and the Power of Libraries: Innovative Shared E-book Delivery Models from a Library Consortium near You [S]

This program will include an interactive panel discussion of the major trends in e-books and how library consortia are at the forefront of elevating libraries as a major player in the e-book market. Leading models from library consortia that showcase innovation and advocacy including shared collections using open source, commercial and hybrid platforms and the investigation of a national e-book platform for local content from self-published authors and independent publishers.

Panelists: Michelle Bickert, Digital Public Library of America; Veronda Pitchford, Director of Membership Development and Resource Sharing, Reaching Across Illinois Library System; Valerie Horton, Executive Director, Minitex; Greg Pronevitz, Executive Director, Massachusetts Library System

1:00pm – 2:30pm:  Transforming Libraries: Knight News Challenge Winners Announced [K]

For their latest Knight News Challenge, the Knight Foundation asked applicants to submit their best idea answering the question: “How might libraries meet 21st century information needs? This program will include a presentation of the newest winners of the challenge and a panel discussion on transformational change in the library field.

Panelists: Lisa Peet, Associate News Editor at Library JournalFrancesca Rodriquez. Foundation Officer at Madison Public Library Foundation; Matthew Phillips, Manager, Technology Development Team at Harvard University Library

3:00pm – 4:00pm: Can I Use It? New Tools for Determining Rights and (Re)Use Status for Our Digital Collections [S] [K] [H]

Two innovative approaches help libraries address rights and reuse status for growing digital collections. RightsStatements.org addresses the need for standardized rights statements through international collaboration around a shared framework implemented by the Digital Public Library of America, New York Public Library, and other institutions. The Copyright Review Management System provides a toolkit for determining copyright, building off the copyright status work for materials in HathiTrust.

Panelists: Emily Gore, Director for Content, Digital Public Library of America; Greg Cram, Associate Director, Copyright and Information Policy, The New York Public Library; Rick Adler, DPLA Service Hub Coordinator at University of Michigan, School of Information

SUNDAY, June 26, 2016 10:30am – 11:30am:  From the Macro to the Micro: How Small-Scale Digitization Can Make a Big Difference [K] [H]

Digitization programs can be resource rich, even when institutions may be resource poor. Developing a program for the digitization of cultural heritage materials benefits from planning at the macro level, with organizational buy-in and strategic considerations addressed. Once this foundation is in place,an organization can successfully implement a digitization service aligned with organizational mission that benefits important known stakeholders and the wider community. This panel will focus on digitization programs from these two perspectives with emphasis on the creation of a mobile digitization service and how this can be replicated to sustain small-scale digitization programs that can have a huge and positive impact – not only for the institution but for the communities they serve.

Panelists: Caroline Catchpole, Mobile Digitization Specialist at Metropolitan New York Library Council; Natalie Milbrodt, Associate Coordinator at Metadata Services at Queens Library; Jolie O. Graybill, Assistant Director at Minitex; Molly Huber, Outreach Coordinator at Minnesota Digital Library

Additional Knight Foundation Sponsored Panels:  See you in Orlando!

“Greetings from Orlando, The City Beautiful” postcard c. 1930-1945 from the collection of Boston Public Library via Digital Commonwealth.

DPLA: Catch up with DPLA at ALA Annual in Orlando

planet code4lib - Tue, 2016-06-21 14:00

The American Library Association’s Annual Conference kicks off later this week in Orlando, Florida and DPLA staffers are excited to hit the road, connect with a fantastic community of librarians and show our support for the city of Orlando.  Here’s your guide to when and where to catch up with DPLA’s staff and community members at ALA Annual.  If you’ll be following the conference from afar, connect with us on Twitter and following the conference at #alaac16.

[S] = DPLA Staff Participating, [K] = Knight Foundation Sponsored Panel, [H] = DPLA Hub and/or Contributing Institution represented

FRIDAY, June 24, 2016 12:00pm – 2:00pm: Ebook Working Group Project Update [S]
Location: Networking Uncommons, Orange County Convention Center This meeting is open to all librarians curious about current issues, ongoing projects, and ways to get involved. Attendees will learn how the Ebook Working Group fits in with other library ebook groups, and explore the projects we currently work on, including the Library E-content Access Project (LEAP), SimplyE/Open eBooks, SimplyE for Consortia, Readers First and other library-created ebook projects. Current members of the working groups will have the opportunity to meet and share updates, and connect with potential new members. DPLA Staff Presenting: Michelle Bickert, Ebook Program Manager, and Rachel Frick, Business Development Director SATURDAY, June 25, 2016 8:30am – 10:00am: Linked Data – Globally Connecting Libraries, Archives, and Museums

In the past years, libraries have embraced their role as global participants in the Semantic Web. Developments in library metadata frameworks such as BibFrame and RDA built on standard data models and ontologies including RDF, SKOS and OWL highlight the importance of linking data in an increasingly global environment. What is the status of linked data projects in libraries and other memory institutions internationally? Come hear our speakers address current projects, including RightsStatements.org, opportunities and challenges.

Panelists: Gordon Dunsire, Chair, RDA Steering Committee, Edinburgh, United Kingdom; Reinhold Heuvelmann, Senior Information Standards Specialist, German National Library; Richard Urban, Asst. Professor, School of Information, Florida State University

1:00pm – 2:30pm: Library Consortia, E-books and the Power of Libraries: Innovative Shared E-book Delivery Models from a Library Consortium near You [S]

This program will include an interactive panel discussion of the major trends in e-books and how library consortia are at the forefront of elevating libraries as a major player in the e-book market. Leading models from library consortia that showcase innovation and advocacy including shared collections using open source, commercial and hybrid platforms and the investigation of a national e-book platform for local content from self-published authors and independent publishers.

Panelists: Michelle Bickert, Digital Public Library of America; Veronda Pitchford, Director of Membership Development and Resource Sharing, Reaching Across Illinois Library System; Valerie Horton, Executive Director, Minitex; Greg Pronevitz, Executive Director, Massachusetts Library System

1:00pm – 2:30pm:  Transforming Libraries: Knight News Challenge Winners Announced [K]

For their latest Knight News Challenge, the Knight Foundation asked applicants to submit their best idea answering the question: “How might libraries meet 21st century information needs? This program will include a presentation of the newest winners of the challenge and a panel discussion on transformational change in the library field.

Panelists: Lisa Peet, Associate News Editor at Library JournalFrancesca Rodriquez. Foundation Officer at Madison Public Library Foundation; Matthew Phillips, Manager, Technology Development Team at Harvard University Library

3:00pm – 4:00pm: Can I Use It? New Tools for Determining Rights and (Re)Use Status for Our Digital Collections [S] [K] [H]

Two innovative approaches help libraries address rights and reuse status for growing digital collections. RightsStatements.org addresses the need for standardized rights statements through international collaboration around a shared framework implemented by the Digital Public Library of America, New York Public Library, and other institutions. The Copyright Review Management System provides a toolkit for determining copyright, building off the copyright status work for materials in HathiTrust.

Panelists: Emily Gore, Director for Content, Digital Public Library of America; Greg Cram, Associate Director, Copyright and Information Policy, The New York Public Library; Rick Adler, DPLA Service Hub Coordinator at University of Michigan, School of Information

SUNDAY, June 26, 2016 10:30am – 11:30am:  From the Macro to the Micro: How Small-Scale Digitization Can Make a Big Difference [K] [H]

Digitization programs can be resource rich, even when institutions may be resource poor. Developing a program for the digitization of cultural heritage materials benefits from planning at the macro level, with organizational buy-in and strategic considerations addressed. Once this foundation is in place,an organization can successfully implement a digitization service aligned with organizational mission that benefits important known stakeholders and the wider community. This panel will focus on digitization programs from these two perspectives with emphasis on the creation of a mobile digitization service and how this can be replicated to sustain small-scale digitization programs that can have a huge and positive impact – not only for the institution but for the communities they serve.

Panelists: Caroline Catchpole, Mobile Digitization Specialist at Metropolitan New York Library Council; Natalie Milbrodt, Associate Coordinator at Metadata Services at Queens Library; Jolie O. Graybill, Assistant Director at Minitex; Molly Huber, Outreach Coordinator at Minnesota Digital Library

Additional Knight Foundation Sponsored Panels:  See you in Orlando!

“Greetings from Orlando, The City Beautiful” postcard c. 1930-1945 from the collection of Boston Public Library via Digital Commonwealth.

Karen Coyle: Catalog and Context, Part II

planet code4lib - Tue, 2016-06-21 04:13
In the previous post, I talked about book and card catalogs, and how they existed as a heading layer over the bibliographic description representing library holdings. In this post, I will talk about what changed when that same data was stored in database management systems and delivered to users on a computer screen.

Taking a very simple example, in the card catalog a single library holding with author, title and one subject becomes three separate entries, one for each heading. These are filed alphabetically in their respective places in the catalog.

In this sense, the catalog is composed of cards for headings that have attached to them the related bibliographic description. Most items in the library are represented more than once in the library catalog. The catalog is a catalog of headings.

In most computer-based catalogs, the relationship between headings and bibliographic data is reversed: the record with bibliographic and heading data, is stored once; access points, analogous to the headings of the card catalog, are extracted to indexes that all point to the single record.

This in itself could be just a minor change in the mechanism of the catalog, but in fact it turns out to be more than that.

First, the indexes of the database system are not visible to the user. This is the opposite of the card catalog where the entry points were what the user saw and navigated through. Those entry points, at their best, served as a knowledge organization system that gave the user a context for the headings. Those headings suggest topics to users once the user finds a starting point in the catalog.

When this system works well for the user, she has some understanding of where she was in the virtual library that the catalog created. This context could be a subject area or it could be a bibliographic context such as the editions of a work.

Most, if not all, online catalogs do not present the catalog as a linear, alphabetically ordered list of headings. Database management technology encourages the use of searching rather than linear browsing. Even if one searches in headings as a left-anchored string of characters a search results in a retrieved set of matching entries, not a point in an alphabetical list. There is no way to navigate to nearby entries. The bibliographic data is therefore not provided either in the context or the order of the catalog. After a search on "cat breeds" the user sees a screen-full of bibliographic records but lacking in context because most default displays do not show the user the headings or text that caused the item to be retrieved.

Although each of these items has a subject heading containing the words "Cat breeds" the order of the entries is not the subject order. The subject headings in the first few records read, in order:

  1. Cat breed
  2. Cat breeds
  3. Cat breeds - History
  4. Cat breeds - Handbooks, manuals, etc.
  5. Cat breeds
  6. Cat breeds - Thailand
  7. Cat breeds

If if the catalog uses a visible and logical order, like alphabetical by author and title, or most recent by date, there is no way from the displayed list for the user to get the sense of "where am I?" that was provided by the catalog of headings.

In the early 1980's, when I was working on the University of California's first online catalog, the catalogers immediately noted this as a problem. They would have wanted the retrieved set to be displayed as:

(Note how much this resembles the book catalog shown in Part I.) At the time, and perhaps still today, there were technical barriers to such a display, mainly because of limitations on the sorting of large retrieved sets. (Large, at that time, was anything over a few hundred items.) Another issue was that any bibliographic record could be retrieved more than once in a single retrieved set, and presenting the records more than once in the display, given the database design, would be tricky. I don't know if starting afresh today some of these features would be easier to produce, but the pattern of search and display seems not to have progressed greatly from those first catalogs.

In addition, it is in any case questionable whether a set of bibliographic items retrieved from a database on some query would reproduce the presumably coherent context of the catalog. This is especially true because of the third major difference between the card catalog and the computer catalog: the ability to search on individual words in the bibliographic record rather than being limited to seeking on full left-anchored headings. The move to keyword searching was both a boon and a bane because it was a major factor in the loss of context in the library catalog.

Keyword searching will be the main topic of Part III of this series.


Karen Coyle: Catalog and Context, Part II

planet code4lib - Tue, 2016-06-21 04:13
In the previous post, I talked about book and card catalogs, and how they existed as a heading layer over the bibliographic description representing library holdings. In this post, I will talk about what changed when that same data was stored in database management systems and delivered to users on a computer screen.

Taking a very simple example, in the card catalog a single library holding with author, title and one subject becomes three separate entries, one for each heading. These are filed alphabetically in their respective places in the catalog.

In this sense, the catalog is composed of cards for headings that have attached to them the related bibliographic description. Most items in the library are represented more than once in the library catalog. The catalog is a catalog of headings.

In most computer-based catalogs, the relationship between headings and bibliographic data is reversed: the record with bibliographic and heading data, is stored once; access points, analogous to the headings of the card catalog, are extracted to indexes that all point to the single record.

This in itself could be just a minor change in the mechanism of the catalog, but in fact it turns out to be more than that.

First, the indexes of the database system are not visible to the user. This is the opposite of the card catalog where the entry points were what the user saw and navigated through. Those entry points, at their best, served as a knowledge organization system that gave the user a context for the headings. Those headings suggest topics to users once the user finds a starting point in the catalog.

When this system works well for the user, she has some understanding of where she was in the virtual library that the catalog created. This context could be a subject area or it could be a bibliographic context such as the editions of a work.

Most, if not all, online catalogs do not present the catalog as a linear, alphabetically ordered list of headings. Database management technology encourages the use of searching rather than linear browsing. Even if one searches in headings as a left-anchored string of characters a search results in a retrieved set of matching entries, not a point in an alphabetical list. There is no way to navigate to nearby entries. The bibliographic data is therefore not provided either in the context or the order of the catalog. After a search on "cat breeds" the user sees a screen-full of bibliographic records but lacking in context because most default displays do not show the user the headings or text that caused the item to be retrieved.

Although each of these items has a subject heading containing the words "Cat breeds" the order of the entries is not the subject order. The subject headings in the first few records read, in order:

  1. Cat breed
  2. Cat breeds
  3. Cat breeds - History
  4. Cat breeds - Handbooks, manuals, etc.
  5. Cat breeds
  6. Cat breeds - Thailand
  7. Cat breeds

If if the catalog uses a visible and logical order, like alphabetical by author and title, or most recent by date, there is no way from the displayed list for the user to get the sense of "where am I?" that was provided by the catalog of headings.

In the early 1980's, when I was working on the University of California's first online catalog, the catalogers immediately noted this as a problem. They would have wanted the retrieved set to be displayed as:

(Note how much this resembles the book catalog shown in Part I.) At the time, and perhaps still today, there were technical barriers to such a display, mainly because of limitations on the sorting of large retrieved sets. (Large, at that time, was anything over a few hundred items.) Another issue was that any bibliographic record could be retrieved more than once in a single retrieved set, and presenting the records more than once in the display, given the database design, would be tricky. I don't know if starting afresh today some of these features would be easier to produce, but the pattern of search and display seems not to have progressed greatly from those first catalogs.

In addition, it is in any case questionable whether a set of bibliographic items retrieved from a database on some query would reproduce the presumably coherent context of the catalog. This is especially true because of the third major difference between the card catalog and the computer catalog: the ability to search on individual words in the bibliographic record rather than being limited to seeking on full left-anchored headings. The move to keyword searching was both a boon and a bane because it was a major factor in the loss of context in the library catalog.

Keyword searching will be the main topic of Part III of this series.


M. Ryan Hess: Virtual Realty is Getting Real in the Library

planet code4lib - Mon, 2016-06-20 23:41

My library just received three Samsung S7 devices with Gear VR goggles. We put them to work right away.

The first thought I had was: Wow, this will change everything. My second thought was: Wow, I can’t wait for Apple to make a VR device!

The Samsung Gear VR experience is grainy and fraught with limitations, but you can see the potential right away. The virtual reality is, after all, working off a smartphone. There is no high-end graphics card working under the hood. Really, the goggles are just a plastic case holding the phone up to your eyes. But still, despite all this, it’s amazing.

Within twenty-four hours, I’d surfed beside the world’s top surfers on giant waves off Hawaii, hung out with the Masai in Africa and shared an intimate moment with a pianist and his dog in their (New York?) apartment. It was all beautiful.

We’ve Been Here Before

Remember when the Internet came online? If you’re old enough, you’ll recall the crude attempts to chat on digital bulletin board systems (BBS) or, much later, the publication of the first colorful (often jarringly so) HTML pages.

It’s the Hello World! moment for VR now. People are just getting started. You can tell the content currently available is just scratching the surface of potentialities for this medium. But once you try VR and consider the ways it can be used, you start to realize nothing will be the same again.

The Internet Will Disappear

So said Google CEO Erik Schmidt in 2015. He was talking about the rise of AI, wearable tech and many other emerging technologies that will transform how we access data. For Schmidt, the Internet will simply fade into these technologies to the point that it will be unrecognizable.

I agree. But being primarily a web librarian, I’m mostly concerned with how new technologies will translate in the library context. What will VR mean for library websites, online catalogs, eBooks, databases and the social networking aspects of libraries.

So after trying out VR, I was already thinking about all this. Here are some brief thoughts:

  • Visiting the library stacks in VR could transform the online catalog experience
  • Library programming could break out of the physical world (virtual speakers, virtual locations)
  • VR book discussions could incorporate virtual tours of topics/locations touched on in books
  • Collections of VR experiences could become a new source for local collections
  • VR maker spaces and tools for creatives to create VR experiences/objects
Year Zero?

Still, VR makes your eyes tired. It’s not perfect. It has a long way to go.

But based on my experience sharing this technology with others, it’s addictive. People love trying it. They can’t stop talking about it afterward.

So, while it may be some time before the VR revolution disrupts the Internet (and virtual library services with it), it sure feels imminent.


M. Ryan Hess: Virtual Realty is Getting Real in the Library

planet code4lib - Mon, 2016-06-20 23:41

My library just received three Samsung S7 devices with Gear VR goggles. We put them to work right away.

The first thought I had was: Wow, this will change everything. My second thought was: Wow, I can’t wait for Apple to make a VR device!

The Samsung Gear VR experience is grainy and fraught with limitations, but you can see the potential right away. The virtual reality is, after all, working off a smartphone. There is no high-end graphics card working under the hood. Really, the goggles are just a plastic case holding the phone up to your eyes. But still, despite all this, it’s amazing.

Within twenty-four hours, I’d surfed beside the world’s top surfers on giant waves off Hawaii, hung out with the Masai in Africa and shared an intimate moment with a pianist and his dog in their (New York?) apartment. It was all beautiful.

We’ve Been Here Before

Remember when the Internet came online? If you’re old enough, you’ll recall the crude attempts to chat on digital bulletin board systems (BBS) or, much later, the publication of the first colorful (often jarringly so) HTML pages.

It’s the Hello World! moment for VR now. People are just getting started. You can tell the content currently available is just scratching the surface of potentialities for this medium. But once you try VR and consider the ways it can be used, you start to realize nothing will be the same again.

The Internet Will Disappear

So said Google CEO Erik Schmidt in 2015. He was talking about the rise of AI, wearable tech and many other emerging technologies that will transform how we access data. For Schmidt, the Internet will simply fade into these technologies to the point that it will be unrecognizable.

I agree. But being primarily a web librarian, I’m mostly concerned with how new technologies will translate in the library context. What will VR mean for library websites, online catalogs, eBooks, databases and the social networking aspects of libraries.

So after trying out VR, I was already thinking about all this. Here are some brief thoughts:

  • Visiting the library stacks in VR could transform the online catalog experience
  • Library programming could break out of the physical world (virtual speakers, virtual locations)
  • VR book discussions could incorporate virtual tours of topics/locations touched on in books
  • Collections of VR experiences could become a new source for local collections
  • VR maker spaces and tools for creatives to create VR experiences/objects
Year Zero?

Still, VR makes your eyes tired. It’s not perfect. It has a long way to go.

But based on my experience sharing this technology with others, it’s addictive. People love trying it. They can’t stop talking about it afterward.

So, while it may be some time before the VR revolution disrupts the Internet (and virtual library services with it), it sure feels imminent.


District Dispatch: Re:create event draws crowd

planet code4lib - Mon, 2016-06-20 21:08

Photo credit: Jimmy Emerson, via Flickr.

Re:create, the copyright coalition that includes members from industry and library associations, public policy think tanks, public interest groups, and creators sponsored a program – How it works: understanding copyright law in the new creative economy – to a packed audience at the US Capital Visitors Center. Speakers included Alex Feerst, Corporate Counsel from Medium; Katie Oyama, Senior Policy Counsel for Google; Becky “Boop” Prince, YouTube CeWEBrity and Internet New Analyst; and Betsy Rosenblatt, Legal Director for the Organization for Transformative Works. The panel was moderated by Joshua Lamel, Executive Director of Re:create. Discussion focused on new creators and commercial businesses made possible by the Internet, fair use, and freedom of expression.

We live in a time of creative resurgence; more creative content is produced and distributed now than in any time of history.  Some creators have successfully built profit-making businesses by “doing something they love,” whether it’s quilting, storytelling, applying makeup, or riffing on their favorite TV shows. What I thought was most interesting (because sometimes I get tired of talking about copyright) was hearing the stories of new creators – in particular, how they established a sense of self by communicating with people across the globe that have like-minded interests. People who found a way to express themselves through fan fiction, for example, found the process of creating and sharing with others so edifying that their lives were changed. Regardless of whether they made money or not, being able to express themselves with a diverse audience was worth the effort.

One story included a quilter from Hamilton, Missouri who started conducting quilting tutorials on YouTube. Her popularity grew to such an extent that she and her family – facing a tough economic time – bought an old warehouse and built a quilting store selling pre-cut fabrics. Their quilting store became so popular that fans as far away as Australia travel to see the store. And those people spent money in Hamilton. In four years, the Missouri Star Quilting Company became the biggest employee in the entire county, employing over 150 people, including single moms, retirees and students.

But enough about crafts. The panel also shared their thoughts on proposals to change “notice and take down” to “notice and stay down,” a position advocated by the content community in their comments on Section 512.  This provision is supposed to help rights holders limit alleged infringement and provide a safe harbor for intermediaries – like libraries that offer open internet service – from third party liability.  Unfortunately, the provision has been used to censor speech that someone does not like, whether or not copyright infringement is implicated. A timely example is Axl Rose, who wanted an unflattering photo of himself taken down even though he is not the rights holder of the photo. The speakers, however, did favor keeping Section 512 as it is.  They noted that without the liability provision, it is likely they would not continue their creative work, because of the risk involved in copyright litigation.

All in all, a very inspiring group of people with powerful stories to tell about creativity and free expression, and the importance of fair use.

The post Re:create event draws crowd appeared first on District Dispatch.

District Dispatch: Re:create event draws crowd

planet code4lib - Mon, 2016-06-20 21:08

Photo credit: Jimmy Emerson, via Flickr.

Re:create, the copyright coalition that includes members from industry and library associations, public policy think tanks, public interest groups, and creators sponsored a program – How it works: understanding copyright law in the new creative economy – to a packed audience at the US Capital Visitors Center. Speakers included Alex Feerst, Corporate Counsel from Medium; Katie Oyama, Senior Policy Counsel for Google; Becky “Boop” Prince, YouTube CeWEBrity and Internet New Analyst; and Betsy Rosenblatt, Legal Director for the Organization for Transformative Works. The panel was moderated by Joshua Lamel, Executive Director of Re:create. Discussion focused on new creators and commercial businesses made possible by the Internet, fair use, and freedom of expression.

We live in a time of creative resurgence; more creative content is produced and distributed now than in any time of history.  Some creators have successfully built profit-making businesses by “doing something they love,” whether it’s quilting, storytelling, applying makeup, or riffing on their favorite TV shows. What I thought was most interesting (because sometimes I get tired of talking about copyright) was hearing the stories of new creators – in particular, how they established a sense of self by communicating with people across the globe that have like-minded interests. People who found a way to express themselves through fan fiction, for example, found the process of creating and sharing with others so edifying that their lives were changed. Regardless of whether they made money or not, being able to express themselves with a diverse audience was worth the effort.

One story included a quilter from Hamilton, Missouri who started conducting quilting tutorials on YouTube. Her popularity grew to such an extent that she and her family – facing a tough economic time – bought an old warehouse and built a quilting store selling pre-cut fabrics. Their quilting store became so popular that fans as far away as Australia travel to see the store. And those people spent money in Hamilton. In four years, the Missouri Star Quilting Company became the biggest employee in the entire county, employing over 150 people, including single moms, retirees and students.

But enough about crafts. The panel also shared their thoughts on proposals to change “notice and take down” to “notice and stay down,” a position advocated by the content community in their comments on Section 512.  This provision is supposed to help rights holders limit alleged infringement and provide a safe harbor for intermediaries – like libraries that offer open internet service – from third party liability.  Unfortunately, the provision has been used to censor speech that someone does not like, whether or not copyright infringement is implicated. A timely example is Axl Rose, who wanted an unflattering photo of himself taken down even though he is not the rights holder of the photo. The speakers, however, did favor keeping Section 512 as it is.  They noted that without the liability provision, it is likely they would not continue their creative work, because of the risk involved in copyright litigation.

All in all, a very inspiring group of people with powerful stories to tell about creativity and free expression, and the importance of fair use.

The post Re:create event draws crowd appeared first on District Dispatch.

Access Conference: Ignite Access 2016

planet code4lib - Mon, 2016-06-20 20:28

Wishing you’d submitted a proposal for Access? As part of this year’s program, we’re assembling a 60-minute Ignite event, and we’re looking for a few more brave souls to take up the challenge!

What’s Ignite, you ask? An Ignite talk is a fast-paced, focused, and entertaining presentation format that challenges presenters to “enlighten us, but make it quick.”

Ignite talks are short-form presentations that follow a simple format: presenters have 5 minutes, and must use 20 slides which are set to automatically advance every 15 seconds. Tell us about a project or experience, whether it’s a brilliant success story, or a dismal failure! Issue a challenge, raise awareness, or share an opportunity! An Ignite talk can focus on anything and can be provocative, tangential, or just plain fun. Best of all, it’s a great challenge to hone your presentation skills in front of one of the best audiences you’ll find anywhere!

Ignite talks have deep roots in the tech community, but have grown to reach an incredibly broad range of presenters and audiences; even schools and businesses are using Ignite talks to teach the value of crafting effective, powerful messages in a minimalist frame. Ignite Talk’s video archive provides hundreds of great examples by presenters from around the world. Better yet, check out author Scott Berkun’s Why and How to Give an Ignite Talk.

Interested? Send your submissions to drross@unb.ca, and tell us in 200 words or fewer what you’d like to enlighten us about. We’ll continue to accept Ignite proposals until July 15th, and accepted submitters will be notified by July 20.

Questions? Contact your friendly neighbourhood program chair at drross@unb.ca!

Access Conference: Ignite Access 2016

planet code4lib - Mon, 2016-06-20 20:28

Wishing you’d submitted a proposal for Access? As part of this year’s program, we’re assembling a 60-minute Ignite event, and we’re looking for a few more brave souls to take up the challenge!

What’s Ignite, you ask? An Ignite talk is a fast-paced, focused, and entertaining presentation format that challenges presenters to “enlighten us, but make it quick.”

Ignite talks are short-form presentations that follow a simple format: presenters have 5 minutes, and must use 20 slides which are set to automatically advance every 15 seconds. Tell us about a project or experience, whether it’s a brilliant success story, or a dismal failure! Issue a challenge, raise awareness, or share an opportunity! An Ignite talk can focus on anything and can be provocative, tangential, or just plain fun. Best of all, it’s a great challenge to hone your presentation skills in front of one of the best audiences you’ll find anywhere!

Ignite talks have deep roots in the tech community, but have grown to reach an incredibly broad range of presenters and audiences; even schools and businesses are using Ignite talks to teach the value of crafting effective, powerful messages in a minimalist frame. Ignite Talk’s video archive provides hundreds of great examples by presenters from around the world. Better yet, check out author Scott Berkun’s Why and How to Give an Ignite Talk.

Interested? Send your submissions to drross@unb.ca, and tell us in 200 words or fewer what you’d like to enlighten us about. We’ll continue to accept Ignite proposals until July 15th, and accepted submitters will be notified by July 20.

Questions? Contact your friendly neighbourhood program chair at drross@unb.ca!

LITA: Top Strategies to Win that Technology Grant: Part 3

planet code4lib - Mon, 2016-06-20 18:08

As mentioned in my last posts, conducting a needs assessment and/or producing quantitative and/or qualitative data about the communities you serve is key in having a successfully funded proposal.  Once you have an idea of the project that connects to your patrons, your research for RFPs or Request for Proposals begins.

Here are some RFP research items to keep in mind:

Open your opportunities for funding.  Our first choice may be to look at “technology grants” only, but thinking of other avenues to broaden your search may be helpful. As MacKellar mentions in her book Writing Successful Technology Grant Proposals, “Rule #15: Use grant resources that focus on the goal or purpose of your project or on your target population.  Do not limit your research to resources that include only grants for technology” (p.71).

Build a comprehensive list of keywords that describes your project in order to conduct strong searches.

Keep in mind throughout the whole process: grants are for people not about owning the latest devices or tools. Also, what may work for one library may not work for another; each library has its own unique vibe.  This is another reason why a needs assessment is essential.

Know how you will evaluate your project during and after project completion.

Sharpen your project management skills by working on a multi-step project such as grants.  It takes proper planning, and key players to get the project moving and afloat.  It is helpful to slice the project into pieces, foster patience, and develop comfort in working on multi-year projects.

It is helpful to have leadership that supports and aids in all phases of the grant project. Try to find support from administration or from community/department partnerships.  Find a mentor or someone seasoned in writing and overseeing grants in or outside of your organization.

Read the RFP carefully and contact funder with well-thought out questions if needed. It is important to have your questions and comments written down to lessen multiple emails or calls.  Asking the right questions informs you if the proposal is right for a particular RFP.

Build a strong team that are invested in the project and communities served. It is wonderful to share aspects of the project in order to avoid burnout.

Find sources for funding:

LITA: Top Strategies to Win that Technology Grant: Part 3

planet code4lib - Mon, 2016-06-20 18:08

As mentioned in my last posts, conducting a needs assessment and/or producing quantitative and/or qualitative data about the communities you serve is key in having a successfully funded proposal.  Once you have an idea of the project that connects to your patrons, your research for RFPs or Request for Proposals begins.

Here are some RFP research items to keep in mind:

Open your opportunities for funding.  Our first choice may be to look at “technology grants” only, but thinking of other avenues to broaden your search may be helpful. As MacKellar mentions in her book Writing Successful Technology Grant Proposals, “Rule #15: Use grant resources that focus on the goal or purpose of your project or on your target population.  Do not limit your research to resources that include only grants for technology” (p.71).

Build a comprehensive list of keywords that describes your project in order to conduct strong searches.

Keep in mind throughout the whole process: grants are for people not about owning the latest devices or tools. Also, what may work for one library may not work for another; each library has its own unique vibe.  This is another reason why a needs assessment is essential.

Know how you will evaluate your project during and after project completion.

Sharpen your project management skills by working on a multi-step project such as grants.  It takes proper planning, and key players to get the project moving and afloat.  It is helpful to slice the project into pieces, foster patience, and develop comfort in working on multi-year projects.

It is helpful to have leadership that supports and aids in all phases of the grant project. Try to find support from administration or from community/department partnerships.  Find a mentor or someone seasoned in writing and overseeing grants in or outside of your organization.

Read the RFP carefully and contact funder with well-thought out questions if needed. It is important to have your questions and comments written down to lessen multiple emails or calls.  Asking the right questions informs you if the proposal is right for a particular RFP.

Build a strong team that are invested in the project and communities served. It is wonderful to share aspects of the project in order to avoid burnout.

Find sources for funding:

David Rosenthal: Glyn Moody on Open Access

planet code4lib - Mon, 2016-06-20 15:00
At Ars Technica, Glyn Moody writes Open access: All human knowledge is there—so why can’t everybody access it? , a long (9 "page") piece examining this question:
What's stopping us? That's the central question that the "open access" movement has been asking, and trying to answer, for the last two decades. Although tremendous progress has been made, with more knowledge freely available now than ever before, there are signs that open access is at a critical point in its development, which could determine whether it will ever succeedIt is a really impressive, accurate, detailed and well-linked history of how we got into the mess we're in, and a must-read despite the length. Below the fold, a couple of comments.

Moody writes:
In addition, academics are often asked to work on editorial boards of academic journals, helping to set the overall objectives, and to deal with any issues that arise requiring their specialised knowledge. ... these activities require both time and rare skills, and academics generally receive no remuneration for supplying them.and:
The skewed nature of power in this industry is demonstrated by the fact that the scientific publishing divisions of leading players like Elsevier and Springer consistently achieve profit margins between 30 percent and 40 percent—levels that are rare in any other industry. The sums involved are large: annual revenues generated from English-language scientific, technical, and medical journal publishing worldwide were about $9.4bn (£6.4bn) in 2011.This understates the problem. It is rumored that the internal margins on many journals at the big publishers are around 90%, and that the biggest cost for these journals is the care and feeding of the editorial board. Co-opting senior researchers onto editorial boards that provide boondoggles is a way of aligning their interests with those of the publisher. This matters because senior researchers do not pay for the journals, but in effect control the resource allocation decisions of University librarians, who are the publisher's actual customers. The Library Loon understands the importance of this:
The key aspect of Elsevier’s business model that it will do its level best to retain in any acquisitions or service launches is the disconnect between service users and service purchasers.Moody does understand the economic problems of hybrid journals:
the Council of the European Union has just issued a call for full open access to scientific research by 2020. In its statement it "welcomes open access to scientific publications as the option by default for publishing the results of publicly funded [EU] research," but also says this move should "be based on common principles such as transparency, research integrity, sustainability, fair pricing and economic viability." Although potentially a big win for open access in the EU, the risk is this might simply lead to more use of the costly hybrid open access, as has happened in the UK.but doesn't note the degraded form of open access they provide. Articles can be "free to read" while being protected by "registration required", restrictive copyright licenses, and robots.txt. Even if gold open access eventually became the norm, the record of science leading up to that point would still be behind a paywall.

Compliance with open access mandates has been poor, for example the Wellcome Trust reported that:
Elsevier and Wiley have been singled out as regularly failing to put papers in the right open access repository and properly attribute them with a creative commons licence.
This was a particular problem with so-called hybrid journals, which contain a mixture of open access and subscription-based articles.
More than half of articles published in Wiley hybrid journals were found to be “non-compliant” with depositing and licensing requirements, an analysis of 2014-15 papers funded by Wellcome and five other medical research bodies found.
For Elsevier the non-compliance figure was 31 per cent for hybrid journals and 26 per cent for full open access.An entire organization called CHORUS has had to be set up to monitor compliance, especially with the US OSTP mandate. Note that it is paid for and controlled by the publishers, kind of like the fox guarding the hen-house.

David Rosenthal: Glyn Moody on Open Access

planet code4lib - Mon, 2016-06-20 15:00
At Ars Technica, Glyn Moody writes Open access: All human knowledge is there—so why can’t everybody access it? , a long (9 "page") piece examining this question:
What's stopping us? That's the central question that the "open access" movement has been asking, and trying to answer, for the last two decades. Although tremendous progress has been made, with more knowledge freely available now than ever before, there are signs that open access is at a critical point in its development, which could determine whether it will ever succeedIt is a really impressive, accurate, detailed and well-linked history of how we got into the mess we're in, and a must-read despite the length. Below the fold, a couple of comments.

Moody writes:
In addition, academics are often asked to work on editorial boards of academic journals, helping to set the overall objectives, and to deal with any issues that arise requiring their specialised knowledge. ... these activities require both time and rare skills, and academics generally receive no remuneration for supplying them.and:
The skewed nature of power in this industry is demonstrated by the fact that the scientific publishing divisions of leading players like Elsevier and Springer consistently achieve profit margins between 30 percent and 40 percent—levels that are rare in any other industry. The sums involved are large: annual revenues generated from English-language scientific, technical, and medical journal publishing worldwide were about $9.4bn (£6.4bn) in 2011.This understates the problem. It is rumored that the internal margins on many journals at the big publishers are around 90%, and that the biggest cost for these journals is the care and feeding of the editorial board. Co-opting senior researchers onto editorial boards that provide boondoggles is a way of aligning their interests with those of the publisher. This matters because senior researchers do not pay for the journals, but in effect control the resource allocation decisions of University librarians, who are the publisher's actual customers. The Library Loon understands the importance of this:
The key aspect of Elsevier’s business model that it will do its level best to retain in any acquisitions or service launches is the disconnect between service users and service purchasers.Moody does understand the economic problems of hybrid journals:
the Council of the European Union has just issued a call for full open access to scientific research by 2020. In its statement it "welcomes open access to scientific publications as the option by default for publishing the results of publicly funded [EU] research," but also says this move should "be based on common principles such as transparency, research integrity, sustainability, fair pricing and economic viability." Although potentially a big win for open access in the EU, the risk is this might simply lead to more use of the costly hybrid open access, as has happened in the UK.but doesn't note the degraded form of open access they provide. Articles can be "free to read" while being protected by "registration required", restrictive copyright licenses, and robots.txt. Even if gold open access eventually became the norm, the record of science leading up to that point would still be behind a paywall.

Compliance with open access mandates has been poor, for example the Wellcome Trust reported that:
Elsevier and Wiley have been singled out as regularly failing to put papers in the right open access repository and properly attribute them with a creative commons licence.
This was a particular problem with so-called hybrid journals, which contain a mixture of open access and subscription-based articles.
More than half of articles published in Wiley hybrid journals were found to be “non-compliant” with depositing and licensing requirements, an analysis of 2014-15 papers funded by Wellcome and five other medical research bodies found.
For Elsevier the non-compliance figure was 31 per cent for hybrid journals and 26 per cent for full open access.An entire organization called CHORUS has had to be set up to monitor compliance, especially with the US OSTP mandate. Note that it is paid for and controlled by the publishers, kind of like the fox guarding the hen-house.

District Dispatch: OITP releases entrepreneurship white paper

planet code4lib - Mon, 2016-06-20 14:40

Photo by Reynermedia via Flickr

In recent District Dispatch posts on National Start-Up Day Across America, the George Washington University Entrepreneurship Research and Policy Conference and National Small Business Week, I’ve made  the case that libraries are indispensable contributors to the entrepreneurship ecosystem. These posts offer broad-brush descriptions of the library community’s value and potential in the entrepreneurship space. Nowhere, however – on the District Dispatch or elsewhere – has the multifaceted support libraries offer entrepreneurs been comprehensively surveyed and elucidated…Until now.

Today, OITP released “The People’s Incubator: Libraries Propel Entrepreneurship” (.pdf), a 21-page white paper that describes libraries as critical actors in the innovation economy and urges decision makers to work more closely with the library community to boost American enterprise. The paper is rife with examples of library programming, activities and collaborations from across the country, including:

  • classes, mentoring and networking opportunities developed and hosted by libraries;
  • dedicated spaces and tools (including 3D printers and digital media suites) for entrepreneurs;
  • collaborations with the U.S. Small Business Administration (SBA), SCORE and more;
  • access to and assistance using specialized business databases;
  • business plan competitions;
  • guidance navigating copyright, patent and trademark resources; and
  • programs that engage youth in coding and STEM activities.

One goal for this paper is to stimulate new opportunities for libraries, library professionals and library patrons to drive the innovation economy forward. Are you aware of exemplary entrepreneurial programs in libraries that were not captured in this report? If so, please let us know by commenting on this post or writing to me.

As part of our national public policy work, we want to raise awareness of the ways modern libraries are transforming their communities for the better through the technologies, collections and expertise they offer all comers.

This report also will be used as background research in our policy work in preparation for the new Presidential Administration.  In fact, look for a shorter, policy-focused supplement to The People’s Incubator to be released this summer.

ALA and OITP thank all those who contributed to the report, and congratulate the many libraries across the country that provide robust entrepreneurship support services.

The post OITP releases entrepreneurship white paper appeared first on District Dispatch.

District Dispatch: OITP releases entrepreneurship white paper

planet code4lib - Mon, 2016-06-20 14:40

Photo by Reynermedia via Flickr

In recent District Dispatch posts on National Start-Up Day Across America, the George Washington University Entrepreneurship Research and Policy Conference and National Small Business Week, I’ve made  the case that libraries are indispensable contributors to the entrepreneurship ecosystem. These posts offer broad-brush descriptions of the library community’s value and potential in the entrepreneurship space. Nowhere, however – on the District Dispatch or elsewhere – has the multifaceted support libraries offer entrepreneurs been comprehensively surveyed and elucidated…Until now.

Today, OITP released “The People’s Incubator: Libraries Propel Entrepreneurship” (.pdf), a 21-page white paper that describes libraries as critical actors in the innovation economy and urges decision makers to work more closely with the library community to boost American enterprise. The paper is rife with examples of library programming, activities and collaborations from across the country, including:

  • classes, mentoring and networking opportunities developed and hosted by libraries;
  • dedicated spaces and tools (including 3D printers and digital media suites) for entrepreneurs;
  • collaborations with the U.S. Small Business Administration (SBA), SCORE and more;
  • access to and assistance using specialized business databases;
  • business plan competitions;
  • guidance navigating copyright, patent and trademark resources; and
  • programs that engage youth in coding and STEM activities.

One goal for this paper is to stimulate new opportunities for libraries, library professionals and library patrons to drive the innovation economy forward. Are you aware of exemplary entrepreneurial programs in libraries that were not captured in this report? If so, please let us know by commenting on this post or writing to me.

As part of our national public policy work, we want to raise awareness of the ways modern libraries are transforming their communities for the better through the technologies, collections and expertise they offer all comers.

This report also will be used as background research in our policy work in preparation for the new Presidential Administration.  In fact, look for a shorter, policy-focused supplement to The People’s Incubator to be released this summer.

ALA and OITP thank all those who contributed to the report, and congratulate the many libraries across the country that provide robust entrepreneurship support services.

The post OITP releases entrepreneurship white paper appeared first on District Dispatch.

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – Curator Intent

planet code4lib - Mon, 2016-06-20 14:30

This is another post in a series that I’ve been doing to compare the End of Term Web Archives from 2008 and 2012.  If you look back a few posts in this blog you will see some other analysis that I’ve done with the datasets so far.

One thing that I am interested in understanding is how well the group that conducted the EOT crawls did in relation to what I’m calling “curator intent”.  For both the EOT archives suggested seeds were collected using instances of the URL Nomination Tool hosted by the UNT Libraries. A combination of bulk lists of seeds URLs collected by various institutions and individuals were combined individual nominations made by users of the nomination tool.  The resulting lists were used as seed lists for the crawlers that were used to harvest the EOT archives.  In 2008 there were four institutions that crawled content,  the Internet Archive (IA), Library of Congress (LOC), California Digital Library (CDL), and the UNT Libraries (UNT).  In 2012 CDL was not able to do any crawling so just IA, LOC and UNT crawled.  UNT and LOC had limited scope in what they were interested in crawling while CDL and IA took the entire seed list and used that to feed their crawlers.  The crawlers were scoped very wide so that they would get as much content as they could, so the nomination seeds were used as starting places and we allowed the crawlers to go to all subdomains and paths on those sites as well as to areas that the sites linked to on other domains.

During the capture period there wasn’t consistent quality control performed for the crawls, we accepted what we could get and went on with our business.

Looking back at the crawling that we did I was curious of two things.

  1. How many of the domain names from the nomination tool were not present in the EOT archive.
  2. How many domains from .gov and .mil were captured but not explicitly nominated.
EOT2008 Nominated vs Captured Domains.

In the 2008 nominated URL list form the URL Nomination Tool there were a total of 1,252 domains with 1,194 being either .gov or .mil.  In the EOT2008 archive there were a total of 87,889 domains and 1,647 of those were either .gov or .mil.

There are 943 domains that are present in both the 2008 nomination list and the EOT2008 archive.  There are 251 .gov or .mil domains from the nomination list that were not present in the EOT2008 archive. There are 704 .gov or .mil domains that are present in the EOT2008 archive but that aren’t present in the 2008 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2008 .gov and .mil Nominated and Archived

Of those 704 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs womenshealth.gov 168,559 dccourts.gov 161,289 acquisition.gov 102,568 america.gov 89,610 cfo.gov 83,846 kingcounty.gov 61,069 pa.gov 42,955 dc.gov 28,839 inl.gov 23,881 nationalservice.gov 22,096 defenseimagery.mil 21,922 recovery.gov 17,601 wa.gov 14,259 louisiana.gov 12,942 mo.gov 12,570 ky.gov 11,668 delaware.gov 10,124 michigan.gov 9,322 invasivespeciesinfo.gov 8,566 virginia.gov 8,520 alabama.gov 6,709 ct.gov 6,498 idaho.gov 6,046 ri.gov 5,810 kansas.gov 5,672 vermont.gov 5,504 arkansas.gov 5,424 wi.gov 4,938 illinois.gov 4,322 maine.gov 3,956

I see quite a few state and local governments that have a .gov domain which was out of scope of the EOT project but there are also a number of legitimate domains in the list that were never nominated.

EOT2012 Nominated vs Captured Domains.

In the 2012 nominated URL list form the URL Nomination Tool there were a total of 1,674 domains with 1,551 of those being .gov or .mil domains.  In the EOT2012 archive there were a total of 186,214 domains and 1,944 of those were either .gov or .mil.

There are 1,343 domains that are present in both the 2008 nomination list and the EOT2012 archive.  There are 208 .gov or .mil domains from the nomination list that were not present in the EOT2012 archive. There are 601 .gov or .mil domains that are present in the EOT2012 archive but that aren’t present in the 2012 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2012 .gov and .mil Domains Nominated and Archived

Of those 601 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs gao.gov 952,654 vaccines.mil 856,188 esgr.mil 212,741 fdlp.gov 156,499 copyright.gov 70,281 congress.gov 40,338 openworld.gov 31,929 americaslibrary.gov 18,415 digitalpreservation.gov 17,327 majorityleader.gov 15,931 sanjoseca.gov 10,830 utah.gov 9,387 dc.gov 9,063 nyc.gov 8,707 ng.mil 8,199 ny.gov 8,185 wa.gov 8,126 in.gov 8,011 vermont.gov 7,683 maryland.gov 7,612 medicalmuseum.mil 7,135 usbg.gov 6,724 virginia.gov 6,437 wv.gov 6,188 compliance.gov 6,181 mo.gov 6,030 idaho.gov 5,880 nv.gov 5,709 ct.gov 5,628 ne.gov 5,414

Again there are a number of state and local government domains present in the list but up at the top we see quite a few URLs harvested from domains that are federal in nature and would fit into the collection scope for the EOT project.

How did we do?

The way that seed lists for the nomination tool were collected for the EOT2008 and EOT2012 nomination lists introduced a bit of dirty data.  We would need to look a little deeper to see what the issues were with these. Some things that come to mind are that we got seeds from domains that existed prior to 2008 or 2012 but that didn’t exist when we were harvesting.  Also there could have been typos in the URLs that were nominated so we never grabbed the suggested content.  We might want to introduce a validate process for the nomination tool that let’s us know what that status of a URL in a project is at a given point so that we can at least have some sort of record.

 

 

 

13% to 10%

Mark E. Phillips: Comparing Web Archives: EOT2008 and EOT2012 – Curator Intent

planet code4lib - Mon, 2016-06-20 14:30

This is another post in a series that I’ve been doing to compare the End of Term Web Archives from 2008 and 2012.  If you look back a few posts in this blog you will see some other analysis that I’ve done with the datasets so far.

One thing that I am interested in understanding is how well the group that conducted the EOT crawls did in relation to what I’m calling “curator intent”.  For both the EOT archives suggested seeds were collected using instances of the URL Nomination Tool hosted by the UNT Libraries. A combination of bulk lists of seeds URLs collected by various institutions and individuals were combined individual nominations made by users of the nomination tool.  The resulting lists were used as seed lists for the crawlers that were used to harvest the EOT archives.  In 2008 there were four institutions that crawled content,  the Internet Archive (IA), Library of Congress (LOC), California Digital Library (CDL), and the UNT Libraries (UNT).  In 2012 CDL was not able to do any crawling so just IA, LOC and UNT crawled.  UNT and LOC had limited scope in what they were interested in crawling while CDL and IA took the entire seed list and used that to feed their crawlers.  The crawlers were scoped very wide so that they would get as much content as they could, so the nomination seeds were used as starting places and we allowed the crawlers to go to all subdomains and paths on those sites as well as to areas that the sites linked to on other domains.

During the capture period there wasn’t consistent quality control performed for the crawls, we accepted what we could get and went on with our business.

Looking back at the crawling that we did I was curious of two things.

  1. How many of the domain names from the nomination tool were not present in the EOT archive.
  2. How many domains from .gov and .mil were captured but not explicitly nominated.
EOT2008 Nominated vs Captured Domains.

In the 2008 nominated URL list form the URL Nomination Tool there were a total of 1,252 domains with 1,194 being either .gov or .mil.  In the EOT2008 archive there were a total of 87,889 domains and 1,647 of those were either .gov or .mil.

There are 943 domains that are present in both the 2008 nomination list and the EOT2008 archive.  There are 251 .gov or .mil domains from the nomination list that were not present in the EOT2008 archive. There are 704 .gov or .mil domains that are present in the EOT2008 archive but that aren’t present in the 2008 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2008 .gov and .mil Nominated and Archived

Of those 704 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs womenshealth.gov 168,559 dccourts.gov 161,289 acquisition.gov 102,568 america.gov 89,610 cfo.gov 83,846 kingcounty.gov 61,069 pa.gov 42,955 dc.gov 28,839 inl.gov 23,881 nationalservice.gov 22,096 defenseimagery.mil 21,922 recovery.gov 17,601 wa.gov 14,259 louisiana.gov 12,942 mo.gov 12,570 ky.gov 11,668 delaware.gov 10,124 michigan.gov 9,322 invasivespeciesinfo.gov 8,566 virginia.gov 8,520 alabama.gov 6,709 ct.gov 6,498 idaho.gov 6,046 ri.gov 5,810 kansas.gov 5,672 vermont.gov 5,504 arkansas.gov 5,424 wi.gov 4,938 illinois.gov 4,322 maine.gov 3,956

I see quite a few state and local governments that have a .gov domain which was out of scope of the EOT project but there are also a number of legitimate domains in the list that were never nominated.

EOT2012 Nominated vs Captured Domains.

In the 2012 nominated URL list form the URL Nomination Tool there were a total of 1,674 domains with 1,551 of those being .gov or .mil domains.  In the EOT2012 archive there were a total of 186,214 domains and 1,944 of those were either .gov or .mil.

There are 1,343 domains that are present in both the 2008 nomination list and the EOT2012 archive.  There are 208 .gov or .mil domains from the nomination list that were not present in the EOT2012 archive. There are 601 .gov or .mil domains that are present in the EOT2012 archive but that aren’t present in the 2012 nomination list.

Below is a chart showing the nominated vs captured for the .gov and .mil

2012 .gov and .mil Domains Nominated and Archived

Of those 601 domains that were captured but never nominated, here are the thirty most prolific.

Domain URLs gao.gov 952,654 vaccines.mil 856,188 esgr.mil 212,741 fdlp.gov 156,499 copyright.gov 70,281 congress.gov 40,338 openworld.gov 31,929 americaslibrary.gov 18,415 digitalpreservation.gov 17,327 majorityleader.gov 15,931 sanjoseca.gov 10,830 utah.gov 9,387 dc.gov 9,063 nyc.gov 8,707 ng.mil 8,199 ny.gov 8,185 wa.gov 8,126 in.gov 8,011 vermont.gov 7,683 maryland.gov 7,612 medicalmuseum.mil 7,135 usbg.gov 6,724 virginia.gov 6,437 wv.gov 6,188 compliance.gov 6,181 mo.gov 6,030 idaho.gov 5,880 nv.gov 5,709 ct.gov 5,628 ne.gov 5,414

Again there are a number of state and local government domains present in the list but up at the top we see quite a few URLs harvested from domains that are federal in nature and would fit into the collection scope for the EOT project.

How did we do?

The way that seed lists for the nomination tool were collected for the EOT2008 and EOT2012 nomination lists introduced a bit of dirty data.  We would need to look a little deeper to see what the issues were with these. Some things that come to mind are that we got seeds from domains that existed prior to 2008 or 2012 but that didn’t exist when we were harvesting.  Also there could have been typos in the URLs that were nominated so we never grabbed the suggested content.  We might want to introduce a validate process for the nomination tool that let’s us know what that status of a URL in a project is at a given point so that we can at least have some sort of record.

 

 

 

13% to 10%

Pages

Subscribe to code4lib aggregator