You are here

Feed aggregator

Open Library Data Additions: Amazon Crawl: part il

planet code4lib - Tue, 2016-05-03 14:05

Part il of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 2-ac

planet code4lib - Tue, 2016-05-03 14:02

Part 2-ac of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 2-ah

planet code4lib - Tue, 2016-05-03 13:55

Part 2-ah of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 18

planet code4lib - Tue, 2016-05-03 13:54

Part 18 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 2-ak

planet code4lib - Tue, 2016-05-03 13:48

Part 2-ak of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part cr

planet code4lib - Tue, 2016-05-03 13:46

Part cr of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 11

planet code4lib - Tue, 2016-05-03 13:39

Part 11 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part o-3

planet code4lib - Tue, 2016-05-03 13:35

Part o-3 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 15

planet code4lib - Tue, 2016-05-03 13:34

Part 15 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 10

planet code4lib - Tue, 2016-05-03 13:28

Part 10 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part o-6

planet code4lib - Tue, 2016-05-03 13:23

Part o-6 of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Data, Metadata, Text

Open Library Data Additions: Amazon Crawl: part 2-aa

planet code4lib - Tue, 2016-05-03 13:21

Part 2-aa of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

Library of Congress: The Signal: The Harvard Library Digital Repository Service

planet code4lib - Tue, 2016-05-03 13:19

This is a guest post by Julie Siefert.

The Charles River between Boston and Cambridge. Photo by Julie Siefert.

As part of the National Digital Stewardship Residency, I am assessing the Harvard Library Digital Repository Service, comparing it to the ISO16363 standard for trusted digital repositories (which is similar to TRAC). The standard is made up of over 100 individual metrics that address various aspects of a repository, everything from financial planning to ingest workflows.

The Harvard Digital Repository Service provides long-term preservation and access to materials from over fifty libraries, archives and museums at Harvard. It’s been in production for about fifteen years. The next generation of the DRS, with increased preservation capabilities, was recently launched, so this is an ideal time to evaluate the DRS and consider how it might be improved in the future. I hope to identify areas needing new policies and/or documentation and, in doing so, help the DRS improve its services. The DRS staff also hope to eventually seek certification as a trusted digital repository and this project will prepare them.

When I started the project, my first step was to become familiar with the ISO16363 standard. I read through it several times and tried to parse out the meaning of the metrics. Sometimes this was straightforward and I found the metric easy to understand. For others, I had to read through a few times before I fully understood what the metric was asking for. I also found it helpful to write down notes about what they meant and put it in my own words. I read about other people’s experiences performing audits, which was  very helpful and gave me some ideas about how to go about the process. In particular, I found David Rosenthal’s blogs posts about the CLOCKSS self-audit helpful, as they used the same standard, ISO16363.

By Julie Siefert

Inspired by the CLOCKSS audit, I created a Wiki with a different page for each metric. On these pages, I copied the text from the standard and included space for my notes. I also created an Excel sheet to help track my findings. In the Excel sheet, I gave each metric its own row and , in that row, a column about documentation and a column that linked to the Wiki. (I blogged more about the organization process.)

I reviewed the DRS documentation, interviewed staff members about metrics and asked them to point me to relevant documentation. I realized that many of the actions required by the metric were being performed at Harvard but these actions and policies weren’t documented. Everyone in the organization knew that they happened but sometimes no one had written them down. In my notes, I indicated when something was being done but not documented versus when something was not being done at all. I used a Green, Yellow, Red color scheme in the Excel sheet for the different metrics, with yellow indicating things that were done but not documented.

The assessment was the most time-consuming part.  In thinking about how to best summarize and report on my findings, I am looking for commonalities among the gap areas. It’s possible that many of the gaps are similar and several gaps could be filled with a single piece of documentation. For example, many of the “yellow”  areas have to do with ingest workflows, so perhaps a single document about this workflow could fill all these gaps at once. I hope that finding the commonalities among the gaps can help the DRS fill these gaps most effectively and efficiently.

Open Library Data Additions: Amazon Crawl: part 2-ag

planet code4lib - Tue, 2016-05-03 13:14

Part 2-ag of Amazon crawl..

This item belongs to: data/ol_data.

This item has files of the following types: Data, Archive BitTorrent, Data, Metadata, Text

DuraSpace News: NOW AVAILABLE: Fedora 4.5.1 Release

planet code4lib - Tue, 2016-05-03 00:00

From David Wilcox, Fedora Product Manager, on behalf of the Fedora team.

Austin, TX  The Fedora team is proud to announce that Fedora 4.5.1 was released on April 29, 2016. Full release notes are included below and are also available on the wiki: https://wiki.duraspace.org/display/FF/Fedora+4.5.1+Release+Notes.

M. Ryan Hess: AI First

planet code4lib - Mon, 2016-05-02 22:54

Looking to the future, the next big step will be for the very concept of the “device” to fade away. Over time, the computer itself—whatever its form factor—will be an intelligent assistant helping you through your day. We will move from mobile first to an AI first world.

Google Founder’s Letter, April 2016

My Library recently finalized a Vision Document for our virtual library presence. Happily, our vision was aligned with the long-term direction of technology as understood by movers and shakers like Google.

As I’ve written previously, the Library Website will disappear. But this is because the Internet (as we currently understand it) will also disappear.

In its place, a new mode of information retrieval and creation will move us away from the paper-based metaphor of web pages. Information will be more ubiquitous. It will be more free-form, more adaptable, more contextualized, more interactive.

Part of this is already underway. For example, people are becoming a data set. And other apps are learning about you and changing how they work based on who you are. Your personal data set contains location data, patterns in speech and movement around the world, consumer history, keywords particular to your interests, associations based on your social networks, etc.

AI Emerging

All of this information makes it possible for emerging AI systems like Siri and Cortana to better serve you. Soon, it will allow AI to control the flow of information based on your mood and other factors to help you be more productive. And like a good friend that knows you very, very well, AI will even be able to alert you to serendipitous events or inconveniences so that you can navigate life more happily.

People’s expectations are already being set for this kind of experience. Perhaps you’ve noticed yourself getting annoyed when your personal assistant just fetches a Wikipedia article when you ask it something. You’re left wanting. What we want is that kernel of gold we asked about. But what we get right now, is something too general to be useful.

But soon, that will all change. Nascent AI will soon be able to provide exactly the piece of information that you really want rather than a generalized web page. This is what Google means when they make statements like “AI First” or “the Web will die.” They’re talking about a world where information is not only presented as article-like web pages, but broken down into actual kernels of information that are both discrete and yet interconnected.

AI First in the Library

Library discussions often focus on building better web pages or navigation menus or providing responsive websites. But the conversation we need to have is about pulling our data out of siloed systems and websites and making it available to all modes like AI, apps and basic data harvesters.

You hear this conversation in bits and pieces. The ongoing linked data project is part of this long-term strategy. So too with next-gen OPACs. But on the ground, in our local strategy meetings, we need to tie every big project we do to this emerging reality where web browsers are increasingly no longer relevant.

We need to think AI First.


LITA: LITA ALA Annual Precon: Technology Tools and Transforming Librarianship

planet code4lib - Mon, 2016-05-02 20:19

Sign up for this fun, informative, and hands on ALA Annual pre-conference

Technology Tools and Transforming Librarianship
Friday June 24, 2016, 1:00 – 4:00 pm
Presenters: Lola Bradley, Reference Librarian, Upstate University; Breanne Kirsch, Coordinator of Emerging Technologies, Upstate University; Jonathan Kirsch, Librarian, Spartanburg County Public Library; Rod Franco, Librarian, Richland Library; Thomas Lide, Learning Engagement Librarian, Richland Library

Register for ALA Annual and Discover Ticketed Events

Technology envelops every aspect of librarianship, so it is important to keep up with new technology tools and find ways to use them to improve services and better help patrons. This hands-on, interactive preconference will teach six to eight technology tools in detail and show attendees the resources to find out about 50 free technology tools that can be used in all libraries. There will be plenty of time for exploration of the tools, so please BYOD! You may also want to bring headphones or earbuds.

    

Lola Bradley is a Public Services Librarian at the University of South Carolina Upstate Library. Her professional interests include instructional design, educational technology, and information literacy for all ages.

Breanne Kirsch is a Public Services Librarian at the University of South Carolina Upstate Library. She is the Coordinator of Emerging Technologies at Upstate and the founder and current Co-Chair of LITA’s Game Making Interest Group.

Jonathan Kirsch is the Head Librarian at the Pacolet Library Branch of the Spartanburg County Public Libraries. His professional interests include emerging technology, digital collections, e-books, publishing, and programming for libraries.

Rod Franco is a Librarian at Richland Library, Columbia South Carolina. Technology has always been at the forefront of any of his library related endeavors.

Thomas Lide is the Learning Engagement Librarian at Richland Library, Columbia South Carolina.  He helps to pave a parallel path of learning for community members and colleagues.

More LITA Preconferences at ALA Annual
Friday June 24, 2016, 1:00 – 4:00 pm

  • Digital Privacy and Security: Keeping You And Your Library Safe and Secure In A Post-Snowden World
  • Islandora for Managers: Open Source Digital Repository Training

Cost:

LITA Member: $205
ALA Member: $270
Non Member: $335

Registration Information

Register for the 2016 ALA Annual Conference in Orlando FL

Discover Ticketed Events

Questions or Comments?

For all other questions or comments related to the preconference, contact LITA at (312) 280-4269 or Mark Beatty, mbeatty@ala.org.

District Dispatch: ALA, Harry Potter Alliance make it easy to advocate

planet code4lib - Mon, 2016-05-02 16:31

The American Library Association (ALA) joined the Harry Potter Alliance in launching “Spark,” an eight-part video series developed to support and guide first-time advocates who are interested in advocating at the federal level for issues that matter to them. The series, targeted to viewers aged 13–22, will be hosted on the YouTube page of the Harry Potter Alliance, while librarians and educators are encouraged to use the videos to engage young people or first time advocates. The video series was launched today during the 42nd annual National Library Legislative Day in Washington, D.C.

The video series provides supporting information for inexperienced grassroots advocates, covering everything from setting up in-person legislator meetings to the process of constructing a campaign. By breaking down oft-intimidating “inside the Beltway” language, Spark provides an accessible set of tools that can activate and motivate young advocates for the rest of their lives. The video series also includes information on writing press releases, staging social media campaigns, using library resources for research or holding events, and best practices for contacting elected officials.

“We are pleased to launch Spark, a series of interactive advocacy videos. We hope that young or new advocates will be inspired to start their own campaigns, and that librarians and educators will be able to use the series to engage young people and get them involved in advocacy efforts.” said Emily Sheketoff, executive director of the American Library Association’s Washington Office.

Janae Phillips, Chapters Director for the Harry Potter Alliance, added, “I’ve worked with youth for a many years now, and I’ve never met a young person who just really didn’t want to get involved – they just weren’t sure how! I think this is true for adults who have never been involved in civic engagement before, too. I hope that Spark will be a resource to people who have heard a lot about getting engaged in the political process but have never been sure where to start, and hopefully—dare I say—spark some new ideas and action.

The post ALA, Harry Potter Alliance make it easy to advocate appeared first on District Dispatch.

Access Conference: Review Process for Proposals Now Underway

planet code4lib - Mon, 2016-05-02 15:31

The Call for Proposals closed last week. A big thank you to all the eager participants.

The review and selection process is now underway. The committee has their work cut out as there are many great submissions. We also have a few interesting ideas up our sleeves.

It is shaping up to be an excellent conference!

Mark E. Phillips: DPLA Description Fields: More statistics (so many graphs)

planet code4lib - Mon, 2016-05-02 14:30

In the past few posts we looked at the length of the description fields in the DPLA dataset as a whole and at the provider/hub level.

The length of the description field isn’t the only field that was indexed for this work.  In fact I indexed on a variety of different values for each of the descriptions in the dataset.

Below are the fields I currently am working with.

Field Indexed Value Example dpla_id 11fb82a0f458b69cf2e7658d8269f179 id 11fb82a0f458b69cf2e7658d8269f179_01 provider_s usc desc_order_i 1 description_t A corner view of the Santa Monica City Hall.; Streetscape. Horizontal photography. desc_length_i 82 tokens_ss “A”, “corner”, “view”, “of”, “the”, “Santa”, “Monica”, “City”, “Hall”, “Streetscape”, “Horizontal”, “photography” token_count_i 12 average_token_length_f 5.5833335 percent_int_f 0 percent_punct_f 0.048780486 percent_letters_f 0.81707317 percent_printable_f 1 percent_special_char_f 0 token_capitalized_f 0.5833333 token_lowercased_f 0.41666666 percent_1000_f 0.5 non_1000_words_ss “santa”, “monica”, “hall”, “streetscape”, “horizontal”, “photography” percent_5000_f 0.6666667 non_5000_words_ss “santa”, “monica”, “streetscape”, “horizontal” percent_en_dict_f 0.8333333 non_english_words_ss “monica”, “streetscape” percent_stopwords_f 0.25 has_url_b FALSE

This post will try and pull together some of the data from the different fields listed above and present them in a way that we will hopefully be able to use to derive some meaning from.

More Description Length Discussion

In the previous posts I’ve primarily focused on the length of the description fields.  There are two other fields that I’ve indexed that are related to the length of the description fields.  These two fields include the number of tokens in a description and the average token length of fields.

I’ve included those values below.  I’ve included two mean values, one for all of the descriptions in the dataset (17,884,946 descriptions) and in the other the descriptions that are 1 character in length or more (13,771,105descriptions).

Field Mean – Total Mean – 1+ length desc_length_i 83.321 108.211 token_count_i 13.346 17.333 average_token_length_f 3.866 5.020

The graphs below are based on the numbers of just descriptions that are 1+ length or more.

This first graph is being reused from a previous post that shows the average length of description by Provider/Hub.  David Rumsey and the Getty are the two that average over 250 characters per description.

Average Description Length by Hub

It shouldn’t surprise you that David Ramsey and Getter are two of the Providers/Hubs that have the highest average token counts,  with longer descriptions generally creating more tokens. There are a few differences that don’t match this though,  USC that has an average of just over 50 characters for the average description length comes in as the third highest in the average token counts at over 40 tokens per description.  There are a few other providers/hubs that look a bit different than their average description length.

Average Token Count by Provider

Below is a graph of the average token lengths by providers.  The lower the number is the lower average length of a token.  The mean for the entire DPLA dataset for descriptions of length 1+ is just over 5 characters.

Average Token Length by Provider

That’s all I have to say about the various statistics related to length for this post.  I swear!. Next we move on to some of the other metrics that I calculated when indexing things.

Other Metrics for the Description Field

Throughout this analysis I had a question of when to take into account that there were millions of records in the dataset that had no description present.  I couldn’t just throw away that fact in the analysis but I didn’t know exactly what to do with them.  So below I present statistics for the average of many of the fields I indexed as both the mean of all of the descriptions and then the mean of just the descriptions that are one or more characters in length.  The graphs that follow the table below are all based on the subset of descriptions that are greater than or equal to one character in length.

Field Mean – Total Mean – 1+ length percent_int_f 12.368% 16.063% percent_punct_f 4.420% 5.741% percent_letters_f 50.730% 65.885% percent_printable_f 76.869% 99.832% percent_special_char_f 0.129% 0.168% token_capitalized_f 26.603% 34.550% token_lowercased_f 32.112% 41.705% percent_1000_f 19.516% 25.345% percent_5000_f 31.591% 41.028% percent_en_dict_f 49.539% 64.338% percent_stopwords_f 12.749% 16.557% Stopwords

Stopwords are words that occur very commonly in natural language.  I used a list of 127 stopwords for this work to help understand what percentage of a description (based on tokens) is made up of stopwords.  While stopwords generally carry little meaning for natural language, they are a good indicator of natural language,  so providers/hubs that have a higher percentage of stopwords would probably have more descriptions that resemble natural language.

Percent Stopwords by Provider

Punctuation

I was curious about how much punctuation was present in a description on average.  I used the following characters as my set of “punctuation characters”

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

I found the number of characters in a description that were made up of these characters vs other characters and then divided the number of punctuation characters by the total description length to get the percentage of the description that is punctuation.

Percent Punctuation by Provider

Punctuation is common in natural language but it occurs relatively infrequently. For example that last sentence was eighty characters long and only one of them was punctuation (the period at the end of the sentence). That comes to a percent_punctuation of only 1.25%.  In the graph above you will see the the bhl provider/hub has over 50% of their description with 25-49% punctuation.  That’s very high when compared to the other hubs and the fact that there is an average of about 5% overall for the DPLA dataset. Digital Commonwealth has a percentage of descriptions that are from 50-74% punctuation which is pretty interesting as well.

Integers

Next up in our list of things to look at is the percentage of the description field that consists of integers.  For review,  integers are digits,  like the following.

0123456789

I used the same process for the percent integer as I did for the percent punctuation mentioned above.

Percent Integer by Provider

You can see that there are several providers/hubs that have quite a high percentage integer for their descriptions.  These providers/hubs are the bhl and the smithsonian.  The smithsonian has over 70% of its descriptions with percent integers of over 70%.

Letters

Once we’ve looked at punctuation and integers,  that leaves really just letters of the alphabet to makeup the rest of a description field.

That’s exactly what we will look at next. For this I used the following characters to define letters.

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

I didn’t perform any case folding so letters with diacritics wouldn’t be counted as letters in this analysis,  but we will look at those a little bit later.

Percent Letter by Provider

For percent letters you would expect there to be a very high percentage of the descriptions that themselves contain a high percentage of letters in the description.  Generally this appears to be true but there are some odd providers/hubs again mainly bhl and the smithsonian,  though nypl, kdl and gpo also seem to have a different distribution of letters than others in the dataset.

Special Characters

The next thing to look at was the percentage of “special characters” used in a description.  For this I used the following definition of “special character”.  If a character is not present in the following list of characters (which also includes whitespace characters) then it is considered to be a “special character”

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 

Percent Special Character by Provider

A note in reading the graph above,  keep in mind that the y-axis is only 95-100% so while USC looks different here it only represents 3% of its descriptions that have 50-100% of the description being special characters.  Most likely a set of descriptions that have metadata created in a non-english language.

URLs

The final graph I want to look at in this post is the percentage of descriptions for a provider/hub that has a URL present in its description.  I used the presence of either http:// or https:// in the description to define if it does or doesn’t have a URL present.

Percent URL by Provider

The majority providers/hubs don’t have URLs in their descriptions with a few obvious exceptions.  The provider/hubs of washington, mwdl, harvard, gpo and david_ramsey do have a reasonable number of descriptions with URLs with washington leading with almost 20% of their descriptions having a URL present.

Again this analysis is just looking at what high-level information about the descriptions can tell us.  The only metric we’ve looked at that actually goes into the content of the description field to pull out a little bit of meaning is the percent stopwords.  I have one more post in this series before we wrap things up and then we will leave descriptions in the DPLA along for a bit.

If you have questions or comments about this post,  please let me know via Twitter.

Pages

Subscribe to code4lib aggregator