You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 13 hours 10 min ago

District Dispatch: Join March 6 free webinar on mapping inclusion: Public library technology and community needs

Mon, 2015-03-02 20:15

As economic, education, health and other disparities grow, equitable access to and participation in the online environment is essential for success. And yet, communities and individuals find themselves at differing levels of readiness in their ability to access and use the Internet, engage a range of digital technologies and get and create digital content.

The Digital Inclusion Survey examines the efforts of public libraries to address these readiness gaps by providing free access to broadband, public access technologies, digital content, digital literacy training and a range of programming that helps build digitally inclusive communities. A new interactive mapping tool places these library resources in a community context, including unemployment and education rates.

Join researchers and data visualization experts at a free webinar on March 6, 1-2 p.m. EST, to explore the intersections of public access technologies and education, employment, health & wellness, digital literacy, e-government and inclusion. Speakers will share new tools and demonstrate how to locate and interpret national and state-level results from the survey for planning and advocacy purposes, as well as present cases for the interactive mapping tool, with suggestions for creating a digital inclusion snapshot of your public library.

The survey, which is funded by the Institute of Museum and Library Services (IMLS) and conducted by the ALA Office for Research & Statistics and the Information Policy & Access Center (iPAC) at the University of Maryland. The International City/County Management Association and the ALA Office for Information Technology Policy are grant partners.

Learn more about the webinar and speakers from iPAC, Community Attributes, IMLS and OITP here.

The post Join March 6 free webinar on mapping inclusion: Public library technology and community needs appeared first on District Dispatch.

District Dispatch: Rising to the newest (Knight) challenge

Mon, 2015-03-02 20:02

DC Public Library in Washington, D.C. Photo by Maxine Schnitzer Photography.

It has been said that “libraries are the cornerstone of our democracy” so the newest Knight News Challenge on Elections should be right up our alley. From candidate forums to community conversations, about half of all public libraries report to the Digital Inclusion Survey that they host community engagement events. What is your library doing that you might want to expand or what new innovative idea would you like to seed? Knight is inviting all kinds of ideas: “We see democratic engagement as more than just the act of voting. It should be embedded in every part of civic life…”

So—what’s your best idea for: How might we better inform voters and increase civic participation before, during and after elections?

There are several ways you can participate and learn more:

  1. Check out and comment on the growing number of applications. Which of these could best help address issues you see and hear in your community and your library? On a quick scan, I could definitely see a library or libraries as partners for the Knowledge Swap Market, or a similar project, for instance. Also—how might an application be made stronger and more useful? You don’t have to be an applicant to contribute to the conversation, and comments are accepted through April 13.
  2. BUT—you should definitely consider applying! With more than $3 million available, a wide-open invitation to interpret the question as you see fit, encouragement to partner with others, and the opportunity to get feedback from others to improve your application, there’s a lot to be gained in participating.
  3. Learn more about the whole process at “virtual office hours” open Tuesday, March 3, from 1-2 p.m. Eastern Time and on Tuesday, March 17, from 1-2 p.m. ET. Information about these virtual office hours and in-person events in cities across the county can be accessed here. I attended the event in D.C., and it was a great opportunity to meet people and make connections for possible collaboration.

The challenge is a collaboration between the John S. and James L. Knight Foundation, a leading funder of news and media innovation, and three other foundations: the Democracy Fund, the Rita Allen Foundation and the William & Flora Hewlett Foundation. Winners will receive a share of more than $3 million, which includes up to $250,000 from the Democracy Fund.

This news challenge and the recent NetGain challenge are great opportunities to gain visibility and support for library projects working to address community needs and challenges in innovative ways. These invitations to engage with other community and national stakeholders also resonate with the emerging national policy agenda for libraries and the Aspen Institute report (pdf) on re-envisioning public libraries.

I hope you’ll consider joining the conversation. If so, please leave a note here in comments, so others can look for your proposal.

The post Rising to the newest (Knight) challenge appeared first on District Dispatch.

Islandora: Community Contributor Kudos: Diego Pino

Mon, 2015-03-02 16:12

It has been a while since we have done a Community Contributor Kudos post, but if anyone is worthy of reviving the feature, it is this week's subject: Diego Pino.

Diego is a freelance developer who specializes in addressing the needs of the scientific community with open source solutions. Right now he is also working as an IT Project Manager for a project that aims to build a national biodiversity network, funded by the Chilean government. If you have gone to the listserv with a question in the past several months, you will also recognize him as one of the most helpful troubleshooters in the Islandora community - pretty remarkable given that he only started using the software about a year ago:

Islandora is still new for me and still amazes me. All started about a year ago. I was given the task to find a way of storing and sharing Biodiversity occurrence records, and thus build a federated network that could help scientists to collaborate and share research data. The primary need was to move data to GBIF for storage, described with Darwin Core metadata, so I started researching what was going on in terms of preserving digital content for science. Until then I thought everything could be solved using a relational database and some custom coding (how wrong I was!)   He started by exploring eSciDoc (created by Matthias Razum), a project based on Fedora 3.x. It was designed to address a need that Diego had been working on for some time: how to involve researchers and scientists directly in the process of sharing and curating their own data. This, and the project's own documentation, sold Diego on Fedora 3.x, but he wanted more - not only the ability to ingest and preserve digital content, but a fully working framework/API that would allow him to focus on the user experience.   And then I found the Islandora's google forum and it was exactly what I needed: A big and nice community of human beings, with problems similar to mine, and with an incredible piece of software, a.k.a. Islandora. I must admit the learning curve was hard; some needed things were not developed and I had to add to my new knowledge Drupal, Solr and Web Semantics (my favourite subject right now), but the community was great and helpful, and meeting Giancarlo Birello was an inspiration to keep working and also to help other users on the forum. I have received so much; giving a little back is a must.

Currently Diego is developing and managing a four repo configuration, with each running a stock Islandora 7.x-1.4/Fedora 3.7.1, using an external Tomcat and other goodies, but sharing a common Solr Cloud index. As Diego describes it, "one collection, many shards, many replicas." He had to fine tune the way objects were indexed to avoid duplicated PIDs and to be able to distinguish during search which repo the object lives in. The repos are also running his Redbiodiversidad Solution Pack , which handles Darwin Core based objects, maps, EML, and GBIF DC archives; and the Ontologies Solution Pack, which allows objects to be related by multiple overlapping ontologies- and which Diego is particularly proud of.

My favourite thing about this configuration is that I can search across all existing repos and their collections, use existing solution packs like PDF or Scholar to describe publications and people, relate local objects to remote ones, and build nice linked data graphs. These expand the notion of plain, independent metadata records encapsulated in objects, to a fully new dimension for us (maybe exaggerating here!) that is helping local scientists to understand their data in a more ample context: in my opinion the needed transition from information to knowledge.

A very simple and trivial example. A Chilean scientist can now discover what other biological occurrences (associated species) are found near a place where they made a discovery; who found them, when, under which method, and filter by many parameters in a few steps or clicks,  thanks to Solr search module + linked data. They can expand their knowledge, collaborate, and  manage their own research data in ways their previous workflows (excel?) did not allow. And my favourite part: if something is not working as expected I can fix it using Islandora's API. There are some many nice hooks available and more to come.

As for projects coming down the pipeline, Diego is working on a new visual workflow to ingest and manage relationships between objects, reusing the way the Ontologies SP currently displays a linked object graph. The end goal is to allow people to interactively add new objects, connect them using rules present in multiple OWLs, and finally save this new "knowledge" representation as a whole. Essentially, every ontology becomes a programable workflow. Using this system will maintain a consistent network of repositories with well-related objects, while still giving users control of their data. He has promised the community an OCR editor, which remains high on his TODO list. As an active member of the Fedora 4 Interest Group, Diego is also involved in planning and developing the next generation of Islandora (and taking a stand for those who don't want to see XML Forms vanish into the night).

Diego does all of this amazing work from his home office in a little village named Pucón in southern Chile, nestled next to an active volcano and a lake. He credits this environment with giving him the peace to code - that, and his small herd of dogs:

Lastly, none of this work using Islandora could have be done without the great support of the community and the also very important support  and patience of my wife and my 4 Dogs, who by this time already hate ontologies.

His Red Biodiversidad repo is still in development, but a beta site is online, showing Solr results from their cloud, fetched from the real repos' collections. And here is one of those collections, full of biological data and growing all the time. You can find more of Diego's work on his GitHub page, and you can usually find him making the Islandora community better one solution at a time on our listerv (it's quite remarkable how many search results for 'diego' in our Google Group turns up some variation of the phrase "thanks, Diego").

Someone in Diego's family is a remarkable photographer, so when I asked him to send along a photo I could use with this blog so the community could put a face to all of those awesome listserv posts, it was difficult to choose. I leave it the community to decide which image best suits Diego Pino: Programmer on a Mountain or Man Hugs Dog:

        

LibUX: WordPress for Libraries

Mon, 2015-03-02 14:44

Amanda and Michael are teaching simultaneous online classes on WordPress for Libraries – at least sixty hours worth of tutorial for beginners and developers. Back to back, these classes take you from using WordPress out-of-the-box to create and manage a library website through the custom development of an event management plugin.

Using WordPress to Build Library Websites

WordPress is an open-source content management system that helps you create, design, and maintain a website. Its intuitive interface means that there’s no need to learn complex programming languages — and it’s free, you can do away with purchasing expensive web development software. This course will guide you in applying WordPress tools and functionality to library content. You will learn the nuts and bolts of building a library website that is both user friendly and easy to maintain. Info

Advanced WordPress

WordPress is an incredible out-of-the-box tool, but libraries with ambitious web services will find it needs to be customized to meet their unique needs. This course is built around a single project: the ground-up development of an event management plugin, which will provide a thorough understanding of WordPress as a framework–hooks, actions, methods–that can be used to address pressing and ancillary issues like content silos and the need to C.O.P.E. – create once, publish everywhere. Info

Format

American Library Association eCourses are asynchronous with mixed-media materials available online and at no additional cost. So, you don’t have to get a text book. You can usually proceed at your own pace and submit material through the forums, unless the facilitator changes it up — and we probably won’t, unless it makes sense to keep the class proceeding together. Both of our courses are six weeks, beginning March 16, 2015 – but we want you to squeeze as much as you can out of these classes, so we are available to explain, walkthrough, and answer questions for as long as you need. We really want you to walk away with real-world applicable skills.

The post WordPress for Libraries appeared first on LibUX.

Ed Summers: Repetition

Mon, 2015-03-02 14:00

To be satisfied with repeating, with traversing the ruts which in other conditions led to good, is the surest way of creating carelessness about present and actual good.

John Dewey in Human Nature and Conduct (p. 67).

Mark E. Phillips: DPLA Metadata Analysis: Part 4 – Normalized Subjects

Mon, 2015-03-02 13:43

This is yet another post in the series DPLA Metadata Analysis that already has three parts, here are links to part one, two and three.

This post looks at what is the effect of basic normalization of subjects on various metrics mentioned in the previous posts.

Background

One of the things that happens in library land is that subject headings are often constructed by connecting various broader pieces into a single subject string that becomes more specific.  For example the heading “Children–Texas.” is constructed from two different pieces,  “Children”, and “Texas”.  If we had a record that was about children in Oklahoma it could be represented as “Children–Oklahoma.”.

The analysis I did earlier took the subject exactly as it occurred in the dataset and used that for the analysis.  I had a question asked about what would happen if we normalized the subjects before we did the analysis on them,  effectively turning the unique string of “Children–Texas.” into two subject pieces of “Children” and “Texas” and then applied the previous analysis to the new data. The specific normalization includes stripping trailing periods, and then splitting on double hyphens.

Note:  Because this conversion has the ability to introduce quite a bit of duplication into the number of subjects within a record I am making the normalized subjects unique before adding them to the index.  I also apply this same method to the un-normalized subjects.  In doing so I noticed that the item that had the  most subjects previously at 1,476 was reduced to 1,084 because there were a 347 values that were in the subject list more than once.  Because of this the numbers in the resulting tables will be slightly different than those in the first three posts when it comes to average subjects and total subjects,  each of these values should go down.

Predictions

My predictions before the analysis are that we will see an increase in the number of unique subjects,  a drop in the number of unique subjects per Hub for some Hubs, and an increase in the number of shared subjects across Hubs.

Results

With the normalization of subjects,  there was a change in the number of unique subject headings from 1,871,884 unique headings to 1,162,491 unique headings after normalization,  a reduction in the number of unique subject headings by 38%.

In addition to the reduction of the total number of unique subject headings by 38% as stated above,  the distribution of subjects across the Hubs changed significantly, in one case an increase of 443%.  The table below displays these numbers before and after normalization as well as the percentage change.

# of Hubs with Subject # of Subjects # of Normalized Subjects % Change 1 1,717,512 1,055,561 -39% 2 114,047 60,981 -47% 3 21,126 20,172 -5% 4 8,013 9,483 18% 5 3,905 5,130 31% 6 2,187 3,094 41% 7 1,330 2,024 52% 8 970 1,481 53% 9 689 1,080 57% 10 494 765 55% 11 405 571 41% 12 302 453 50% 13 245 413 69% 14 199 340 71% 15 152 261 72% 16 117 205 75% 17 63 152 141% 18 62 130 110% 19 32 77 141% 20 20 55 175% 21 7 38 443% 22 7 23 229% 23 0 2 N/A

The two subjects that are shared across 23 of the Hubs once normalized are “Education” and “United States”

The high level stats for all 8,012,390 records are available in the following table.

 Records Total Subject Strings Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change 8,012,390 23,860,080 28,644,188 2.98 3.57 20.05%

You can see the total number of subjects went up 20% after they were normalized, and the number of subjects per record increased from just under three per record to a little over three and a half normalized subjects per record.

Results by Hub

The table below presents data for each hub in the DPLA.  The columns are the number of records, total subjects, total normalized subjects, the average number of subjects per record, the average number of normalized subjects per record, and finally the percent of change that is represented.

Hub Records Total Subject String Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change ARTstor 56,342 194,883 202,220 3.46 3.59 3.76 Biodiversity Heritage Library 138,288 453,843 452,007 3.28 3.27 -0.40 David Rumsey 48,132 22,976 22,976 0.48 0.48 0 Digital Commonwealth 124,804 295,778 336,935 2.37 2.7 13.91 Digital Library of Georgia 259,640 1,151,351 1,783,884 4.43 6.87 54.94 Harvard Library 10,568 26,641 36,511 2.52 3.45 37.05 HathiTrust 1,915,159 2,608,567 4,154,244 1.36 2.17 59.25 Internet Archive 208,953 363,634 412,640 1.74 1.97 13.48 J. Paul Getty Trust 92,681 32,949 43,590 0.36 0.47 32.30 Kentucky Digital Library 127,755 26,008 27,561 0.2 0.22 5.97 Minnesota Digital Library 40,533 202,456 211,539 4.99 5.22 4.49 Missouri Hub 41,557 97,111 117,933 2.34 2.84 21.44 Mountain West Digital Library 867,538 2,636,219 3,552,268 3.04 4.09 34.75 National Archives and Records Administration 700,952 231,513 231,513 0.33 0.33 0 North Carolina Digital Heritage Center 260,709 866,697 1,207,488 3.32 4.63 39.32 Smithsonian Institution 897,196 5,689,135 5,686,107 6.34 6.34 -0.05 South Carolina Digital Library 76,001 231,267 355,504 3.04 4.68 53.72 The New York Public Library 1,169,576 1,995,817 2,515,252 1.71 2.15 26.03 The Portal to Texas History 477,639 5,255,588 5,410,963 11 11.33 2.96 United States Government Printing Office (GPO) 148,715 456,363 768,830 3.07 5.17 68.47 University of Illinois at Urbana-Champaign 18,103 67,954 85,263 3.75 4.71 25.47 University of Southern California. Libraries 301,325 859,868 905,465 2.85 3 5.30 University of Virginia Library 30,188 93,378 123,405 3.09 4.09 32.16

The number of unique subjects before and after subject normalization is presented in the table below.  The percent of change is also included in the final column.

Hub Unique Subjects Unique Normalized Subjects % Change Unique ARTstor 9,560 9,546 -0.15 Biodiversity Heritage Library 22,004 22,005 0 David Rumsey 123 123 0 Digital Commonwealth 41,704 39,557 -5.15 Digital Library of Georgia 132,160 88,200 -33.26 Harvard Library 9,257 6,210 -32.92 HathiTrust 685,733 272,340 -60.28 Internet Archive 56,911 49,117 -13.70 J. Paul Getty Trust 2,777 2,560 -7.81 Kentucky Digital Library 1,972 1,831 -7.15 Minnesota Digital Library 24,472 24,325 -0.60 Missouri Hub 6,893 6,757 -1.97 Mountain West Digital Library 227,755 172,663 -24.19 National Archives and Records Administration 7,086 7,086 0 North Carolina Digital Heritage Center 99,258 79,353 -20.05 Smithsonian Institution 348,302 346,096 -0.63 South Carolina Digital Library 23,842 17,516 -26.53 The New York Public Library 69,210 36,709 -46.96 The Portal to Texas History 104,566 97,441 -6.81 United States Government Printing Office (GPO) 174,067 48,537 -72.12 University of Illinois at Urbana-Champaign 6,183 5,724 -7.42 University of Southern California. Libraries 65,958 64,021 -2.94 University of Virginia Library 3,736 3,664 -1.93

The number and percentage of subjects and normalized subjects that are unique and also unique to a given hub is presented in the table below.

Hub Subjects Unique to Hub Normalized Subject Unique to Hub % Subjects Unique to Hub % Normalized Subjects Unique to Hub % Change ARTstor 4,941 4,806 52 50 -4 Biodiversity Heritage Library 9,136 6,929 42 31 -26 David Rumsey 30 28 24 23 -4 Digital Commonwealth 31,094 27,712 75 70 -7 Digital Library of Georgia 114,689 67,768 87 77 -11 Harvard Library 7,204 3,238 78 52 -33 HathiTrust 570,292 200,652 83 74 -11 Internet Archive 28,978 23,387 51 48 -6 J. Paul Getty Trust 1,852 1,337 67 52 -22 Kentucky Digital Library 1,337 1,111 68 61 -10 Minnesota Digital Library 17,545 17,145 72 70 -3 Missouri Hub 4,338 3,783 63 56 -11 Mountain West Digital Library 192,501 134,870 85 78 -8 National Archives and Records Administration 3,589 3,399 51 48 -6 North Carolina Digital Heritage Center 84,203 62,406 85 79 -7 Smithsonian Institution 325,878 322,945 94 93 -1 South Carolina Digital Library 18,110 9,767 76 56 -26 The New York Public Library 52,002 18,075 75 49 -35 The Portal to Texas History 87,076 78,153 83 80 -4 United States Government Printing Office (GPO) 105,389 15,702 61 32 -48 University of Illinois at Urbana-Champaign 3,076 2,322 50 41 -18 University of Southern California. Libraries 51,822 48,889 79 76 -4 University of Virginia Library 2,425 1,134 65 31 -52 Conclusion

Overall there was an increase (20%) in the total occurrences of subject strings in the dataset when subject normalization was applied. The total number of unique subjects decreased significantly (38%) after subject normalization.  It is easy to identify Hubs which are heavy users of the LCSH subject headings for their subjects because the percent change in the number of unique subjects before and after normalization is quite high, examples of this include the HathiTrust and the Government Printing Office. For many of the Hubs,  normalization of subjects significantly reduced the number and percentage of subjects that were unique to that hub.

I hope you found this post interesting,  if you want to chat about the topic hit me up on Twitter.

Alf Eaton, Alf: Organising, building and deploying static web sites/applications

Mon, 2015-03-02 09:34
Build remotely

At the simplest end of the scale is GitHub Pages, which uses Jekyll to build the app on GitHub’s servers:

  • The config files and source code are in the root directory of a gh-pages branch.

  • Jekyll builds the source HTML/MD, CSS/SASS and JS/CS files to a _site directory - this is where the app is served from.

  • For third-party libraries, you can either download production-ready code manually to a lib folder and include them, or install with Bower to a bower_components folder and include them directly from there.

The benefit of this approach is that you can edit the source files through GitHub’s web interface, and the site will update without needing to do any local building or deployment.

Jekyll will build all CSS/SASS files (including those pulled in from bower_components) into a single CSS file. However, it doesn’t yet have something similar for JS/CoffeeScript. If this was available it would be ideal, as then the bower_components folder could be left out of the built app.

Directory structure of a Jekyll GitHub Pages app Build locally, deploy the built app as a separate branch

If the app is being built locally, there are several steps that can be taken to improve the process:

  • Keep the config files in the root folder, but move the app’s source files into an app folder.

  • Use Gulp to build the Bower-managed third-party libraries alongside the app’s own styles and scripts.

  • While keeping the source files in the master branch, use Gulp to deploy the built app in a separate gh-pages branch.

A good example of this is the way that the Yeoman generator for scaffolding a Polymer app structures a project (other Yeoman generators are similar):

  • In the master branch, install/build-related files are in the root folder (run npm install and bower install to fetch third-party components, use bower link for any independent local components).

  • The actual app source files (index.html, app styles, app-specific elements) are in the app folder.

  • gulp builds all the HTML, CSS/SASS and JS source files to the dist folder; gulp serve makes the built files available over HTTP and reloads on changes; gulp deploy pushes the dist folder to a remote gh-pages branch.

Directory structure of a Polymer app Tools

District Dispatch: Reminder: Last chance to apply for Google summer fellowship

Mon, 2015-03-02 06:43

Google Policy Fellows

The American Library Association’s Washington Office is calling for graduate students, especially those in library and information science-related academic programs, to apply for the 2015 Google Policy Fellows program. Applications are due by March 12, 2015.

For the summer of 2015, the selected fellow will spend 10 weeks in residence at the ALA policy office in Washington, D.C., to learn about national policy and complete a major project. Google provides the $7,500 stipend for the summer, but the work agenda is determined by the ALA and the selected fellow. Throughout the summer, Google’s Washington office will provide an educational program for all of the fellows, such as lunchtime talks and interactions with Google Washington staff.

The fellows work in diverse areas of information policy that may include digital copyright, e-book licenses and access, future of reading, international copyright policy, broadband deployment, telecommunications policy (including e-rate and network neutrality), digital divide, access to information, free expression, digital literacy, online privacy, the future of libraries generally, and many other topics.

Margaret Kavaras, a recent graduate from the George Washington University, served as the 2014 ALA Google Policy Fellow. Kavaras was later appointed as an OITP Research Associate shortly after participating in the Google Fellowship program.

Further information about the program and host organizations is available at the Google Public Policy Fellowship website.

The post Reminder: Last chance to apply for Google summer fellowship appeared first on District Dispatch.

James Cook University, Library Tech: Readings & Past Exams/Reserver Online/Masterfile access issues

Sun, 2015-03-01 23:33
Not of interest to anyone outside of JCU, just using my blog to list the workarounds for a local issue: <!--[if gte mso 9]> <![endif]--> <!--[if gte mso 9]> Normal 0 false false false EN-AU X-NONE X-NONE <![endif]--><!--[if gte mso 9]>

Open Library Data Additions: An error occurred

Sun, 2015-03-01 05:43
The RSS feed is currently experiencing technical difficulties. The error is: Search engine returned invalid information or was unresponsive

David Rosenthal: Don't Panic

Sat, 2015-02-28 17:51
I was one of the crowd of people who reacted to Wednesday's news that Argonne National Labs would shut down the NEWTON Ask A Scientist service, on-line since 1991, this Sunday by alerting Jason Scott's ArchiveTeam. Jason did what I should have done before flashing the bat-signal. He fed the URL into the Internet Archive's Save Page Now, to be told "relax, we're all over it". The site has been captured since 1996 and the most recent capture before the announcement was Feb 7th. Jason arranged for captures Thursday and today.

As you can see by these examples, the Wayback Machine has a pretty good copy of the final state of the service and, as the use of Memento spreads, it will even remain accessible via its original URL.

Hydra Project: Hydra Europe events Spring 2015

Sat, 2015-02-28 10:02

Registration is now open for two Hydra events in Europe this spring:

Hydra Europe Symposium – a free event for digital collection managers, collection owners and their software developers that will provide insights into how Hydra can serve your needs

  • Thursday 23rd April – Friday 24th April 2015 | LSE, London

Hydra Camp London – a training event enabling technical staff to learn about the Hydra technology stack so they can establish their own implementation

  • Monday 20th April – lunchtime Thursday 23rd April 2015 | LSE, London

Full details and booking arrangements for both events can be found here.

Mark E. Phillips: DPLA Metadata Analysis: Part 3 – Where to go from here.

Fri, 2015-02-27 23:31

This is the last of three posts about working with the Digital Public Library of America’s (DPLA) metadata to demonstrate some of the analysis that can be done using Solr and a little bit of time and patience. Here are links to the first and second post in the series.

What I wanted to talk about in this post is how can we use this data to help improve access to our digital resources in the DPLA, and also be able to measure that we’ve in fact improved when we go out to spend resources, both time and money on metadata work.

The first thing I think we need to do is to make an assumption to frame this conversation. For now let’s say that the presence of subjects in a metadata record is a positive indicator of quality. And that for the most part a record that has three or more subjects (controlled, keywords, whatever) improves the access to resources in metadata aggregation systems like the DPLA which doesn’t have the benefit of full-text for searching.

So out of the numbers we’ve looked at so far, which ones are the ones to pay the most attention to.

Zero Subjects

For me it is focusing on the number of records that have zero subject headings that are already online. Going from 0-1 subject headings is much more of an improvement for access than going from 1-2, 2-3,3-4,4-8,8-15 subjects per record. So once we have all records with at least one subject we can move on. We can measure this directly with the metric for how many records have zero subjects that I introduced last post.

There are currently 1,827,276 records in the DPLA that have no subjects or keywords. This accounts for 23% of the DPLA dataset analyzed for these blog posts. I think this is a pretty straightforward area to work on related to metadata improvement.

Dead end subjects

One are we could work to improve is when we have subjects that are either only used once in the DPLA as a whole, or only once within a single Hub. Reducing this number would allow for more avenues for navigation between records by connecting them via subject when available. There isn’t anything bad about unique subject headings within a community, but if a record doesn’t have a way to get you to like records (assuming there are like records within a collection) then it isn’t as useful as one that connects you to more, similar items.  There of course are many legitimate reasons that there is only one instance of a subject in a dataset and I don’t think that we should strive to remove them completely,  but reducing the number overall would be an indicator of improvement in my book.

In the last post I had a table that had the number of unique subjects and the number of subjects that were unique to a single Hub.  I was curious about the percentage of subjects from a Hub that were unique to just that Hub based on the number of unique subjects.  Here is that table.

Hub Name Records Unique Subjects # of subjects unique to hub % of subjects that are unique to hub ARTstor 56,342 9,560 4,941 52% Biodiversity Heritage Library 138,288 22,004 9,136 42% David Rumsey 48,132 123 30 24% Digital Commonwealth 124,804 41,704 31,094 75% Digital Library of Georgia 259,640 132,160 114,689 87% Harvard Library 10,568 9,257 7,204 78% HathiTrust 1,915,159 685,733 570,292 83% Internet Archive 208,953 56,911 28,978 51% J. Paul Getty Trust 92,681 2,777 1,852 67% Kentucky Digital Library 127,755 1,972 1,337 68% Minnesota Digital Library 40,533 24,472 17,545 72% Missouri Hub 41,557 6,893 4,338 63% Mountain West Digital Library 867,538 227,755 192,501 85% National Archives and Records Administration 700,952 7,086 3,589 51% North Carolina Digital Heritage Center 260,709 99,258 84,203 85% Smithsonian Institution 897,196 348,302 325,878 94% South Carolina Digital Library 76,001 23,842 18,110 76% The New York Public Library 1,169,576 69,210 52,002 75% The Portal to Texas History 477,639 104,566 87,076 83% United States Government Printing Office (GPO) 148,715 174,067 105,389 61% University of Illinois at Urbana-Champaign 18,103 6,183 3,076 50% University of Southern California. Libraries 301,325 65,958 51,822 79% University of Virginia Library 30,188 3,736 2,425 65%

Here is the breakdown when grouped by type of Hub,  either Service-Hub or Content-Hub

Hub Type Records Unique Subjects Subjects unique to Hub Type % of Subjects unique to Hub Type Content Hubs 5,736,178 1,311,830 1,253,769 96% Service Hubs 2,276,176 618,081 560,049 91%

Or another way to look at how the subjects are shared between the different types of Hubs is the following graph.

Subjects unique to and shared between Hub Types.

It appears that there is a small number (3%) of subjects that are shared between Hub types.  Would increasing this number improve the ability for users to discover resources better from multiple Hubs?

More, More, More

I think once we’ve looked at the ways mentioned above I think that we should work to up the number of subjects per record within a given Hub. I don’t think there is a magic number for everyone, but at UNT we try and have three subjects for each record whenever possible. So that’s what we are shooting for. We can easily see improvement by looking at the mean and see if it goes up (even ever so slightly up)

Next Steps

I think that there is some work that we could do to identify which records need specific kinds work for subjects based on more involved processing of the input records, but I’m going to leave that for another post and probably another flight somewhere to work on.

Hope you enjoyed these three posts and hope they resonate at least a bit with you.

Feel free to send me a not on twitter if you have questions, comments, or idea for me about this.

District Dispatch: Copyright Office should modernize its operation

Fri, 2015-02-27 21:06

Photo by Steve Akers

The U.S. House Judiciary Committee has mulled its way through 16 well-attended and sometimes contentious hearings on comprehensive copyright reform since 2013. Thursday’s hearing—“the U.S. Copyright Office: Its Function and Resources—sounds like one that keen copyright followers might think is mundane enough to skip, but they would be wrong. The Office’s functions tremendously affect how libraries, businesses, authors, and other creators also operate, because the Office holds the official record of what works are protected and who holds the copyright. Plus there was a little excitement.

The witnesses were Keith Kupferschmid, Software & Information Industry Association general counsel; Lisa Dunner of Dunner Law on behalf of the American Bar Association; Nancy Mertzel of Schoeman Updike on behalf of the American Intellectual Property Law Association; and Bob Brauneis, George Washington University Law School professor.

In an unusual accord, the witnesses agreed, the Representatives agreed, probably everybody in the room agreed that the time is now for the modernization of the Copyright Office. Stuck in the 1970s, the Copyright Office is ill-equipped to manage its basic function—recording what works are protected by copyright and who holds the rights to those works. Based on their testimony, the panel agreed that the Copyright Office requires more technical expertise and additional resources to build a 21st century digital infrastructure for registration, recordation and search. The Judiciary committee member statements demonstrated that in light of the Office’s role in enabling an effective and efficient structure to enable transactions in the copyright industry— an industry that Representative Deutch said was worth over a trillion dollars—something must be done. A functioning, modern registration system would help alleviate the orphan works problem by making it possible to locate rights holders and track the provenance of copyrighted works. (Perhaps now all stakeholders can agree with the Library Copyright Alliance’s position that a full scale searchable and interoperable system that meets the needs of commerce, creators, and the public is preferable to an orphan works solution legislated by Congress).

Only Representative Zoe Lofgren went off script, asking cynically if modernization could also help the Copyright Office represent “a greater diversity of viewpoints?” Rep. Lofgren continued by saying the Office has made some “bone-headed mistakes.” Its strong endorsement of Stop Online Piracy Act (SOPA) did not take into account the public interest, and resulted in an unpreceded backlash. A recorded 14 million people contacted Congress to protest SOPA and the world witnessed the first internet blackout campaign. Lofgren continued that it was the Copyright Office that recommended not extending the 1201 exemption for cell phone unlocking, leading to another public outcry, a “We The People” petition to the President, and the need for Congress to pass the “Unlocking Consumer Choice and Wireless Competition Act.” Lofgren was not done. At an earlier 1201 rulemaking the Office advised against an exemption for circumvention of e-readers to enable to enable text to speech functionality for people with print disabilities. Really? Thankfully, this decision was overturned by the Librarian of Congress in his final decision.

Lofgren, who represents constituents in Silicon Valley, has butted heads with the Copyright Office in the past. At an earlier hearing, Register Maria Pallante was quizzed by Lofgren over the purpose of the copyright. In the American Bar Association Landslide Magazine, Pallante was quoted as saying that “copyright is for the author first and the nation second.” Lofgren countered “it seems to me when you look at the Constitution, which empowers Congress to grant exclusive rights in creative works in order, and I quote, “to promote the progress of science and the useful arts.” It seems to me that the Constitution is very clear that copyright does not exist inherently for the author but for the benefit for society at large.”

It is in the public interest to provide the necessary resources and expertise to upgrade the Copyright Office’s infrastructure. Whether the Copyright Office’s will be able to balance the interests of all parties for the benefit for society at large in other areas of copyright review is another more fundamental matter. Protection of legacy business models and copyright enforcement continue to dominate policy discussions in both the legislative and executive branches of government, so the Copyright Office is not alone in its perspective that the public interest is a secondary matter. Let’s hope that the public continues to pay close attention to the House Judiciary copyright review. They have been more than willing to speak up on their behalf.

The post Copyright Office should modernize its operation appeared first on District Dispatch.

Chris Prom: Rights in the Digital Era

Fri, 2015-02-27 20:31

When working with electronic records, we often feel like we stand on shaky ground.

For example, when I teach my DAS course, some of the most difficult questions that course participants raise are related to rights: “How do we know what is copyrighted?” “How do we identify private materials?” “How can we provide access to ________”  “How can we track rights information?” “What do we do if we get sued?”

As archivists, we like to provide as much access as possible to the great things our repositories hold.  How can we do that and still sleep soundly at night?

These are the kinds of questions that led the SAA Publications Board to commission the newest entry in our Trends in Archives Practice Series, Rights in the Digital Era.    As with the courses in the SAA Digital Archives Specialist curriculum, I see the current and forthcoming works in this series as falling very much within the original spirit of this blog, making digital archives work practical and accessible.

Rights in the Digital Era, for instance, lays out risk management strategies and techniques you can use to provide responsible access to analog and digital collections, while meeting legal and ethical obligations, whatever your professional status or repository profile  As Peter B. Hirtle notes in his introduction to the volume, “A close reading of the modules will provide you with the rights information that all archivists should know, whether you are a repository’s director, reference archivist, or processing assistant.”

  • Module 4: Understanding Copyright Law by Heather Briston – provides a short-but-sweet introduction to copyright law and unpublished materials, emphasizing practical steps that can be taken to make archives more accessible and useful.
  • Module 5: Balancing Privacy and Access in Manuscript Collections by Menzi Behrnd-Klodt – navigates the difficult terrain of personal privacy, developing a roadmap to the law and describing risk management strategies applicable for a range or repositories and collections.
  • Module 6: Balancing Privacy and Access in the Records of Organizations by Menzi Behrnd-Klodt – unpacks legal and ethical requirements for records that are restricted under law, highlighting hands-on techniques you can use to develop thoughtful access policies and pathways in a variety of repository settings.
  • Module 7: Managing Rights and Permissions by Aprille McKay – examines provides methods and steps you can take to control rights information, supplying many useful approaches that you can easily adapt to your own circumstances.

It gives me and the entire SAA Publications Board great pride to see the fruits of our authors’ thinking and of our society’s work emerge in high quality, peer reviewed, and attractively designed books.

But don’t just take my word for it—read Rights in the Digital Era or any of the dozens of other publications that comprise the SAA catalog of titles!

 

 

District Dispatch: School libraries can’t afford to wait

Fri, 2015-02-27 19:46

A student at the ‘Iolani School in Hawaii

Today, we are ending a very busy week of work to include school library provisions in the reauthorization of the Elementary and Secondary Education Act (ESEA). As we move forward on this very important legislative effort to secure federal funding for school libraries, we are asking library advocates, teachers, parents and students to ask their Senators to become co-sponsors of the SKILLS Act (S 312). We are encouraging those lucky few advocates who live in states where one of their Senators is on on the Senate HELP Committee to call their Washington Office and ask their Senators to co-sponsor Senator Sheldon Whitehouse’s (D-RI) efforts to include the SKILLS Act in the Committee’s ESEA bill.

I’ve had many meetings in the Senate recently and none of the congressional staff members I’ve met with have heard from any library supporters…..so PLEASE MAKE THESE CALLS. AND ASK EVERYONE ELSE YOU KNOW TO CALL TOO.

Students deserve to go to a school with an effective school library program-take action now!

The post School libraries can’t afford to wait appeared first on District Dispatch.

District Dispatch: ALA welcomes Alternative Spring Break student Natalie Yee

Fri, 2015-02-27 19:25

Natalie Yee

Next week, we welcome University of Michigan student Natalie Yee to the American Library Association (ALA) Washington Office, where she will learn about information-related fields as part of her university’s School of Information Alternative Spring Break program. Yee will conduct research on online resources and tools produced by the ALA Washington Office (under my guidance as the press officer).

Yee is currently working to earn a master’s degree in Information, with a focus in Human-Computer Interaction. She previously earned a bachelors in Anthropology with a minor in Asian Studies from Colorado College.

The University of Michigan Alternative Spring Break program creates the opportunity for students to engage in a service-oriented integrative learning experience; connects public sector organizations to the knowledge and abilities of students through a social impact project; and facilitates and enhances the relationship between the School and the greater community.

In addition to ALA, the students are hosted by other advocacy groups such as the Future of Music Coalition as well as federal agencies such as the Library of Congress, the Smithsonian Institution, and the National Archives. The students get a taste of work life here in D.C. and an opportunity to network with information professionals.

“We are pleased to support students from the University of Michigan’s spring break program,” said Alan S. Inouye, director of ALA’s Office for Information Technology Policy. “We look forward to working collaboratively with Yee in the next week.”

The post ALA welcomes Alternative Spring Break student Natalie Yee appeared first on District Dispatch.

M. Ryan Hess: Three Emerging Digital Platforms for 2015

Fri, 2015-02-27 15:43

‘Twas a world of limited options for digital libraries just a few short years back. Nowadays, however, the options are many more and the features and functionalities are truly groundbreaking.

Before I dive into some of the latest whizzbang technologies that have caught my eye, let me lay out the platforms we currently use and why we use them.

  • Digital Commons for our institutional repository. This is a simple yet powerful hosted repository service. It has customizable workflows built into it for managing and publishing online journals, conferences, e-books, media galleries and much more. And, I’d emphasize the “service” aspect. Included in the subscription comes notable SEO power, robust publishing tools, reporting, stellar customer service and, of course, you don’t have to worry about the technical upkeep of the platform.
  • CONTENTdm for our digital collections. There was a time that OCLC’s digital collections platform appeared to be on a development trajectory that would take out of the clunky mire it was in say in 2010. They’ve made strides, but this has not kept up.
  • LUNA for restricted image reserve services. You and your faculty can build collections in this system popular with museums and libraries alike. Your collection also sits within the LUNA Commons, which means users of LUNA can take advantage of collections outside their institutions.
  • Omeka.net for online exhibits and digital humanities projects. The limited cousin to the self-hosted Omeka, this version is an easy way to launch multiple sites for your campus without having to administer multiple installs. But it has a limited number of plugins and options, so your users will quickly grow out of it.
The Movers and Shakers of 2015

There are some very interesting developments out there and so here is a brief overview of a few of the three most ground-breaking, in my opinion.

PressForward

If you took Blog DNA and spliced it with Journal Publishing, you’d get a critter called PresForward: a WordPress plug-in that allows users to launch publications that approach publishing from a contemporary web publishing perspective.

There are a number of ways you can use PressForward but the most basic publishing model its intended for starts with treating other online publications (RSS feeds from individuals, organizations, other journals) as sources of submissions. Editors can add external content feeds to their submission feed, which bring that content into their PressForward queue for consideration. Editors can then go through all the content that is brought in automatically from outside and then decide to include it in their publication. And of course, locally produced content is also included if you’re so inclined.

Examples of PressForward include:

Islandora

Built on Fedora Commons with a Drupal front-end layer, Islandora is a truly remarkable platform that is growing in popularity at a good clip. A few years back, I worked with a local consortia examining various platforms and we looked at Islandora. At the time, there were no examples of the platform being put into use and it felt more like an interesting concept more than a tool we should recommend for our needs. Had we been looking at this today, I think it would have been our number one choice.

Part of the magic with Islandora is that it uses RDF triples to flatten your collections and items into a simple array of objects that can have unlimited relationships to each other. In other words, a single image can be associated with other objects that all relate as a single object (say a book of images) and that book object can be part of a collection of books object, or, in fact, be connected to multiple other collections. This is a technical way of saying that it’s hyper flexible and yet very simple.

And because Islandora is built on two widely used open source platforms, finding tech staff to help manage it is easy.

But if you don’t have the staff to run a Fedora-Drupal server, Lyrasis now offers hosted options that are just as powerful. In fact, one subscription model they offer allows you to have complete access to the Drupal back end if customization and development are important to you, but you dont’ want to waste staff time on updates and monitoring/testing server performance.

Either way, this looks like a major player in this space and I expect it to continue to grow exponentially. That’s a good thing too, because some aspects of the platform are feeling a little “not ready for prime time.” The Newspaper solution pack, for example, while okay, is no where near as cool as what Veridian currently can do.

ArtStor’s SharedShelf

Rapid development has taken this digital image collection platform to a new level with promises of more to come. SharedShelf integrates the open web, including DPLA and Google Images, with their proprietary image database in novel ways that I think put LUNA on notice.

Like LUNA, SharedShelf allows institutions to build local collections that can contain copyrighted works to be used in classroom and research environments. But what sets it apart is that it allows users to also build beyond their institutions and push that content to the open web (or not depending on the rights to the images they are publishing).

SharedShelf also integrates with other ArtStor services such as their Curriculum Guides that allow faculty to create instructional narratives using all the resources available from ArtStor.

The management layer is pretty nice and works well with a host of schema.

And, oh, apparently audio and video support is on the way.


CrossRef: Update for CrossCheck Users: iThenticate bibliography exclusion issue fix

Fri, 2015-02-27 14:18

This is an update for CrossRef members who participate in the CrossCheck service, powered by iThenticate.

Users of the iThenticate system had reported that the bibliography/reference exclusion option did not work properly when tables were appended after the bibliography of a document and a cell of the table contained a bibliography keyword i.e. references, works cited, bibliography.

A fix was released on February 16th 2015 for this issue, so that reference exclusion will work for documents that fit the criteria above. Please note that the enhancement applies to new manuscripts submitted to iThenticate from February 16th.

CrossCheck users should feel free to test out the bibliography exclusion feature, and do contact us if you have any questions; or if you continue to experience this problem on any new submissions to the service.

LITA: Librarians: We Open Access

Fri, 2015-02-27 12:00

Open Access (storefront). Credit: Flickr user Gideon Burton

In his February 11 post, my fellow LITA blogger Bryan Brown interrogated the definitions of librarianship. He concluded that librarianship amounts to a “set of shared values and duties to our communities,” nicely summarized in the ALA’s Core Values of Librarianship. These core values are access, confidentiality / privacy, democracy, diversity, education and lifelong learning, intellectual freedom, preservation, the public good, professionalism, service, and social responsibility. But the greatest of these is access, without which we would revert to our roots as monastic scriptoriums and subscription libraries for the literate elite.

Bryan experienced some existential angst given that he is a web developer and not a “librarian” in the sense of job title or traditional responsibilities–the ancient triad of collection development, cataloging, and reference. In contrast, I never felt troubled about my job, as my title is e-learning librarian (got that buzzword going for me, which is nice) and as I do a lot of mainstream librarian-esque things, especially camping up front doing reference or visiting classes doing information literacy instruction.

Meme by Michael Rodriguez using Imgflip

However, I never expected to become manager of electronic resources, systems, web redesign, invoicing and vendor negotiations, and hopefully a new institutional repository fresh out of library school. I did not expect to spend my mornings troubleshooting LDAP authentication errors, walking students through login issues, running cost-benefit analyses on databases, and training users on screencasting and BlackBoard.

But digital librarians like Bryan and myself are the new faces of librarianship. I deliver and facilitate electronic information access in the library context; therefore, I am a librarian. A web developer facilitates access to digital scholarship and library resources. A reference librarian points folks to information they need. An instruction librarian teaches people how to find and evaluate information. A cataloger organizes information so that people can access it efficiently. A collection developer selects materials that users will most likely desire to access. All of these job descriptions–and any others that you can produce–are predicated on the fundamental tenet of access, preferably open, necessarily free.

Democracy, diversity, and the public good is our vision. Our active mission is to open access to users freely and equitably. Within that mission lie intellectual freedom (open access to information regardless of moralistic or political beliefs), privacy (fear of publicity can discourage people from openly accessing information), preservation (enabling future users to access the information), and other values that grow from the opening of access to books, articles, artifacts, the web, and more.

The Librarians (Fair use – parody)

By now you will have picked up on my wordplay. The phrase “open access” (OA) typically refers to scholarly literature that is “digital, online, free of charge, and free of most copyright and licensing restrictions” (Peter Suber). But when used as a verb rather than an adjective, “open” means not simply the state of being unrestricted but also the action of removing barriers to access. We librarians must not only cultivate the open fields–the commons–but also strive to dismantle paywalls and other obstacles to access. Recall Robert Frost’s Mending Wall:

Before I built a wall I’d ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
Something there is that doesn’t love a wall,
That wants it down.’ I could say ‘Elves’ to him…

Or librarians, good sir. Or librarians.

Pages