You are here

Feed aggregator

Islandora: Guest Blog: So you want to get started working on CLAW?

planet code4lib - Tue, 2016-05-31 13:54

We have a very exciting entry for this week's Islandora update: a guest blog from Islandora CLAW sprinter Ben Rosner (Barnard College). Ben has been following along with the project for a while now and started his first sprint last month. He has been an awesome addition to the team and he was kind enough to share his experiences as a 'newbie' to CLAW, and explain how you can join in too:

So you want to get started working on CLAW?

Lessons learned from a first-time contributor, hopefully to ease your transition into the land of CLAW.

Foremost, before any listing of resources, pages, and things to understand: please know that IRC is the place to be. Even if the channel is quiet, someone is ALWAYS lurking and will help you through any issue. Without the #islandora channel on irc.freenode.net I'm not sure this beginner would have had the fortitude to stick with what can sometimes be challenging, but very rewarding work. Also, the weekly CLAW calls are great if you have the time, even if just to listen in. I didn't understand a flippin' thing the first time I joined, but I stuck with it as time allowed and it has paid off. Lastly the Google Group, both islandora and islandora-dev - just so you can get a 'feel' for what's going on in the community. 

The stuff to bookmark now, and remember for later!

If you're going to work on or with the microservices you need this guide:https://github.com/Islandora-CLAW/CLAW/wiki/Using-CLAW-PHP-Microservices. READ THIS GUIDE, LEARN THIS GUIDE, LIVE THIS GUIDE. curl so many times it hurts, then some more. While you're doing all this 'curling' watch and explore http://localhost:8080/fcrepo/rest/, query blazegraph with a simple SELECT * WHERE { ?s ?p ?o }. What's happening? Why? HOW!? Is that a camel? No way, get out there's a camel? Yup. We've got that.

Here are some dummy objects I've created to quickly populate my repo that might be handy when you're curling: https://gist.github.com/br2490/310a005b02e70cc9a4b6e3190cf55e50. Note the instructions in my README.md may be outdated as we continue work on the Islandora microservices.

If you've never used a Vagrant before look inside of the install folder in the main CLAW repo and follow the README. Note to Windows users, run the command prompt/git shell/whatever and VirtualBox opened as an ADMIN before typing vagrant up. You've been warned!

Have an idea of PSR coding standards - when you're not in Drupal land you'll be using PSR2 while working on CLAW. Just like in college when you had to write in APA (or any of the lesser styles, muhahaha), PSR2 is a style guide. Here's the guide https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md and here's something to fix your code for you http://cs.sensiolabs.org/ (also see below about picking your editor of choice).

Theres more?! Diego Pino (@DiegoPino) has been amazing and hosted a series of live learning tutorials that are recorded and available on YouTube. There are links in main CLAW repository's README.md, and here are the slides https://github.com/DiegoPino/clawlessons that were presented (or at least some of em').

Finding a comfortable editing space.

Pick your editor of choice - personally I'm a fan of Sublime Text 3 with a few plugins, mainly phpfmt/SublimeLinter (for PSR2 compliance and auto-formatting), bracket highlighter (distinguishes which block of code you're working on/in), and markdown preview (sometimes you want it to look nice before you push that commit). There are a few linters out there - some of them MUCH better than phpfmt (like SublimeLinter), and they work great, but rtm before you get started using them.

If you prefer a full-fledged IDE, PhpStorm is my recommend - educational users should be able to swing a free license, though do check with them as I'm not a lawyer... It does your typical IDE stuff (formatting, code completion, intellisense, etc.).

If you love yourself some VIM look through @nruests repos for his configs and I'm told you'll be all set!

Finally, if you're new to GitHub, don't panic, but learning it is a practical exercise regardless of how many guides you read. Here is one such guide, http://rogerdudler.github.io/git-guide/ but go ahead and Google 'github guide' and you'll be inundated with everyone's "best guide to learn github." Depending on your learning style all of it is rubbish. Github is a PRACTICAL application. Just mess around with it and when you don't know something, speak up in IRC! Also, don't let your IDE do your git stuff, the commands are more powerful than an IDE exposes.

Okay enough ranting. Here's my "CLAW story."

A nice way to start contributing is through sprints, which are held during the first (or last, I forget) two weeks of each month. There are a number of opened issues/tickets that you can peruse. There is a sprint kick-off call where folks discuss what they want to tackle, and if you're not sure, you can be aimed towards those items that will interest you. I've learned this: EVERY LITTLE BIT HELPS. No task is to small. (See: the last CLAW Tutorial/Lesson for more about Sprints and getting involved and read the CONTRIBUTING.md).

So I started in Sprint 06 which ended April, 2016. Here is my "journal entry" from then:

Thus far my #CLAW experience has been using and helping update the Vagrant (for Windows comparability sigh) and working on the PHP Microservices (it's iterative process, still workin' my first PR).

I'm working primarily on #150 and having some success. Lot's of testing using error.log and watching what's happening in Apache/FCRepo/Blazegraph. I think commit 64df023 is gonna crack this nut.

The Vagrant refuses to symlink on windows machines and so our composer architecture which allows live editing is failing to work. Since I work in a mixed coding environment (which includes a Windows machine) this is fairly important. It turns out we may need to run VMWare with elevated privs for proper linking to occur. shrug

And my work continued, even outside of any true sprint. Small things like creating a .gitconfig to handle line endings properly, resolving other issues with the Vagrant and small documentation tasks. Again little things that I could do as I could.

Then Sprint 007 (I added the extra 0, because who doesn't love a Bond reference?) came along, and I missed the kickoff call due to other obligations. But I asked for a ticket ~ and got one that was like "woot" this is amazing. However! Due to my own overzelousness and over obligation I did not get to work on my issue as much as I would have like :(. And guess what?! No-one judged me, everyone understood, and was it was just great to be part of the community and realize we really are in it together. I plan to continue my work with CLAW, and to take on as much as I possibily can! It's a great project, with great people (Nick, Diego, Jared - the main committers - are so supportive and collaborative), and something I am happy to be part of.

Ben out!

Wait Ben! Help me get started!

Have you curled yet? Get on IRC! Chat with us! Ping nruest a lot, he loves helping folks get started. Check the issues on the github, noone will stop you from trying to tackle one! Just fork and hack away :).

Eric Lease Morgan: Catholic Pamphlets and the Catholic Portal: An evolution in librarianship

planet code4lib - Tue, 2016-05-31 12:34

This blog posting outlines, describes, and demonstrates how a set of Catholic pamphlets were digitized, indexed, and made accessible through the Catholic Portal. In the end it advocates an evolution in librarianship.

A few years ago, a fledgling Catholic pamphlets digitization process was embarked upon. [1] In summary, a number of different library departments were brought together, a workflow was discussed, timelines were constructed, and in the end approximately one third of the collection was digitized. The MARC records pointing to the physical manifestations of the pamphlets were enhanced with URLs pointing to their digital surrogates and made accessible through the library catalog. [2] These records were also denoted as being destined for the Catholic Portal by adding a value of CRRA to a local note. Consequently, each of the Catholic Pamphlet records also made their way to the Portal. [3]

Because the pamphlets have been digitized, and because the digitized versions of the pamphlets can be transformed into plain text files using optical character recognition, it is possible to provide enhanced services against this collection, namely, text mining services. Text mining is a digital humanities application rooted in the counting and tabulation of words. By counting and tabulating the words (and phrases) in one or more texts, it is possible to “read” the texts and gain a quick & dirty understanding of their content. Probably the oldest form of text mining is the concordance, and each of the digitized pamphlets in the Portal is associated with a concordance interface.

For example, the reader can search the Portal for something like “is the pope always right”, and the result ought to return a pointer to a pamphlet named Is the Pope always right? of papal infallibility. [4] Upon closer examination, the reader can download a PDF version of the pamphlet as well as use a concordance against it. [5, 6] Through the use of the concordance the reader can see that the words church, bill, charlie, father, and catholic are the most frequently used, and by searching the concordance for the phrase “pope is”, the reader gets a single sentence fragment in the result, “…ctrine does not declare that the Pope is the subject of divine inspiration by wh…” And upon further investigation, the reader can see this phrase is used about 80% of the way through the pamphlet.

The process of digitizing library materials is very much like the workflows of medieval scriptoriums, and the process is well understood. Description and access to digital versions of original materials is well-accommodated by the exploitation of MARC records. The next step for the profession to move beyond find & get and towards use & understand. Many people can find many things, with relative ease. The next step for librarianship is to provide services against the things readers find so they can more easily learn & comprehend. Save the time of the reader. The integration of the University of Notre Dame’s Hesburgh Libraries’s Catholic Pamphlets Collection into the Catholic Portal is one possible example of how this evolutionary process can be implemented.

Links

[1] digitization process – http://blogs.nd.edu/emorgan/2012/03/pamphlets/

[2] library catalog – http://bit.ly/sw1JH8

[3] Catholic Portal – http://bit.ly/cathholicpamphlets

[4] “Of Papal Infallibility” – http://www.catholicresearch.net/vufind/Record/undmarc_003078072

[5] PDF version – http://repository.library.nd.edu/view/45/743445.pdf

[6] concordance interface – https://concordance.library.nd.edu/app/concordance/?id=743445

HangingTogether: A story about the spirituality of libraries

planet code4lib - Tue, 2016-05-31 09:02

At the AMICAL 2016 conference, I heard an inspiring story about the cyclical destruction and revival of libraries. Dr. Richard Hodges, President of the AU of Rome, began his welcome message as follows: “Unlike what many of you may believe, you don’t come from Silicon Valley, you come from the monks”. He went on to explain that libraries were created at the end of the 8th century by monasteries. It was then that monks started to make books, besides growing food and brewing beer. They crafted the leather skin covers, the straps, the folded leaves of vellum, and all the instrumentation necessary to write books. They created the blue print of the library. In subsequent years, full blown libraries developed, like the ones in Saint Denis and Montecassino. Like with all things successful, once they grow, you need to sustain them. The monks thus devised a model to attract donors with the lure of a counter gift: a hand-crafted book. When, in the mid-9th century, the Vikings and Saracens destroyed many monasteries and their libraries, the books survived in the hands of those who had been donors. And so, the spirituality of what libraries stood for, as preservers of intellectual heritage, survived the destruction and was the seed for the new learning of the Renaissance. The storyteller hinted at globalization as a similar wave of destruction, which might leave us without libraries but with the promise that the spirituality of what libraries stand for, will resurface in a new guise.

According to keynote speaker Jim Groom this new guise is the Archiving Movement. He painted the Web landscape as a wonderful space of many small initiatives with do-it-yourself blogs and a Wiki-infra on which one can build an entire curriculum for free, with fascinating open technology and with exciting new learning experiences. In his view, it is all about our individual content, building domains of our own and leaving our personal digital footprints. He advocated the need for individuals to become archivists, reclaiming ownership and control over their data from the “big companies”. In this world of the small against the giants, “rogue Internet archivists” (or morphing librarians, as you wish) are excavating and rescuing the remains of parts of the web, that are dying and being destroyed.

In my presentation on “Adapting to the new scholarly record” I talked about shifting trends in the research ecosystem and disturbances which are disrupting the tasks and responsibilities of librarians, as stewards of the record of science. I conveyed the concerns of experts and practitioners in the field, who met during a series of OCLC Research workshops on this matter. They talked about the short-term need for a demonstrable pay-off by universities and funding agencies; the diverse concerns on campus around image, IPR and compliance; the emergence of new digital platforms like ResearchGate and others, that lure researchers into providing data to them and bypassing their institutional repositories; etc. All these forces at play are distracting libraries from safeguarding the record for future scholarship. These observations beg the question, which came from the audience: “what can we do about it?” and in particular “What can we do, as AMICAL libraries”?

I had been impressed by the information literacy (IL) session the day before. AMICAL libraries from Paris to Sharjah presented their efforts to engage faculty and to broaden the understanding of IL within the university. Many of the libraries face challenges with their student population, such as reluctance and resistance to reading, deficiencies in academic writing skills, inexperienced information retrieval expectations and ineffective search practices. The session concluded with the desirability to integrate IL in the curriculum.

So, I answered my audience without hesitation: Please continue the good work you are doing in IL! Why do we hear so little about IL at other library conferences in Europe? Isn’t IL a core part of that spirituality Richard Hodges talked about – a core part of what libraries stand for? The next generation needs to be prepared for the new learning in the digital information age. This requires education and training. People are not born being-digital!

About Titia van der Werf

Titia van der Werf is a Senior Program Officer in OCLC Research based in OCLC's Leiden office. Titia coordinates and extends OCLC Research work throughout Europe and has special responsibilities for interactions with OCLC Research Library Partners in Europe. She represents OCLC in European and international library and cultural heritage venues.

Mail | Web | More Posts (3)

District Dispatch: Finding the “big picture” on big data at the 2016 ALA Annual Conference

planet code4lib - Tue, 2016-05-31 06:50

Photo by Rowan Universit yPublication via Flickr

Every day, technology is making it possible to collect and analyze ever more data about students’ performance and behavior, including their use of library resources. The use of “big data” in the educational environment, however, raises thorny questions and deep concerns about individual privacy and data security. California responded to these concerns by passing the Student Online Private Information Protection Act, and student data privacy also is now the focus of several bills in Congress.

Participate in a discussion on the big picture on student data privacy at the conference session “Student Privacy: The Big Picture on Big Data,” which takes place during the 2016 American Library Association (ALA) Annual Conference in Orlando, Fla.  During the session, Khaliah Barnes, associate director of the Electronic Privacy Information Center and Director of its Student Privacy Project, will discuss how the growing use of big data threatens student privacy and how evolving state and federal data privacy laws impact school and academic libraries. The session takes place on Monday, June 27, 2016, 10:00-11:30 a.m., in the Orange County Convention Center, room W206A.

As director of the EPIC Student Privacy Project, Khaliah created the Student Privacy Bill of Rights. Khaliah defends student privacy rights before federal regulatory agencies and federal court. She has testified before states and local districts on the need to safeguard student records. Khaliah is a frequent panelist, commentator, and writer on student data collection. Khaliah has provided expert commentary to local and national media, including CBS This Morning, the New York Times, the Washington Post, NPR, Fox Business, CNN, Education Week, Politico, USA Today, and Time Magazine.

Want to attend other policy sessions at the 2016 ALA Annual Conference? View all ALA Washington Office sessions

The post Finding the “big picture” on big data at the 2016 ALA Annual Conference appeared first on District Dispatch.

DuraSpace News: Managed by DuraCloud: 150+ Terabytes of Data

planet code4lib - Tue, 2016-05-31 00:00

 

Breakout of content stored in DuraCloud based on storage provider

DuraSpace News: Collaboration Between DSpace Registered Service Providers To Integrate And Deploy a Module On a Tight Deadline

planet code4lib - Tue, 2016-05-31 00:00

From Peter Dietz, Longsight

Independence, Ohio  Longsight manages DSpace for the Woods Hole Open Access System (WHOAS), a repository of marine biology publications and datasets. To provide as much value to this rich data, WHOAS has added a Linked Data module to DSpace to allow researchers to query their data.

DuraSpace News: Fedora at Open Repositories: Hands-on Fedora 4, RepoRodeo, API Extension, State of the CLAW, Hydra at 30

planet code4lib - Tue, 2016-05-31 00:00

Austin, TX  In two weeks the open repository community will gather at the Open Repositories Conference in Dublin, Ireland to share ideas and catch up with old friends and colleagues. The Fedora community will be on hand to participate and offer insights into current and future development of the flexible and extensible open source repository platform used by leading academic institutions.

Introduction to Fedora 4 Workshop

DuraSpace News: Save the Date for Open Repositories 2017

planet code4lib - Tue, 2016-05-31 00:00

From the organizers of Open Repositories 2017

Brisbane, Australia  The Open Repositories (OR) Steering Committee in conjunction with the University of Queensland (UQ), Queensland University of Technology (QUT) and Griffith University are delighted to inform you that Brisbane will host the annual Open Repositories 2017 Conference.

It's exciting to have Open Repositories return to Australia, where it all began in 2006.  

DuraSpace News: New DSpace Repository for The Natural History Museum

planet code4lib - Tue, 2016-05-31 00:00

From Lisa Cardy, Library Services Manager, Natural History Museum

LibUX: Announcing W3 Radio

planet code4lib - Mon, 2016-05-30 18:21

So, LibUX has been a super vehicle for me — hi, I’m Michael — to talk, write, and make friends around the user experience design and development of libraries, non-profits, and the higher-ed web. These are special niches wherein the day-to-day challenges pervading the web are compounded by unique hyperlocal user bases and ethical imperatives largely without parallel on the commercial web.

And because there is so much to mine, so many hours in the day, and the LibUX audience is so varied — designers, developers, enthusiasts, dabblers, directors, big-bosses, students, vendors — I curate against topics that interest me-the-developer but maybe aren’t exactly relevant to me-the-librarian. There is soooo much happening in this space that captures my imagination, so of course I thought I’d start a new podcast: W3 Radio — you know, as in “world wide web.”

Do you need your web design news right now in ten minutes or less? Well it’s just your luck that soon I Michael Schofield am starting a new podcast: W3 Radio – bite sized best-of the world wide web. You’ll soon be able to tune in to W3 Radio on your podcatcher of choice
and real soon at w3radio.com.

The gimmick is that it’s just a weekly recap of headlines in under ten minutes, which I think makes it perfect for playing catch-up – oh, also, I am pretending to be an old-timey radio anchor, and I am almost positive that won’t get old.

Download the MP3

Availability update ( June 1, 2016 )

W3 Radio is now available in Google Play, and still pending in iTunes. Of course, you can always subscribe to the direct feed

So, as of this writing, publishing this announcement generates the feed I use to populate the various podcatchers. It will likely pend for a day or two before they make it available. Stay tuned to this space for links.

The post Announcing W3 Radio appeared first on LibUX.

LITA: Travel Apps!

planet code4lib - Mon, 2016-05-30 17:00

With ALA’s annual conference in Orlando just around the corner, travel is in the plans for many librarians and staff. Fortunately, as I live in Florida, I don’t have that far to go. But if you do, then you’re going to need some good apps.

I travel frequently and have a few of my favorite apps that I use for travel, and I’d like to share them with you:

Airline App of Choice

I personally only use two airlines so I can only speak to their particular apps, but seriously, if you have a smartphone and you aren’t using it to hold tickets or boarding passes, you’re missing out. You can also use your app to check flight times and delays, book future travel, or just to play around (one of my airline’s apps lets you send virtual postcards).

PackPoint

Even if it’s just a weekend trip, this app is great at letting you know what you should bring depending on the weather and your activities. You can adjust the lists according to your preferences as well. Though this is the free version, there is a paid version where you can save your packing lists to Evernote. (Android)

Foursquare

I only use Foursquare when I travel. It got put through its paces in Boston when I needed to find a place to eat near my location or was looking for a historic site I hadn’t been to. It also helped in giving me tips about the place: what to order, when to avoid the place, how the staff was. On top of that, it links with your Map App of Choice (Google Maps FTW!) to give you directions and contact information. It’s not Yelp, but I feel it’s more genuine. (Android)

Waze

Take it from someone who lived in Orlando: driving in that city is not fun. This is why you want Waze: it can show you directions as well as let you input traffic accidents you happen across as you drive (well, maybe after you drive). It even helps out with finding cheap gas. (Android)

Photo-Editing App of Choice

You’re no doubt going to be taking a lot of photos on your trip, so why not spice them up with some creative edits and share them? There are a plethora of photo apps out there to choose from, the most ubiquitous being Instagram (Android), but I love Hipstamatic (paid, iOS only) because you can randomize your filters and get a totally unexpected result every shot. Other apps that are fun are Pixlr (Android) (there’s a desktop version, too!) and Photoshop Express (Android)
What are some travel apps that you cannot live without? Post them in the comments or tweet them my way @LibrarianStevie!

Tim Ribaric: Code4Lib North Strikes Again

planet code4lib - Mon, 2016-05-30 13:57

Hotel had put-put, not relevant but interesting none the less.

read more

LibUX: 039 – Jean Felisme

planet code4lib - Mon, 2016-05-30 04:50

I’ve known Jean Felisme for awhile through WordCamp Miami. We see each other quite a bit at meetups and he’s a ton of fun – he’s also been pretty hardcore about evangelizing freelance. Recently he made the switch from freelance into the very special niche that is the higher-ed web, so when he was just six weeks into his new position at the School of Computing and Information Sciences at Florida International University I took the opportunity to pick his brain.

Hope you enjoy.

If you like, you can download the MP3 or subscribe to LibUX on Stitcher, iTunes, Google Play Music, or just plug our feed straight into your podcatcher of choice. Help us out and say something nice. Your sharing and positive reviews are the best marketing we could ask for.

Here’s what we talked about
  • 1:08 – WP Campus is coming up!
  • 2:45 – All about Jean
  • 4:28 – How the trend toward building-out in-house teams will impact freelance
  • 9:38 – What is the day-to-day like just six weeks in?
  • 12:03 – Student-hosted applications and content – scary
  • 13:09 – The makeup of Jean’s team
  • 17:43 – Are you playing with any web technology you haven’t before?
  • 19:37 – The tight relationship with the students
  • 20:31 – On web design curriculum
  • 28:00 – We fail to wrap up and keep talking about freelance for a few more minutes.

The post 039 – Jean Felisme appeared first on LibUX.

District Dispatch: Next CopyTalk on Government Overreach

planet code4lib - Fri, 2016-05-27 16:04

Please join us on June 2 for a free webinar on another form of copyright creep.

Please join us on June 2 for a free webinar on another form of copyright creep. This one on recent efforts to copyright state government works.  

Issues behind State governments copyrighting government works

The purpose of copyright is to provide incentives for creativity in exchange for a time limited government provided monopoly. When drafting the federal copyright law, Congress explicitly prohibited the federal government as well as employees of the federal government from having the authority to create a copyright in government created works. However, the federal law is silent on state government power to create, hold, and enforce copyrights. This has resulted in a patchwork of varying levels of state copyright laws across all fifty states.

Currently, California favors the approach where a vast majority of works created by the state and local government are by default in the public domain. An ongoing debate is happening now as to whether California should end the public domain status of most state and local government works. The state legislature is contemplating a bill (AB 2880) that would authorize copyright authority to all state agencies, local governments, and political subdivisions. In recent years entities of state government have attempted to rely on copyright as a means to suppress the dissemination of taxpayer-funded research and as a means to chill criticism but failed in the courts due to a lack of copyright authority. Ernesto Falcon, legislative counsel with the Electronic Frontier Foundation, will review the status of the legislation, the court decisions that lead to its creation, and the debate that now faces the California legislature.

Day/Time: Thursday, June 2 at 2pm Eastern/11am Pacific for our hour long free webinar.

Go to http://ala.adobeconnect.com/copytalk/ and sign in as a guest. You’re in.

This program is brought to you by OITP’s copyright education subcommittee.

The post Next CopyTalk on Government Overreach appeared first on District Dispatch.

FOSS4Lib Recent Releases: VuFind - 3.0.1

planet code4lib - Fri, 2016-05-27 15:44
Package: VuFindRelease Date: Friday, May 27, 2016

Last updated May 27, 2016. Created by Demian Katz on May 27, 2016.
Log in to edit this page.

Bug fix release.

Eric Lease Morgan: VIAF Finder

planet code4lib - Fri, 2016-05-27 13:34

This posting describes VIAF Finder. In short, given the values from MARC fields 1xx$a, VIAF Finder will try to find and record a VIAF identifier. [0] This identifier, in turn, can be used to facilitate linked data services against authority and bibliographic data.

Quick start

Here is the way to quickly get started:

  1. download and uncompress the distribution to your Unix-ish (Linux or Macintosh) computer [1]
  2. put a file of MARC records named authority.mrc in the ./etc directory, and the file name is VERY important
  3. from the root of the distribution, run ./bin/build.sh

VIAF Finder will then commence to:

  1. create a “database” from the MARC records, and save the result in ./etc/authority.db
  2. use the VIAF API (specifically the AutoSuggest interface) to identify VAIF numbers for each record in your database, and if numbers are identified, then the database will be updated accordingly [3]
  3. repeat Step #2 but through the use of the SRU interface
  4. repeat Step #3 but limiting searches to authority records from the Vatican
  5. repeat Step #3 but limiting searches to the authority named ICCU
  6. done

Once done the reader is expected to programmatically loop through ./etc/authority.db to update the 024 fields of their MARC authority data.

Manifest

Here is a listing of the VIAF Finder distribution:

  • 00-readme.txt – this file
  • bin/build.sh – “One script to rule them all”
  • bin/initialize.pl – reads MARC records and creates a simple “database”
  • bin/make-dist.sh – used to create a distribution of this system
  • bin/search-simple.pl – rudimentary use of the SRU interface to query VIAF
  • bin/search-suggest.pl – rudimentary use of the AutoSuggest interface to query VIAF
  • bin/subfield0to240.pl – sort of demonstrates how to update MARC records with 024 fields
  • bin/truncate.pl – extracts the first n number of MARC records from a set of MARC records, and useful for creating smaller, sample-sized datasets
  • etc – the place where the reader is expected to save their MARC files, and where the database will (eventually) reside
  • lib/subroutines.pl – a tiny set of… subroutines used to read and write against the database
Usage

If the reader hasn’t figured it out already, in order to use VIAF Finder, the Unix-ish computer needs to have Perl and various Perl modules — most notably, MARC::Batch — installed.

If the reader puts a file named authority.mrc in the ./etc directory, and then runs ./bin/build.sh, then the system ought to run as expected. A set of 100,000 records over a wireless network connection will finish processing in a matter of many hours, if not the better part of a day. Speed will be increased over a wired network, obviously.

But in reality, most people will not want to run the system out of the box. Instead, each of the individual tools will need to be run individually. Here’s how:

  1. save a file of MARC (authority) records anywhere on your file system
  2. not recommended, but optionally edit the value of DB in bin/initialize.pl
  3. run ./bin/initialize.pl feeding it the name of your MARC file, as per Step #1
  4. if you edited the value of DB (Step #2), then edit the value of DB in bin/search-suggest.pl, and then run ./bin/search-suggest.pl
  5. if you want to possibly find more VIAF identifiers, then repeat Step #4 but with ./bin/search-simple.pl and with the “simple” command-line option
  6. optionally repeat Step #5, but this time use the “named” command-line option, and the possible named values are documented as a part of the VAIF API (i.e., “bav” denotes the Vatican
  7. optionally repeat Step #6, but with other “named” values
  8. optionally repeat Step #7 until you get tired
  9. once you get this far, the reader may want to edit bin/build.sh, specifically configuring the value of MARC, and running the whole thing again — “one script to rule them all”
  10. done

A word of caution is now in order. VIAF Finder reads & writes to its local database. To do so it slurps up the whole thing into RAM, updates things as processing continues, and periodically dumps the whole thing just in case things go awry. Consequently, if you want to terminate the program prematurely, try to do so a few steps after the value of “count” has reached the maximum (500 by default). A few times I have prematurely quit the application at the wrong time and blew my whole database away. This is the cost of having a “simple” database implementation.

To do

Alas, search-simple.pl contains a memory leak. Search-simple.pl makes use of the SRU interface to VIAF, and my SRU queries return XML results. Search-simple.pl then uses the venerable XML::XPath Perl module to read the results. Well, after a few hundred queries the totality of my computer’s RAM is taken up, and the script fails. One work-around would be to request the SRU interface to return a different data structure. Another solution is to figure out how to destroy the XML::XPath object. Incidentally, because of this memory leak, the integer fed to simple-search.pl was implemented allowing the reader to restart the process at a different point dataset. Hacky.

Database

The use of the database is key to the implementation of this system, and the database is really a simple tab-delimited table with the following columns:

  1. id (MARC 001)
  2. tag (MARC field name)
  3. _1xx (MARC 1xx)
  4. a (MARC 1xx$a)
  5. b (MARC 1xx$b and usually empty)
  6. c (MARC 1xx$c and usually empty)
  7. d (MARC 1xx$d and usually empty)
  8. l (MARC 1xx$l and usually empty)
  9. n (MARC 1xx$n and usually empty)
  10. p (MARC 1xx$p and usually empty)
  11. t (MARC 1xx$t and usually empty)
  12. x (MARC 1xx$x and usually empty)
  13. suggestions (a possible sublist of names, Levenshtein scores, and VIAF identifiers)
  14. viafid (selected VIAF identifier)
  15. name (authorized name from the VIAF record)

Most of the fields will be empty, especially fields b through x. The intention is/was to use these fields to enhance or limit SRU queries. Field #13 (suggestions) is for future, possible use. Field #14 is key, literally. Field #15 is a possible replacement for MARC 1xx$a. Field #15 can also be used as a sort of sanity check against the search results. “Did VIAF Finder really identify the correct record?”

Consider pouring the database into your favorite text editor, spreadsheet, database, or statistical analysis application for further investigation. For example, write a report against the database allowing the reader to see the details of the local authority record as well as the authority data in VIAF. Alternatively, open the database in OpenRefine in order to count & tabulate variations of data it contains. [4] Your eyes will widened, I assure you.

Commentary

First, this system was written during my “artist’s education adventure” which included a three-month stint in Rome. More specifically, this system was written for the good folks at Pontificia Università della Santa Croce. “Thank you, Stefano Bargioni, for the opportunity, and we did some very good collaborative work.”

Second, I first wrote search-simple.pl (SRU interface) and I was able to find VIAF identifiers for about 20% of my given authority records. I then enhanced search-simple.pl to include limitations to specific authority sets. I then wrote search-suggest.pl (AutoSuggest interface), and not only was the result many times faster, but the result was just as good, if not better, than the previous result. This felt like two steps forward and one step back. Consequently, the reader may not ever need nor want to run search-simple.pl.

Third, while the AutoSuggest interface was much faster, I was not able to determine how suggestions were made. This makes the AutoSuggest interface seem a bit like a “black box”. One of my next steps, during the copious spare time I still have here in Rome, is to investigate how to make my scripts smarter. Specifically, I hope to exploit the use of the Levenshtein distance algorithm. [5]

Finally, I would not have been able to do this work without the “shoulders of giants”. Specifically, Stefano and I took long & hard looks at the code of people who have done similar things. For example, the source code of Jeff Chiu’s OpenRefine Reconciliation service demonstrates how to use the Levenshtein distance algorithm. [6] And we found Jakob Voß’s viaflookup.pl useful for pointing out AutoSuggest as well as elegant ways of submitting URL’s to remote HTTP servers. [7] “Thanks, guys!”

Fun with MARC-based authority data!

Links

[0] VIAF – http://viaf.org

[1] VIAF Finder distribution – http://infomotions.com/sandbox/pusc/etc/viaf-finder.tar.gz

[2] VIAF API – http://www.oclc.org/developer/develop/web-services/viaf.en.html

[4] OpenRefine – http://openrefine.org

[5] Levenshtein distance – https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance

[6] Chiu’s reconciliation service – https://github.com/codeforkjeff/refine_viaf

[7] Voß’s viaflookup.pl – https://gist.github.com/nichtich/832052/3274497bfc4ae6612d0c49671ae636960aaa40d2

District Dispatch: Full STEAM ahead: Creating Tomorrowland today

planet code4lib - Fri, 2016-05-27 01:50

Sascha Paladino, Series Creator and Executive Producer

STEAM programming (which includes science, technology, engineering, arts and mathematics) is fast becoming a core service in libraries across the country. From intermixing STEAM activities into family story hour to teen maker spaces and coding camps, public and school libraries provide engaging opportunities for kids of all ages to develop a passion for science, technology, engineering, the arts, and math. Curious to learn who else is experimenting with STEAM programs for kids? The article below comes from Sascha Paladino, who is the creator and executive producer of “Miles from Tomorrowland”, a Disney Junior animated series that weaves science, technology, engineering and mathematics concepts geared towards kids ages 2-7 into its storylines. Paladino will delve deeper into the topic at the 2016 American Library Association Annual Conference joined by others instrumental in getting Miles off the ground and kids into STEAM.

Six years ago, I came up with an idea for an animated series about a family on an adventure in outer space – from the kid’s perspective. I wanted to explore the universe through the eyes of a seven-year-old. I remembered how I saw outer space when I was young – as the greatest imaginable place for adventure – and I wanted to capture that feeling.

I pitched the idea to Disney, who liked it, and we began developing what would become MILES FROM TOMORROWLAND. Through the ups and downs that are part of any TV show’s journey to the screen, I tried to stay focused on my goals: tell entertaining stories, encourage kids to dream big and inspire viewers to explore STEAM (Science, Technology, Engineering, Arts and Math).

Luckily, with the support of Disney, I was able to surround the MILES creative team with a group of genius scientists. Dr. Randii Wessen of NASA’s Jet Propulsion Laboratory came onboard as an advisor, as did NASA astronaut Dr. Yvonne Cagle, and Space Tourism Society founder John Spencer. They shared their deep knowledge and experience with us, and gave our show some serious scientific street cred.

Along the way, I got a crash course in outer space. I was able to immerse myself in the science of our universe, and learned all about exoplanets, tardigrades, and electromagnetic pulses, for starters. Then, I could sit down with my writing and design teams and figure out ways to work these science facts into engaging stories to share with our audience.

I realized that I was making the show I wished I had as a kid: An exciting adventure that incorporates real science in a way that appeals to viewers whether or not they gravitate towards science. I always loved science, but my career path took me into the arts. In making this show, I learned that the arts can be a route into the sciences – which is why I’m really glad that STEM has expanded to STEAM, to include the “A” for “arts.”

My hope is that by exposing all sorts of kids to concepts such as black holes, coronal mass ejections, and spaghettification (best word ever), they’re inspired to explore further and deeper once the television is turned off.

When we were researching the series, we met with scientists, techies, and space professionals from amazing places such as NASA, SpaceX, Virgin Galactic, and Google. Over and over, we heard that they were inspired to go into their field because of science-fiction TV shows and movies that they saw as kids. Real-life innovations such as the first flip-phone were directly influenced by fantastical creations imagined on STAR TREK. Science fiction becomes science fact. It’s the circle of (sci-fi) life.

Now that MILES FROM TOMORROWLAND is on the air, I’ve been hearing from parents and kids that our vision of the future is giving the scientists of tomorrow some ideas. Nothing could make me happier. We’ve seen kids make their own creative versions of Miles’ tech and gear, such as cardboard spaceships and gadgets made from dried macaroni. As NASA’s Dr. Cagle told me recently, one of our goals should be to encourage kids to “engineer their dreams.” That sums it up perfectly.

I even heard from a kid who loves Miles’ Blastboard – his flying hoverboard – so much that he decided to sit down and design a real one. Whether it works or not is beside the point (although I’m quite sure that it does). What matters to me is that MILES FROM TOMORROWLAND set off a spark that, I hope, will continue to grow, multiply, and eventually inspire a future generation of scientists and innovators.

But mostly, I can’t wait to ride that Blastboard.

Join the “Coding in Tomorrowland: Inspiring Girls in STEM” session at the 2016 American Library Association Annual Conference in Orlando, which takes place on Sunday, June 26, 2016, from 1:00-2:30 p.m. (in the Orange County Convention Center, in room OCCC W303). Session speakers include “Miles from Tomorrowland” creator and executive producer, Sascha Paladino; series consultant and NASA astronaut, Dr. Yvonne Cagle; and Disney Junior executive, Diane Ikemiyashiro. This session will be moderated by Roger Rosen, who is the chief executive officer of Rosen Publishing and a senior advisor for national policy advocacy to ALA’s Office for Information Technology Policy.

Want to attend other policy sessions at the 2016 ALA Annual Conference? View all ALA Washington Office sessions

The post Full STEAM ahead: Creating Tomorrowland today appeared first on District Dispatch.

David Rosenthal: Abby Smith Rumsey's "When We Are No More"

planet code4lib - Thu, 2016-05-26 15:00
Back in March I attended the launch of Abby Smith Rumsey's book When We Are No More. I finally found time to read it from cover to cover, and can recommend it. Below the fold are some notes.

There are four main areas where I have comments on Rumsey's text. On page 144, in the midst of a paragraph about the risks to our personal digital information she writes:
The documents on our hard disks will be indecipherable in a decade.The word "indecipherable" implies not data loss but format obsolescence. As I've written many times, Jeff Rothenberg was correct to identify format obsolescence as a major problem for documents published before the advent of the Web in the mid-90s. But the Web caused documents to evolve from being the private property of a particular application to being published. On the Web, published documents don't know what application will render them, and are thus largely immune to format obsolescence.

It is true that we're currently facing a future in which most current browsers will not render preserved Flash, not because they don't know how to but because it isn't safe to do so. But oldweb.today shows that the technological fix for this problem is already in place. Format obsolescence, were it to occur, would be hard for individuals to mitigate. Especially since it isn't likely to happen, it isn't helpful to lump it in with threats they can do something about by, for example, keeping local copies of their cloud data.

On page 148 Rumsey discusses the problem of the scale of the preservation effort needed and the resulting cost:
We need to keep as much as we can as cheaply as possible. ... we will have to invent ways to essentially freeze-dry data, to store data at some inexpensive low level of curation, and at some unknown time in the future be able to restore it. ... Until such a long-term strategy is worked out, preservation experts focus on keeping digital files readable by migrating data to new hardware and software systems periodically. Even though this looks like a short-term strategy, it has been working well  ... for three decades and more.Yes, it has been working well and will continue to do so provided the low level of curation manages find enough money to keep the bits safe. Emulation will ensure that if the bits survive we will be able to render them, and it does not impose significant curation costs along the way.

The aggressive (and therefore necessarily lossy) compression Rumsey enviasges would reduce storage costs, and I've been warning for some time that Storage Will Be Much Less Free Than It Used To Be. But it is important not to lose sight of the fact that ingest, not storage, is the major cost in digital preservation. We can't keep it all; deciding what to keep and putting it some place safe is the most expensive part of the process.

On page 163 Rumsey switches to ignoring the cost and assuming that, magically, storage supply will expand to meet the demand:
Our appetite for more and more data is like a child's appetite for chocolate milk: ... So rather than less, we are certain to collect more. The more we create, paradoxically, the less we can afford to lose.Alas, we can't store everything we create now, and the situation isn't going to get better.

On page 166 Rumsey writes:
Other than the fact that preservation yields long-term rewards, and most technology funding goes to creating applications that yield short-term rewards, it is hard to see why there is so little investment, either public or private, in preserving data. The culprit is our myopic focus on short-term rewards, abetted by financial incentives that reward short-term thinking. Financial incentives are matters of public policy, and can be changed to encourage more investment in digital infrastructure.I completely agree that the culprit is short-term thinking, but the idea that "incentives ... can be changed" is highly optimistic. The work of, among others, Andrew Haldane at the Bank of England shows that short-termism is a fundamental problem in our global society. Inadequate investment in infrastructure, both physical and digital, is just a symptom, and is far less of a problem than society's inability to curb carbon emissions.

Finally, some nits to pick. On page 7 Rumsey writes of the Square Kilometer Array:
up to one exabyte (1018 bytes) of data per dayI've already had to debunk another "exabyte a day" claim. It may be true that the SKA generates an exabyte a day but it could not store that much data. An exabyte a day is most of the world's production of storage. Like the Large Hadron Collider, which throws away all but one byte in a million before it is stored the SKA actually stores only(!) a petabyte a day (according to Ian Emsley, who is responsible for planning its storage). A book about preserving information for the long term should be careful to maintain the distinction between the amounts of data generated, and stored. Only the stored data is relevant.

On page 46 Rumsey writes:
our recording medium of choice, the silicon chip, is vulnerable to decay, accidental deletion and overwritingOur recording medium of choice is not, and in the foreseeable future will not be, the silicon chip. It will be the hard disk, which is of course equally vulnerable, as any read-write digital medium would be. Write-once media would be somewhat less vulnerable, and they definitely have a role to play, but they don't change the argument.

Pages

Subscribe to code4lib aggregator