One of my daughters graduated from college last week (see pic). Call me a proud Dad, as she graduated with top honors (Summa cum laude) from Tulane University in New Orleans. This, while holding down two jobs in her last semester. So like many people who have college, high school, middle school, or whatever graduations in this season of graduations, my thoughts turn to what I may have wished to have known when I was graduating.
In my case, I’m going to look back at my graduation from library school, which was a Master’s degree from UC Berkeley in 1986. Yes, I really am that old. But let’s not dwell on that.
Here is what I wished I had known back then:
- Don’t ever expect to get anything handed to you. So many of the things that ended up making a difference in my career I had to actively pursue or initiate. Frankly, needing to make sure I could support two children (twins) spurred me to go after things simply for the money. However, they also helped me build a career that I wouldn’t have had otherwise.
- If something does get handed to you, run with it. My best career break came from someone who saw something in me and gave me a chance to prove myself. I ran with it, and never looked back. You should too.
- Don’t let success, should you be lucky enough to experience it, go to your head. My lucky break turned out to be the chance of a lifetime, and for a while I flirted with the idea of quitting my day job and going out on my own as a speaker/consultant. At least for me, that would have been a disaster, as the opportunities starting drying up and the recession killed whatever was left. I had a family to support, and a paycheck you can count on is worth all kinds of consulting opportunities upon which you can’t necessarily count.
- Know and be true to yourself. This absurdly general statement is meant to signify knowing who you are willing to work for. As a newly-minted librarian, I flirted with the idea of working for a commercial vendor. But after interviewing, I realized that it really wasn’t for me. Others enjoy it and that is perfectly fine. The point is to know yourself enough to know what is right for you.
- Expect the unexpected. Again, an absurdly general statement that in this case is meant to signify that whatever you learned in library school will likely be not just out of date in 3-5 years, but perhaps even wrong. I would even say the phrase should be welcome the unexpected, as those who do will inherit the future.
- Pursue connections with others. As someone who benefited greatly from mentors, I have turned, in my later career, to mentoring others. So you could say that I’ve seen both sides of making connections and I can tell you that they are more meaningful and helpful than you can even imagine. Perhaps I am an extreme case, as I had one mentor who truly launched my career. Unfortunately, I know that I have not had the same effect on those I mentor. But a major part of what I try to do is to bring together young librarians of like mind to help form peer networks that will take them forward long after I have left the scene. You, as a young professional, can pursue these kinds of situations. Look for a seasoned professional who can introduce you to people you should know. Suggest a mentor/mentee relationship. I doubt you will be disappointed.
- Be good to others. At the end of the day, you need to be able to sleep at night. So whatever life throws at you, try to handle it with grace and be good to your fellow travelers. Besides, you never know when you will need them to lend you a hand.
- Make bridges, don’t burn them. A corollary to the last point is to be good to the organizations you serve. Do your best work, and if they disdain you, then move on. But don’t make a big deal out of it. You never know what the future may bring and it just might be important that you didn’t disrespect your former employer.
- Have fun. I’ve often said in many of the speeches I’ve made over the years that if you’re not having fun you aren’t doing it right. I realize that sounds flip, and assumes that everyone can find a job they enjoy, but I happen to think you are worth it. If you aren’t happy doing what you are doing then you should seek out that which makes you happy. Seriously, it’s worth the extra effort. If you find yourself dragging yourself out of bed in the morning, loathing the day you face, then that’s a pretty good sign you need to find something else. Don’t settle without a fight. You owe yourself at least that much.
I realize that advice is all too easy to give and much more difficult to take to heart. I don’t expect anyone to change their life based on this post. But it makes me feel better to get this down on “paper,” and to be able to point people to it should I ever run into someone who seems like they could use the advice.
But you’re right, I doubt I would have listened back then either. I needed to learn it on my own, one bloody, painful step at a time. I suppose in the end all we ever need is the ability to make good decisions, given the particular realities that face us at any one point in our lives. And that is perhaps the best possible graduation speech: how to make good decisions, as that is what life tends to throw at you — the need to make good decisions, time and time again.
Anna Neatrour is the Digital Metadata Librarian at the Mountain West Digital Library. In that capacity she works with libraries across the western states to support description and discovery of digital collections.
In this post, Anna describes one of her typical days as a metadata librarian aggregating data on a regional level and as a Service Hub with DPLA.
What does a Metadata Librarian do? The over ten million records in the Digital Public Library of America represent the work of countless people collecting, digitizing, and describing unique cultural heritage items. Mountain West Digital Library provides access to over 900,000 records, or about 10% of DPLA’s total collection. So, what does it take to be a metadata services librarian at a large DPLA service hub? Let’s find out.
8:30-10:30. Evaluate New Collections
I evaluate new collections from partners throughout Utah, Idaho, Montana, Arizona, and Nevada, and harvest their metadata into the Mountain West Digital Library. The MWDL has a well-established Metadata Application Profile, and I check new collections for conformance with the MWDL community’s shared expectations for descriptive metadata. Sometimes there are adjustments a local collection manager will need to make to field mappings, or values in the metadata that need to be revised or added. MWDL runs on ExLibris’ Primo discovery system, and we harvest collections through OAI-PMH. This means that I spend time checking OAI streams prior to harvesting a new collection. For a new repository I’ll send the collection manager a detailed report with information about what to fix. For long-term, established partners of MWDL, I’ll fire off e-mails with quick suggestions.
10:30-12:00. MWDL Staff Meeting
Once a week, our team checks in about current projects, technical troubleshooting, and the status of new collections we are adding.
12:30-1:30. Web Page Updates for New Collections
I’ve been working recently on harvesting new collections from the University of Idaho Digital Library, which has a wonderfully eclectic collection of materials that covers a variety of topics including jazz history, forestry, and much more.
There’s some great graphic design in the Vandal Football Program Covers Collection, like this one which proclaims “Mashed Idahoes Comin’ Up!”
The International Jazz Collections at the University of Idaho are a unique resource, and many of the digitized materials from those collections are available in the DPLA, like this photo of Joe Williams and Count Basie from the Leonard Feather Jazz Collection.
We’ve also added great collections from the Arizona Memory Project, including the Petrified Forest Historic Photographs collection that adds to our existing materials on national parks and recreation in the region. My favorite item in this collection is photograph of Albert Einstein touring the park, a detail of which can be seen above in the header image for this post.
One of the things I enjoy the most about harvesting new collections into MWDL is seeing how the information available on a particular topic gets augmented and expanded as more items are digitized. For example, many MWDL partners have photos and documents that tell the story of the Saltair Resort on the shores of the Great Salt Lake.
We have many older photos documenting the history of the resort, but we recently added a selection of color photos from 1965, during the time period after the resort was abandoned, but before it was later destroyed by arson.
All of these collections from MWDL then combine to help researchers find even more resources on these topic in DPLA.
2:00-3:00. Virtual Meeting or Training Support
I enjoy working with librarians from different institutions across our multi-state region, which means meeting online. The meetings might center on the activities of a MWDL Task Force or time with a librarian needing support.
3:00-4:00 Technical Troubleshooting
I check harvested collections after they are imported/ingested into Primo and troubleshoot any issues when necessary. This means checking the PNX (Primo Normalized XML) records in our discovery system to make sure that the harvested metadata will display correctly, and also be available for DPLA to harvest.
4:00-5:30 PLPP Partner Support
MWDL is one of the four service hubs working on the Public Libraries Partnerships Project, and while we support all our partners, we are spending extra time helping public librarians who are new to digitization get their first collections online!
Sharing the digital collections regionally at mwdl.org and nationally through DPLA is extremely rewarding. The next time you find a cool digital item in DPLA, thank your local metadata librarian!
Featured image: Detail of Dr. and Mrs. Albert Einstein visit Rainbow Forest, date unknown. Courtesy of the National Park Service (AZ) via the Arizona Memory Project and Mountain West Digital Library.
All written content on this blog is made available under a Creative Commons Attribution 4.0 International License. All images found on this blog are available under the specific license(s) attributed to them, unless otherwise noted.
Open Knowledge in partnership with the Philippine Center for Investigative Journalism is pleased to announce the launch of Data Journalism Ph 2015. Supported by the World Bank, the program will train journalists and citizen media in producing high-quality, data-driven stories.
In recent years, government and multilateral agencies in the Philippines have published large amounts of data such as the government’s recently launched Open Data platform. These were accompanied by other platforms that track the implementation and expenditure of flagship programs such as Bottom-Up-Budgeting via OpenBUB.gov.ph, Infrastructure via OpenRoads.ph and reconstruction platforms including the Foreign Aid Transparency Hub. The training aims to encourage more journalists to use these and other online resources to produce compelling investigative stories.
Data Journalism Ph 2015 will train journalists on the tools and techniques required to gain and communicate insight from public data, including web scraping, database analysis and interactive visualization. The program will support journalists in using data to back their stories, which will be published by their media organization over a period of five months.
Participating teams will benefit from the following:
- A 3-day data journalism training workshop by the Open Knowledge and PCIJ in July 2015 in Manila
- A series of online tutorials on a variety of topics from digital security to online mapping
- Technical support in developing interactive visual content to accompany their published stories
Teams of up to three members working with the same print, TV, or online media agencies in the Philippines are invited to submit an application here.
Participants will be selected on the basis of the data story projects they pitch focused on key datasets including infrastructure, reconstruction, participatory budgeting, procurement and customs. Through Data Journalism Ph 2015 and its trainers, these projects will be developed into data stories to be published by the participants’ media organizations.Join the launch
Open Knowledge and PCIJ will host a half-day public event for those interested in the program in July in Quezon City. If you would like to receive full details about the event, please sign up here.
To follow the programme as it progresses go to the Data Journalism 2015 Ph project website.
Last updated May 26, 2015. Created by David Nind on May 26, 2015.
Log in to edit this page.
Koha 3.20 is the latest major release. It includes 5 new features, 114 enhancements and 407 bug fixes.
For more details see:
- Koha 3.20.0 - http://koha-community.org/koha-3-20-0-released/ (22 May 2015 - major six-monthly release)
Koha's release cycle:
Today I found the following resources and bookmarked them on Delicious.
- Open Hub, the open source network Discover, Track and Compare Open Source
- Arches: Heritage Inventory & Management System Arches is an innovative open source software system that incorporates international standards and is built to inventory and help manage all types of immovable cultural heritage. It brings together a growing worldwide community of heritage professionals and IT specialists. Arches is freely available to download, customize, and independently implement.
Digest powered by RSS Digest
- Learn about Open Source from Me and Infopeople
- Online Presentations
- CIL2008: Open Source Solutions to Offer Superior Service
Tuesday May 26, 2015.
Today we had a lively half hour free webinar presentation by Kimberly Bryant and Lake Raymond from Black Girls CODE about their latest efforts and the exciting LITA preconference they will be giving at ALA Annual in San Francisco. Here’s the link to the recording from todays session:
For more information check out the previous LITA Blog entry:
Did you attend the webinar, or view the recording? Give us your feedback by taking the Evaluation Survey.
Then register for and attend the LITA preconference at ALA Annual. This opportunity is following up on the 2014 LITA President’s Program at ALA Annual where then LITA President Cindi Trainor Blyberg welcomed Kimberly Bryant, founder of Black Girls Code.
The Black Girl Code Vision is to increase the number of women of color in the digital space by empowering girls of color ages 7 to 17 to become innovators in STEM fields, leaders in their communities, and builders of their own futures through exposure to computer science and technology.
“To bring together the records of the past and to house them in buildings where they will be preserved for the use of men and women in the future, a Nation must believe in three things.
It must believe in the past.
It must believe in the future.
It must, above all, believe in the capacity of its own people so to learn from the past that they can gain in judgement in creating their own future.”
– Franklin Roosevelt At the dedication of his library on June 30, 1941
Earlier this month it was announced the President Barack Obama’s Presidential Library will be built on the south side of Chicago. It will be our 14th Presidential Library.
The idea originated with FDR who in his second term “on the advice of noted historians and scholars, established a public repository to preserve the evidence of the Presidency for future generations”
Then in 1955, Congress passed the Presidential Libraries Act, establishing a system of privately erected and federally maintained libraries.
Here’s a sampling of images from the Digital Public Library of America related to our presidents and their libraries. Enjoy!
In my copious spare time I have hacked together a thing I’m calling the HathiTrust Research Center Workset Browser, a (fledgling) tool for doing “distant reading” against corpora from the HathiTrust. 
The idea is to: 1) create, refine, or identify a HathiTrust Research Center workset of interest — your corpus, 2) feed the workset’s rsync file to the Browser, 3) have the Browser download, index, and analyze the corpus, and 4) enable to reader to search, browse, and interact with the result of the analysis. With varying success, I have done this with a number of worksets ranging on topics from literature, philosophy, Rome, and cookery. The best working examples are the ones from Thoreau and Austen. [2, 3] The others are still buggy.
As a further example, the Browser can/will create reports describing the corpus as a whole. This analysis includes the size of a corpus measured in pages as well as words, date ranges, word frequencies, and selected items of interest based on pre-set “themes” — usage of color words, name of “great” authors, and a set of timeless ideas.  This report is based on more fundamental reports such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]
The whole thing is written in a combination of shell and Python scripts. It should run on just about any out-of-the-box Linux or Macintosh computer. Take a look at the code.  No special libraries needed. (“Famous last words.”) In its current state, it is very Unix-y. Everything is done from the command line. Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a Renaissance cartoon, the Browser, in its current state, is only a sketch. Only later will a more full-bodied, Web-based interface be created.
The next steps are numerous and listed in no priority order: putting the whole thing on GitHub, outputting the reports in generic formats so other things can easily read them, improving the terminal-based search interface, implementing a Web-based search interface, writing advanced programs in R that chart and graph analysis, provide a means for comparing & contrasting two or more items from a corpus, indexing the corpus with a (real) indexer such as Solr, writing a “cookbook” describing how to use the browser to to “kewl” things, making the metadata of corpora available as Linked Data, etc.
‘Want to give it a try? For a limited period of time, go to the HathiTrust Research Center Portal, create (refine or identify) a collection of personal interest, use the Algorithms tool to export the collection’s rsync file, and send the file to me. I will feed the rsync file to the Browser, and then send you the URL pointing to the results.  Let’s see what happens.
Fun with public domain content, text mining, and the definition of librarianship.Links
- HTRC Workset Browser – http://bit.ly/workset-browser
- Thoreau – http://bit.ly/browser-thoreau
- Austen – http://bit.ly/browser-austen
- Thoreau report – http://ntrda.me/1LD3xds
- Thoreau dictionary (frequency list) – http://bit.ly/thoreau-dictionary
- usage of color words in Thoreau — http://bit.ly/thoreau-colors
- unique words in the corpus – http://bit.ly/thoreau-unique
- Thoreau “catalog” — http://bit.ly/thoreau-catalog
- source code – http://ntrda.me/1Q8pPoI
- HathiTrust Research Center – https://sharc.hathitrust.org
Economists like to say there are no bad people, just bad incentives. The incentives to publish today are corrupting the scientific literature and the media that covers it. Until those incentives change, we’ll all get fooled again.Earlier this year I saw Tom Stoppard's play The Hard Problem at the Royal National Theatre, which deals with the same issue. The tragedy is driven by the characters being entranced by the prospect of publishing an attention-grabbing result. Below the fold, more on the problem of bad incentives in science.
Back in April, after a Wellcome Trust symposium on the reproducibility and reliability of biomedical science, Richard Horton, editor of The Lancet, wrote an editorial entitled What is medicine’s 5 sigma? that is well worth a read. His focus is also on incentives for scientists:
In their quest for telling a compelling story, scientists too often sculpt data to fit their preferred theory. Or they retrofit hypotheses to fit their data. and journal editors:
Our acquiescence to the impact factor fuels an unhealthy competition to win a place in a select few journals. Our love of "significance" pollutes the literature with many a statistical fairy-tale. We reject important confirmations.and Universities:
in a perpetual struggle for money and talent, endpoints that foster reductive metrics, such as high-impact publication. National assessment procedures, such as the Research Excellence Framework, incentivise bad practices.Horton points out that:
Part of the problem is that no-one is incentivised to be right. Instead, scientists are incentivised to be productive and innovative.He concludes:
The good news is that science is beginning to take some of its worst failings very seriously. The bad news is that nobody is ready to take the first step to clean up the system.Six years ago Marcia Angell, the long-time editor of a competitor to The Lancet wrote in an review of three books pointing out the corrupt incentives that drug companies provide researchers and Universities:
It is simply no longer possible to believe much of the clinical research that is published, or to rely on the judgment of trusted physicians or authoritative medical guidelines. I take no pleasure in this conclusion, which I reached slowly and reluctantly over my two decades as an editor of The New England Journal of Medicine.In most fields, little has changed since then. Horton points to an exception:
Following several high-profile errors, the particle physics community now invests great effort into intensive checking and re-checking of data prior to publication. By filtering results through independent working groups, physicists are encouraged to criticise. Good criticism is rewarded. The goal is a reliable result, and the incentives for scientists are aligned around this goal.Unfortunately, particle physics is an exception. The cost of finding the Higgs Boson was around $13.25B, but no-one stood to make a profit from it. A single particle physics paper can have over 5,000 authors. The resources needed for "intensive checking and re-checking of data prior to publication" are trivial by comparison. In other fields, the incentives for all actors are against devoting resources which would represent a significant part of the total for the research to such checking.
Fixing these problems of science is a collective action problem; it requires all actors to take actions that are against their immediate interests roughly simultaneously. So nothing happens, and the long-term result is, as Arthur Caplan (of the Division of Medical Ethics at NYU's Langone Medical Center) pointed out, a total loss of science's credibility:
The time for a serious, sustained international effort to halt publication pollution is now. Otherwise scientists and physicians will not have to argue about any issue—no one will believe them anyway.(see also John Michael Greer). I am not optimistic, based on the fact that the problem has been obvious for many years, and that this is but one aspect of society's inability to deal with long-term problems.
Metadata quality and assessment is a concept that has been around for decades in the library community. Recently it has been getting more interest as new aggregations of metadata become available in open and freely reusable ways such as the Digital Public Library of America (DPLA) and Europeana. Both of these groups make available their metadata so that others can remix and reuse the data in new ways.
I’ve had an interest in analyzing the metadata in the DPLA for a while and have spent some time working on the subject fields. This post will continue along those lines in trying to figure out what some of the metrics that we can calculate with the DPLA dataset that we can use to define “quality”. Ideally we will be able to turn these assessments and notions of quality into concrete recommendations for how to improve metadata records in the originating repositories.
This post will focus on normalization of subject strings, and how those normalizations might be useful as a way of assessing quality of a set of records.
One of the the powerful features of OpenRefine is the ability to cluster a set or data and combine these clusters into a single entry. Often times this will significantly reduce the number of values that occur in a dataset in a quick and easy manner.
OpenRefine has a number different algorithms that can be used for this work that are documented in their Clustering in Depth documentation. Depending on ones data one approach may perform better than others for this kind of clustering.Normalization
Case normalization is probably the easiest to kind of normalization to understand. If you have two strings, say “Mark” and “marK” if you converted each of the strings to lowercase you would end up with a single value of “mark”. Many more complicated normalizations assume this as a start because it reduces the number of subjects without drastically transforming the original string values.
Case folding is another kind of transformation that is fairly common in the world of libraries. This is the process of taking a string like “José” and converting it to “Jose”. While this can introduce issues if a string is meant to have a diacritic and that diacritic makes the word or phrase different than the one without the diacritic, often times it can help to normalize inconsistently notated versions of the same string.
In addition to case folding and lower casing, libraries have been normalizing data for a long time, there have been efforts in the past to formalize algorithms for the normalization of subject strings for use in matching these strings. Often referred to as NACO normalizations rules, they are Authority File Comparison Rules. I’ve always found this work to be intriguing and have a preference for the work and simplified algorithm that was developed at OCLC in their NACO Normalization Service. In fact we’ve taken the sample Python implementation there and created a stand-alone repository and project called pynaco on GitHub for the code so that we could add tests and then work to port it Python 3 in the near future.
Another common type of normalization that is performed on strings in library land is stemming. This is often done within search applications so that if you search one of the phrases run, runs, running you would get documents that contain each of these.
What I’ve been playing around with is if we could use the reduction in unique terms for a field in a metadata repository as an indicator of quality.
Here is an example.
If we have the following sets of subjects:Musical Instruments Musical Instruments. Musical instrument Musical instruments Musical instruments, Musical instruments.
If you applied the simplified NACO normalization from pynaco you would end up with the following strings:musical instruments musical instruments musical instrument musical instruments musical instruments musical instruments
If you then applied the porter stemming algorithm to the new set of subjects you would end up with the following:music instrument music instrument music instrument music instrument music instrument music instrument
So in effect you have normalized the original set of six unique subjects down to one unique subject strings with a NACO transformation followed by a normalization with the Porter Stemming algorithm.Experiment
In some past posts here, here, here, and here, I discussed some of the aspects of the subject fields present in Digital Public Library of America dataset. I dusted that dataset off and extracted all of the subjects from the dataset so that I could work with them by themselves.
I ended up with a set of text files that were 23,858,236 lines long that held the item identifier and a subject value for each subject of each item in the DPLA dataset. Here is a short snippet of what that looks like.d8f192def7107b4975cf15e422dc7cf1 Hoggson Brothers d8f192def7107b4975cf15e422dc7cf1 Bank buildings--United States d8f192def7107b4975cf15e422dc7cf1 Vaults (Strong rooms) 4aea3f45d6533dc8405a4ef2ff23e324 Public works--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 City planning--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 Art, Municipal--Illinois--Chicago 63f068904de7d669ad34edb885925931 Building laws--New York (State)--New York 63f068904de7d669ad34edb885925931 Tenement houses--New York (State)--New York 1f9a312ffe872f8419619478cc1f0401 Benedictine nuns--France--Beauvais
Once I have the data in this format I could experiment with different normalizations to see what kind of effect they had on the dataset.Total vs Unique
The first thing I did was to make the 23,858,236 long text file only contain unique values. I do this with the tried and true method of using unix sort and uniq.sort subjects_all.txt | uniq > subjects_uniq.txt
After about eight minutes of waiting I ended up with a new text file subjects_uniq.txt that contains the unique subject strings in the dataset. There are a total of 1,871,882 unique subject strings in this file.Case folding
Using a Python script to perform case folding on each of the unique subjects I’m able to see is that causes a reduction in the number of unique subjects.
I started out with 1,871,882 unique subjects and after case folding ended up with 1,867,129 unique subjects. That is a difference of 4,753 or a 0.25% reduction in the number of unique subjects. So nothing huge.Lowercase
The next normalization tested was lowercasing of the values. I chose to do this on the set of subjects that were already case folded to take advantage of the previous reduction in the dataset.
By converting the subject strings to lowercase I reduced the number of unique case folded subjects from 1,867,129 to 1,849,682 which is a reduction of 22,200 or a 1.2% reduction from the original 1,871,882 unique subjects.NACO Normalization
Next we look at the simple NACO normalization from pynaco. I applied this to the unique lower cased subjects from the previous step.
With the NACO normalization, I end up with 1,826,523 unique subject strings from the 1,849,682 that I started with from the lowercased subjects. This is a difference of 45,359 or a 2.4% reduction from the original 1,871,882 unique subjects.Porter stemming
Moving along, I looked at for this work was applying the Porter Stemming algorithm to the output of the NACO normalized subjects from the previous step. I used the Porter implementation from the Natural Language Tool Kit (NLTK) for Python.
With the Portal stemmer applied, I ended up with 1,801,114 unique subject strings from the 1,826,523 that I started with from the NACO normalized subjects. This is a difference of 70,768 or a 3.8% reduction from the original 1,871,882 unique subjects.Fingerprint
Finally I used a python porting of the fingerprint algorithm that OpenRefine uses for its clustering feature. This will help to normalize strings like “phillips mark” and “mark phillips” into a single value of “mark phillips”. I used the output of the previous Porter stemming step as the input for this normalization.
With the fingerprint algorithm applied, I ended up with 1,766,489 unique fingerprint normalized subject strings. This is a difference of 105,393 or a 5.6% reduction from the original 1,871,882 unique subjects.Overview Reduction Occurrences Percent Reduction Unique 0 1,871,882 0% Case Folded 4,753 1,867,129 0.3% Lowercase 22,200 1,849,682 1.2% NACO 45,359 1,826,523 2.4% Porter 70,768 1,801,114 3.8% Fingerprint 105,393 1,766,489 5.6% Conclusion
I think that it might be interesting to apply this analysis to the various Hubs in the whole DPLA dataset to see if there is anything interesting to be seen across the various types of content providers.
I’m also curious if there are other kinds of normalizations that would be logical to apply to the subjects that I’m blanking on. One that I would probably want to apply at some point is the normalization for LCSH that splits a subject into its parts if it has the double hype — in the string. I wrote about the effect on the subjects for the DPLA dataset in a previous post.
As always feel free to contact me via Twitter if you have questions or comments.
Last week I had the pleasure of presenting a short talk at the second virtual meeting of the NISO effort to reach a Consensus Framework to Support Patron Privacy in Digital Library and Information Systems. The slides from the presentation are below and on SlideShare, followed by a cleaned-up transcript of my remarks.
It looks like in the agenda that I’m batting in the clean-up role, and my message might be pithily summarized as “Can’t we all get along?” A core tenet of librarianship — perhaps dating back to the 13th and 14th century when this manuscript was illuminated — is to protect the activity trails of patrons from unwarranted and unnecessary disclosure.
This is embedded in the ethos of librarianship. As Todd pointed out in the introduction, third principle of the American Library Association’s Code of Ethics states: “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.” Librarians have performed this duty across time and technology, and as both have progressed the profession has sought new ways to protect the privacy of patrons.
For instance, there was once a time when books had a pocket in the back that held a card showing who had checked out the book and when it was due. Upon checkout the card was taken out, had the patron’s name embossed or written on it, and was stored in a date-sorted file so that the library knew when it was due and who had it checked out. When the book was returned, the name was scratched through before putting the card in the pocket and the book on the shelf. Sometimes, as a process shortcut, the name was left “in the clear” on the card, and anyone that picked the book off the shelf could look on the card to see who had checked it out.
When libraries automated their circulation management with barcodes and database records, the card in the back of the book and the information it disclosed was no longer necessary. This was hailed as one of the advantages to moving to a computerized circulation system. While doing away with circulation cards eliminated one sort of privacy leakage — patrons being able to see what each other had checked out — it enabled another: systematic collection of patron activity in a searchable database. Many automation systems put in features that automatically removed the link between patron and item after it was checked in. Or, if that information was stored for a period of time, it was password protected so only approved staff could view the information. Some, however, did not, and this became a concern with the passage of the USA PATRIOT act by the United States Congress.
We are now in an age where patron activity is scattered across web server log files, search histories, and usage analytics of dozens of systems, some of which are under the direct control of the library while others are in the hands of second and third party service providers. Librarians that are trying to do their due diligence in living up to the third principle of the Code of Ethics have a more difficult time accounting for all of the places where patron activity is collected. It has also become more difficult for patrons to make informed choices about what information is collected about their library activity and how it is used.
In the mid-2000s, libraries and content providers had a similar problem: the constant one-off negotiation of license terms was a burden to all parties involved. In order to gain new efficiencies in the process of acquiring and selling licensed content, representatives from the library and publisher communities came together under a NISO umbrella to reach a shared understanding of what the terms of an agreement would be and a registry of organizations that ascribed to those terms. Quoting from the forward of the 2012 edition: “The Shared Electronic Resource Understanding (SERU) Recommended Practice offers a mechanism that can be used as an alternative to a license agreement. The SERU statement expresses commonly shared understandings of the content provider, the subscribing institution and authorized users; the nature of the content; use of materials and inappropriate uses; privacy and confidentiality; online performance and service provision; and archiving and perpetual access. Widespread adoption of the SERU model for many electronic resource transactions offers substantial benefits both to publishers and libraries by removing the overhead of bilateral license negotiation.”
Today’s web service is filled with social sharing widgets (Facebook, Twitter, and the like), web analytics tools (Google Analytics), and content from advertising syndicates. While these tools provide useful services to the patrons, libraries and service providers, they also become centralized points of data gathering that can aggregate a user’s activity across the web. Does your library catalog page include a Facebook “Like” button? Whether or not the patron clicks on that button, Facebook knows that user has browsed to that web page and can gleen details of user behavior from that. Does your service use Google Analytics to understand user behavior and demographics? Google Analytics tracks user behavior across an estimated one half of the sites on the internet. Your user’s activity as a patron of your services is commingled with their activity as a general user.
A “filter bubble” is phrase coined by Eli Pariser to describe a system that adapts its output based on what it knows about a user: location, past searches, click activity, and other signals. The system is using these signals to deliver what it deems to be more relevant information to the user. In order to do this, the system must gather, store and analyze this information from patrons. However, a patron may not want his or her past search history to affect their search results. Or, even worse, when activity is aggregated from a shared terminal, the results can be wildly skewed.
Simply using a library-subscribed service can transmit patron activity and intention to dozens of parties, and all of it invisible to the user. To uphold that third principle in the ALA Code of Ethics, librarians need to examine the patron activity capturing practices its information suppliers, and that can be as unwieldy as negotiating bilateral license agreements between each library and supplier. If we start from the premise that libraries, publishers and service providers want to serve the the patron’s information needs while respecting their desire to do so privately, what is needed is a shared understanding of how patron activity is captured, used, and discarded. A new gathering of librarians and providers could accomplish for patron activity what they did for electronic licensing terms a decade ago. One could imagine discussions around these topics:
What Information is Collected From the Patron: When is personally identifiable information captured in the process of using the provider’s service. How is activity tagged to a particular patron — both before and after the patron identifies himself or herself? Are search histories stored? Is the patron activity encrypted — both in transit on the network and at rest on the server?
What Activity That Can Be Gleaned by Other Parties: If a patron follows a link to another website, how much of the context of the patron’s activity is transferred to the new website. Are search terms included in the URL? Is personally identifiable information in the URL? Does the service provider employ social sharing tools or third party web analytics that can gather information about the patron’s activity? Such activity could include IP address (and therefore rough geolocation), content of the web page, cross-site web cookies, and so forth.
How does patron activity influence service delivery: Is relevancy ranking altered based on the past activity of the user? Can the patron modify the search history to remove unwanted entries or segregate research activities from each other?
What is the disposition of patron activity data: Is a patron activity data anonymized and co-mingled with others? How is that information used and to whom is it disclosed? How long does the system keep patron activity data? Under what conditions would a provider release information to third parties?
It is arguably the responsibility of libraries to protect patron activity data from unwarranted collection and distribution. Service providers, too, want clear guidance from libraries so they can efficiently expend their efforts to develop systems that librarians feel comfortable promoting. To have each library and service provider audit this activity for each bilateral relationship would be inefficient and cumbersome. By coming to a shared understanding of how patron activity data is collected, used, and disclosed, libraries and service providers can advance their educational roles and offer tools to patrons to manage the disclosure of their activity.Link to this post!
I’ve been working hard on making a few changes to a couple of the MarcEdit internal components to improve the porting work. To that end, I’ve posted an update that targets improvements to the Deduping and the Merging tools.
- Update: Dedup tool — improves the handling of qualified data in the 020, 022, and 035.
- Update: Merge Records Tool — improves the handling of qualified data in the 020, 022, and 035.
Downloads can be picked up using the automated update tool or by going to: http://marcedit.reeset.net/downloads/
From Claire Knowles, Library Digital Development Manager, University of Edinburgh
Edinburgh, Scotland We are pleased to announce that Repository Fringe returns this year on the 3rd and 4th of August 2015. The event will be held at the University of Edinburgh and coincides once again with preview week to the Edinburgh Festival Fringe.
From the boardroom to City Hall, powerful negotiation skills make a big difference in advancing library goals. Power up your ability to persuade at the 2015 American Library Association (ALA) Annual Conference interactive session “The Policy Revolution! Negotiating to Advocacy Success!” 1:00 to 2:30 p.m. on Saturday, June 27, 2015. The session will be held at the Moscone Convention Center in room 2016 of the West building.
American Library Association Senior Policy Counsel Alan Fishel will bring nearly 30 years of legal practice and teaching effective and creative negotiation to the program. Bill & Melinda Gates Foundation Senior Program Officer Chris Jowaisas will share his experience advocating for and advancing U.S. and global library services. From securing new funding to negotiating licenses to establishing mutually beneficial partnerships, today’s librarians at all levels of service are brokering support for the programs, policies and services needed to meet diverse community demands. The session will jump off from a new national public policy agenda for U.S. libraries to deliver new tools you can use immediately at the local, state, national and international levels.
The Policy Revolution! initiative aims to advance national policy for libraries and our communities and campuses. The grant-funded effort focuses on establishing a proactive policy agenda, engaging national decision makers and influencers, and upgrading ALA policy capacity.
Speakers include Larra Clark, deputy director, ALA Office for Information Technology Policy; Alan G. Fishel, partner, Arent Fox; and Chris Jowaisas, senior program officer, Bill and Melinda Gates Foundation.
The post Ramping up negotiation skills to advance library agenda appeared first on District Dispatch.
Some time ago I promised I'd keep this space up to date on how my return to grad school was doing. Turns out I'm pretty lazy with doing that.
While working on the migration mappings for fcrepo3->fcrepo4 properties, I documented all known RELS-EXT and RELS-INT predicates in the Islandora 7.x-1.x code base. The predicates came from two namespaces; fedora and islandora.
The fedora namespace has a published ontology that we use -- relations-external -- and that can be referenced. However, the islandora namespace did not have any published ontologies associated with it.
That said, I have worked over the last couple of weeks with some very helpful folks on drafting initial version of Islandora RELS-EXT and RELS-INT ontologies, and the Islandora Roadmap Committee voted that it should be published. The published version of the RELS-EXT ontology can be viewed here, and the published version of the RELS-INT ontology can be viewed here. In addition, the ontologies were drafted in rdfs, and include a handy rdf2html.xsl to quickly create a publishable html version. This available on GitHub.
What does this all mean?
We have now documented what we have been doing for the last number of years, and we have a referencable version of our ontologies. In addition, this is extremely helpful for referencing and documenting predicates that will be apart of an fcrepo3-fcrepo4 migration.
The initial versions of each ontology have proposed rdfs comments, ranges and and skos *matches for a number of predicates. However, this is by no means complete, and I would love to see some community input/feedback on rdfs comments, ranges, additional skos *matches, or anything else that you think should be included in the RELS-EXT ontology.
How to provide feedback?
I'd like to have everything handled through 'issues' on the GitHub repo. If you comfortable with forking and creating pull requests, by all means do so. If you're more comfortable with replying here, that's works as well. All contributions are welcome! The key thing -- for me at least -- is to have community consensus around our understanding of these documented predicates :-)
I have not licensed the repository yet. I had planned on using the Apache 2.0 License as is done with PCDM, but I'd like your thoughts/opinions on proceeding before I make a LICENSE commit.
I hope I have covered it all. But, if you have have any questions, don't hesitate to ask.
It is almost a sure bet that certain NSA programs will expire at the end of the month. The next Senate vote is set for May 31st. We will be sure to provide updates as we hear them.