“To bring together the records of the past and to house them in buildings where they will be preserved for the use of men and women in the future, a Nation must believe in three things.
It must believe in the past.
It must believe in the future.
It must, above all, believe in the capacity of its own people so to learn from the past that they can gain in judgement in creating their own future.”
– Franklin Roosevelt At the dedication of his library on June 30, 1941
Earlier this month it was announced the President Barack Obama’s Presidential Library will be built on the south side of Chicago. It will be our 14th Presidential Library.
The idea originated with FDR who in his second term “on the advice of noted historians and scholars, established a public repository to preserve the evidence of the Presidency for future generations”
Then in 1955, Congress passed the Presidential Libraries Act, establishing a system of privately erected and federally maintained libraries.
Here’s a sampling of images from the Digital Public Library of America related to our presidents and their libraries. Enjoy!
In my copious spare time I have hacked together a thing I’m calling the HathiTrust Research Center Workset Browser, a (fledgling) tool for doing “distant reading” against corpora from the HathiTrust. 
The idea is to: 1) create, refine, or identify a HathiTrust Research Center workset of interest — your corpus, 2) feed the workset’s rsync file to the Browser, 3) have the Browser download, index, and analyze the corpus, and 4) enable to reader to search, browse, and interact with the result of the analysis. With varying success, I have done this with a number of worksets ranging on topics from literature, philosophy, Rome, and cookery. The best working examples are the ones from Thoreau and Austen. [2, 3] The others are still buggy.
As a further example, the Browser can/will create reports describing the corpus as a whole. This analysis includes the size of a corpus measured in pages as well as words, date ranges, word frequencies, and selected items of interest based on pre-set “themes” — usage of color words, name of “great” authors, and a set of timeless ideas.  This report is based on more fundamental reports such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]
The whole thing is written in a combination of shell and Python scripts. It should run on just about any out-of-the-box Linux or Macintosh computer. Take a look at the code.  No special libraries needed. (“Famous last words.”) In its current state, it is very Unix-y. Everything is done from the command line. Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a Renaissance cartoon, the Browser, in its current state, is only a sketch. Only later will a more full-bodied, Web-based interface be created.
The next steps are numerous and listed in no priority order: putting the whole thing on GitHub, outputting the reports in generic formats so other things can easily read them, improving the terminal-based search interface, implementing a Web-based search interface, writing advanced programs in R that chart and graph analysis, provide a means for comparing & contrasting two or more items from a corpus, indexing the corpus with a (real) indexer such as Solr, writing a “cookbook” describing how to use the browser to to “kewl” things, making the metadata of corpora available as Linked Data, etc.
‘Want to give it a try? For a limited period of time, go to the HathiTrust Research Center Portal, create (refine or identify) a collection of personal interest, use the Algorithms tool to export the collection’s rsync file, and send the file to me. I will feed the rsync file to the Browser, and then send you the URL pointing to the results.  Let’s see what happens.
Fun with public domain content, text mining, and the definition of librarianship.Links
- HTRC Workset Browser – http://bit.ly/workset-browser
- Thoreau – http://bit.ly/browser-thoreau
- Austen – http://bit.ly/browser-austen
- Thoreau report – http://ntrda.me/1LD3xds
- Thoreau dictionary (frequency list) – http://bit.ly/thoreau-dictionary
- usage of color words in Thoreau — http://bit.ly/thoreau-colors
- unique words in the corpus – http://bit.ly/thoreau-unique
- Thoreau “catalog” — http://bit.ly/thoreau-catalog
- source code – http://ntrda.me/1Q8pPoI
- HathiTrust Research Center – https://sharc.hathitrust.org
Economists like to say there are no bad people, just bad incentives. The incentives to publish today are corrupting the scientific literature and the media that covers it. Until those incentives change, we’ll all get fooled again.Earlier this year I saw Tom Stoppard's play The Hard Problem at the Royal National Theatre, which deals with the same issue. The tragedy is driven by the characters being entranced by the prospect of publishing an attention-grabbing result. Below the fold, more on the problem of bad incentives in science.
Back in April, after a Wellcome Trust symposium on the reproducibility and reliability of biomedical science, Richard Horton, editor of The Lancet, wrote an editorial entitled What is medicine’s 5 sigma? that is well worth a read. His focus is also on incentives for scientists:
In their quest for telling a compelling story, scientists too often sculpt data to fit their preferred theory. Or they retrofit hypotheses to fit their data. and journal editors:
Our acquiescence to the impact factor fuels an unhealthy competition to win a place in a select few journals. Our love of "significance" pollutes the literature with many a statistical fairy-tale. We reject important confirmations.and Universities:
in a perpetual struggle for money and talent, endpoints that foster reductive metrics, such as high-impact publication. National assessment procedures, such as the Research Excellence Framework, incentivise bad practices.Horton points out that:
Part of the problem is that no-one is incentivised to be right. Instead, scientists are incentivised to be productive and innovative.He concludes:
The good news is that science is beginning to take some of its worst failings very seriously. The bad news is that nobody is ready to take the first step to clean up the system.Six years ago Marcia Angell, the long-time editor of a competitor to The Lancet wrote in an review of three books pointing out the corrupt incentives that drug companies provide researchers and Universities:
It is simply no longer possible to believe much of the clinical research that is published, or to rely on the judgment of trusted physicians or authoritative medical guidelines. I take no pleasure in this conclusion, which I reached slowly and reluctantly over my two decades as an editor of The New England Journal of Medicine.In most fields, little has changed since then. Horton points to an exception:
Following several high-profile errors, the particle physics community now invests great effort into intensive checking and re-checking of data prior to publication. By filtering results through independent working groups, physicists are encouraged to criticise. Good criticism is rewarded. The goal is a reliable result, and the incentives for scientists are aligned around this goal.Unfortunately, particle physics is an exception. The cost of finding the Higgs Boson was around $13.25B, but no-one stood to make a profit from it. A single particle physics paper can have over 5,000 authors. The resources needed for "intensive checking and re-checking of data prior to publication" are trivial by comparison. In other fields, the incentives for all actors are against devoting resources which would represent a significant part of the total for the research to such checking.
Fixing these problems of science is a collective action problem; it requires all actors to take actions that are against their immediate interests roughly simultaneously. So nothing happens, and the long-term result is, as Arthur Caplan (of the Division of Medical Ethics at NYU's Langone Medical Center) pointed out, a total loss of science's credibility:
The time for a serious, sustained international effort to halt publication pollution is now. Otherwise scientists and physicians will not have to argue about any issue—no one will believe them anyway.(see also John Michael Greer). I am not optimistic, based on the fact that the problem has been obvious for many years, and that this is but one aspect of society's inability to deal with long-term problems.
Metadata quality and assessment is a concept that has been around for decades in the library community. Recently it has been getting more interest as new aggregations of metadata become available in open and freely reusable ways such as the Digital Public Library of America (DPLA) and Europeana. Both of these groups make available their metadata so that others can remix and reuse the data in new ways.
I’ve had an interest in analyzing the metadata in the DPLA for a while and have spent some time working on the subject fields. This post will continue along those lines in trying to figure out what some of the metrics that we can calculate with the DPLA dataset that we can use to define “quality”. Ideally we will be able to turn these assessments and notions of quality into concrete recommendations for how to improve metadata records in the originating repositories.
This post will focus on normalization of subject strings, and how those normalizations might be useful as a way of assessing quality of a set of records.
One of the the powerful features of OpenRefine is the ability to cluster a set or data and combine these clusters into a single entry. Often times this will significantly reduce the number of values that occur in a dataset in a quick and easy manner.
OpenRefine has a number different algorithms that can be used for this work that are documented in their Clustering in Depth documentation. Depending on ones data one approach may perform better than others for this kind of clustering.Normalization
Case normalization is probably the easiest to kind of normalization to understand. If you have two strings, say “Mark” and “marK” if you converted each of the strings to lowercase you would end up with a single value of “mark”. Many more complicated normalizations assume this as a start because it reduces the number of subjects without drastically transforming the original string values.
Case folding is another kind of transformation that is fairly common in the world of libraries. This is the process of taking a string like “José” and converting it to “Jose”. While this can introduce issues if a string is meant to have a diacritic and that diacritic makes the word or phrase different than the one without the diacritic, often times it can help to normalize inconsistently notated versions of the same string.
In addition to case folding and lower casing, libraries have been normalizing data for a long time, there have been efforts in the past to formalize algorithms for the normalization of subject strings for use in matching these strings. Often referred to as NACO normalizations rules, they are Authority File Comparison Rules. I’ve always found this work to be intriguing and have a preference for the work and simplified algorithm that was developed at OCLC in their NACO Normalization Service. In fact we’ve taken the sample Python implementation there and created a stand-alone repository and project called pynaco on GitHub for the code so that we could add tests and then work to port it Python 3 in the near future.
Another common type of normalization that is performed on strings in library land is stemming. This is often done within search applications so that if you search one of the phrases run, runs, running you would get documents that contain each of these.
What I’ve been playing around with is if we could use the reduction in unique terms for a field in a metadata repository as an indicator of quality.
Here is an example.
If we have the following sets of subjects:Musical Instruments Musical Instruments. Musical instrument Musical instruments Musical instruments, Musical instruments.
If you applied the simplified NACO normalization from pynaco you would end up with the following strings:musical instruments musical instruments musical instrument musical instruments musical instruments musical instruments
If you then applied the porter stemming algorithm to the new set of subjects you would end up with the following:music instrument music instrument music instrument music instrument music instrument music instrument
So in effect you have normalized the original set of six unique subjects down to one unique subject strings with a NACO transformation followed by a normalization with the Porter Stemming algorithm.Experiment
In some past posts here, here, here, and here, I discussed some of the aspects of the subject fields present in Digital Public Library of America dataset. I dusted that dataset off and extracted all of the subjects from the dataset so that I could work with them by themselves.
I ended up with a set of text files that were 23,858,236 lines long that held the item identifier and a subject value for each subject of each item in the DPLA dataset. Here is a short snippet of what that looks like.d8f192def7107b4975cf15e422dc7cf1 Hoggson Brothers d8f192def7107b4975cf15e422dc7cf1 Bank buildings--United States d8f192def7107b4975cf15e422dc7cf1 Vaults (Strong rooms) 4aea3f45d6533dc8405a4ef2ff23e324 Public works--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 City planning--Illinois--Chicago 4aea3f45d6533dc8405a4ef2ff23e324 Art, Municipal--Illinois--Chicago 63f068904de7d669ad34edb885925931 Building laws--New York (State)--New York 63f068904de7d669ad34edb885925931 Tenement houses--New York (State)--New York 1f9a312ffe872f8419619478cc1f0401 Benedictine nuns--France--Beauvais
Once I have the data in this format I could experiment with different normalizations to see what kind of effect they had on the dataset.Total vs Unique
The first thing I did was to make the 23,858,236 long text file only contain unique values. I do this with the tried and true method of using unix sort and uniq.sort subjects_all.txt | uniq > subjects_uniq.txt
After about eight minutes of waiting I ended up with a new text file subjects_uniq.txt that contains the unique subject strings in the dataset. There are a total of 1,871,882 unique subject strings in this file.Case folding
Using a Python script to perform case folding on each of the unique subjects I’m able to see is that causes a reduction in the number of unique subjects.
I started out with 1,871,882 unique subjects and after case folding ended up with 1,867,129 unique subjects. That is a difference of 4,753 or a 0.25% reduction in the number of unique subjects. So nothing huge.Lowercase
The next normalization tested was lowercasing of the values. I chose to do this on the set of subjects that were already case folded to take advantage of the previous reduction in the dataset.
By converting the subject strings to lowercase I reduced the number of unique case folded subjects from 1,867,129 to 1,849,682 which is a reduction of 22,200 or a 1.2% reduction from the original 1,871,882 unique subjects.NACO Normalization
Next we look at the simple NACO normalization from pynaco. I applied this to the unique lower cased subjects from the previous step.
With the NACO normalization, I end up with 1,826,523 unique subject strings from the 1,849,682 that I started with from the lowercased subjects. This is a difference of 45,359 or a 2.4% reduction from the original 1,871,882 unique subjects.Porter stemming
Moving along, I looked at for this work was applying the Porter Stemming algorithm to the output of the NACO normalized subjects from the previous step. I used the Porter implementation from the Natural Language Tool Kit (NLTK) for Python.
With the Portal stemmer applied, I ended up with 1,801,114 unique subject strings from the 1,826,523 that I started with from the NACO normalized subjects. This is a difference of 70,768 or a 3.8% reduction from the original 1,871,882 unique subjects.Fingerprint
Finally I used a python porting of the fingerprint algorithm that OpenRefine uses for its clustering feature. This will help to normalize strings like “phillips mark” and “mark phillips” into a single value of “mark phillips”. I used the output of the previous Porter stemming step as the input for this normalization.
With the fingerprint algorithm applied, I ended up with 1,766,489 unique fingerprint normalized subject strings. This is a difference of 105,393 or a 5.6% reduction from the original 1,871,882 unique subjects.Overview Reduction Occurrences Percent Reduction Unique 0 1,871,882 0% Case Folded 4,753 1,867,129 0.3% Lowercase 22,200 1,849,682 1.2% NACO 45,359 1,826,523 2.4% Porter 70,768 1,801,114 3.8% Fingerprint 105,393 1,766,489 5.6% Conclusion
I think that it might be interesting to apply this analysis to the various Hubs in the whole DPLA dataset to see if there is anything interesting to be seen across the various types of content providers.
I’m also curious if there are other kinds of normalizations that would be logical to apply to the subjects that I’m blanking on. One that I would probably want to apply at some point is the normalization for LCSH that splits a subject into its parts if it has the double hype — in the string. I wrote about the effect on the subjects for the DPLA dataset in a previous post.
As always feel free to contact me via Twitter if you have questions or comments.
Last week I had the pleasure of presenting a short talk at the second virtual meeting of the NISO effort to reach a Consensus Framework to Support Patron Privacy in Digital Library and Information Systems. The slides from the presentation are below and on SlideShare, followed by a cleaned-up transcript of my remarks.
It looks like in the agenda that I’m batting in the clean-up role, and my message might be pithily summarized as “Can’t we all get along?” A core tenet of librarianship — perhaps dating back to the 13th and 14th century when this manuscript was illuminated — is to protect the activity trails of patrons from unwarranted and unnecessary disclosure.
This is embedded in the ethos of librarianship. As Todd pointed out in the introduction, third principle of the American Library Association’s Code of Ethics states: “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.” Librarians have performed this duty across time and technology, and as both have progressed the profession has sought new ways to protect the privacy of patrons.
For instance, there was once a time when books had a pocket in the back that held a card showing who had checked out the book and when it was due. Upon checkout the card was taken out, had the patron’s name embossed or written on it, and was stored in a date-sorted file so that the library knew when it was due and who had it checked out. When the book was returned, the name was scratched through before putting the card in the pocket and the book on the shelf. Sometimes, as a process shortcut, the name was left “in the clear” on the card, and anyone that picked the book off the shelf could look on the card to see who had checked it out.
When libraries automated their circulation management with barcodes and database records, the card in the back of the book and the information it disclosed was no longer necessary. This was hailed as one of the advantages to moving to a computerized circulation system. While doing away with circulation cards eliminated one sort of privacy leakage — patrons being able to see what each other had checked out — it enabled another: systematic collection of patron activity in a searchable database. Many automation systems put in features that automatically removed the link between patron and item after it was checked in. Or, if that information was stored for a period of time, it was password protected so only approved staff could view the information. Some, however, did not, and this became a concern with the passage of the USA PATRIOT act by the United States Congress.
We are now in an age where patron activity is scattered across web server log files, search histories, and usage analytics of dozens of systems, some of which are under the direct control of the library while others are in the hands of second and third party service providers. Librarians that are trying to do their due diligence in living up to the third principle of the Code of Ethics have a more difficult time accounting for all of the places where patron activity is collected. It has also become more difficult for patrons to make informed choices about what information is collected about their library activity and how it is used.
In the mid-2000s, libraries and content providers had a similar problem: the constant one-off negotiation of license terms was a burden to all parties involved. In order to gain new efficiencies in the process of acquiring and selling licensed content, representatives from the library and publisher communities came together under a NISO umbrella to reach a shared understanding of what the terms of an agreement would be and a registry of organizations that ascribed to those terms. Quoting from the forward of the 2012 edition: “The Shared Electronic Resource Understanding (SERU) Recommended Practice offers a mechanism that can be used as an alternative to a license agreement. The SERU statement expresses commonly shared understandings of the content provider, the subscribing institution and authorized users; the nature of the content; use of materials and inappropriate uses; privacy and confidentiality; online performance and service provision; and archiving and perpetual access. Widespread adoption of the SERU model for many electronic resource transactions offers substantial benefits both to publishers and libraries by removing the overhead of bilateral license negotiation.”
Today’s web service is filled with social sharing widgets (Facebook, Twitter, and the like), web analytics tools (Google Analytics), and content from advertising syndicates. While these tools provide useful services to the patrons, libraries and service providers, they also become centralized points of data gathering that can aggregate a user’s activity across the web. Does your library catalog page include a Facebook “Like” button? Whether or not the patron clicks on that button, Facebook knows that user has browsed to that web page and can gleen details of user behavior from that. Does your service use Google Analytics to understand user behavior and demographics? Google Analytics tracks user behavior across an estimated one half of the sites on the internet. Your user’s activity as a patron of your services is commingled with their activity as a general user.
A “filter bubble” is phrase coined by Eli Pariser to describe a system that adapts its output based on what it knows about a user: location, past searches, click activity, and other signals. The system is using these signals to deliver what it deems to be more relevant information to the user. In order to do this, the system must gather, store and analyze this information from patrons. However, a patron may not want his or her past search history to affect their search results. Or, even worse, when activity is aggregated from a shared terminal, the results can be wildly skewed.
Simply using a library-subscribed service can transmit patron activity and intention to dozens of parties, and all of it invisible to the user. To uphold that third principle in the ALA Code of Ethics, librarians need to examine the patron activity capturing practices its information suppliers, and that can be as unwieldy as negotiating bilateral license agreements between each library and supplier. If we start from the premise that libraries, publishers and service providers want to serve the the patron’s information needs while respecting their desire to do so privately, what is needed is a shared understanding of how patron activity is captured, used, and discarded. A new gathering of librarians and providers could accomplish for patron activity what they did for electronic licensing terms a decade ago. One could imagine discussions around these topics:
What Information is Collected From the Patron: When is personally identifiable information captured in the process of using the provider’s service. How is activity tagged to a particular patron — both before and after the patron identifies himself or herself? Are search histories stored? Is the patron activity encrypted — both in transit on the network and at rest on the server?
What Activity That Can Be Gleaned by Other Parties: If a patron follows a link to another website, how much of the context of the patron’s activity is transferred to the new website. Are search terms included in the URL? Is personally identifiable information in the URL? Does the service provider employ social sharing tools or third party web analytics that can gather information about the patron’s activity? Such activity could include IP address (and therefore rough geolocation), content of the web page, cross-site web cookies, and so forth.
How does patron activity influence service delivery: Is relevancy ranking altered based on the past activity of the user? Can the patron modify the search history to remove unwanted entries or segregate research activities from each other?
What is the disposition of patron activity data: Is a patron activity data anonymized and co-mingled with others? How is that information used and to whom is it disclosed? How long does the system keep patron activity data? Under what conditions would a provider release information to third parties?
It is arguably the responsibility of libraries to protect patron activity data from unwarranted collection and distribution. Service providers, too, want clear guidance from libraries so they can efficiently expend their efforts to develop systems that librarians feel comfortable promoting. To have each library and service provider audit this activity for each bilateral relationship would be inefficient and cumbersome. By coming to a shared understanding of how patron activity data is collected, used, and disclosed, libraries and service providers can advance their educational roles and offer tools to patrons to manage the disclosure of their activity.Link to this post!
I’ve been working hard on making a few changes to a couple of the MarcEdit internal components to improve the porting work. To that end, I’ve posted an update that targets improvements to the Deduping and the Merging tools.
- Update: Dedup tool — improves the handling of qualified data in the 020, 022, and 035.
- Update: Merge Records Tool — improves the handling of qualified data in the 020, 022, and 035.
Downloads can be picked up using the automated update tool or by going to: http://marcedit.reeset.net/downloads/
From Claire Knowles, Library Digital Development Manager, University of Edinburgh
Edinburgh, Scotland We are pleased to announce that Repository Fringe returns this year on the 3rd and 4th of August 2015. The event will be held at the University of Edinburgh and coincides once again with preview week to the Edinburgh Festival Fringe.
From the boardroom to City Hall, powerful negotiation skills make a big difference in advancing library goals. Power up your ability to persuade at the 2015 American Library Association (ALA) Annual Conference interactive session “The Policy Revolution! Negotiating to Advocacy Success!” 1:00 to 2:30 p.m. on Saturday, June 27, 2015. The session will be held at the Moscone Convention Center in room 2016 of the West building.
American Library Association Senior Policy Counsel Alan Fishel will bring nearly 30 years of legal practice and teaching effective and creative negotiation to the program. Bill & Melinda Gates Foundation Senior Program Officer Chris Jowaisas will share his experience advocating for and advancing U.S. and global library services. From securing new funding to negotiating licenses to establishing mutually beneficial partnerships, today’s librarians at all levels of service are brokering support for the programs, policies and services needed to meet diverse community demands. The session will jump off from a new national public policy agenda for U.S. libraries to deliver new tools you can use immediately at the local, state, national and international levels.
The Policy Revolution! initiative aims to advance national policy for libraries and our communities and campuses. The grant-funded effort focuses on establishing a proactive policy agenda, engaging national decision makers and influencers, and upgrading ALA policy capacity.
Speakers include Larra Clark, deputy director, ALA Office for Information Technology Policy; Alan G. Fishel, partner, Arent Fox; and Chris Jowaisas, senior program officer, Bill and Melinda Gates Foundation.
The post Ramping up negotiation skills to advance library agenda appeared first on District Dispatch.
Some time ago I promised I'd keep this space up to date on how my return to grad school was doing. Turns out I'm pretty lazy with doing that.
While working on the migration mappings for fcrepo3->fcrepo4 properties, I documented all known RELS-EXT and RELS-INT predicates in the Islandora 7.x-1.x code base. The predicates came from two namespaces; fedora and islandora.
The fedora namespace has a published ontology that we use -- relations-external -- and that can be referenced. However, the islandora namespace did not have any published ontologies associated with it.
That said, I have worked over the last couple of weeks with some very helpful folks on drafting initial version of Islandora RELS-EXT and RELS-INT ontologies, and the Islandora Roadmap Committee voted that it should be published. The published version of the RELS-EXT ontology can be viewed here, and the published version of the RELS-INT ontology can be viewed here. In addition, the ontologies were drafted in rdfs, and include a handy rdf2html.xsl to quickly create a publishable html version. This available on GitHub.
What does this all mean?
We have now documented what we have been doing for the last number of years, and we have a referencable version of our ontologies. In addition, this is extremely helpful for referencing and documenting predicates that will be apart of an fcrepo3-fcrepo4 migration.
The initial versions of each ontology have proposed rdfs comments, ranges and and skos *matches for a number of predicates. However, this is by no means complete, and I would love to see some community input/feedback on rdfs comments, ranges, additional skos *matches, or anything else that you think should be included in the RELS-EXT ontology.
How to provide feedback?
I'd like to have everything handled through 'issues' on the GitHub repo. If you comfortable with forking and creating pull requests, by all means do so. If you're more comfortable with replying here, that's works as well. All contributions are welcome! The key thing -- for me at least -- is to have community consensus around our understanding of these documented predicates :-)
I have not licensed the repository yet. I had planned on using the Apache 2.0 License as is done with PCDM, but I'd like your thoughts/opinions on proceeding before I make a LICENSE commit.
I hope I have covered it all. But, if you have have any questions, don't hesitate to ask.
It is almost a sure bet that certain NSA programs will expire at the end of the month. The next Senate vote is set for May 31st. We will be sure to provide updates as we hear them.
It’s that time of the year again. That time when the international open data community descends on an unsuspecting city for a jam packed week of camps, meet-ups, hacks and conference events. Next week, open data enthusiasts will be taking over Ottawa and Open Knowledge will be there in full force! As we don’t want to miss an opportunity to meet with anyone, we have put together a list of events that we will be involved in and ways to get in touch.We have also started collecting this information in a spreadsheet!
The School of Data team is arriving early for the second annual School of Data Summer Camp. Every year we strive to bring the entire School of Data community together for three intense days to plan future activities, to learn from each other, to improve our skills and ways of working and to give new School of Data fellows the opportunity to meet their future collaborators! This year’s School of Data Summer Camp will take place at the HUB Ottawa. We’ll have a meet and greet on one of the evenings for School of Data family and friends – so watch this space for details, or follow @SchoolofData on Twitter.
Wednesday is going to be a busy day as we will be spread out across three events – CKANCon, organised by the CKAN association, the Opening Parliaments Fringe Event and the Open Data Con Research Symposium, where we will be presenting new work on measuring and assessing open data initiatives and on “participatory data infrastructures”.
At the International Open Data Conference, Open Knowledge team members are co-organising or presenting at the following sessions:
- Data & Public Money Thursday May 28th, 11:00 – 12:15
- Data & Fiscal Transparency – Thursday May 28th, 13:30 – 15:30
- Open Data Advocacy Clinic – Friday 29th, 10:30 – 12:30
- Capacity Building for All – Friday 29th, 10:30 – 12:00
- Measuring Open Data Impacts – Friday 29th, 13:30 – 15:00
- Public Interest Innovation with Open Data – Friday 29th, 13:30 – 15:00
- School of Data Day– Friday 29th, featuring a drop in data-clinic session in the morning, where you can come to us with your data projects, proble and questions, and we’ll talk them through with you -then a short data expedition in the afternoon, 13:30 – 15:00
As you can probably see, the week is going to be a busy one and we are aware that it will be difficult to schedule meetings with everyone! To accommodate, the Open Knowledge team and the entire School of Data family are organising informal drinks at The Brig Pub from 7:30 PM Thursday evening! We would love for you to come say hello in person or you can always find Pavel (Open Knowledge’s new CEO!!!!), Zara, Milena, Jonathan, Mor, Sander, Katelyn, School of Data & of course Open Knowledge on twitter!
Safe travels and we will see you in Ottawa!
Via Gary Price’s announcement on InfoDocket comes word of a cost-benefit analysis for the wholesale adoption of ORCID identifiers by eight institutions in the U.K. The report, Institutional ORCID, Implementation and Cost Benefit Analysis Report [PDF], looks at the perspectives of stakeholders, a summary of findings from the pilot institutions, a preliminary cost-benefit analysis, and a 10-page checklist of consideration points for higher education institutions looking to adopt ORCID identifiers in their information systems. The most interesting bits of the executive summary came from the part discussing the findings from the pilot institutions.
Perhaps surprisingly, technical issues were not the major issue for most pilot institutions. A range of technical solutions to the storage of researchers’ ORCID iDs were utilised during the pilots. … Of the eight pilot institutions, only one chose to bulk create ORCID iDs for their researchers, the others opted for the ‘facilitate’ approach to ORCID registration.
Most pilot institutions found it relatively easy to persuade senior management about the institutional benefits of ORCID but many found it difficult to articulate the benefits to individual researchers. Several commented that staff saw it as ‘another level of bureaucracy’ and it was also noted that concurrent Open Access (OA), REF and ORCID activities can make the message confused, as they overlap. … Clear and effective messages (as short and precise as possible), creating a well-defined brand for ORCID and the targeting of specific audiences and audience segments were identified as being especially important.
One thing I found surprising in the report was the lack of the mention of the usefulness of ORCID identifiers in the linked data universe. The word “linked” appeared six times in the report; five of the six mentions talk about connections between campus systems and ORCID. It would seem that some of the biggest benefits of ORCID ids will come when they appear as the object of a subject-predicate-object triple in data published and consumed by various systems on the open web. That is, part of the linked open data.Link to this post!
A book that a few of our colleagues have been working on for quite some time has now been released: Library Linked Data in the Cloud: OCLC’s Experiments with New Models of Resource Description. You can also preview it on Google Books.
OCLC Research has been working with linked data for years, and we have developed processes for mining our MARC record database into linked and linkable entities. This book reports on a lot of that work, the problems we ran into and some of the solutions we created.
The main sections are:
- Library Standards and the Semantic Web
- Modeling Library Authority Files
- Modeling and Discovering Creative Works
- Entity Identification Through Text Mining
- The Library Linked Data Cloud
There are likely few people who have had as much experience parsing library data into linked data triples than the authors of this book and their OCLC Research colleagues. Therefore, anyone seeking to create or use library linked data would do well to study this book. You can take my word for it.About Roy Tennant
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.Mail | Web | Twitter | Facebook | LinkedIn | Flickr | YouTube | More Posts (89)
[UPDATE: IMLS HAS POSTPONED THIS WEBINAR, AND WILL ANNOUNCE A NEW DATE AND TIME IN THE COMING WEEKS]
Next week, the Institute of Museum and Library Services (IMLS) and U.S. Citizenship and Immigration Services (USCIS) will host a free webinar for public librarians on the topic of immigration and U.S. citizenship. Join in to learn more about what resources are available to assist libraries that provide immigrant and adult education services. The webinar will provide an overview of how libraries can expand these services and even acquire free materials to display.
Date: May 27, 2015
Time: 2:00 – 3:00 p.m. EDT
Click here to register
Prior participation in previous webinars on this topic is not required. Registration is not requried, but the agencies recomment that you check your system for compatibility in advance.
This series was developed as part of a partnership between IMLS and USCIS to ensure that librarians have the necessary tools and knowledge to refer their patrons to accurate and reliable sources of information on immigration-related topics. To find out more about the partnership and the webinar series, visit the Serving New Americans page of the IMLS website or on the USCIS website.
The post IMLS announces new immigration webinar for public libraries appeared first on District Dispatch.
The following is a guest post by Abbie Grotke, Lead Information Technology Specialist on the Web Archiving Team, Library of Congress.
Recently the Library of Congress launched a significant amount of new Web Archive content on the Library’s Web site, as a part of a continued effort to integrate the Library’s Web Archives into the rest of the loc.gov web presence.
This is our first big release since we launched the first iteration of collections into this new interface, back in June 2013. The earlier approach to presenting archived web sites turned out to be a challenge to allow us to increase the amount of content available, so in a “one step back, two steps forward” move, the interface has been simplified, and should be more familiar to those working with Web Archives at other institutions – item records point to archived web sites displaying in an open-source version of the Wayback Machine. This simplification allowed the Library to increase the number of sites available in this interface from just under 1,000 to over 5,800. The most recent harvested sites now publicly available were harvested in March-April 2012. The simplified approach should also allow catching up with moving more current content into the online collections.
There are now 21 named collections available in the new interface; some had been available in our old interface but are newly migrated; other content is entirely new. With this launch, we are particularly excited about the addition of the United States Congressional Web Archives, which for the first time allows researchers to access content collected since December 2002 up thru April 2012. Each record covers those sessions where a particular member of Congress was serving, such as for Barack Obama as senator during two sessions, or the example of Kirsten E. Gillibrand serving in the House and Senate, represented on one record despite a URL change.
Other newly available collections include the Burma/Myanmar General Election 2010 Web Archive, Egypt 2008 Web Archive, Laotian General Election 2011 Web Archive, Thai General Election 2011 Web Archive, Vietnamese General Election 2011 Web Archive and the Winter Olympic Games 2002 Web Archive.
We still have some work to do to move the U.S. Election Web Archives from our old interface, so for the time being researchers interested in those collections will need to refer back to the old site. Eventually we will be combining the separate Election collections into one U.S. Election Archive that will allow better searchability and access, and migrating them over (and then “turning off” the old interface).
We hope researchers will enjoy access to these new web archive collections.
I’ve recently attended some of my first conferences/meetings post-MLIS and I thought I’d pass on the information I learned from my experience navigating them for the first time.
Courtesy of Jatenipit. Pixabay 2014
Always be prepared to promote
This is the most dreaded aspect of networking. It essentially implies schmoozing and self-aggrandizement, but if you consider it as a socializing you’ll realize it’s an essential part of getting to know others in the profession and the roles they play in their organization. If you’re new to the information profession, it can be a great opportunity to ask other professionals about the path they took to enter the industry. More often than not when they find that you’re new to the profession, they’ll offer you advice. They’ll be curious to know what your career goals are and why you’re attending. This is a great opportunity to ask for their business card or contact information. If you find that you’ve built a good rapport and want to become more familiar with their work/organization, you should offer your business card (more on this later).
The thing about promotion
If you’re at a conference on behalf of an organization, then you’re on the company dollar. Therefore your mission is to network, learn and share. Since I plan on attending conferences to learn more about the profession and network, I couldn’t talk shop about procedures and management. If you are attending on behalf of an organization you’re expected to create professional networks and trace them back to your institution. It sounds intimidating, but if you allow yourself to soak-up as much information as possible, while being open about what works and doesn’t for your information environment, you’ll find others may want to emulate your framework and share theirs in return.
You have leverage too
Believe it or not the pros don’t know everything. Sometimes when you’re new to a profession you can become caught-up in what you don’t know and the list of skills you need to get to that ever distant “next level.” I was very surprised to find that many of the resources I was familiar with escaped the purview of individuals working in the digital records management and archives field. I introduced The Signal Digital Preservation and the Cancer Imaging Archive into a conversation and a few individuals took genuine interest in my explanation of their services. While earning your degree or working in different information environments, you are exposed to a variety of resources and ideas that others aren’t aware of. Don’t count yourself out, you have something to add to the conversation.
Think outside the box
There is no need to be intimidated about approaching new acquaintances during a professional conference. Most of the time you’re meeting with people who remember what it’s like to be at the forefront of a new career. It can be exciting and informative to strike-up a conversation with a presenter. There is nothing wrong with inquiring about lunch plans and meeting outside of the conference venue during scheduled breaks. The relaxed atmosphere of a restaurant is where funny stories of the trade can be passed along and you’ll get to know each other on a personal level. There are several factors that account for good networking and having an outgoing personality is one of them. While being personable is fine, doing so in a respectful manner is most apt.
Handy business cards
If you’re using a conference to network for future employment, then you need to have business cards. At larger conferences you can be one in hundreds of attendees. Business cards are a great way to establish that you’re prepared and professional. However, providing an acquaintance with your contact information is not enough. Perhaps you may want to ask for their card if you want to continue the conversation after the conference concludes. It’s likely that they’ll never take a look at your business card again, so it’s important to follow-up with an e-mail to remind them of the highlights of the conversation you had and how you’d like to collaborate with them going forward.
If you’re hoping to enter a new field post-graduation, at a minimum your business card should include: your name, degree(s) and university, your phone number and e-mail. You can also add a specialization to encompass your career trajectory such as Librarian, Electronic Resource Specialist or Certified Webmaster. For points of contact beyond your phone number and e-mail, providing your website, online portfolio or LinkedIn URL is a great way to showcase your web presence. If you can connect with another professional’s LinkedIn, you will not only increase their awareness of you, but you will be exposed to their extended network as well.
An added bonus
If you are networking for employment, one thing that you don’t want to do is outrightly ask about potential employment with another attendee’s organization. I’ve seen this happen before and it can be off-putting for the person being asked as well as anyone involved in the conversation. If you’re a new graduate or changing careers, the conversation will naturally flow into questions about your career plans. If the person you’re speaking with feels inclined to mention an upcoming opportunity, then it is an added bonus. Otherwise, enjoy yourself and take advantage of the learning opportunity. You’ll be in a room filled with like-minded professionals and everyone wants the most of their experience.
Are you planning on attending any conferences this year? What takeaways do you have from conferences you’ve attended in the past? Let me know in the comments section.
We’re happy to announce that the WorldShare License Manager API is now available in Production.