The Minnesota Digital Library (MDL) is one of four DPLA Service Hubs to be sub-awarded a grant from the Bill and Melinda Gates Foundation, through the DPLA, for the Public Library Partnership Project (PLPP). The purpose of PLPP is to develop a curriculum for teaching basic digitization concepts and skills and pilot it through workshops for public library staff, encourage and facilitate their participation in their local digital libraries and DPLA, and create collaborative online exhibitions based on materials digitized through this project. At the end of PLPP, we will also be sharing a self-guided version of the curriculum we built.
MDL was very pleased with the success of our implementation of the first stage of PLPP—we offered four digital skills training sessions to thirty-one individuals from twenty-two different public libraries and collaborating historical societies around Minnesota. The training was so well received that we hope to incorporate similar basic group training sessions into our ongoing recruitment and preparation of potential participants.
We are now deep into the second phase of the PLPP in which the organizations propose projects, select appropriate materials from their collections and send them to us for digitization and metadata preparation. An early success was the contribution of a 1930 plat book of Polk County by the Fosston Public Library, the first organization to contribute to MDL from this county.
One of the challenges we face is that, because of a very strong network of local historical societies throughout Minnesota, our public libraries don’t often have significant collections of archival or historic materials (Hennepin County Library being one important exception). However, we have been able to leverage our PLPP resources to encourage and support collaboration between public libraries and other organizations in their communities. In some cases, public libraries made new connections with city or county offices when collaborators realized they had materials that were worth preserving and making accessible, but didn’t know how to go about it and were not aware of MDL. Public library participants in PLPP were able to identify these materials, make the case of online access, facilitate an avenue for digitization, and share description and rights assessment work. Because of the connections made via our PLPP library participants we’ll be digitizing the portraits of Duluth mayors, the master plans for county parks from the Washington County Park Board, and historically significant and previously inaccessible materials from the Minneapolis Parks and Recreation Board, among other projects.
The Gates-funded project will wrap up at the end of September 2015. Between now and then we will be completing additional projects and developing two online exhibitions built in part on materials digitized through this grant.
PLPP has strengthened our relationship with public libraries around the state, improved the digitization knowledge of public library staff, increased our capacity, and brought in materials to which we would otherwise not have had access. MDL has been more than pleased by the outcomes of our participation in the PLPP!
Carla Urban will be co-leading a digitization training session, with Sheila McAllister of the Digital Library of Georgia, at DPLAfest 2015. To learn more about PLPP and lessons learned, come participate in the discussion!
Header image: Detroit Public Library, Detroit, Minnesota, 1913. Courtesy of Becker County Historical Society via Minnesota Digital Library.
If you appreciate the critical roles that libraries play in creating an informed and engaged citizenry, register now for this year’s National Library Legislative Day (NLLD), a two-day advocacy event where hundreds of library supporters, leaders and patrons will meet with their legislators to advocate for library funding.
National Library Legislative Day, which is hosted by the American Library Association (ALA), will be held May 4-5, 2015, in Washington, D.C. Now in its 41st year, National Library Legislative Day focuses on the most pressing issues, including the need to fund the Library Services and Technology Act, support legislation that gives people who use libraries access to federally-funded scholarly journal articles and continued funding that provides school libraries with vital materials.
National Library Legislative Day Coordinators from each U.S. state arrange all advocacy meetings with legislators, communicate with the ALA Washington Office and serve as the contact person for state delegations. The ALA Washington Office will host a special training session on Sunday (May 3rd) afternoon for first-timers. On the first day of the event, participants will receive training and issue briefings to prepare them for meetings with their members of Congress.Advocate from Home
Advocates who cannot travel to Washington for National Library Legislative Day can still make a difference and speak up for libraries. As an alternative, the American Library Association sponsors Virtual Library Legislative Day, which takes place on May 5, 2015. To participate in Virtual Library Legislative Day, register now for American Library Association policy action alerts.
The post Interested in Natl. Library Legislative Day? Here’s what you need to know appeared first on District Dispatch.
Wanna be a peer reviewer for AccessYYZ? Excellent, because we need some of those.
If you’re interested, please shoot an email to firstname.lastname@example.org by March 27th, 2015 with the following information:
Current Position (including whether you are a student)
Have you been to Access before?
Have you presented at Access before?
Have you done peer review for Access before?
Come to think of it, a CV/resume would be nice. Yes, make sure you include that too.
Every once in a while something really interesting comes up on the listserv, and I try to bring the highlights here to the blog so that it will get exposure with a wider audience. Right now, that interesting thing is the nascent Dev Ops Interest Group.
Interest Groups have been a thing in Islandora for just about a year now. They are a way for members of the Islandora community with similar interests, challenges, and projects to come together to share resources and discuss the direction the project should take in the future. We have one for Preservation, Archives, Documentation, GIS, and Fedora 4 (which has become the guiding group for the Islandora/Fedora 4 upgration project). Following up on some conversations from iCampBC, Mark Jordan has proposed that there might be a need for a Dev Ops Interest Group as well, for the folks who spend their time actually deploying Islandora to come together and talk strategies.
As you can see from the thread, the interest is certainly there, and I expect to be announcing the provision of a new Interest Group in days to come. But what brings this subject out from the list is the challenge issued by our Release Manager and Upgration Guru, Nick Ruest:
I'm sitting here waiting for the 7.x-1.5RC1 VM to upload to the release server, and I'm thinking...
I propose or challenge the following:
All of those who have expressed interest in the group, would you be willing to collaborate on creating a canonical development and release VM Vagrant file? I think this is probably the most pressing need to grow our developer community.
I can create a shared repo in the Islandora-Labs organization, and add anyone willing to contribute to it.
I can get us started. I'll cannibalize what we have for the Islandora & Fedora integration project.
We could cannibalize Islandora Chef.
We could cannibalize anything y'all are willing to bring to the table.
Things to think about and sort out - CLAs, LSAP, 7.x-1.6 release. Those who are willing to contribute, should be aware that if this is given to the foundation via LSAP, we'll all have to be covered by a CLA, and do you think we could get this finished by the 7.x-1.6 release? I think we could.
Other benefits, by sticking with bash scripts, and Vagrant, we can take advantage of other DevOps platforms. I'm thinking specifically of Packer.io and virtually looking toward Kevin Clarke. Wouldn't it be great if we finally had that Docker container whose tires have been kicked a couple of time?
Any, let me know what you think. If you think I'm crazy, that's ok too :-)
A crowd-sourced development environment for Islandora. And in fact, the first draft is already out there, just waiting for you to try it out and contribute. And prove to Nick that he's not crazy.
Part 2-ba of Amazon crawl..
This item belongs to: data/ol_data.
This item has files of the following types: Data, Data, Metadata, Text
The other day I made this blog, galencharlton.com/blog/, HTTPS-only. In other words, if Eve want to sniff what Bob is reading on my blog, she’ll need to do more than just capture packets between my blog and Bob’s computer to do so.
This is not bulletproof: perhaps Eve is in possession of truly spectacular computing capabilities or a breakthrough in cryptography and can break the ciphers. Perhaps she works for any of the sites that host external images, fonts, or analytics for my blog and has access to their server logs containing referrer headers information. Currently these sites are Flickr (images), Gravatar (more images), Google (fonts) or WordPress (site stats – I will be changing this soon, however). Or perhaps she’s installed a keylogger on Bob’s computer, in which case anything I do to protect Bob is moot.
Or perhaps I am Eve and I’ve set up a dastardly plan to entrap people by recording when they read about MARC records, then showing up at Linked Data conferences and disclosing that activity. Or vice versa. (Note: I will not actually do this.)
So, yes – protecting the privacy of one’s website visitors is hard; often the best we can do is be better at it than we were yesterday.
To that end, here are some notes on how I made my blog require HTTPS.Certificates
I got my SSL certificate from Gandi.net. Why them? Their price was OK, I already register my domains through them, and I like their corporate philosophy: they support a number of free and open source software projects; they’re not annoying about up-selling, and they have never (to my knowledge) run sexist advertising, unlikely some of their larger and more well-known competitors. But there are, of course, plenty of options for getting SSL certificates, and once Let’s Encrypt is in production, it should be both cheaper and easier for me to replace the certs next year.
I have three subdomains of galencharlton.com that I wanted a certificate for, so I decided to get a multi-domain certificate. I consulted this tutorial by rtCamp to generate the CSR.
After following the tutorial to create a modified version of openssl.conf specifying the subjectAltName values I needed, I generated a new private key and a certificate-signing request as follows:openssl req -new -key galencharlton.com.key \ -out galencharlton.com.csr \ -config galencharlton.com.cnf \ -sha256
The openssl command asked me a few questions; the most important of which being the value to set the common name (CN) field; I used “galencharlton.com” for that, as that’s the primary domain that the certificate protects.
I then entered the text of the CSR into a form and paid the cost of the certificate. Since I am a library techie, not a bank, I purchased a domain-validated certificate. That means that all I had to prove to the certificate’s issuer that I had control of the three domains that the cert should cover. That validation could have been done via email to an address at galencharlton.com or by inserting a special TXT field to the DNS zone file for galencharlton.com. I ended up choosing to go the route of placing a file on the web server whose contents and location were specified by the issuer; once they (or rather, their software) downloaded the test files, they had some assurance that I had control of the domain.
In due course, I got the certificate. I put it and the intermediate cert specified by Gandi in the /etc/ssl/certs directory on my server and the private key in /etc/private/.Operating System and Apache configuration
Various vulnerabilities in the OpenSSL library or in HTTPS itself have been identified and mitigated over the years: suffice it to say that it is a BEASTly CRIME to make a POODLE suffer a HeartBleed — or something like that.
To avoid the known problems, I wanted to ensure that I had a recent enough version of OpenSSL on the web server and had configured Apache to disable insecure protocols (e.g., SSLv3) and eschew bad ciphers.
The server in question is running Debian Squeeze LTS, but since OpenSSL 1.0.x is not currently packaged for that release, I indeed up adding Wheezy to the APT repositories list and upgrading the openssl and apache2 packages.
For the latter, after some Googling I ended up adapting the recommended Apache SSL virtualhost configuration from this blog post by Tim Janik. Here’s what I ended up with:<VirtualHost _default_:443> ServerAdmin email@example.com DocumentRoot /var/www/galencharlton.com ServerName galencharlton.com ServerAlias www.galencharlton.com SSLEngine on SSLCertificateFile /etc/ssl/certs/galencharlton.com.crt SSLCertificateChainFile /etc/ssl/certs/GandiStandardSSLCA2.pem SSLCertificateKeyFile /etc/ssl/private/galencharlton.com.key Header add Strict-Transport-Security "max-age=15552000" # No POODLE SSLProtocol all -SSLv2 -SSLv3 +TLSv1.1 +TLSv1.2 SSLHonorCipherOrder on SSLCipherSuite "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+ aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+AESGCM EECDH EDH+AESGCM EDH+aRSA HIGH !MEDIUM !LOW !aNULL !eNULL !LOW !RC4 !MD5 !EXP !PSK !SRP !DSS" </VirtualHost>
I also wanted to make sure that folks coming in via old HTTP links would get permanently redirected to the HTTPS site:<VirtualHost *:80> ServerName galencharlton.com Redirect 301 / https://galencharlton.com/ </VirtualHost> <VirtualHost *:80> ServerName www.galencharlton.com Redirect 301 / https://www.galencharlton.com/ </VirtualHost> Checking my work
I’m a big fan of the Qualsys SSL Labs server test tool, which does a number of things to test how well a given website implements HTTPS:
- Identifying issues with the certificate chain
- Whether it supports vulnerable protocol versions such as SSLv3
- Whether it supports – and request – use of sufficiently strong ciphers.
- Whether it is vulnerable to common attacks.
Suffice it to say that I required a couple iterations to get the Apache configuration just right.WordPress
To be fully protected, all of the content embedded on a web page served via HTTPS must also be served via HTTPS. In other words, this means that image URLs should require HTTPS – and the redirects in the Apache config are not enough. Here is the sledgehammer I used to update image links in the blog posts:create table bkp_posts as select * from wp_posts; begin; update wp_posts set post_content = replace(post_content, 'http://galen', 'https://galen') where post_content like '%http://galen%'; commit;
In the course of testing, I discovered a couple more things to tweak:
- The web sever had been using Apache’s mod_php5filter – I no longer remember why – and that was causing some issues when attempting to load the WordPress dashboard. Switching to mod_php5 resolved that.
- My domain ownership proof on keybase.io failed after the switch to HTTPS. I eventually tracked that down to the fact that keybase.io doesn’t have a bunch of intermediate certificates in its certificate store that many browsers do. I resolved this by adding a cross-signed intermediate certificate to the file referenced by SSLCertificateChainFile in the Apache config above.
My blog now has an A+ score from SSL Labs. Yay! Of course, it’s important to remember that this is not a static state of affairs – another big OpenSSL or HTTPS protocol vulnerability could turn that grade to an F. In other words, it’s a good idea to test one’s website periodically.
The Library & Information Technology Association (LITA), a division of the American Library Association (ALA), announces Ed Summers as the 2015 winner of the Frederick G. Kilgour Award for Research in Library and Information Technology. The award, which is jointly sponsored by OCLC, is given for research relevant to the development of information technologies, especially work which shows promise of having a positive and substantive impact on any aspect(s) of the publication, storage, retrieval and dissemination of information, or the processes by which information and data is manipulated and managed. The awardee receives $2,000, a citation, and travel expenses to attend the award ceremony at the ALA Annual Conference in San Francisco, where the award will be presented on June 28, 2015.
Ed Summers is Lead Developer at the Maryland Institute for Technology in the Humanities (MITH), University of Maryland. Ed has been working for two decades helping to build connections between libraries and archives and the larger communities of the World Wide Web. During that time Ed has worked in academia, start-ups, corporations and the government. He is interested in the role of open source software, community development, and open access to enable digital curation. Ed has a MS in Library and Information Science and a BA in English and American Literature from Rutgers University.
Prior to joining MITH Ed helped build the Repository Development Center (RDC) at the Library of Congress. In that role he led the design and implementation of the NEH funded National Digital Newspaper Program’s Web application, which provides access to 8 million newspapers from across the United States. He also helped create the Twitter archiving application that has archived close to 500 billion tweets (as of September 2014). Ed created LC’s image quality assurance service that has allowed curators to sample and review over 50 million images. He served as a member of the Semantic Web Deployment Group at the W3C where he helped standardize SKOS, which he put to use in implementing the initial version of LC’s Linked Data service.
Before joining the Library of Congress Ed was a software developer at Follett Corporation where he designed and implemented knowledge management applications to support their early e-book efforts. He was the fourth employee at CheetahMail in New York City, where he led the design of their data management applications. And prior to that Ed worked in academic libraries at Old Dominion University, the University of Illinois and Columbia University where he was mostly focused on metadata management applications.
Ed likes to use experiments to learn about the Web and digital curation. Examples of this include his work with Wikipedia on Wikistream, which helps visualize the rate of change on Wikipedia, and CongressEdits, which allows Twitter users to follow edits being made to Wikipedia from the Congress. Some of these experiments are social, such as his role in creating the code4lib community, which is an international, cross-disciplinary group of hackers, designers and thinkers in the digital library space.
Notified of the award, Ed said: “It is a great honor to have been selected to receive the Kilgour Award this year. I was extremely surprised since I have spent most of my professional career (so far) as a developer, building communities of practice around software for libraries and archives, rather than traditional digital library research. During this time I have had the good fortune to work with some incredibly inspiring and talented individuals, teams and open source collaborators. I’ve only been as good as these partnerships have allowed me to be, and I’m looking forward to more. I am especially grateful to all those individuals that worked on a free and open Internet and World Wide Web. I remain convinced that this is a great time for library and archives professionals, as the information space of the Web is in need of our care, attention and perspective.”
Members of the 2014-15 Frederick G. Kilgour Award committee are:
- Tao Zhang, Purdue University (chair)
- Erik Mitchell, University of California, Berkeley (past chair)
- Danielle Cunniff Plumer, DCPlumer Associates, LLC
- Holly Tomren, Drexel University Libraries
- Jason Simon, Fitchburg State University
- Kebede Wordofa, Austin Peay State University, and
- Roy Tennant, OCLC liaison
Established in 1966, LITA is the leading organization reaching out across types of libraries to provide education and services for a broad membership of over 3,000 systems librarians, library technologists, library administrators, library schools, vendors and many others interested in leading edge technology and applications for librarians and information providers. For more information, visit www.lita.org.
Founded in 1967, OCLC is a nonprofit, membership, computer library service and research organization dedicated to the public purposes of furthering access to the world’s information and reducing library costs. OCLC Research is one of the world’s leading centers devoted exclusively to the challenges facing libraries in a rapidly changing information environment. It works with the community to collaboratively identify problems and opportunities, prototype and test solutions, and share findings through publications, presentations and professional interactions. For more information, visit www.oclc.org/research.
Question and Comments
Library & Information Technology Association (LITA)
(800) 545-2433 ext 4267
Today I found the following resources and bookmarked them on <a href=
- CardKit A simple, configurable, web based image creation tool
Digest powered by RSS Digest
On the cover of today’s NYTimes (print washington edition)
BAGHDAD — In those areas of Iraq and Syria controlled by the Islamic State, residents are furtively recording on their cellphones damage done to antiquities by the extremist group. In northern Syria, museum curators have covered precious mosaics with sealant and sandbags….
…There was also the United States invasion in 2003, when American troops stood by as looters ransacked the Baghdad museum, a scenario that, Mr. Shirshab suggested, is being repeated today….
…The Babylon preservation plan also includes new documentation of the site, including brick-by- brick scale drawings of the ruins. In the event the site is destroyed, Mr. Allen said, the drawings can be used to rebuild it….
…The American invasion alerted archaeologists to what needed protecting. After damage and looting at many sites, documentation and preservation accelerated. One result was that the Mosul Museum, attacked by the Islamic State, had been digitally cataloged…
…He oversees an informal team of Syrians he has nicknamed the Monuments Men, many of them his former students. They document damage and looting by the Islamic State, pushing for crackdowns on the black market. Recently, the United Nations banned all trade in Syrian artifacts….
…Now, Iraqi colleagues teach conservators and concerned residents simple techniques to use in areas controlled by the Islamic State, such as turning on a cellphone’s GPS function when photographing objects, to help trace damage or theft, or to add sites to the “no-strike” list for warplanes….
Filed under: General
Open Knowledge Foundation: Open Data Day report #1: Highly inspiring activities across the Asia-Pacific
Following the global Open Data Day 2015 event, which tooks place on February 21 with hundreds of events across the globe, we will do a blog series to highlight some of all the great activities that took place. In this first post (of four in total) we start by looking at some of the great events that took place across the Asia and Pacific. Three more accounts will bring similar accounts from the Americas, Africa and Europe in the days to come.Philippines
In the Philippines, Open Knowledge Philippines and the School of Data local grouping celebrated the International Open Data Day 2015 with back to back events on February 20-21, 2015. The extensive event featured talks by Joel Garcia of Microsoft Philippines, Paul De Paula of Drupal Pilipinas, Dr. Sherwin Ona of De La Salle University and Michael Canares of Web Foundation Open Data Labs, Jakarta – alongside community leaders such as Happy Feraren of BantayPH (who is also one of the 2014 School of Data Fellows) and Open Knowledge Ambassador Joseph De Guia. The keynote speaker was Ivory Ong, Outreach Lead of Open Data Philippines, who rightly said that “we need citizens who are ready to use the data, and we need the government and citizens to work together to make the open data initiative successful.”
Talks were followed by an open data hackathon and a data jam. The hackathon used data sets taken from the government open data portal; General Appropriation Act (GAA) of the Department of Budget and Management (DBM). The students were tasked to develop a web or mobile app that would encourage participation of citizens in the grass root participatory budgeting program of national government. The winning team was able to develop a web application containing a dashboard of the Philippine National Budget and a “Do-It-Yourself” budget allocation.Nepal
Another large event took place in Kathmandu, where Open Knowledge Nepal had teamed up with an impressive coalition of partners including open communities such as Free and Open Source Software (FOSS) Nepal Community, Mozilla Nepal, Wikimedians of Nepal,CSIT Association of Nepal, Acme Open Source Community (AOSC) and Open Source Ascol Circle (OSAC). The event had several streams of activities including among other a Spending Data Party, CKAN Localization session, a Data Scrapathon, a MakerFest, a Wikipedia Editathon and a community discussion. Each session had teams of facilitators and over 60 people tooks part in the day.Bangladesh
In Dhaka an event was held by Bangladesh Open Source Network (BdOSN) and Open Knowledge Bangladesh. The event featured a series of distinguished speakers including Jabed Morshed Chowdhury, Joint Secretary of BDOSN and Bangla administrator of Google Developer Group, Nurunnaby Chowdhury Hasive, Ambassador Open Knowledge Bangladesh, Abu Sayed, president of Mukto Ashor, Bayzid Bhuiyan Juwel, General Secretary of Mukto Ashor, Nusrat Jahan, Executive Officer of Janata Bank Limited and Promi Nahid, BdOSN coordinator – who all discussed various topics and issues of open data including what open data is, how it works, where Bangladesh fits in and more. Moreover those interested in working with open data were introduced to various tools of Open Knowledge.Tajikistan
An community initiative in Tajikistan took place in partnership with the magazine ICT4D under the banner of “A day of open data in Tajikistan”. The event was held at the Centre for Information Technology and Communications in the Office of Education in Dushanbe, and brought together designers, developers, statisticians and others who had ideas for the use of open data, or desires to find interesting projects to contribute to as well as learn how to visualize and analyze data. With participants both experienced and brand new to the topic, the event aimed to ensure that every citizen had the opportunity to learn and help the global community of open data to develop.
Among the activities were basic introductions to open data and discussions about how the local government could contribute to the creation of open data. There were also discussions about the involvement of local non-profit organizations and companies in the use of open data for products and missions, as well as trainings and other hands-on activities to participants actively involved.India
Open Knowledge India, with support from the National Council of Education Bengal and the Open Knowledge micro grants, organised the India Open Data Summit on February, 28. It was the first ever Data Summit of this kind held in India and was attended by Open Data enthusiasts from all over India. Talks and workshops were held throughout the day, revolving around Open Science, Open Education, Open Data and Open GLAM in general, but also zooming in on concrete projects, for instance:
- The Open Education Project, run by Open Knowledge India, which aims to complement the government’s efforts to bring the light of education to everyone. The project seeks to build a platform that would offer the Power of Choice to the children in matters of educational content, and on the matter of open data platforms, [CKAN](http://ckan.org/) was also discussed.
- Opening up research data of all kinds was another point that was discussed. India has recently passed legislature ensuring that all government funded research results will be in the open.
- Open governance not only at the national level, but even at the level of local governments, was something that was discussed with seriousness. Everyone agreed that in order to reduce corruption, open governance is the way to go. Encouraging the common man to participate in the process of open governance is another key point that was stressed upon. India is the largest democracy in the world and this democracy is very complex too.Greater use of the power of the crowd in matters of governance can help the democracy a long way by uprooting corruption from the very core.
Overall, the India Open Data Summit, 2015 was a grand success in bringing likeminded individuals together and in giving them a shared platform, where they can join hands to empower themselves. The first major Open Data Summit in India ended with the promise of keeping the ball rolling. Hopefully, in near future we will see many more such events all over India.Australia
In Australia they had worked for a few weeks in advance to set up a regional Open Data Census instance, which was then launched on Open Data Day. The projects for the day included drafting a Contributor Guide, creating a Google Sheet to allow people to collect census entries prior to entering them online as well as adding Google Analytics to the site – plus of course submission of data sets.
The launch even drew media attention: CIO Magazine published an article where they covered International Open Data Day, the open data movement in Australia, and the importance of open data in helping the community.
The Open Knowledge Cambodia local group in partnership with Open Development Cambodia & Destination Justice, co-organized a full day event with presentations/talks in the morning & translate-a-thon of the Open Data Handbook into the Khmer language at Development Innovations Cambodia. The event was attended by over 20 participants representing private sector employees, NGO staff, students and researchers.
Watch this space for more Open Data Day reports during the week!
One of the things that I keep coming back to in our digital library system are the states that an object can be in and how that affects various aspects of our system. Hopefully this post can explain some of them and how they are currently implemented locally.Hidden vs Non-Hidden
Our main distinction once an item is in our system is if it is hidden or not.
Hidden means that it is not viewable by any of our users and that it is only available in our internal Edit system where a metadata record and basic access exists to the item. If a request for this items comes in through our public facing digital library interfaces, the user will receive a “404 Not Found” response from our system.
If a record is not hidden then it is viewable and discoverable in one of our digital library interfaces. If an end user tries to access this item there may be limitations based on the level of access, or any embargoes on the item that might be present.
In our metadata scheme UNTL, we notate if an item is hidden or not in the following way. If there is a value of <meta qualifier=”hidden”>True</meta> then the item is considered hidden. If there is a value of <meta qualifier=”hidden”>False</meta> then the item is considered not hidden. If there is no element with qualifier of hidden then the default is placed as False in the system and it is considered not hidden.
This works pretty well for basic situations and with the assumption that nobody will ever make a mistake.
But… People make mistakes.Deleted Items
The first issue we ran into when we started to scale up our systems is that from time to time we would accidentally load the same resource into the system twice. This happens for a variety of reasons. User error on the part of the ingest technician (me) is the major cause of this. Also there are a number of times that the same item will be sent through the digitization/processing queue a number of times because of the amount of time that passes for some projects to complete. There are other situations where the same item will be digitized again because the first instance was poorly scanned, and instead of updating the existing record it is added a second time. For all of these situations we needed to have a way of suppressing these records
Right now we add an element to the metadata record that is <meta qualified=recordStatus”>deleted</meta> which designates that this item has been suppressed in the system and that it should be effectively forgotten. On the technical side this triggers a delete from the Solr index, which holds our metadata indexes and the item is then gone.
When a user requests an item that is deleted she will currently receive a “404 Not Found” though we have an open ticket to change this behavior to return a “410 Gone” status code for these items. Another limitation of our current process of just deleting these from our Solr index is that we are not able to mark them as “deleted” in our OAI-PMH repositories which isn’t ideal. Finally by purging these items completely from our system we have no way of knowing how many have been suppressed/deleted, or not easy way of making the items visible again.
These suppressed records are only deleted from the Solr index but all of their edit history and the records themselves. In fact if you know that an item used to be in a non-suppressed state, and remember the ARK identifier you can still access the full record, remove the recordStatus flag and un-suppress the item. Assuming you remember the identifier.What does hidden really mean?
So right now we have hidden, and non-hidden and deleted and non-deleted. The deleted items are effectively forgotten about, but what about those hidden items, what do they mean.
Here are some of the reasons that we have hidden records vs non-hidden records.Metadata Missing
We have a workflow for our system that allows us to ingest stub records which have minimal descriptive metadata in place for items so that they can be edited in our online editing environment by metadata editors around the library, university, and state. These are loaded with minimal title information (usually just the institution’s unique identifier for the item), the partner and collection that the item belongs to, and any metadata that makes sense to set across a large set of records. Once in the editing system these items will have metadata created for them over time and be made available to the end user.Hard Embargoes
While our system has built-in functionality for embargoing an item, this functionality will always make available the descriptive metadata for the item to the public. In our UNT Scholarly Works Repository, we work to make the contact information for the creators of the item known so that you can “request a copy” of the item if you discover it but if it is still under an embargo. Here is an example item that won’t become available until later this year.
Sometimes this is not the desired way of presenting the embargoed items to the public. For example we work with a number of newspaper publishers around Texas who make available their PDF print masters to UNT for archiving and presentation via The Portal to Texas History. They do so with the agreement that we will not make their items available until one, two, or three years after publication. Instead of presenting the end user with an item they aren’t able to access in the Portal, we just have these items hidden until they are ready to be made available. I have a feeling that this method will be changed soon in the future because it becomes a large metadata management problem.
Finally there are items that we are either digitizing or capturing which we do not have the ability to provide access to because of current copyright restrictions. We have these items in a hidden state in the system until either an agreement can be reached with the rights holder, or until the item falls into the public domain.
Right not it is impossible for us to identify how many of these items are being held as “embargoed” by the use of a hidden item flag.Copyright Challenge, or Personally Identifiable Information
We have another small set of items (less than a dozen… I think) that are hidden because there is an active copyright challenge we are working with for the item, or because the item contained personally identifiable information. Our first step in these situations is to mark the item as hidden until the item or the situations can be resolved. If situation with the item has been successfully resolve and access restored to the item, it is marked as un-hidden.Others?
I’m sure there are other reasons that an item can be hidden within a system, I would be interested in hearing your reasons within your collections especially if they are different from the ones listed above. I’m blissfully unaware of any controlled vocabularies for these kinds of states that a record might be in within digital library systems so if there is prior work in this area I’d love to hear about it.
As always feel free to contact me via Twitter if you have questions or comments.
Pakistan is a small country with a high population density. Within 796,096 square kilometres of its territory, Pakistan has a population of over 180 million people. Such a large population poses immense responsibilities on the government. Majority of the population in Pakistan is uneducated, living in rural areas, with a growing influx of the rural people to the urban areas. Thus we can say that the rate of urbanization in Pakistan is raising rapidly. This is a major challenge to the civic planners and the Government of Pakistan.Urban population (% of total)
Using our experience from our initial net archive search setup, Thomas Egense and I have been tweaking options and adding patches to the fine webarchive-discovery from UKWA for some weeks. We will be re-starting indexing Real Soon Now. So what have we learned?
- Stored text takes up a huge part of the index: Nearly half of the total index size. The biggest sinner is not surprisingly the content field, but we need that for highlighting and potentially text extraction from search results. As we have discovered that we can avoid storing DocValued fields, at the price of increased document retrieval time, we have turned off storing for several fields.
- DocValue everything! Or at least a lot more than we did initially. Enabling DocValues for a field and getting low-overhead faceting turned out to be a lot disk-space-cheaper than we thought. As every other feature request from the researchers seems to be “We would also like to facet on field X”, our new strategy should make them at least half happy.
- DocValues are required for some fields. Due to internal limits on facet.method=fc without DocValues, it is simply not possible to do faceting if the number of references gets high.
- Faceting on outgoing links is highly valuable. Being able to facet on links makes it possible to generate real-time graphs for interconnected websites. Links with host- or domain granularity are easily handled and there is no doubt that those should be enabled. Based on posivitive experimental results with document-granularity links faceting (see section below), we will also be enabling that.
- The addition of performance instrumentation made it a lot easier for us to prioritize features. We simply do not have time for everything we can think of and some specific features were very heavy.
- Face recognition (just finding the location of faces in images, not guessing the persons) was an interesting feature, but with a so-so success rate. Turning it on for all images would triple our indexing time and we have little need for sampling in this area, so we will not be doing it at all for this iteration.
- Most prominent colour extraction was only somewhat heavy, but unfortunately the resulting colour turned out to vary a great deal depending on adjustment of extraction parameters. This might be useful if a top-X of prominent colours were extracted, but for now we have turned off this feature.
- Language detection is valuable, but processing time is non-trivial and rises linear with the number of languages to check. We lowered the number of detected languages from 20 to 10, pruning the more obscure (relative to Danish) languages.
- Meta-data about harvesting turned out to be important for the researchers. We will be indexing the ID of the harvest-job used for collecting the data, the institution responsible and some specific sub-job-ID.
- Disabling of image-analysis features and optimization of part of the code-base means faster indexing. Our previous speed was 7-8 days/shard, while the new one is 3-4 days/shard. As we has also doubled our indexing hardware capacity, we expect to do a full re-build of the existing index in 2 months and catching up to the present within 6 months.
- Our overall indexing workflow, with dedicated builders creating independent shards of a fixed size, worked very well for us. Besides some minor tweaks, we will not be changing this.
- We have been happy with Solr 4.8. Solr 5 is just out, but as re-indexing is very costly for us, we do not feel comfortable with a switch at this time. We will do the conservative thing and stick to the old Solr 4-series, which currently means Solr 4.10.4.
The biggest new feature will be document links. This is basically all links present on all web pages at full detail. For a single test shard with 217M documents / 906GB, there were 7 billion references to 640M unique links, the most popular link being used 2.4M times. Doing a full faceted search on *:* was understandable heavy at around 4 minutes, while ad hoc testing of “standard” searches resulted in response times varying from 50 ms to 3500 ms. Scaling up to 25 shards/machine, it will be 175 billion references to 16 billion values. It will be interesting to see the accumulated response time.
We expect this feature to be used to generate visual graphs of interconnected resources, which can be navigated in real-time. Or at least you-have-to-run-to-get-coffee-time. For the curious, here is the histogram for links in the test-shard:References #terms 1 425,799,733 2 85,835,129 4 52,695,663 8 33,153,759 16 18,864,935 32 10,245,205 64 5,691,412 128 3,223,077 256 1,981,279 512 1,240,879 1,024 714,595 2,048 429,129 4,096 225,416 8,192 114,271 16,384 45,521 32,768 12,966 65,536 4,005 131,072 1,764 262,144 805 524,288 789 1,048,576 123 2,097,152 77 4,194,304 1
LDPath can traverse the Linked Data Cloud as easily as working with local resources and can cache remote resources for future access. The LDPath language is also (generally) implementation independent (java, ruby) and relatively easy to implement. The language also lends itself to integration within development environments (e.g. ldpath-angular-demo-app, with context-aware autocompletion and real-time responses). For me, working with the LDPath language and implementation was the first time that linked data moved from being a good idea to being a practical solution to some problems.
Here is a selection from the VIAF record :<> void:inDataset <../data> ; a genont:InformationResource, foaf:Document ; foaf:primaryTopic <../65687612> . <../65687612> schema:alternateName "Bittman, Mark" ; schema:birthDate "1950-02-17" ; schema:familyName "Bittman" ; schema:givenName "Mark" ; schema:name "Bittman, Mark" ; schema:sameAs <http://d-nb.info/gnd/1058912836>, <http://dbpedia.org/resource/Mark_Bittman> ; a schema:Person ; rdfs:seeAlso <../182434519>, <../310263569>, <../314261350>, <../314497377>, <../314513297>, <../314718264> ; foaf:isPrimaryTopicOf <http://en.wikipedia.org/wiki/Mark_Bittman> .
We can use LDPath to extract the person’s name:
So far, this is not so different from traditional approaches. But, if we look deeper in the response, we can see other resources, including books by the author.<../310263569> schema:creator <../65687612> ; schema:name "How to Cook Everything : Simple Recipes for Great Food" ; a schema:CreativeWork .
We can traverse the links to include the titles in our record:
LDPath also gives us the ability to write this query using a reverse property selector, e.g:books = foaf:primaryTopic / ^schema:creator[rdf:type is schema:CreativeWork] / schema:name :: xsd:string ;
The resource links out to some external resources, including a link to dbpedia. Here is a selection from record in dbpedia:<http://dbpedia.org/resource/Mark_Bittman> dbpedia-owl:abstract "Mark Bittman (born c. 1950) is an American food journalist, author, and columnist for The New York Times."@en, "Mark Bittman est un auteur et chroniqueur culinaire américain. Il a tenu une chronique hebdomadaire pour le The New York Times, appelée The Minimalist (« le minimaliste »), parue entre le 17 septembre 1997 et le 26 janvier 2011. Bittman continue d'écrire pour le New York Times Magazine, et participe à la section Opinion du journal. Il tient également un blog."@fr ; dbpedia-owl:birthDate "1950+02:00"^^<http://www.w3.org/2001/XMLSchema#gYear> ; dbpprop:name "Bittman, Mark"@en ; dbpprop:shortDescription "American journalist, food writer"@en ; dc:description "American journalist, food writer", "American journalist, food writer"@en ; dcterms:subject <http://dbpedia.org/resource/Category:1950s_births>, <http://dbpedia.org/resource/Category:American_food_writers>, <http://dbpedia.org/resource/Category:American_journalists>, <http://dbpedia.org/resource/Category:American_television_chefs>, <http://dbpedia.org/resource/Category:Clark_University_alumni>, <http://dbpedia.org/resource/Category:Living_people>, <http://dbpedia.org/resource/Category:The_New_York_Times_writers> ;
LDPath allows us to transparently traverse that link, allowing us to extract the subjects for VIAF record:
 If you’re playing along at home, note that, as of this writing, VIAF.org fails to correctly implement content negotiation and returns HTML if it appears anywhere in the Accept header, e.g.:
curl -H "Accept: application/rdf+xml, text/html; q=0.1" -v http://viaf.org/viaf/152427175/
will return a text/html response. This may cause trouble for your linked data clients.
Librarians interested in intellectual property, public policy and copyright have until June 1, 2015, to apply for the Robert L. Oakley Memorial Scholarship. The annual $1,000 scholarship, which was developed by the American Library Association and the Library Copyright Alliance, supports research and advanced study for librarians in their early-to-mid-careers.
Applicants should provide a statement of intent for use of the scholarship funds. Such a statement should include the applicant’s interest and background in intellectual property, public policy, and/or copyright and their impacts on libraries and the ways libraries serve their communities.
Additionally, statements should include information about how the applicant and the library community will benefit from the applicant’s receipt of scholarship. Statements should be no longer than three pages (1000 words). The applicant’s resume or curriculum vitae should be included in their application.
Applications must be submitted via e-mail to Carrie Russell, firstname.lastname@example.org. Awardees may receive the Robert L. Oakley Memorial Scholarship up to two times in a lifetime. Funds may be used for equipment, expendable supplies, travel necessary to conduct, attend conferences, release from library duties or other reasonable and appropriate research expenses.
The award honors the life accomplishments and contributions of Robert L. Oakley. Professor and law librarian Robert Oakley was an expert on copyright law and wrote and lectured on the subject. He served on the Library Copyright Alliance representing the American Association of Law Librarians and played a leading role in advocating for U.S. libraries and the public they serve at many international forums including those of the World Intellectual Property Organization and United Nations Educational Scientific and Cultural Organization.
Oakley served as the United States delegate to the International Federation of Library Associations Standing Committee on Copyright and Related Rights from 1997-2003. Mr. Oakley testified before Congress on copyright, open access, library appropriations and free access to government documents and was a member of the Library of Congress’ Section 108 Study Group. A valued colleague and mentor for numerous librarians, Oakley was a recognized leader in law librarianship and library management who also maintained a profound commitment to public policy and the rights of library users.
The post Call for Nominations: Robert L. Oakley Memorial Scholarship appeared first on District Dispatch.
Check out the brand new LITA web course:
Taking the Struggle Out of Statistics
Instructor: Jackie Bronicki, Collections and Online Resources Coordinator, University of Houston.
Offered: April 6 – May 3, 2015
A Moodle based web course with asynchronous weekly lectures, tutorials, assignments, and group discussion.
Recently, librarians of all types have been asked to take a more evidence-based look at their practices. Statistics is a powerful tool that can be used to uncover trends in library-related areas such as collections, user studies, usability testing, and patron satisfaction studies. Knowledge of basic statistical principles will greatly help librarians achieve these new expectations.
This course will be a blend of learning basic statistical concepts and techniques along with practical application of common statistical analyses to library data. The course will include online learning modules for basic statistical concepts, examples from completed and ongoing library research projects, and also exercises accompanied by practice datasets to apply techniques learned during the course.
Got assessment in your title or duties? This brand new web course is for you!
Jackie Bronicki’s background is in research methodology, data collection and project management for large research projects including international dialysis research and large-scale digitization quality assessment. Her focus is on collection assessment and evaluation and she works closely with subject liaisons, web services, and access services librarians at the University of Houston to facilitate various research projects.
- LITA Member: $135
- ALA Member: $195
- Non-member: $260
Moodle login info will be sent to registrants the week prior to the start date. The Moodle-developed course site will include weekly asynchronous lectures and is composed of self-paced modules with facilitated interaction led by the instructor. Students regularly use the forum and chat room functions to facilitate their class participation. The course web site will be open for 1 week prior to the start date for students to have access to Moodle instructions and set their browser correctly. The course site will remain open for 90 days after the end date for students to refer back to course material.
Register Online page arranged by session date (login required)
Mail or fax form to ALA Registration
Call 1-800-545-2433 and press 5
Questions or Comments?
For all other questions or comments related to the course, contact LITA at (312) 280-4269 or Mark Beatty, email@example.com.
Disney, tanks, Pantone, Bingo and the paperback book.
Tank bookmobile weapon of mass instruction
Library visits vs. major tourist attractions
Portraits with the exact Pantone color of the skin tone set as the background
Composting company has customers collect troublesome fruit stickers on a Bingo card to receive free compost.
The roots of the paperback. Pop into the Grolier Club for a fascinating exhibit.