Austin, TX The most recent Fedora camp in Pasadena, California was hosted by the Caltech Library at the California Institute of Technology's Keck Institute for Space Studies.
Collecting web usage data through services like Google Analytics is a top priority for any library. But what about user privacy?
Most libraries (and websites for that matter) lean on Google Analytics to measure website usage and learn about how people access their online content. It’s a great tool. You can learn about where people are coming from (the geolocation of their IP addresses anyway), what devices, browsers and operating systems they are using. You can learn about how big their screen is. You can identify your top pages and much much more.
Google Analytics is really indispensable for any organization with an online presence.
But then there’s the privacy issue.Is Google Analytics a Privacy Concern?
The question is often asked, what personal information is Google Analytics actually collecting? And then, how does this data collection jive with our organization’s privacy policies.
It turns out, as a user of Google Analytics, you’ve already agreed to publish a privacy document on your site outlining the why and what of your analytics program. So if you haven’t done so, you probably should if only for the sake of transparency.Personally Identifiable Data
Fact is, if someone really wanted to learn about a particular person, it’s not entirely outside the realm of possibility that they could glean a limited set of personal attributes from the generally anonymized data Google Analytics collects. IP addresses can be loosely linked to people. If you wanted to, you could set up filters in Google Analytics that look at a single IP.
Of course, on the Google side, any user that is logged into their Gmail, YouTube or other Google account, is already being tracked and identified by Google. This is a broadly underappreciated fact. And it’s a critical one when it comes to how approach the question of dealing with the privacy issue.
In both the case of what your organization collects with Google Analytics and what all those web trackers, including Google’s trackers, collect, the onus falls entirely on the user.The Internet is Public
Over the years, the Internet has become a public space and users of the Web should understand it as such. Everything you do, is recorded and seen. Companies like Google, Facebook, Mircosoft, Yahoo! and many, many others are all in the data mining business. Carriers and Internet Service Providers are also in this game. They deploy technologies in websites that identify you and then sell what your interests, shopping habits, web searches and other activities are to companies interested in selling to you. They’ve made billions on selling your data.
Ever done a search on Google and then seen ads all over the Web trying to sell you that thing you searched last week? That’s the tracking at work.Only You Can Prevent Data Fires
The good news is that with little effort, individuals can stop most (but not all) of the data collection. Browsers like Chrome and Firefox have plugins like Ghostery, Avast and many others that will block trackers.
Google Analytics can be stopped cold by these plugins. But it won’t solve all the problems. Users also need to set up their browsers to delete cookies websites save to their browsers. And moving off of accounts provided from data mining companies “for free” like Facebook accounts, Gmail and Google.com can also help.
But you’ll never be completely anonymous. Super cookies are a thing and are very difficult to stop without breaking websites. And some trackers are required in order to load content. So sometimes you need to pay with your data to play.Policies for Privacy Conscious Libraries
All of this means that libraries wishing to be transparent and honest about their data collection, need to also contextualize the information in the broader data mining debate.
First and foremost, we need to educate our users on what it means to go online. We need to let them know its their responsibility alone to control their own data. And we need to provide instructions on doing so.
Unfortunately, this isn’t an opt-in model. That’s too bad. It actually would be great if the world worked that way. But don’t expect the moneyed interests involved in data mining to allow the US Congress to pass anything that cuts into their bottom line. This ain’t Germany, after all.
We actually do our users a service by going with the opt-out model. This underlines the larger privacy problems on the Wild Wild Web, which our sites are a part of.
New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.
New This Week
Visit the LITA Job Site for more available jobs and for information on submitting a job posting.
DuraSpace News: VIVO Updates for April 24–Mozilla Open Science Hackathon, VIVO User Group Meeting and More
From Mike Conlon, VIVO Project Director
From the VIVO 2016 Conference organizers, to be held In Denver, CO August 17-19
The VIVO 2016 Planning Committee is excited to announce two invited speakers! We’re looking forward to their talks, and we’re thrilled that, in addition to their invited sessions, both Dr. Ruben Verborgh and Dr. Pedro Szekely will be hosting half-day Workshops on August 17th.
From the Federal Depository Library Program (FDLP):
A live training webinar, “School Librarian’s Workshop: Federal Government Resources for K-12 / Taller para maestros de español: Recursos de gobierno federal para niveles K-12,” will be presented on Tuesday, May 31, 2016.
Click here to register!
- Start time: 2:00 p.m. (Eastern)
- Duration: 60 minutes
- Speaker: Jane Canfield, Coordinator of Federal Documents, Pontifical Catholic University of Puerto Rico
- Learning outcomes: Are you a school librarian? Do you work with school librarians or children? The School Librarian’s Workshop will provide useful information for grades K-12, including Ben’s Guide to the U.S. Government and Kids.gov. The webinar will explore specific agency sites which provide information, in English and Spanish, appropriate for elementary and secondary school students. Teachers and school librarians will discover information on Federal laws and regulations and learn about resources for best practices in the classroom.
- Expected level of knowledge for participants: No prerequisite knowledge required.
Closed captioning will be available for this webinar.
The webinar is free, however registration is required. Upon registering, a confirmation email will be sent to you. This registration confirmation email includes the instructions for joining the webinar.
Registration confirmations will be sent from sqldba[at]icohere.com. To ensure delivery of registration confirmations, registrants should configure junk mail or spam filter(s) to permit messages from that email address. If you do not receive the confirmation, please notify GPO.
GPO’s eLearning platform presents webinars using WebEx. In order to attend or present at a GPO-hosted webinar, a WebEx plug-in must be installed in your internet browser(s). Download instructions.
Visit FDLP Academy for access to FDLP educational and training resources. All are encouraged to share and re-post information about this free training opportunity.
The post School librarian’s workshop: federal government resources for K-12 appeared first on District Dispatch.
Many web sites have explicit terms of service. For example, here are the terms of service that "govern your use of certain New York Times digital products". They start with this clause:
1.1 If you choose to use NYTimes.com (the “Site”), NYT’s mobile sites and applications, any of the features of this site, including but not limited to RSS, API, software and other downloads (collectively, the "Services"), you will be agreeing to abide by all of the terms and conditions of these Terms of Service between you and The New York Times Company ("NYT", “us” or “we”).So, just by using the services of nytimes.com, the New York Times claims that I have agreed to a whole lot of legal terms and conditions. I didn't have to click a check-box agreeing to them, or do anything explicit. The terms and conditions are not on the front page itself, they're just linked from it. The link is hard to find, in faint type at the very bottom of the page, wedged blandly between "Privacy" and the eye-glazing "Terms of Sale."
Among the terms that I'm deemed to have agreed to are:
2.3 You may download or copy the Content and other downloadable items displayed on the Services for personal use only, ... Copying or storing of any Content for other than personal use is expressly prohibited ... So, if the Terms of Service apply, Web archives are clearly violating the terms of service. Interestingly, there is an exception:
5.2 ... THE SERVICES AND ALL DOWNLOADABLE SOFTWARE ARE DISTRIBUTED ON AN "AS IS" BASIS WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE OR IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. YOU HEREBY ACKNOWLEDGE THAT USE OF THE SERVICES IS AT YOUR SOLE RISK.The New York Times claims not to be liable. Even if you thought "arguing with a man who buys ink by the barrel" was a good idea:
11.1 These Terms of Service have been made in and shall be construed and enforced in accordance with New York law. Any action to enforce these Terms of Service shall be brought in the federal or state courts located in New York City. Good luck with that.
So the interesting question is whether, in the absence of any explicit action on my (or an archive's crawler's) part, the terms of service bind me (or the archive)? Now, IANAL, and even actual lawyers appear to believe the answer isn't obvious. But writing on the Technology and Law blog a year ago, Venkat Balasubramani suggests that unless there is an explicit action indicating assent, the terms are unlikely to apply:
In place of the flawed browsewrap/clickwrap typology, we can use a simple non-overlapping typology for web interfaces: Category A is a click-through presentation where a user clicks while knowing that the click signals assent to the applicable terms; and Category B is everything else, which is not a contract.Let us assume for the moment that Balasubramani is correct and if there was no click-through the terms are not binding. In the good old days of Web archiving, this would mean there was no problem because the crawler would not have clicked the "I agree" box. But in today's Web, browser-based crawlers are clicking on things. Lots of things. In fact, they're clicking on everything they can find. Which might well be an "I agree" box. Lawyers will be able to argue whether the crawler clicked on it "knowing that the click signals assent to the applicable terms".
Making this assumption, Jefferson and I argued as follows. Suppose my, or the archive's, browser were configured to include in the HTTP request to nytimes.com, a Link header with "rel=license" pointing to the Terms of Service that apply to the services available from the requesting browser. The New York Times would have been notified of these terms far more directly than I had been of their terms by the faint type link at the bottom of the page that few have ever consciously clicked on. Thus, using exactly the same argument that the New York Times used to bind me to their terms, they would have been bound to my terms.
What's sauce for the goose is sauce for the gander. If an explicit action is required, archive crawlers that don't click on an "I agree" box are not bound by the terms. If no explicit action is required, only some form of notification, browsers and browser-based crawlers can bind websites to their terms by providing a suitable notification.
What Terms of Service would be appropriate for using my browser? Based on the New York Times' terms, perhaps they should include:
1.2 We may change, add or remove portions of these Terms of Service at any time, which shall become effective immediately upon posting. It is your responsibility to review these Terms of Service prior to each use of the Browser and by continuing to use this Browser, you agree to any changes.and:
1.4 We may change, suspend or discontinue any aspect of the Services at any time, including the availability of any Services feature, database, or content. We may also impose limits on certain features and services or restrict your access to parts or all of the Services without notice or liability. and:
4.1 You may not access or use, or attempt to access or use, the Services to take any action that could harm us or a third party. You may not access parts of the Services to which you are not authorized. You may not attempt to circumvent any restriction or condition imposed on your use or access, or do anything that could disable or damage the functioning or appearance of the Services, I.e. you and your advertising networks better not send us any malware. And, of course, we need the perennial favorite:
5.2 ... THE SERVICES AND ALL INFORMATION THEY CONTAIN ARE DISTRIBUTED ON AN "AS IS" BASIS WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE OR IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. YOU HEREBY ACKNOWLEDGE THAT USE OF THE SERVICES IS AT YOUR SOLE RISK.A reverse EULA. Wouldn't you like to be able to do this?
So far, this may sound like a parody or a paranoid fantasy. But many online media companies have begun to target client-side browser information to police content delivery. Sites like Forbes, Wired, and maybe even (gasp) The New York Times are now disallowing access to their sites for those with ad-blocking browser add-ons:
We noticed you still have ad blocker enabled. By turning it off or whitelisting Forbes.com, you can continue to our site and receive the Forbes ad-light experience.It turns out that the "Forbes ad-light experience" includes free bonus malware!
that using an ad-blocker detector script is basically doing the same sort of thing as a cookie in terms of spying on client-side information within one's web browser, and a letter he received from the EU Commission apparently confirms his assertion.Thus running a script that collects information from an EU citizen's browser (which is what the vast majority do) apparently requires explicit permission. If Hanff's efforts succeed, anticipate European Web publishers going non-linear.
As the web has grown into a processing environment, it presumes a reciprocal interactivity, the parameters of which are still shifting and unbalanced. In the end the terms of this overall interplay of information exchange and license seem, as they so often do, inequitable. The future is here, it's just not evenly licensed. On one end, media and other corporate content sites target user browsers, inject (accidentally or via 3rd parties) potentially malicious scripts, monitor for plug-in screeners, install browsing trackers, analyze cookies and add all sorts of profiling and monitoring scripts, all generally without any explicit agreement on our part. On the other hand, we, simple users, often are presumed to agree to prolix legalese and verbose, obscure license agreements, all simply so we can read about people doing yoga with their dogs.
Issue #32 of the Code4Lib Journal is now available.
- Editorial Introduction: People, by Meghan Finch
- An Open-Source Strategy for Documenting Events: The Case Study of the 42nd Canadian Federal Election on Twitter by Nick Ruest and Ian Milligan
- How to Party Like it’s 1999: Emulation for Everyone – by Dianne Dietrich, Julia Kim, Morgan McKeehan, and Alison Rhonemus
- How We Went from Worst Practices to Good Practices, and Became Happier in the Process by Amanda French, Francis Kayiwa, Anne Lawrence, Keith Gilbertson and Melissa Lohrey
- Shining a Light on Scientific Data: Building a Data Catalog to Foster Data Sharing and Reuse by Ian Lamb and Catherine Larson
- Creation of a Library Tour Application for Mobile Equipment using iBeacon Technology by Jonathan Bradley, Neal Henshaw, Liz McVoy, Amanda French, Keith Gilbertson, Lisa Becksford, and Elisabeth Givens.
- Measuring Library Vendor Cyber Security: Seven Easy Questions Every Librarian Can Ask by Alex Caro and Chris Markman
- Building Bridges with Logs: Collaborative Conversations about Discovery across Library Departments by Jimmy Ghaphery, Emily Owens, Donna Coghill, Laura Gariepy, Megan Hodge, Thomas McNulty, and Erin White
This is a guest post by John Scancella, Information Technology Specialist with the Library of Congress, and Tibaut Houzanme, Digital Archivist with the Indiana Archives and Records Administration. BagIt is an internationally accepted method of transferring files via digital containers. If you are new to BagIt, please watch our introductory video.
Bagger is a digital records packaging and validation tool based on the BagIt Specification. This BagIt-compliant software allows creators and recipients of BagIt packages to verify that the files in the bag are complete and valid. This is done by creating manifests of the files that exist in the bag and their corresponding checksum values.
Bagger, built in Java, works in a variety of computing environments such as Windows, Linux and Mac. As a graphic user interface application, Bagger is a simpler tool for the average computer user than the text-only command-line interface implementation of BagIt.
Many improvements were made to Bagger recently:
- Added more profiles to give the user and archival communities more options. Users can select from various profiles and fields to decide on their own requirements.
- Bagger’s build system was switched to Gradle. Gradle is quickly becoming the standard build system for Java applications, and its use contributes to future-proofing Bagger’s improvements by giving Bagger the advantage of having a domain-specific language that leads to concise, maintainable and comprehensible builds.
- The lowest compatible version of Java that Bagger can run with now is 1.7. Running Bagger with at least Java 1.7 helps with security and brings a host of new programming language features that allow for easier maintenance and performance improvement.
- General code cleanup was performed for easier maintenance.
- Long standing bugs and issues were fixed.
The Indiana Archives and Records Administration prepared a relatively detailed accession profile that is included with Bagger 2.5.0. A generic version of this profile is also available, where metadata fields are all optional.
These profiles were designed to help facilitate the accessioning of digital records, with preservation actions and management in mind. Overall, intellectual and physical components of digital records’ metadata were targeted. The justifications behind the metadata fields in these new profiles are:
- Consistent metadata fields with simple descriptors. The metadata field names use clear and simple terms. The consistency in the order of the fields on the display screen and in the metadata text file (part of the recent improvements) is also a benefit to data entry and review. The profiles use pre-identified values in drop-down menus that will help reduce typing mistakes and enforce cleaner metadata collection. The Indiana profile also uses pre-populated field entries, such as names and addresses, which help reduce repetitive data entry and save time during accessioning.
- Adaptable to various institutional contexts and practices. IARA requires the collection of metadata that it deems essential for digital records; these are represented in its profile. To make the profile adaptable across institutions, the generic version uses optional fields only. Individual users can edit the metadata fields, delete them or change their optional/required status. Switching between “Required: false” to “Required: true” in the local JSON file will be sufficient to help achieve the desired level of enforcement appropriate for each institution. Additional fields from the main menu can be added that draw from the BagIt specification. Also, custom metadata fields can be created or added on the fly.
- Collection of data points that matter for preservation decisions and actions. Some of the metadata fields added to standard accession fields help to identify records that are available only in digital formats so they can be treated accordingly; others assist with being able to locate records in proprietary digital formats that need migration to open standards formats. Information about sensitive records can also be captured to assist with prioritization and access management.
- Make automation possible through fields mapping. By using consistent and orderly metadata fields in a profile, you will create bags with a well-structured and predictable metadata sequence and value. This makes it easier to map the bag’s fields, values or collected information to a preservation system’s database fields. Investing in this automation opportunity will likely reduce the data entry time when importing bags into a preservation system. This assumes that the preservation system is either BagIt-compliant already (interoperability benefit) or will be made to effectively know what to do with each part of the bag, each metadata field and the captured values (to be achieved through integration).
Following are two screenshots of Bagger with the full list of metadata fields for a sample accession:
Figure 1: IARA Profile with Sample Accession Screen 1 of 2 [ENLARGE]
Figure 2: IARA Profile with Sample Accession[ENLARGE]
In both screenshots, the letter “R” next to a metadata field means that you must enter or select a value, or the right value, before the bag can be finalized. The drop-down selection marked with “???” indicates that a value can be selected through clicks. Also question marks “???” as a value, or a different value in their place, can be used as a placeholder that may be found and replaced later with the correct value. In IARA’s experience, a single accession may come on multiple storage media/carriers. For that reason, the “records/medium carrier” field has been repeated five times (arbitrarily) to allow for multiple choices and entries; it can be further expanded. The number of media received, when entered with consistency, can help with easier media count and inventories.
Once completed, Bagger also adds, in the “bag-info.txt” metadata file, the size of the bag in Bytes and in Megabytes. When all the required metadata is entered and the files added, the bag can be completed. A successful bagging session process will see this message displayed: “Bag Saved Successfully.”
The fictitious metadata values in the first two figures are for demonstration and include additional metadata such as hash value and file size in the figure below:
Figure 3: Metadata Fields and Values in the bag-info.txt File after Bag Creation [ENLARGE]
IARA’s accession profile, the generic version or any profile available in Bagger, can be used as is if it meets the user’s requirements. Or they can be customized to fit institutional needs, such as enforcing certain metadata, field-name modifications, additional fields or drop-down values, and to support other document forms (e.g. audiovisual metadata fields such as linear duration of content). As Bagger’s metadata remain extensible, a profile can be created to fit almost any project. And the more profiles are available directly in Bagger, the better for the archival community who will have choices.
To use the IARA’s profile, its generic version or any other profile in Bagger, download the latest version (as of this writing 2.5.0). To start an accession, select the appropriate profile from the drop-down list. This will populate the screen with profile-specific metadata fields. Select files or folders, enter values and save the bag.
For detailed instructions on how to edit metadata fields and their obligation level, create a new profile, or change an existing profile to meet the project/institution’s requirements, please refer to the Bagger User Guide in the “doc” folder inside the downloaded Bagger.zip file.
BagIt has been adopted for digital preservation by The Library of Congress, the Dryad Data Repository, the National Science Foundation DataONE and the Rockefeller Archive Center. BagIt is also used at Cornell, Purdue, Stanford, Ghent, New York and the University of California. BagIt has been implemented in Python, Ruby, Java, Perl, PHP, and in other programming languages.
We encourage feedback for BagIt. Here are some ways to contribute:
- The Digital Curation Google Group is an open discussion list that reaches many of the contributors to, and users of this open-source project
- Comments or suggestions on IARA’s profiles are welcome at firstname.lastname@example.org
- For any issues, please open a GitHub ticket
- To contact a developer at The Library of Congress please email email@example.com
- To contribute to the code, please see our Contributing notice and submit a pull request.
Journal of Web Librarianship: Mobile Web Site Ease of Use: An Analysis of Orbis Cascade Alliance Member Web Sites
I’m coming down from the Gender and Sexuality in Library and Information Studies colloquium that Emily Drabinski, Baharak Yousefi and I organized. For me one of the big themes was bodies and embodiment.
Vanessa Richards‘ keynote was amazing. She spoke a bit and facilitated us in singing together. It was powerful, transformative and extremely emotional for me. Some of the instruction she gave us was to pay attention to our bodies, “what do you feel and where in your body do you feel it when I tell you we are going to sing together?” Both my body and my mind are very uncomfortable with singing. At some point in my life someone told me I was a bad singer and ridiculed me and I think I believed them. Vanessa Richards said something like: “Your body is the source code. Your body knows how to sing. All the people who told you that you can’t sing, kick them to the curb. This is your human right.”
For me this was deeply transformative and created magic in the room. We sang 3 songs together, and by the last one there was a beautiful transformation. I observed people’s bodies. People’s shoulders had dropped and their weight was sinking their weight down into their feet. People were taking up more space and looking less self conscious. Also, our voices were much louder and they were beautiful. This was an unconventional and magical way to start the day together.
There were so many excellent presentations. I was so excited to learn about GynePunk, the cyborg witches of DIY gynecology in Spain. James Cheng, Lauren Di Monte, and Madison Sullivan completely blew my mind in their talk titled Makerspace Meets Medicine: Politics, Gender, and Embodiment in Critical Information Practice. This is the most exciting talk I’ve heard about makerspaces, though they argued that because it’s gendered and political we’re unlikely to see this in a library makerspace. GynePunk reminds me of the zine Hot Pantz that starts with:
Patriarchy sucks. It’s robbed us of our autonomy and much of our history. We believe it’s integral for women to be aware and in control of our own bodies.
I also loved Stacy Wood’s talk on Mourning and Melancholia in Archives. She told the story of working in an archive and having cremated ashes fall out of a poorly sealed bag that was in a poorly sealed envelope. I hope I have a chance to read her paper as she had many smart things to say about institutional practice, as well as melancholia.
Marika Cifor presented Blood, Sweat, and Hair: The Archival Potential of Queer and Trans Bodies in three acts: blood, sweat and hair. She used examples of these parts of our bodies that were part of archival objects:
- blood – blood on a menstrual sponge, blood during the AIDS crisis, blood on Harvey Milk’s clothing from when he was shot and killed
- sweat – sweat stains on a tshirt from a gay leather bar
- hair – hair on a lipstick of Victoria Schneider a trans woman, sex worker and activist, and hair samples (both pubic hair and regular hair from your head) in Samuel Steward’s stud file, where he documented his lovers, that is in the Yale Archives
It was so exciting and nourishing to talk about bodies in relation to libraries, archives and information work. I didn’t realize that I was so hungry to have these conversations. I realized that when I’m doing my daily work I’m fairly unembodied dissociated. I bike to work, hang up my body on the back of my office door, and then let my brain run around for the day. I put on my body and go about the rest of my life. I’ve been working to try and be my whole self at work, and have realized that the brain/body binary needs to be dismantled.
I’m not really sure what this is going to look like. I fear it might be messy, as bodies often are. I also fear that there will be failure, as is common with trying new things. To start, I think I’m going to go join the Woodward’s Community Singers this Thursday and sing again.
From Tim Donohue, DSpace Tech Lead, on behalf of the DSpace Committers
The DSpace 6.0 Testathon is now underway. We ask that you take a few minutes of your time in these coming weeks to help us fully test this new release. We want to ensure we are maintaining the same level of quality that you come to expect out of a new DSpace release. We'd also love to hear your early feedback on 6.0!
There are of course as many different ways to use social media (or not) as there are people. But I was thinking the other day that probably most of us who use social media tools such as Twitter and Facebook probably fall into one of three camps:
- Promiscuous — These are the people who share just about everything. You know how many pets they have and of which kind, you know if they have kids, you know how many, how old, and also that you will see every cute or horrible thing that they do as it will be posted by their adoring parent.
- Particular — These people don’t post everything, they post very selectively.
- Private — These are the lurkers. They like to see what is going on with their friends but they only rarely, if ever, share themselves. I know a young person like this. She is a very private person and she has never shared anything on Facebook despite having an account.
Having created these categories, I would guess that many of us go in and out of these categories at different times and for different purposes. But do you consider yourself to be promiscuous, particular, or private when it comes to social media?
Photo by e-codices, Creative Commons License CC BY-NC 2.0
Cynthia Ng: Imagine Living Without Books Part 1: The Importance of Supporting Print Disabled Readers
When it comes to digital preservation, everyone agrees that a little bit is better than nothing. Look no further than these two excellent presentations from Code4Lib 2016, “Can’t Wait for Perfect: Implementing “Good Enough” Digital Preservation” by Shira Peltzman and Alice Sara Prael, and “Digital Preservation 101, or, How to Keep Bits for Centuries” by Julie Swierczek. I highly suggest you go check those out before reading more of this post if you are new to digital preservation, since they get into some technical details that I won’t.
The takeaway from these for me was twofold. First, digital preservation doesn’t have to be hard, but it does have to be intentional, and secondly, it does require institutional commitment. If you’re new to the world of digital preservation, understanding all the basic issues and what your options are can be daunting. I’ve been fortunate enough to lead a group at my institution that has spent the last few years working through some of these issues, and so in this post I want to give a brief overview of the work we’ve done, as well as the current landscape for digital preservation systems. This won’t be an in-depth exploration, more like a key to the map. Note that ACRL TechConnect has covered a variety of digital preservation issues before, including data management and preservation in “The Library as Research Partner” and using bash scripts to automate digital preservation workflow tasks in “Bash Scripting: automating repetitive command line tasks”.
The committee I chair started examining born digital materials, but expanded focus to all digital materials, since our digitized materials were an easier test case for a lot of our ideas. The committee spent a long time understanding the basic tenets of digital preservation–and in truth, we’re still working on this. For this process, we found working through the NDSA Levels of Digital Preservation an extremely helpful exercise–you can find a helpfully annotated version with tools by Shira Peltzman and Alice Sara Prael, as well as an additional explanation by Shira Peltman. We also relied on the Library of Congress Signal blog and the work of Brad Houston, among other resources. A few of the tasks we accomplished were to create a rough inventory of digital materials, a workflow manual, and to acquire many terabytes (currently around 8) of secure networked storage space for files to replace all removable hard drives being used for backups. While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have. An inventory and workflow manual may sound impressive, but I want to emphasize that these are living and somewhat messy documents. The major advantage of having these is not so much for what we do have, but for identifying gaps in our processes. Through this process, we were able to develop a lengthy (but prioritized) list of tasks that need to be completed before we’ll be satisfied with our processes. An example of this is that one of the major workflow gaps we discovered is that we have many items on obsolete digital media formats, such as floppy disks, that needs to be imaged before it can even be inventoried. We identified the tool we wanted to use for that, but time and staffing pressures have left the completion of this project in limbo. We’re now working on hiring a graduate student who can help work on this and similar projects.
The other piece of our work has been trying to understand what systems are available for digital preservation. I’ll summarize my understanding of this below, with several major caveats. This is a world that is currently undergoing a huge amount of change as many companies and people work on developing new systems or improving existing systems, so there is a lot missing from what I will say. Second, none of these solutions are necessarily mutually exclusive. Some by design require various pieces to be used together, some may not require it, but your circumstances may dictate a different solution. For instance, you may not like the access layer built into one system, and so will choose something else. The dream that you can just throw money at the problem and it will go away is, at present, still just a dream–as are so many library technology problems.
The closest to such a dream is the end-to-end system. This is something where at one end you load in a file or set of files you want to preserve (for example, a large set of donated digital photographs in TIFF format), and at the other end have a processed archival package (which might include the TIFF files, some metadata about the processing, and a way to check for bit rot in your files), as well as an access copy (for example, a smaller sized JPG appropriate for display to the public) if you so desire–not all digital files should be available to the public, but still need to be preserved.
Examples of such systems include Preservica, ArchivesDirect, and Rosetta. All of these are hosted vended products, but ArchivesDirect is based on open source Archivematica so it is possible to get some idea of the experience of using it if you are able to install the tools on which it based. The issues with end-t0-end systems are similar to any other choice you make in library systems. First, they come at a high price–Preservica and ArchivesDirect are open about their pricing, and for a plan that will meet the needs of medium-sized libraries you will be looking at $10,000-$14,000 annual cost. You are pretty much stuck with the options offered in the product, though you still have many decisions to make within that framework. Migrating from one system to another if you change your mind may involve some very difficult processes, and so inertia dictates that you will be using that system for the long haul, which a short trial period or demos may not be enough to really tell you that it’s a good idea. But you do have the potential for more simplicity and therefore a stronger likelihood that you will actually use them, as well as being much more manageable for smaller staffs that lack dedicated positions for digital preservation work–or even room in the current positions for digital preservation work. A hosted product is ideal if you don’t have the staff or servers to install anything yourself, and helps you get your long-term archival files onto Amazon Glacier. Amazon Glacier is, by the way, where pretty much all the services we’re discussing store everything you are submitting for long-term storage. It’s dirt cheap to store on Amazon Glacier and if you can restore slowly, not too expensive to restore–only expensive if you need to restore a lot quickly. But using it is somewhat technically challenging since you only interact with it through APIs–there’s no way to log in and upload files or download files as with a cloud storage service like Dropbox. For that reason, when you’re paying a service hundreds of dollars a terabyte that ultimately stores all your material on Amazon Glacier which costs pennies per gigabye, you’re paying for the technical infrastructure to get your stuff on and off of there as much as anything else. In another way you’re paying an insurance policy for accessing materials in a catastrophic situation where you do need to recover all your files–theoretically, you don’t have to pay extra for such a situation.
A related option to an end-to-end system that has some attractive features is to join a preservation network. Examples of these include Digital Preservation Network (DPN) or APTrust. In this model, you pay an annual membership fee (right now $20,000 annually, though this could change soon) to join the consortium. This gives you access to a network of preservation nodes (either Amazon Glacier or nodes at other institutions), access to tools, and a right (and requirement) to participate in the governance of the network. Another larger preservation goal of such networks is to ensure long-term access to material even if the owning institution disappears. Of course, $20,000 plus travel to meetings and work time to participate in governance may be out of reach of many, but it appears that both DPN and APTrust are investigating new pricing models that may meet the needs of smaller institutions who would like to participate but can’t contribute as much in money or time. This a world that I would recommend watching closely.
Up until recently, the way that many institutions were achieving digital preservation was through some kind of repository that they created themselves, either with open source repository software such as Fedora Repository or DSpace or some other type of DIY system. With open source Archivematica, and a few other tools, you can build your own end-to-end system that will allow you to process files, store the files and preservation metadata, and provide access as is appropriate for the collection. This is theoretically a great plan. You can make all the choices yourself about your workflows, storage, and access layer. You can do as much or as little as you need to do. But in practice for most of us, this just isn’t going to happen without a strong institutional commitment of staff and servers to maintain this long term, at possibly a higher cost than any of the other solutions. That realization is one of the driving forces behind Hydra-in-a-Box, which is an exciting initiative that is currently in development. The idea is to make it possible for many different sizes of institutions to take advantage of the robust feature sets for preservation in Fedora and workflow management/access in Hydra, but without the overhead of installing and maintaining them. You can follow the project on Twitter and by joining the mailing list.
After going through all this, I am reminded of one of my favorite slides from Julie Swierczek’s Code4Lib presentation. She works through the Open Archival Initiative System model graph to explain it in depth, and comes to a point in the workflow that calls for “Sustainable Financing”, and then zooms in on this. For many, this is the crux of the digital preservation problem. It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires. Given how much attention digital preservation is starting to receive, we can hope that more libraries will see this as a priority and start to participate. This may lead to even more options, tools, and knowledge, but it will still require making it a priority and putting in the work.
Learn about the open source digital repository, Islandora, from experts, in this afternoon of diving into the framework.
Islandora for Managers: Open Source Digital Repository Training
Friday June 24, 2016, 1:00 – 4:00 pm
Presenters: Erin Tripp, Business Development Manager at discoverygarden inc. and Stephen Perkins, Managing Member of Infoset Digital Publishing
This Islandora for Managers workshop will empower participants to manage digital content in an open source, standards-based, and interoperable repository framework. Islandora combines Drupal, Fedora Commons and Solr software together with additional open source applications. The framework delivers easy-to-configure tools to expose and preserve all types of digital content. The Islandora for Managers workshop will provide an overview of the Islandora software and open source community. It will also feature an interactive ‘how to’ guide for ingesting various types of content, setting permissions, metadata management, configuring discovery, managing embargoes and much more. Participants can choose to follow along using a virtual machine or an online Islandora sandbox.Erin Tripp
Erin Tripp is currently the Business Development Manager at discoverygarden inc. Since 2011, Erin’s been involved in the Islandora project; a community supported framework of open source technologies for digital repositories. In that time, Erin’s been involved in more than 40 different Islandora projects ranging from consulting, custom development, and data migrations. Prior to coming to discoverygarden inc., Erin graduated from the University of King’s College (BJH), worked as a broadcast journalist with CTV Globemedia, and graduated from the Dalhousie University School of Information Management (MLIS) where she won the Outstanding Service Award in 2011.Stephen Perkins
Stephen Perkins, an official agent and consultant of discoverygarden, is Managing Member of Infoset Digital Publishing. Infoset provides content and technology solutions for institutions, publishers, and businesses. Stephen has more than 20 years experience directing small-to-medium scale IT projects, specializing in digital asset management solutions for the Humanities. He has extensive experience in architecting solutions for cultural heritage institutions, reference publishers, and documentary editing projects.
More LITA Preconferences at ALA Annual
Friday June 24, 2016, 1:00 – 4:00 pm
- Digital Privacy and Security: Keeping You And Your Library Safe and Secure In A Post-Snowden World
- Technology Tools and Transforming Librarianship
LITA Member: $205
ALA Member: $270
Non Member: $335
Questions or Comments?
For all other questions or comments related to the preconference, contact LITA at (312) 280-4269 or Mark Beatty, firstname.lastname@example.org.
Thanks to support from the University of Manitoba, McMaster University, York University, the tireless work of developer Nigel Banks, and Q&A/testing by Project Director Nick Ruest, the Islandora Foundation is very happy to announce the development of a suite of new ways to work with Islandora CLAW:
- Islandora Ansible: Ansible playbooks for setting up Islandora CLAW
- Islandora CLAW: All in One Docker Image: A single container with all the dependencies required to run Islandora CLAW.
- Islandora CLAW Docker: A single repository, which calls on the below repositories to build out Islandora CLAW with individual containers for each component.
These links represent a tremendous amount of work that will make it much easier for you to deploy and develop in Islandora CLAW. Moreover, this work illustrates that the CLAW architecture can be split out into its individual components, and scaled horizontally. If you're interested in learning more about how to develop with CLAW or want to be a part of the project, please consider attending our weekly CLAW Tech Calls, or taking in our run of CLAW development webinars - more info here.