You are here

Feed aggregator

DuraSpace News: FIND OUT How to Take Advantage of DSpaceDirect: Your Fast and Affordable Repository Solution

planet code4lib - Tue, 2015-09-29 00:00

Winchester, MA  Do you need a fast, efficient and affordable hosted repository solution? DSpaceDirect is a service offered by DuraSpace based on the popular DSpace open source repository software.  DSpaceDirect provides access, management, and preservation of any content or file type in a hosted repository environment making digital resources discoverable by your users and easily managed by you.

District Dispatch: Candidates should focus on community anchors

planet code4lib - Mon, 2015-09-28 21:25

Libraries serve everyone in the community, photo courtesy of Cherry Hill Public Library.

Libraries: The quintessential community organization in the digital age

Given the critical role of local communities to our nation’s economic strength, ALA’s Alan Inouye, director of the Office for Information Technology Policy (OITP), is urging the presidential candidates to make communities, and specifically libraries in their role as community anchors, a central part of the candidates’ campaign conversations.

Libraries are now digitally-enabled community spaces with an array of technology training and resources, Inouye explains, noting that “Libraries, as well as other community anchor institutions, are best positioned to effect positive change towards an economy for the future that works for everyone.”

His op ed article is published in the Digital Beat Blog of the Benton Foundation.

Take a moment to check it out!

The post Candidates should focus on community anchors appeared first on District Dispatch.

Nicole Engard: Bookmarks for September 28, 2015

planet code4lib - Mon, 2015-09-28 20:30

Today I found the following resources and bookmarked them on Delicious.

  • Zulip A group chat application optimized for software development teams

Digest powered by RSS Digest

The post Bookmarks for September 28, 2015 appeared first on What I Learned Today....

Related posts:

  1. Software Freedom Day in September
  2. Another way to use Zoho
  3. September Workshops

LITA: Teaching Patrons About Privacy, a LITA webinar

planet code4lib - Mon, 2015-09-28 18:00

Attend this important new LITA webinar:

Teaching Patrons about Privacy in a World of Pervasive Surveillance: Lessons from the Library Freedom Project

Tuesday October 6, 2015
1:30 pm – 3:00 pm Central Time
Register Online, page arranged by session date (login required)

In the wake of Edward Snowden’s revelations about NSA and FBI dragnet surveillance, Alison Macrina started the Library Freedom Project as a way to teach other librarians about surveillance, privacy rights, and technology tools that protect privacy. In this 90 minute webinar, she’ll talk about the landscape of surveillance, the work of the LFP, and some strategies you can use to protect yourself and your patrons online. Administrators, instructors, librarians and library staff of all shapes and sizes will learn about the important work of the Library Freedom Project and how they can help their patrons.

Alison’s work for the Library Freedom Project and classes for patrons including tips on teaching patron privacy classes can be found at:

Alison Macrina

Is a librarian, privacy rights activist, and the founder and director of the Library Freedom Project, an initiative which aims to make real the promise of intellectual freedom in libraries by teaching librarians and their local communities about surveillance threats, privacy rights and law, and privacy-protecting technology tools to help safeguard digital freedoms. Alison is passionate about connecting surveillance issues to larger global struggles for justice, demystifying privacy and security technologies for ordinary users, and resisting an internet controlled by a handful of intelligence agencies and giant multinational corporations. When she’s not doing any of that, she’s reading.

Register for the Webinar

Full details
Can’t make the date but still ant to join in? Registered participants will have access to the recorded webinar.


  • LITA Member: $45
  • Non-Member: $105
  • Group: $196

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

Mashcat: Mashcat face-to-face event in Boston: call for proposals

planet code4lib - Mon, 2015-09-28 17:15

We are excited to announce that the first face-to-face Mashcat event in North America will be held on January 13th, 2016, at Simmons College in Boston, Massachusetts. We invite you to save the date, and we hope to have registration and a schedule for this low-cost (less than $10), 1-day event announced in November.

At present, we are accepting proposals for talks, events, panels, workshops or other for the Mashcat event. We are open to a variety of formats, with the reminder that this will be a one-day, single-track event aiming to support the cross-pollination goals of Mashcat (see more below). We are open to proposals for sessions led virtually. Please submit your proposals using this form. All proposals must be received by November 1st, 2015, midnight, and we will respond to all proposals by November 8th, 2015.

Not sure what Mashcat is? “Mashcat” was originally an event in the UK in 2012 which was aimed at bringing together people working on the IT systems side of libraries with those working in cataloguing and metadata. Three years later, Mashcat is a loose group of metadata specialists, cataloguers, developers and anyone else with an interest in how metadata in and around libraries can be created, manipulated, used and re-used by computers and software. The aim is to work together and bridge the communications gap that has sometimes gotten in the way of building the best tools we possibly can to manage library data.

Thanks for considering, and we hope to see you in January.

Library of Congress: The Signal: Stewarding Academic and Research Content: An Interview with Bradley Daigle and Chip German about APTrust

planet code4lib - Mon, 2015-09-28 14:20

The following is a guest post by Lauren Work, digital collections librarian, Virginia Commonwealth University.

In this edition of the Insights Interview series for the NDSA Innovation Working Group, I was excited to talk with Bradley Daigle, director of digital curation services and digital strategist for special collections at the University of Virginia, and R. F. (Chip) German Jr., program director of the APTrust, about the Academic Preservation Trust.

Lauren: Tell us about the Academic Preservation Trust and how the organization addresses the needs of member institutions.

Bradley and Chip: The APTrust is a consortium of 17 members who believe that their combined expertise and experience can provide more efficient and effective means to answering the challenges of digital stewardship. The consortium’s objective is to establish new collaborative strategies to help in addressing the complex and daunting issue of preserving the digital scholarly content produced or managed by universities. The group draws upon the deep knowledge of its members to target specific solutions that are content, technological, and administratively focused. Each member has representatives that work locally with their organization and then bring that knowledge back to the larger collective. This dialogic approach provides the methodology by which challenges are identified, analyzed, and then addressed in the best manner possible for the consortium.

The consortium is governed by its members, and it is operated and managed by a small staff based at the University of Virginia Library.  The core APTrust team organizes and deploys the resources of the group in an open, collaborative manner. We work to guide and seek guidance from the consortium itself.

Lauren: You mentioned that members work within their organization and share what they learned with the consortium. Could you talk a bit more specifically about what are members expected to contribute to APTrust? What are some of the resources from which members can benefit?

Chip German. Courtesy of AP Trust.

Bradley and Chip: The APTrust seeks to provide broad, scalable solutions that identify the true costs of preservation. In this manner, we hope to provide the economic and business models for digital preservation that any level of organization can adopt and deploy locally. Working together, we hope to create solutions that anyone can use.

To that end, members play a key role in seeking out both the problems and solutions to specific preservation challenges. For example, we have a current sub group of members who are focused specifically on the requirements for becoming a Trusted Digital Repository. This qualification is highly desired by some members but not necessarily everyone at the same level. Therefore, the ability to form special interest groups who can plumb the depths of a given issue and then bring a condensed version back to the collective is one of the many ways we use engagement and need to move the entire effort forward. We also have groups that are focused on our communications efforts as well as storage security. Some of these groups will disband once the initial work is concluded–others (like the TDR) represent the ongoing need for focused attention.

Lauren: You recently confirmed your mission statement, and the word “innovative” is used. How do you define or hope to define APTrust as an innovator in the field of digital preservation?

Bradley and Chip: The APTrust sees innovation as an ongoing goal. Preservation issues are not easily solved and once solutions are determined the problem set can mutate. Innovation means that we are striving for the best solution we can identify at the time and continue to identify and adapt and solve. Innovation is ongoing and the product of a great deal of collaborative effort on the part of everyone in the APTrust. We never see solutions as final but rather structures that need constant repair.

Lauren: Digital preservation is a daunting topic for many organizations, and the effort sometimes faces a “Why try?” stance. What advice do you have for those attempting to form digital preservation guidelines for their own organizations?

Bradley and Chip: The stewardship of our digital heritage is indeed an overwhelming and daunting task. It is always a matter of perspective–the best being an acknowledgement that we will never be able to accomplish it in its entirety. As with the physical realm, we can only hope to do our best at any given time. Digital preservation requires perspective and humility.

As with most efforts at this scale, often the most effective approach is to define the problem and then create a plan that speaks to what is possible for your organization. Define the scope, choose what is important and start in small but achievable chunks. As with collections, one must define the scope and not try to collect everything. Specialize if it makes sense for your organization–content type, format type, level of preservation. We have found it most useful to create levels of preservation–mapping to what is achievable by your organization and use that as a guide. Start somewhere and you will find you can make a difference, no matter how small.

Lauren: Technology changes quickly, and keeping up with evolving hardware, software and formats is an issue. As APTrust accepts all types of formats from its institutions, what advice do you have for librarians and archivists who need to make the preservation case for funding the technologies and infrastructure to support digital preservation in their organizations?

Bradley Daigle. Photo by Luca DiCecco.

Bradley and Chip: This goes back to defining your organization’s levels of preservation. For example, the lowest level of preservation may simply be a piece of metadata that states something existed at one time but is no longer extant. The highest level may be the management of those digital files in an emulated environment. The crux of sustainability lies in overlapping two mutable matrices: a map of what preservation levels are meant to do overlaid on a technical implementation matrix that defines how that level can be accomplished. This way you can adapt to new trends in technology. The former matrix, that of collecting or preservation levels, should change very little over time. The technical implementation, however, should adapt to evolving trends.

Lauren: Digital preservation benefits are not immediate, and it can be difficult to demonstrate value, even for the immediate future. How did APTrust articulate the value of digital preservation and make the case for allocating current resources to reap long-term benefits?

Bradley and Chip: The APTrust consortium benefits from a shared belief that digital preservation is not a luxury service. We represent organizations whose mission it is to steward our cultural record–in whatever form it takes. The old adage of there being only two kinds of people: those who have lost data – and those who will lose data – applies here. Most organizations have taken on digital preservation in one manner or another. APTrust offers the ability to provide scalable services at cost–with the added benefit of collective problem solving. Certainly there are preservation solutions out there for any level of organization. However, in taking this singular approach, you are also taking the full brunt of solving each preservation solution on your own as it arises.

We believe that a consortial approach leverages the strengths of all its partners which leads to quicker, more efficient (read: cheaper) solutions. Preservation isn’t solved in a day–it is solved in many ways every day. The more people you have scanning the landscape for challenges and solutions the more effective and scalable your solution.

Lauren: There are many advantages to the consortium model for digital preservation. What advantages do you think individual institutions or smaller consortia might have in their approach to digital preservation?

Bradley and Chip: As we mentioned, the advantages to the “many mind” approach can have dramatic benefits. The ability of a group to identify an arising challenge, task a small group to investigate that challenge, and then bring that knowledge back to the collective has been proven repeatedly. Given the scale, complexity, and scope of digital preservation, doing this at any level is critical to moving us all forward in solving these issues.

Lauren: What do you see as the greatest challenge for digital preservation?

Bradley and Chip: The main challenge for preservation has always been the same: it is infrastructure and infrastructure is not sexy. If you are doing your job and doing it well, no one notices. People only notice when you fail. This fact is inculcated in our society. Witness all the home renovation shows. People don’t care about knob and tube wiring – until they have to replace it. No one wants to pay for that work, they would rather have that brushed nickel six burner stove that everyone will notice and love. That is the challenge of preservation – making the case for the cost of this endeavor is difficult because it is so resource intensive. However, the cost of failing is much higher. It is already likely that we will have a gap in our digital cultural heritage as we play catch up to operationalizing enterprise digital preservation. Let’s just hope it is not too late.

Islandora: Back to the Long Tail

planet code4lib - Mon, 2015-09-28 14:09

It's that time again. Let's wag the Long Tail of Islandora and have a look at some of the really great work being done out in the Islandora community and review some modules that might solve a problem you're having:

Streaming Media Solution Pack

Created by UPEI's Rosie LeFaive and born from the Bowing Down Home project, this module allows you to create and manage Islandora objects representing externally hosted streaming resources, which can be catalogued and displayed via an Islandora instance. You can also store copy of the file as an Islandora object.

EAD Solution Pack

EAD finding aids in Islandora! This module from Chris Clement at Drexel University Libraries lets you ingest EAD and make it browsable. Check out this example.

Islandora Context

From Simon Fraser University's Mark Jordan (Chairman of the Islandora Foundation!) Provides a set of Context "conditions" and "reactions" for Islandora objects. Think of this module as an "if-this-then-that" configurator for Islandora repositories.

Islandora Custom Solr

A neat little tool from Jared Whiklo at the University of Manitoba, this module will replace Sparql queries with Solr queries where possible for speed improvements.

Islandora RML

A great module from Frits Van Latum of Delft University, Islandora RML gets triples from xml datastreams e.g. MODS and stores these triples in RELS-EXT. Triples consist of the URI of the fedoraObject as subject, a generated predicate and a generated object.

LITA: Triaging Technologies

planet code4lib - Mon, 2015-09-28 14:00
Flickr/Etienne Valois, CC BY NC ND

I manage digital services and resources at a small academic library with minimal financial and human resources available. For almost a year, I served as solo librarian for fixing and optimizing the library website, library services platform, electronic resources, workflows, documentation, and other elements of technology management vital to back-end operations and front-end services. Coping with practical limitations and a vast array of responsibilities, I resorted to triage. In triage management, the primary consideration is return on investment (ROI) – how stakeholder benefits measure against time and resources expended to realize those benefits.

Condition Black: The technology must be replaced or phased out because it is dysfunctional and impossible to fix. Into this category fell our website, built with the clunky and unusable Microsoft SharePoint; our laptops running Windows XP and too old to upgrade to a more current operating system; and our technology lending service, for which we had no funds to upgrade the dated technologies on offer. Down the road we might write this last item into the budget or solicit donations from the community, but at the time, the patient was DOA.

Condition Blue: The technology is current, optimal for user needs, and can be left essentially to run itself while library technology managers focus on more urgent priorities. Into this category fell the recently upgraded hardware at one of our campus libraries, as well as LibCal, a study room booking system with faultless performance.

Condition Green: The situation requires monitoring but not immediate intervention, not until higher-order priorities have been addressed. This was the situation with OCLC WorldShare Management Services (WMS). This LSP offers only limited functionality—scandalously to my mind, subscribers still have to pull many reports via FTP. But the platform is cheap and handles the core functions of circulation, cataloging, and interlibrary loan perfectly. For us, WMS was low-priority.

Condition Yellow: The situation needs to be salvaged and the system sustained, but it is still not quite the top priority. In this category fell the OCLC knowledge base and WorldCat discovery layer, which in Hodges University’s instance experiences incessant link resolution issues and requires constant monitoring and frequent repair tickets to OCLC. A screwy discovery layer impacts users’ ability to access resources as well as creating a frustrating user experience. BUT I decided not to prioritize knowledge base optimization because the methodology was already in place for triaging the crisis. For years my colleagues had been steering students directly toward subject databases in lieu of WorldCat.

Condition Red: The system is in dire need of improvement – this is Priority 1. Into this category fell the library’s content management system, LibGuides. My first priority on taking over web services was to upgrade LibGuides to Version 2, which offers responsive design and superior features, and then to integrate the entire library website within this new-and-improved CMS. I would also argue that internal customer service falls into this category – staff must documentation, training, and other support to do their work well before they can exceed expectations for external customer service. These are the critical priorities.

A few additional points.

1. Library technologists must revisit traige placements periodically and reassess as needed. Movement is the goal – from conditions Red to Blue.

2. Library technologists must eschew using triage as a stopgap measure. Triage is vital to long-range planning in terms of budget allocation, project management, and other responsibilities. Triage is planning.

3. Where each priority is placed in a triage system is contingent on local needs and circumstances. There is no one-size-fits-all generalization.

How do you use triage at your library? Is it a useful approach?

Mark E. Phillips: File Duplication in the UNT Libraries Digital Collections

planet code4lib - Mon, 2015-09-28 13:30

A few months ago I was following a conversation on Twitter that for got me thinking about how much bit-for-bit duplication there was in our preservation repository and how much space that duplication amounted to.

I let this curiosity sit for a few months and finally pulled the data from the repository in order to get some answers.

Getting the data

Each of the digital objects in our repository have a METS record that conforms to the UNTL-AIP-METS Profile registered with the Library of Congress. One of the features of this METS profile (like many others) is that these files make use of is the fileStruct section and for each file in a digital object, there exist the following pieces of information

Field Example Value FileName  ark:/67531/metadc419149 CHECKSUM  bc95eea528fa4f87b77e04271ba5e2d8 CHECKSUMTYPE  MD5 USE  0 MIMETYPE  image/tiff CREATED  2014-11-17T22:58:37Z SIZE 60096742 FILENAME file://data/01_tif/2012.201.B0389.0516.TIF OWNERID urn:uuid:295e97ff-0679-4561-a60d-62def4e2e88a ADMID amd_00013 amd_00015 amd_00014 ID file_00005

By extracting this information for each file in each of the digital objects I would be able to get at the initial question I had about duplication at the file level and how much space it accounted for in the repository.

Extracted Data

At the time of writing of this post the Coda Repository that acts as the preservation repository for the UNT Libraries Digital Collections contains 1.3 million digital objects that occupy 285TB of primary data. These 1.3 million digital objects consist of 151 million files that have fixity values in the repository.

The dataset that I extracted has 1,123,228 digital objects because it was extracted a few months ago. Another piece of information that is helpful to know is that the numbers that we report for “file managed by Coda (151 million mentioned above) include both the primary files ingested into the repository as well as metadata files added to the Archival Information Packages as they are ingested into the repository. The analysis in this post deals only with the primary data files deposited with the initial SIP and do not include the extra metadata files. This dataset contains information about 60,164,181 files in the repository.

Analyzing the Data

Once I acquired the METS records from the Coda repository I wrote a very simple script to extract information from the File section of the METS records and format that data into a Tab separated dataset that I could use for subsequent analysis work. Because of the duplication of some of the data to each row to make processing easier, this resulted in a Tab separated file that is just over 9 GB in size (1.9 GB compressed) that contains the 60,164,181 rows, one for each file.

Here is a representation as a table for a few rows of data.

METS File CHECKSUM CHECKSUMTYPE USE MIMETYPE CREATION SIZE FILENAME metadc419149.aip.mets.xml bc95eea528fa4f87b77e04271ba5e2d8 md5 0 image/tiff 2014-11-17T22:58:37Z 60096742 file://data/01_tif/2012.201.B0389.0516.TIF metadc419149.aip.mets.xml 980a81b95ed4f2cda97a82b1e4228b92 md5 0 text/plain 2014-11-17T22:58:37Z 557 file://data/02_json/2012.201.B0389.0516.json metadc419544.aip.mets.xml 0fba542ac5c02e1dc2cba9c7cc436221 md5 0 image/tiff 2014-11-17T23:20:57Z 51603206 file://data/01_tif/2012.201.B0391.0539.TIF metadc419544.aip.mets.xml 0420bff971b151442fa61b4eea9135dd md5 0 text/plain 2014-11-17T23:20:57Z 372 file://data/02_json/2012.201.B0391.0539.json metadc419034.aip.mets.xml df33c7e9d78177340e0661fb05848cc4 md5 0 image/tiff 2014-11-17T23:42:16Z 57983974 file://data/01_tif/2012.201.B0394.0493.TIF metadc419034.aip.mets.xml 334827a9c32ea591f8633406188c9283 md5 0 text/plain 2014-11-17T23:42:16Z 579 file://data/02_json/2012.201.B0394.0493.json metadc419479.aip.mets.xml 4c93737d6d8a44188b5cd656d36f1e3d md5 0 image/tiff 2014-11-17T23:01:15Z 51695974 file://data/01_tif/2012.201.B0389.0678.TIF metadc419479.aip.mets.xml bcba5d94f98bf48181e2159b30a0df4f md5 0 text/plain 2014-11-17T23:01:15Z 486 file://data/02_json/2012.201.B0389.0678.json metadc419495.aip.mets.xml e2f4d1d7d4cd851fea817879515b7437 md5 0 image/tiff 2014-11-17T22:30:10Z 55780430 file://data/01_tif/2012.201.B0387.0179.TIF metadc419495.aip.mets.xml 73f72045269c30ce3f5f73f2b60bf6d5 md5 0 text/plain 2014-11-17T22:30:10Z 499 file://data/02_json/2012.201.B0387.0179.json

My first step at this was to extract the column that stored the MD5 fixity value, sort that column and then find the number of the instances of each fixity value in the dataset. The command ends up looking like this:

cut –f 2 mets_dataset.tsv | sort | uniq –c | sort –nr | head

This worked pretty will and resulted with the MD5 values that occurred the most. This represents the duplication at the file level in the repository.

Count Fixity Value 72,906 68b329da9893e34099c7d8ad5cb9c940 29,602 d41d8cd98f00b204e9800998ecf8427e 3,363 3c80c3bf89652f466c5339b98856fa9f 2,447 45d36f6fae3461167ddef76ecf304035 2,441 388e2017ac36ad7fd20bc23249de5560 2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2,183 6d5f66a48b5ccac59f35ab3939d539a3 1,905 bb7559712e45fa9872695168ee010043 1,859 81051bcc2cf1bedf378224b0a93e2877 1,706 eeb3211246927547a4f8b50a76b31864

There are a few things to note here, first because of the way that we version items in the repository, there is going to be some duplication because of our versioning strategy. If you are interested in understanding the versioning process we use for our system and the overhead that occurs because of this strategy you can take a look at the whitepaper we wrote a in 2014 about the subject.

Phillips, Mark Edward & Ko, Lauren. Understanding Repository Growth at the University of North Texas: A Case Study. UNT Digital Library. Accessed September 26, 2015.

To get a better idea of the kinds of files that are duplicated in the repository, the following table shows fields for the top five more repeated files.

Count MD5 Bytes Mimetype Common File Extension 72,906 68b329da9893e34099c7d8ad5cb9c940 1 text/plain txt 29,602 d41d8cd98f00b204e9800998ecf8427e 0 application/x-empty txt 3,363 3c80c3bf89652f466c5339b98856fa9f 20 text/plain txt 2,447 45d36f6fae3461167ddef76ecf304035 195 application/xml xml 2,441 388e2017ac36ad7fd20bc23249de5560 21 text/plain txt 2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2 text/plain txt 2,183 6d5f66a48b5ccac59f35ab3939d539a3 3 text/plain txt 1,905 bb7559712e45fa9872695168ee010043 61,192 image/jpeg jpg 1,859 81051bcc2cf1bedf378224b0a93e2877 2 text/plain txt 1,706 eeb3211246927547a4f8b50a76b31864 200 application/xml xml

You can see that most of the files that are duplicated are very small in size,  0, 1, 2, and three bytes.  The largest  were jpegs that were represented 1,905 times in the dataset and each were 61,192 byes.  The makeup of files for these top examples are txt, xml and jpg.

Overall we see that for the 60,164,181 rows in the dataset, there are 59,177,155 unique md5 hashes.  This means that 98% of the files in the repository are in fact unique.  Of the 987,026 rows in the dataset that are duplicates of other fixity values,  there are 666,259 unique md5 hashes.

So now we know that there is some duplication in the repository at the file level. Next I wanted to know what kind of effect does this have on the storage allocated. I took care of this by taking the 666,259 values that contained duplicates and went back to pull the number of bytes for those files. I calculated the storage overhead for each of these fixity values as bytes x instances – 1 to remove the size of the initial storage, thus showing only the duplication overhead.

Here is the table for the ten most duplicated files to show that calculation.

Count MD5 Bytes per File Duplicate File Overhead (Bytes) 72,906 68b329da9893e34099c7d8ad5cb9c940 1 72,905 29,602 d41d8cd98f00b204e9800998ecf8427e 0 0 3,363 3c80c3bf89652f466c5339b98856fa9f 20 67,240 2,447 45d36f6fae3461167ddef76ecf304035 195 476,970 2,441 388e2017ac36ad7fd20bc23249de5560 21 51,240 2,237 e1c06d85ae7b8b032bef47e42e4c08f9 2 4,472 2,183 6d5f66a48b5ccac59f35ab3939d539a3 3 6,546 1,905 bb7559712e45fa9872695168ee010043 61,192 116,509,568 1,859 81051bcc2cf1bedf378224b0a93e2877 2 3,716 1,706 eeb3211246927547a4f8b50a76b31864 200 341,000

After taking the overhead for each row of duplicates,  I ended up with 2,746,536,537,700 bytes or 2.75 TB of overhead because of file duplication in the Coda repository.


I don’t think there is much surprise that there is going to be duplication of files in a repository. The most common file we have that is duplicated is a txt file with just one byte.

What I will do with this information I don’t really know. I think that the overall duplication across digital objects is a feature and not a bug. I like the idea of more redundancy when reasonable. It should be noted that this redundancy is often over files that from what I can tell carry very little information (i.e. tiff images of blank pages, or txt files with 0, 1, or 2 bytes of data)

I do know that this kind of data can be helpful when talking with vendors that provide integrated “de-duplication services” into their storage arrays, though that de-duplication is often at a smaller unit that the entire file. It might be interesting to take a stab at seeing what the effect of different de-duplication methodologies and algorithms on a large collection of digital content might be, so if anyone has some interest and algorithms I’d be game on giving it a try.

That’s all for this post, but I have a feeling I might be dusting off this dataset in the future to take a look at some other information such as filesizes and mimetype information that we have in our repository.

Meredith Farkas: The insidious nature of “fit” in hiring and the workplace

planet code4lib - Mon, 2015-09-28 03:40

Organizational culture is a very real and a very powerful force in every organization. I have worked in a variety of different organizations and each had had its own rituals, norms, values, and assumptions that influenced the way people worked together, shared information, and got things done. Culture is this weird, powerful, unspoken thing that both impacts and is impacted by the people within it. While organizational culture can change over time, it is usually because of major staff turnovers as culture is notoriously difficult to change.

Organizational culture can be positive and healthy or seriously maladaptive, but I think most cultures have a little from column A and a little from column B. Healthy cultures incorporate and adapt to new people and ideas. Maladaptive cultures are notoriously difficult for newcomers to feel welcome in and tend to force them to conform or leave. It’s in organizations with maladptive cultures where I think the issue of cultural fit can be most problematic.

I know what it feels like to work at a place where you don’t fit. You feel like a second class citizen in just about every interaction. You go from participating in meetings to avoiding speaking at all costs. You feel like your perspective is not taken seriously and the projects you’re involved in are marginalized. There were a few of us at that job to whom it was made painfully clear that we were the odd men out. These were not slackers who did a crappy job, but folks who were passionate about and devoted to their work. Not fitting was torture for my psyche and made me question whether there was something inherently wrong with me.

Based on my experience, you might think I’d be suggesting that people carefully screen their applicants for “fit.” That couldn’t be further from the truth. Screening for cultural fit tends to lead to monocultures that don’t embrace diversity of any kind — racial, gender, perspective, experience, etc. Monocultures are toxic and have difficulty adapting to change. Hiring people in your own image leads to an organization that can’t see clearly beyond its navel. As expressed in the article “Is Cultural Fit a Qualification for Hiring or a Disguise for Bias?” in Knowledge @ Wharton

Diversity in the workplace has long been valued as a way to introduce new ideas, but researchers have found other reasons for cultivating heterogeneity. Information was processed more carefully in heterogeneous groups than homogenous groups, according to “Is the Pain Worth the Gain? The Advantages and Liabilities of Agreeing With Socially Distinct Newcomers,” by Katherine W. Phillips, Katie A. Liljenquist and Margaret A. Neale, published in Personality and Social Psychology Bulletin. Social awkwardness creates tension, and this is beneficial, the study found. “The mere presence of socially distinct newcomers and the social concerns their presence stimulates among old-timers motivates behavior that can convert affective pains into cognitive gains” — or, in other words, better group problem solving.

So perhaps bringing people in who aren’t such a perfect fit, and maybe even challenge the current structure a bit, is very good for the organization. Any time I have worked with someone who has a very different perspective and lived experience than I have, I have learned so much. I remember when we hired an instructional designer at the PSU Library who came from outside of libraries, I found that it was much more difficult to get on the same page, but the ideas he brought to our work more than compensated for any difficulties I had as a manager. He allowed us to see beyond our myopic librarian view. I think hiring people with different cultural, racial, gender, socioeconomic, etc. backgrounds provide similar benefits to the organization.

Whether it is conscious or unconscious, hiring people who are “like you” is bias, and it tends to result in organizations that are less diverse; not only in terms of perspectives, but in terms of race/gender/religion/etc. When you’re on a hiring committee, how often do you find yourself judging candidates based on qualities you value in a colleague rather than the stated qualifications? It probably happens more than we’d all like to admit.

It’s easy to fall into the trap of considering fit without even thinking about it. I remember when I was on my first hiring committee, once we’d weeded out those candidates who didn’t meet the minimum qualifications, I felt myself basing my evaluation of the rest on whether or not they had the traits I value in a colleague. The person we hired ended up becoming a good friend and while he did a fantastic job in his role, part of me wishes I had put my personal biases aside when making that decision. I may still have championed him, but I would have done it for the right reasons.

One thing I feel strongly that we should hire for is shared values. It is critical that the person one hires doesn’t hold values antithetical to the work of the organization. I don’t care anymore if a candidate seems like they could be a friend, but I do care if they evidence and support the goals and values of my library and community college. Just having the required qualifications isn’t enough; being a community college librarian isn’t for everyone.

Unfortunately, in reading this New York Times article, “Guess Who Doesn’t Fit In at Work”, and from my own experiences, people are judged by much more than shared values, which unintentionally biases people doing hiring against folks who have different lived experiences and interests. This is discrimination, plain and simple. When I was looking for an image to use for this post, I found this blog post about how people doing hiring should look at candidates’ social media profiles to scan for cultural fit. That we should look at what restaurants candidates visit and what things they favorite on Twitter frankly scares the crap out of me. Because in doing that, you’re saying that people with different views or outside-of-work activities are not welcome in your organization.

What we need is to embrace diversity in its many forms and value contributions from everyone, but that is easier said than done. I like the suggestions theNew York Times article has regarding hiring for fit without bias:

First, communicate a clear and consistent idea of what the organization’s culture is (and is not) to potential employees. Second, make sure the definition of cultural fit is closely aligned with business goals. Ideally, fit should be based on data-driven analysis of what types of values, traits and behaviors actually predict on-the-job success. Third, create formal procedures like checklists for measuring fit, so that assessment is not left up to the eyes (and extracurriculars) of the beholder.

Finally, consider putting concrete limits on how much fit can sway hiring. Many organizations tell interviewers what to look for but provide little guidance about how to weigh these different qualities. Left to their own devices, interviewers often define merit in their own image.

Clearly, the more structured the process and the less leeway there is for making decisions based on aspects of a candidates personality, interests and background, the less likely the bias.

And what of those cultures that may hire for diversity but then treat people with different ideas and experiences like pariahs? Unfortunately, I get the sense that changing culture is nearly impossible without a decent amount of staff turnover. I witnessed a culture shift in my first library job, but it was because my boss had hired over half the staff over a period of about six years and was able to cultivate the right mix of values and diverse characters. I’ve also seen new administrators come into organizations with really strong, entrenched cultures and fail spectacularly at creating any kind of culture change. Fixing the problem of bias in hiring is only half the problem. We also need to embrace diversity in our organizations so that people of color or people with divergent ideas feel valued by the organization.

I feel very lucky that I work at a library that values diversity and diverse perspectives. We have a group of librarians who have different passions, different viewpoints, and very different personalities. Yet I don’t see anyone marginalizing anyone else. I don’t see anyone whose opinions are taken less seriously than anyone else’s. I don’t see people playing favorites or being cliquish. What I see is an diverse group of people who value each other’s opinions and also value consensus-building. We don’t always come to complete agreement, but we accept and respect the way things go. We have a functional adhocracy, where we feel empowered to act and where we alternate taking and sharing leadership roles organically. I feel like everyone is valued for what they bring to the group and everyone brings something very different. Even after one year, it still feels like heaven to me and it’s certainly not because everyone is like me.

We have a long way to go in building diverse libraries, but becoming keenly aware of how our unconscious preferences in hiring and our organizational cultures can help or harm diversity is a good step in the right direction.

Image credit

Library Tech Talk (U of Michigan): The Once and Future Text Encoding Model

planet code4lib - Mon, 2015-09-28 00:00

Lately I’ve been looking back through the past of the Digital Library Production Service (DLPS) -- in fact, all the way back to the time before DLPS, when we were the Humanities Text Initiative -- to see what, if anything, we’ve learned that will help us as we move forward into a world of Hydra, ArchivesSpace, and collaborative development of repository and digital resource creation tools.

DuraSpace News: Telling DSpace Stories at the International Livestock Research Institute (ILRI) with Alan Orth

planet code4lib - Mon, 2015-09-28 00:00

“Telling DSpace Stories” is a community-led initiative aimed at introducing project leaders and their ideas to one another while providing details about DSpace implementations for the community and beyond. The following interview includes personal observations that may not represent the opinions and views of  the International Livestock Research Institute (ILRI)  or the DSpace Project.

Tara Robertson: HUMAN: subtitles enhancing access and empathy

planet code4lib - Sun, 2015-09-27 18:28

I came across this video on a friend’s Facebook feed. I’m a chronic multitasker, but by half a minute in I stopped doing whatever else I was doing and just watched and listened. This is the part that grabbed my heart:

This is my star. I had to wear it on my chest, of course, like all the Jews. It’s big, isn’t it? Especially for a child. That was when I was 8 years old.

Also Francine Christophe’s voice was very powerful and moved me. She annunciates each word so clearly. My French isn’t great, but she speaks slowly and clearly enough that I can understand her. Also, the subtitles confirm that I’m understanding correctly and reinforce what she’s saying.

I noticed that there was something different about the subtitles. The font is clear and elegant and the words are positioned in the blank space beside her face. I can watch her face and her eyes while I read the subtitles. My girlfriend reminded me of something I had said when I was reviewing my Queer ASL lesson at home. In ASL I learned that when fingerspelling you position your hand up by your face, as your face (especially your eyebrows) are part of the language. Even when we speak English our faces communicate so much.

I’ve seen a bunch of these short videos from this film. They are everyday people telling amazing stories about the huge range of experiences that we experience on this planet. The people who are filmed are from all over the world, and speaking in various languages. The design decision to shoot people with enough space to put the subtitles beside them is really powerful. For me the way the subtitles are done enhances the feeling of empathy.

A couple of weeks ago I was at a screening event of Microsoft’s Inclusive video at OCAD in Toronto. In the audience were many students of the Inclusive Design program who were in the video. One of the students asked if the video included description of the visuals for blind and visually impaired viewers. The Microsoft team replied that it didn’t and that often audio descriptions were distracting for viewers who didn’t need them. The student asked if there could’ve been a way to weave the audio description into the interviews, perhaps by asking the people who were speaking to describe where they were and what was going on, instead of tacking on the audio description afterwards. I love this idea.

HUMAN is very successful in skillfully including captions that are beautiful, enhance the storytelling, provide access to Deaf and Hard of Hearing people, provide a way for people who know a bit of the language to follow along with the story as told in the storyteller’s mother tongue, and make it easy to translate the film into other languages. I’m going to include this example in the work we’re going around universal design for learning with the BC Open Textbook project.

I can’t wait to see the whole film of HUMAN. I love the stories that they are telling and the way that they are doing it.

Eric Hellman: Weaponization of Library Resources

planet code4lib - Sun, 2015-09-27 01:17
This post needs a trigger warning. You probably think the title indicates that I've gone off the deep end, or that this is one of my satirical posts. But read on, and I think you'll agree with me, we need to make sure that library resources are not turned into weapons. I'll admit that sounds ludicrous, but it won't after you learn about "The Great Cannon" and "QUANTUM".

But first, some background. Most of China's internet connects to the rest of the world through what's known in the rest of the world as "the Great Firewall of China". Similar to network firewalls used for most corporate intranets, the Great Firewall is used as a tool to control and monitor internet communications in and out of China. Websites that are deemed politically sensitive are blocked from view inside China. This blocking has been used against obscure and prominent websites alike. The New York Times, Google, Facebook and Twitter have all been blocked by the firewall.

When web content is unencrypted, it can be scanned at the firewall for politically sensitive terms such as "June 4th", a reference to the Tiananmen Square protests, and blocked at the webpage level. China is certainly not the only entity that does this; many school systems in the US do the same sort of thing to filter content that's considered inappropriate for children. Part of my motivation for working on the "Library Digital Privacy Pledge" is that I don't think libraries and publishers who provide online content to them should be complicit in government censorship of any kind.

Last March, however China's Great Firewall was associated with an offensive attack. To put it more accurately, software co-located with China's Great Firewall turned innocent users of  unencrypted websites into attack weapons. The targets of the attack were "", a website that works to provide Chinese netizens a way to evade the surveillance of the Great Firewall, and, the website that hosts code for hundreds of thousand of programmers, including those supporting

Here's how the Great Cannon operated  In August, Bill Marczak and co-workers from Berkeley, Princeton and Citizen Lab presented their findings on the Great Cannon at the 5th USENIX Workshop on Free and Open Communications on the Internet.
The Great Cannon acted as a "man-in-the-middle"[*] to intercept the communications of users outside china with servers inside china. Javascripts that collected advertising and usage data for Baidu, the "Chinese Google", were replaced with weaponized javascripts. These javascripts, running in the browsers of internet users outside China, then mounted the denial-of-service attack on and Github.China was not the first to weaponized unencrypted internet traffic. Marczak et. al. write:
Our findings in China add another documented case to at least two other known instances of governments tampering with unencrypted Internet traffic to control information or launch attacks—the other two being the use of QUANTUM by the US NSA and UK’s GCHQ.[reference] In addition, product literature from two companies, FinFisher and Hacking Team, indicate that they sell similar “attack from the Internet” tools to governments around the world [reference]. These latest findings emphasize the urgency of replacing legacy web protocols like HTTP with their cryptographically strong counterparts, such as HTTPS.It's worth thinking about how libraries and the resources they offer might be exploited by a man-in-the-middle attacker. Science journals might be extremely useful in targeting espionage scripts at military facilities, for example. A saboteur might alter reference technical information used by a chemical or pharmaceutical company with potentially disastrous consequences. It's easy to see why any publisher that wants its information to be perceived as reliable has no choice but to start encrypting their services now.

The unencrypted services of public libraries are attractive targets for other sorts of mischief, ironically because of their users' trust in them and because they have a reputation for protecting privacy. Think about how many users would enter their names, phone numbers, and last four digits of their social security numbers if a library website seemed to ask for it. When a website is unencrypted, it's possible for "man-in-the-middle" attacks to insert content into an unencrypted web page coming from a library or other trusted website. An easy way for an attacker to get into position to execute such an attack is to spoof a wifi network, for example in a cafe or other public space, such as a library. It doesn't help if only a website's login is encrypted if an attacker can easily insert content into the unencrypted parts of the website.

To be clear, we don't know that libraries and the type of digital resources they offer are being targeted for weaponization, espionage or other sorts of mischief. Unfortunately, the internet offers a target-rich environment of unencrypted websites.

I believe that libraries and their suppliers need to move swiftly to take the possibility off the table and help lead the way to a more secure digital environment for us all.

[note: Technically, the Great Cannon executed a "man-on-the-side" variant of a "man-in-the-middle" attack, not unlike the NSA's "QuantumInsert" attack revealed by Edward Snowden.]

Terry Reese: Automatic Headings Correction–Validate Headings

planet code4lib - Sat, 2015-09-26 21:24

After about a month of working with the headings validation tool, I’m ready to start adding a few enhancements to provide some automated headings corrections.  The first change to be implemented will be automatic correction of headings where the preferred heading is different from the in-use headings.  This will be implemented as an optional element.  If this option is selected, the report will continue to note variants are part of the validation report – but when exporting data for further processing – automatically corrected headings will not be included in the record sets for further action.

Additionally – I’ll continue to be looking at ways to improve the speed of the process.  While there are some limits to what I can do since this tool relies on a web service (outside of providing an option for users to download the ~10GB worth of LC data locally), there are a few things I can to do continue to ensure that only new items are queried when resolving links.

These changes will be made available on the next update.


FOSS4Lib Recent Releases: Avalon Media System - 4.0.1

planet code4lib - Sat, 2015-09-26 18:50

Last updated September 26, 2015. Created by Peter Murray on September 26, 2015.
Log in to edit this page.

Package: Avalon Media SystemRelease Date: Thursday, September 24, 2015

Karen G. Schneider: The importance of important questions

planet code4lib - Sat, 2015-09-26 12:43

Pull up a chair and set a while: I shall talk of my progress in the doctoral program; my research interests, particularly LGBT leadership; the value of patience and persistence; Pauline Kael; and my thoughts on leadership theory. I include a recipe for  cupcakes. Samson, my research assistant, wanted me to add something about bonita flakes, but that’s really his topic.

My comprehensive examinations are two months behind me: two four-hour closed-book exams, as gruesome as it sounds. Studying for these exams was a combination of high-level synthesis of everything I had learned for 28 months and rote memorization of barrels of citations. My brain was not feeling pretty.

I have been re-reading the qualifying paper I submitted earlier this year, once again feeling grateful that I had the patience and persistence to complete and then discard two paper proposals until I found my research beshert, about the antecedents and consequences of sexual identity disclosure for academic library directors. That’s fancy-talk for a paper that asked, why did you come out, and what happened next? The stories participants shared with me were nothing short of wonderful.

As the first major research paper I have ever completed, it is riddled with flaws. At 60–no, now, 52–pages, it is also an unpublishable length, and I am trying to identify what parts to chuck, recycle, or squeeze into smaller dress sizes, and what would not have to be included in a published paper anyway.

But if there is one thing I’ve learned in the last 28 months, it is that it is wise to pursue questions worth pursuing.  I twice made the difficult decision to leave two other proposals on the cutting-room floor, deep-sixing many months of effort. But in the end that meant I had a topic I could live with through the long hard slog of data collection, analysis, and writing, a topic that felt so fresh and important that I would mutter to myself whilst working, “I’m in your corner, little one.”

As I look toward my dissertation proposal, I find myself again (probably, but not inevitably) drawn toward LGBT leadership–even more so when people, as occasionally happens, question this direction. A dear colleague of mine questioned the salience of one of the themes that emerged from my study, the (not unique) idea of being “the only one.” Do LGBT leaders really notice when they are the only ones in any group setting, she asked? I replied, do you notice when you’re the only woman in the room? She laughed and said she saw my point.

The legalization of same-gender marriage has also resulted in some hasty conclusions by well-meaning people, such as the straight library colleague from a liberal coastal community who asked me if “anyone was still closeted these days.” The short answer is yes. A  2013 study of over 800 LGBT employees across the United States found that 53 percent of the respondents hide who they are at work.

But to unpack my response requires recalling Pauline Kael’s comment about not knowing anyone who voted for Nixon (a much wiser observation than the mangled quote popularly attributed to her): “I live in a rather special world. I only know one person who voted for Nixon. Where they are I don’t know. They’re outside my ken. But sometimes when I’m in a theater I can feel them.” 

In my study, I’m pleased to say, most of the participants came from outside that “rather special world.”  I recruited participants through calls to LGBT-focused discussion lists which were then “snowballed” out to people who knew people who knew people, and to quote an ancient meme, “we are everywhere.” The call for participation traveled several fascinating degrees of separation. If only I could have chipped it like a bird and tracked it! As it was, I had 10 strong, eager participants who generated 900 minutes of interview data, and the fact that most were people I didn’t know made my investigation that much better.

After the data collection period for my research had closed, I was occasionally asked, “Do you know so-and-so? You should use that person!” In a couple of cases colleagues complained, “Why didn’t you ask me to participate?” But I designed my study so that participants had to elect to participate during a specific time period, and they did; I had to turn people away.

The same HRC study I cite above shrewdly asked questions of non-LGBT respondents, who revealed their own complicated responses to openly LGBT workers. “In a mark of overall progress in attitudinal shifts, 81% of non-LGBT people report that they feel LGBT people ‘should not have to hide’ who they are at work. However, less than half would feel comfortable hearing an LGBT coworker talk about their social lives, dating or related subject.” I know many of you reading this are “comfortable.” But you’re part of my special world, and I have too much experience outside that “special world” to be surprised by the HRC’s findings.

Well-meaning people have also suggested more than once that I study library leaders who have not disclosed their sexual identity. Aside from the obvious recruitment issues, I’m far more interested in the interrelationship between disclosure and leadership. There is a huge body of literature on concealable differences, but suffice it to say that the act of disclosure is, to quote a favorite article, “a distinct event in leadership that merits attention.” Leaders make decisions all the time; electing to disclose–an action that requires a million smaller decisions throughout life and across life domains–is part of that decision matrix, and inherently an important question.

My own journey into research

If I were to design a comprehensive exam for the road I have been traveling since April, 2013, it would be a single, devilish open-book question to be answered over a weekend: describe your research journey.

Every benchmark in the doctoral program was a threshold moment for my development. Maybe it’s my iconoclast spirit, but I learned that I lose interest when the chain of reasoning for a theory traces back to prosperous white guys interviewing prosperous white guys, cooking up less-than-rigorous theories, and offering prosperous-white-guy advice. “Bring more of yourself to work!” Well, see above for what happens to some LGBT people when they bring more of themselves to work. It’s true that the participants in my study did just that, but it was with an awareness that authenticity has its price as well as its benefits.

The more I poked at some leadership theories, the warier I became. Pat recipes and less-than-rigorous origin stories do not a theory make. (Resonant leadership cupcakes: stir in two cups of self-awareness; practice mindfulness, hope, and compassion; bake until dissonance disappears and renewal is evenly golden.) Too many books on leadership “theory” provide reasonable and generally useful recommendations for how to function as a leader, but are so theoretically flabby that if they were written by women would be labeled self-help books.

(If you feel cheated because you were expecting a real cupcake recipe, here’s one from Cook’s Catalog, complete with obsessive fretting about what makes it a good cupcake.)

I will say that I would often study a mainstream leadership theory and  then see it in action at work. I had just finished boning up on Theory X and Theory Y when someone said to me, with an eye-roll no less, “People don’t change.” Verily, the scales fell from my eyes and I revisited moments in my career where a manager’s X-ness or Y-ness had significant implications. (I have also asked myself if “Theory X” managers can change, which is an X-Y test in itself.) But there is a difference between finding a theory useful and pursuing it in research.

I learned even more when I deep-sixed my second proposal, a “close but no cigar” idea that called for examining a well-tested theory using LGBT leader participants. The idea has merit, but the more I dug into the question, the more I realized that the more urgent question was not how well LGBT leaders conform to predicted majority behavior, but instead the very whatness of the leaders themselves, about which we know so little.

It is no surprise that my interest in research methods also evolved toward exploratory models such as grounded theory and narrative inquiry that are designed to elicit meaning from lived experience. Time and again I would read a dissertation where an author was struggling to match experience with predicated theory when the real findings and “truth” were embedded in the stories people told about their lives. To know, to comprehend, to understand, to connect: these stories led me there.

Bolman and Deal’s “frames” approach also helped me diagnose how and why people are behaving as they are in organizations, even if you occasionally wonder, as I do, if there could be another frame, or if two of the frames are really one frame, or even if “framing” itself is a product of its time.

For that matter, mental models are a useful sorting hat for leadership theorists. Schein and Bolman see the world very differently, and so follows the structure of their advice about organizational excellence. Which brings me back to the question of my own research into LGBT leadership.

In an important discussion about the need for LGBT leadership research, Fassinger, Shullman, and Stevenson get props for (largely) moving the barycenter of LGBT leadership questions from the conceptual framework of being acted upon toward questions about the leaders themselves and their complex, agentic decisions and interactions with others. Their discussion of the role of situation feels like an enduring truth: “in any given situation, no two leaders and followers may be having the same experience, even if obvious organizational or group variables appear constant.”

What I won’t do is adopt their important article on directions for LGBT leadership research as a Simplicity dress pattern for my  leadership research agenda. They created a model; well, you see I am cautious about models. Even my own findings are at best a product of people, time, and place, intended to be valid in the way that all enlightenment is valid, but not deterministic.

So on I go, into the last phase of the program. In this post I have talked about donning and discarding theories as if I had all the time in the world, which is not how I felt in this process at all. It was the most agonizing exercise in patience and persistence I’ve ever had, and I questioned myself along the entire path. I relearned key lessons from my MFA in writing: some topics are more important than others; there is always room for improvement; writing is a process riddled with doubt and insecurity; and there is no substitute for sitting one’s behind in a chair and writing, then rewriting, then writing and rewriting some more.

So the flip side of my self-examination is that I have renewed appreciation for the value of selecting a good question and a good method, and pressing on until done.  I have no intention of repeating my Goldilocks routine.

Will my dissertation be my best work? Two factors suggest otherwise. First, I have now read countless dissertations where somewhere midway in the text the author expresses regret, however subdued, that he or she realized too late that the dissertation had some glaring flaw that could not be addressed without dismantling the entire inquiry. Second, though I don’t know that I’ve ever heard it expressed this way, from a writer’s point of view the dissertation is a distinct genre. I have become reasonably comfortable with the “short story” equivalent of the dissertation. But three short stories do not a novel make, and rarely do one-offs lead to mastery of a genre.

But I will at least be able to appreciate the problem for what it is: a chance to learn, and to share my knowledge; another life experience in the “press on regardless” sweepstakes; and a path toward a goal: the best dissertation I will ever write.

Bookmark to:

Nicole Engard: Bookmarks for September 25, 2015

planet code4lib - Fri, 2015-09-25 20:30

Today I found the following resources and bookmarked them on Delicious.

  • iDoneThis Reply to an evening email reminder with what you did that day. The next day, get a digest with what everyone on the team got done.

Digest powered by RSS Digest

The post Bookmarks for September 25, 2015 appeared first on What I Learned Today....

Related posts:

  1. Another reason I want my MLIS
  2. Get organized
  3. Reminder: Carnival Submissions

SearchHub: Pushing the Limits of Apache Solr at Bloomberg

planet code4lib - Fri, 2015-09-25 17:17
As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Anirudha Jadhav’s session on going beyond the conventional constraints of Solr. The goal of the presentation is to delve into the implementation of Solr, with a focus on how to optimize Solr for big data search. Solr implementations are frequently limited to 5k-7k ingest rates in similar use cases. I conducted several experiments to increase the ingest rate as well as throughput of Solr, and achieved a 5x increase in performance, or north of 25k documents per second. Typically, optimizations are limited by the available network bandwidth. I used three key metrics to benchmark the performance of my Solr implementation: time triggers, document size triggers and document count triggers. The talk will delve into how I optimized the search engine, and how my peers can coax similar performance out of Solr. This is intended to be an in-depth description of the high-frequency search implementation, with q/a with the audience. All implementations described here are based on latest SolrCloud multi-datacenter setups. Anirudha Jadhav is a big data search expert, and has architected and deployed arguably one of the world’s largest Lucene-based search deployments , tipping the scale at a little over 86 billion documents for Bloomberg LP. He has deep expertise in building financial applications, high-frequency trading and search applications as well as solving complex search and ranking problems. In his free time, he also enjoys scuba-diving, off-road treks with his 18th century British Army motorbike, building tri-copters and underwater photography. Anirudha earned his Masters in Computer Science from Courant Institute of Mathematical Sciences, New York University. Never Stop Exploring – Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P. from Lucidworks Join us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Pushing the Limits of Apache Solr at Bloomberg appeared first on Lucidworks.

DPLA: Help the Copyright Office Understand How to Address Mass Digitization

planet code4lib - Fri, 2015-09-25 14:46

Guest post by Dave Hansen, a Clinical Assistant Professor and Faculty Research Librarian at the University of North Carolina’s School of Law, where he runs the library’s faculty research service.

Wouldn’t libraries and archives like to be able to digitize their collections and make the texts and images available to the world online? Of course they would, but copyright inhibits this for most works created in the last 100 years.

The U.S. Copyright Office recently issued a report and a request for comments on its proposal for a new licensing system intended to overcome copyright obstacles to mass digitization. While the goal is laudable, the Office’s proposal is troubling and vague in key respects.

The overarching problem is that the Office’s proposal doesn’t fully consider how libraries and archives currently go about digitization projects, and so it misidentifies how the law should be improved to allow for better digital access. It’s important that libraries and archives submit comments to help the Office better understand how to make recommendations for improvements.

Below is a summary of the Office’s proposal and five specific reasons why libraries and archives should have reservations about it. I strongly encourage you to read the proposal and Notice of Inquiry closely and form your own judgment about it.

For commenting, a model letter is available here (use this form to fill in basic information), but you should tailor it with details that are important to your institution. Comments are due to the Copyright Office by October 9, 2015. The comment submission page is here.

The Copyright Office’s Licensing Proposal

The Copyright Office’s proposal is that Congress enact a five year pilot “extended collective licensing” (ECL) system that would allow collecting societies (e.g., the Authors Guild, or the Copyright Clearance Center) to grant licenses for mass digitization for nonprofit uses.

Societies could, in theory, already grant mass digitization licenses for works owned by their members. The Office’s proposed ECL system would allow collecting societies to go beyond that, and also grant licenses for all works that are similar to those owned by their members, even if the owners of those similar works are not actually members of the collective themselves. That’s the “extended” part of the license; Congress would, by statute, extend the society’s authority to grant licenses on behalf of both members and non-members alike. Such a system would help to solve one of the most difficult copyright problems libraries and archives face: tracking down rights holders. Digitizers would instead need only to negotiate and purchase a license from the collecting societies, simplifying the rights clearance process.

Why the Copyright Office’s Proposal is Troubling

In the abstract, the Office’s proposal sound appealing. But for digitizing libraries and archives, the details make it troubling for these five reasons:

First, the proposal doesn’t address the types of works that libraries and archives are working hardest to preserve and make available online—unique collections that include unpublished works such as personal letters or home photographs. Instead of focusing on these works for which copyright clearance is hardest to obtain, the proposal applies to only three narrow categories: 1) published literary works, 2) published embedded pictorial or graphic works, and 3) published photographs.

Second, given the large variety of content types in the collections that libraries and archives want to digitize—particularly special collections that include everything from unpublished personal papers, to out-of-print books, to government works—there is no one collecting society that could ever offer a license for mass digitization of entire collections. If seeking a license, libraries and archives would still be forced to negotiate with a large number of parties. And because the proposed ECL pilot would include only published works, large sections of collections would remain unlicensable anyway.

Third, digitization is an expensive investment. Because the system would be a five-year pilot project, few libraries or archives would be able to pay what it will cost to digitize works (not to mention ECL license fees) if those works have to be taken offline in a few years when the ECL system expires.

Fourth, for an ECL system to truly address the costs of clearing rights, it would need to include licensing orphan works (works whose owners cannot be located) alongside all other works. While the Copyright Office acknowledges in one part of its report that licensing of orphan works doesn’t make sense because it would require payment of fees that would never go to owners, it later specifies an ECL system that would do just that. The Society of American Archivists said it best in its comments to the Copyright Office: “[R]epositories that are seeking to increase access to our cultural heritage generally have no surplus funds. . . . Allocating those funds in advance to a licensing agency that will only rarely disperse them would be wasteful, and requiring such would be irresponsible from a policy standpoint.”

Finally, one of the most unsettling things about the ECL proposal is its threat to the one legal tool that is currently working for mass digitization: fair use. To be clear, fair use doesn’t work for all mass digitization uses. But it likely does address many of the uses that libraries and archives are most concerned with, including nonprofit educational uses of orphan works, and transformative use of special collections materials.

The Office recognized concerns about fair use in its report, and in response proposed a “fair use savings clause” that would state that “nothing in the [ECL] statute is intended to affect the scope of fair use.” Even with an effective savings clause, the existence of the ECL system alone could shrink the fair use right because fewer users might rely on it in favor of more conservative licensing. As legal scholars have observed, fair use is like a muscle, its strength depends in part on how it is used.

Rather than focus its energy on creating a licensing system that can only reach a small segment of library and archive collections, the Office should instead promote the use of legal tools that are working, such as fair use, and work to solve the problems underlying the rights-clearance issues by helping to create better copyright information registries and by studying solutions that would encourage rightsholders to make themselves easier to be found by potential users of their works.


Subscribe to code4lib aggregator