You are here

Feed aggregator

Islandora: Islandora Fundraising

planet code4lib - Mon, 2015-11-16 15:00

The Islandora Foundation is growing up. As a member-supported nonprofit, we have been very fortunate to have the support of more than a dozen wonderful universities, private companies, and like-minded projects - enough support that within our first year of operation, we were solvent. As of 2015, we now have a small buffer in our budget, which is a comfortable place to be.

But comfortable isn't enough. Not when our mission is to steward the Islandora project and ensure that it is the best software that it can be. With the launch of Fedora 4 last December, we started work on a version of Islandora that would work with this new major upgrade to the storage layer of our sites, recognizing that our community is going to want and need to move on to Fedora 4 someday and we had better be ready with a front-end for them when the time comes. Islandora 7.x-2.x was developed to the prototype stage with special funding from some of our supporters, and development continues by way of volunteer sprints. Meanwhile, Islandora 7.x-1.x (which works with Fedora 3) continues to be supported and improved - also by volunteers

It's a lot to coordinate, and we have determined through consultation with our interest groups, committees, and the community in general that in order to do this right, we need to have someone with the right skill set dedicated to coordinating these projects. We need a Tech Lead.

Right now, the Islandora Foundation has a single employee (*waves*). I am the Project & Community Manager, which means I work to support community groups and initiatives, organize Islandora events, handle communications (both public and private) and promotions, and just generally do everything I can to help our many wonderful volunteers to do the work that keeps this project thriving. We've been getting by with that because many of the duties that would belong to a Tech Lead have been fulfilled by members of the community on a volunteer basis, but we are swiftly outgrowing that model. The Fedora 4 project that inspired us to take on a new major version of Islandora has had great success with a two person team of employees (plus many groups for guidance): Product Manager David Wilcox (more or less my counterpart) and Tech Lead Andrew Woods. 

Now to the point: we need money. We have a confirmed membership revenue of $86,000 per year*, which is plenty for one employee plus some travel and general expenses, but not enough to hire this second position that we need to get the project to the next level. About a month ago I contacted many of the institutions in our community to see if they could consider becoming members of the Islandora Foundation, and we had a gratifying number of hopeful responses (thank you to those folks!), but we're still short of where we need to be. 

And so, the Funding Lobster (or Lobstometre). In the interest of transparency, and perhaps as motivation, this little guy is showing you exactly where things stand with our Tech Lead goal. If we get $160,000 in memberships we can do it (but we'll be operating without a net), $180,000 and we're solid, and if we hit $200,000 or above that's just unmitigated awesome (and would get turned into special projects, events, and other things to support the community). He's the Happy Lobster, and not the Sad Lobster, because we do believe we'll get there with your help, and soon.

How can you help? Become a member. While it would be great if we could frame this as a funding drive and take one-time donations, since the goal is to hire a real live human being who will want to know that they can pay their rent and eat beyond their first year of employment, we need to look for renewable commitments. Our membership levels are as follows:

Institutional Membership:

  • Member - $2000
  • Collaborator - $4000
  • Partner - $10,000

Individual Membership:

  • $10 - $250+ (at your discretion)

There are many benefits to membership, including things like representation on governing committees and discounts at events. Check out the member page or drop me an email if you want to know more.

Many thanks,

- Melissa

* some of our members were able to allocate more funding to support 7.x-2.x development than their typical membership dues. It is currently unknown how many will be able to maintain that funding level at renewal, but yearly membership revenue could be as high as $122,000. I went with the number we can be sure of.

Mark E. Phillips: Finding figures and images in Electronic Theses and Dissertations (ETD)

planet code4lib - Mon, 2015-11-16 14:17

One of the things that we are working on at UNT is a redesign of The Portal to Texas History’s interface.  In doing so I’ve been looking around quite a bit at other digital libraries to get ideas of features that we could incorporate into our new user experience.

One feature that I found that looked pretty nifty was the “peek” interface for the Carolina Digital Repository. They make the code for this interface available to others to use if they are interested via the UNC Libraries GitHub in the peek repository.  I think this is an interesting interface but I had the question still of “how did you decide which images to choose”.  I came across the peek-data repository that suggested that the choosing of images was a manual process, and I also found a powerpoint presentation titled “A Peek Inside the Carolina Digital Repository” by Michael Daines that confirmed this is the case.  These slides are a few years old so I don’t know if the process is still manual.

I really like this idea and would love to try and implement something similar for some of our collections but the thought of manually choosing images doesn’t sound like fun at all.  I looked around a bit to see if I could borrow from some prior work that others have done.  I know that the Internet Archive and the British Library have released some large image datasets that appear to be the “interesting” images from books in their collections.

Less and More interesting images

I ran across a blog post by Chris Adams who works on the World Digital Library at the Library of Congress called “Extracting images from scanned book pages” that seemed to be close to what I wanted to do,  but wasn’t exactly it either.

I remembered back to a Code4Lib Lightning Talk a few years back from Eric Larson called “Finding image in book page images” and the companion GitHub repository picturepages that contains the code that he used.   In reviewing the slides and looking at the code I think I found what I was looking for,  at least a starting point.

Process

What Eric proposed for finding interesting images was that you would take an image, convert it to grayscale, increase the contrast dramatically, convert this new images into a single pixel wide image that is 1500 pixels tall and sharpen the image.  That resulting image would be inverted,  have a threshold applied to it to convert everything to black or white pixels and then it would be inverted again.  Finally the resulting values of either black or white pixels are analyzed to see if there are areas of the image that are 200 or more pixels long that are solid black.

convert #{file} -colorspace Gray -contrast -contrast -contrast -contrast -contrast -contrast -contrast -contrast -resize 1X1500! -sharpen 0x5 miff:- | convert - -negate -threshold 0 -negate TXT:#{filename}.txt`

The script above which uses ImageMagick to convert an input image to greyscale, calls contrast eight times, resizes the image and the sharpens the result. It pipes this file into convert again, flips the colors, applies and threshold and flips back the colors. The output is saved as a text file instead of an image, with one line per pixel. The output looks like this.

# ImageMagick pixel enumeration: 1,1500,255,srgb ... 0,228: (255,255,255) #FFFFFF white 0,229: (255,255,255) #FFFFFF white 0,230: (255,255,255) #FFFFFF white 0,231: (255,255,255) #FFFFFF white 0,232: (0,0,0) #000000 black 0,233: (0,0,0) #000000 black 0,234: (0,0,0) #000000 black 0,235: (255,255,255) #FFFFFF white 0,236: (255,255,255) #FFFFFF white 0,237: (0,0,0) #000000 black 0,238: (0,0,0) #000000 black 0,239: (0,0,0) #000000 black 0,240: (0,0,0) #000000 black 0,241: (0,0,0) #000000 black ...

The next step was to loop through each of the lines in the file to see if there was a sequence of 200 black pixels.

I pulled a set of images from an ETD that we have in the UNT Digital Library and tried a Python port of Eric’s code that I hacked together.  For me things worked pretty well, it was able to identify the images that I would have manually pulled as pages that were “interesting” on my own.

But there was a problem that I ran into,  the process was pretty slow.

I pulled a few more sets of page images from ETDs and found that for those images it would take the ImageMagick convert process up to 23 seconds per images to create the text files that I needed to work with.  This made me ask if I could actually implement this same sort of processing workflow with just Python.

I need a Pillow

I have worked with the Python Image Library (PIL) a few times over the years and had a feeling it could do what I was interested in doing.  I ended up using Pillow which is a “friendly fork” of the original PIL library.  My thought was to apply the same processing workflow as was carried out in Eric’s script and see if doing it all in python would be reasonable.

I ended up with an image processing workflow that looks like this:

# Open image file im = Image.open(filename) # Convert image to grayscale image g_im = ImageOps.grayscale(im) # Create enhanced version of image using aggressive Contrast e_im = ImageEnhance.Contrast(g_im).enhance(100) # resize image into a tiny 1x1500 pixel image # ANTIALIAS, BILINEAR, and BICUBIC work, NEAREST doesn't t_im = e_im.resize((1, 1500), resample=Image.BICUBIC) # Sharpen skinny image file st_im = t_im.filter(ImageFilter.SHARPEN) # Invert the colors it_im = ImageOps.invert(st_im) # If a pixel isn't black (0), make it white (255) fixed_it_im = it_im.point(lambda x: 0 if x < 1 else 255, 'L') # Invert the colors again final = ImageOps.invert(fixed_it_im) final.show()

I was then able to iterate through the pixels in the final image with the getdata() method and apply the same logic of identifying images that have sequences of black pixels that were over 200 pixels long.

Here are some examples of thumbnails from three ETDs,  first all images and then just the images identified by the above algorithm as “interesting”.

Example One

Thumbnails for ark:/67531/metadc699990/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.

 

 

Example Two

Thumbnails for ark:/67531/metadc699999/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699999/ with just visually interesting pages shown.

Example Three

Thumbnails for ark:/67531/metadc699991/ including interesting and less visually interesting pages.

Thumbnails for ark:/67531/metadc699991/ with just visually interesting pages.

So in the end I was able to implement the code in Python with Pillow and a fancy little lambda function.  The speed was much improved as well.  For those same images that were taking up to 23 seconds to process with the ImageMagick version of the workflow,  I was able to process them in a tiny bit over a second with this Python version.

The full script I was using for these tests is below. You will need to download and install Pillow in order to get it to work.

I would love to hear other ideas or methods to do this kind of work, if you have thoughts, suggestions, or if I missed something in my thoughts, please let me know via Twitter.

 

D-Lib: Reminiscing About 15 Years of Interoperability Efforts

planet code4lib - Mon, 2015-11-16 14:14
Opinion by Herbert Van de Sompel, Los Alamos National Laboratory and Michael L. Nelson, Old Dominion University

D-Lib: MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide

planet code4lib - Mon, 2015-11-16 14:14
Article by Vetle I. Torvik, University of Illinois at Urbana-Champaign

D-Lib: Structured Affiliations Extraction from Scientific Literature

planet code4lib - Mon, 2015-11-16 14:14
Article by Dominika Tkaczyk, Bartosz Tarnawski and Lukasz Bolikowski, Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Poland

D-Lib: PubIndia: A Framework for Analyzing Indian Research Publications in Computer Science

planet code4lib - Mon, 2015-11-16 14:14
Article by Mayank Singh, Soumajit Pramanik and Tanmoy Chakraborty, Indian Institute of Technology, Kharagpur, India

D-Lib: Using Scenarios in Introductory Research Data Management Workshops for Library Staff

planet code4lib - Mon, 2015-11-16 14:14
Article by Sam Searle, Griffith University, Brisbane, Australia

D-Lib: Collaborative Construction of Digital Cultural Heritage: A Synthesis of Research on Online Sociability Determinants

planet code4lib - Mon, 2015-11-16 14:14
Article by Chern Li Liew, Victoria University of Wellington, New Zealand

D-Lib: Semantometrics in Coauthorship Networks: Fulltext-based Approach for Analysing Patterns of Research Collaboration

planet code4lib - Mon, 2015-11-16 14:14
Article by Drahomira Herrmannova, KMi, The Open University and Petr Knoth, Mendeley Ltd.

D-Lib: Efficient Table Annotation for Digital Articles

planet code4lib - Mon, 2015-11-16 14:14
Article by Matthias Frey, Graz University of Technology, Austria and Roman Kern, Know-Center GmbH, Austria

D-Lib: NLP4NLP: The Cobbler's Children Won't Go Unshod

planet code4lib - Mon, 2015-11-16 14:14
Article by Gil Francopoulo, IMMI-CNRS + TAGMATICA, France; Joseph Mariani, IMMI-CNRS + LIMSI-CNRS, France; Patrick Paroubek, LIMSI-CNRS, France

D-Lib: Holiday Reading

planet code4lib - Mon, 2015-11-16 14:14
Editorial by Laurence Lannom, CNRI

D-Lib: Developing Best Practices in Digital Library Assessment: Year One Update

planet code4lib - Mon, 2015-11-16 14:14
Article by Joyce Chapman, Duke University Libraries, Jody DeRidder, University of Alabama Libraries and Santi Thompson, University of Houston Libraries

D-Lib: The OpenAIRE Literature Broker Service for Institutional Repositories

planet code4lib - Mon, 2015-11-16 14:14
Article by Michele Artini, Claudio Atzori, Alessia Bardi, Sandro La Bruzzo, Paolo Manghi and Andrea Mannocci, Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo" -- CNR, Pisa, Italy

LITA: Agile Development: Building an Agile Culture

planet code4lib - Mon, 2015-11-16 14:00

Over the last few months I have described various components of Agile development. This time around I want to talk about building an Agile culture. Agile is more than just a codified process; it is a development approach, a philosophy, one that stresses flexibility and communication. In order for a development team to successfully implement Agile the organization must embrace and practice the appropriate culture. In this post will to briefly discuss several tips that will help develop Agile development.

The Right People

It all starts here: as with pretty much any undertaking, you need the right people in place, which is not necessarily the same as saying the best people. Agile development necessitates a specific set of skills that are not intrinsically related to coding mastery: flexibility, teamwork, and ability to take responsibility for a project’s ultimate success are all extremely important. Once the team is formed, management should work to bring team members closer together and create the right environment for information sharing and investment.

Encourage Open Communication

Because of Agile’s quick pace and flexibility, and the lack of overarching structures and processes, open communication is crucial. A team must develop communication pathways and support structures so that all team members are aware of where the project stands at any one moment (the daily scrum is a great example of this). More important, however, is to convince the team to open up and conscientiously share progress individual progress, key roadblocks, and concerns about the path of development. Likewise, management must be proactive about sharing project goals and business objectives with the team. An Agile team is always looking for the most efficient way to deliver results, and the more information they receive about the motivation and goals that lie behind a project the better. Agile managers must actively encourage a culture that says “we’re all in this together, and together we will find the solution to the problem.” Silos are Agile’s kryptonite.

Empower the Team

Agile only works when everyone on the team feels responsible for the success of the project, and management must do its part by encouraging team members to take ownership of the results of their work, and trusting them to do so. Make sure everyone on the team understands the ultimate organizational need, assign specific roles to each team member, and then allow team members to find their own ways to meet the stated goals. Too often in development there is a basic disconnect between the people who understand the business needs and those who have the technical know-how to make them happen. Everyone on the team needs to understand what makes for a successful project, so that wasted effort is minimized.

Reward the Right Behaviors

Too often in development organizations, management metrics are out of alignment with process goals. Hours worked are a popular metric teams use to evaluate members, although often proxies like hours spent at the office, or time spent logged into the system, are used. With Agile, the focus should be on results. As long as a team meets the stated goals of a project, the less time spent working on the solution, the better. Remember, the key is efficiency, and developing software that solves the problem at hand with as few bells and whistles as possible. If a team is consistently beating it’s time estimates by a significant margin, it can recalibrate their estimation procedures. Spending all night at the office working on a piece of code is not a badge of honor, but a failure of the planning process.

Be Patient

Full adoption of Agile takes time. You cannot expect a team to change it’s fundamental philosophy overnight. The key is to keep working at it, taking small steps towards the right environment and rewarding progress. Above all, management needs to be transparent about why it considers this change important. A full transition can take years of incremental improvement. Above all, be conscious that the steady state for your team will likely not look exactly like the theoretical ideal. Agile is adaptable and each organization should create the process that works best for its own needs.

If you want to learn more about building an Agile culture, check out the following resources:

In your experience, how long does it take for a team to fully convert to the Agile way? What is the biggest roadblock to adoption? How is the process initiated and who monitors and controls progress?

“Scrum process” image By Lakeworks (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 4.0-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/4.0-3.0-2.5-2.0-1.0)], via Wikimedia Commons

Conal Tuohy: Taking control of an uncontrolled vocabulary

planet code4lib - Mon, 2015-11-16 13:49

A couple of days ago, Dan McCreary tweeted:

Working on new ideas for NoSQL metadata management for a talk next week. Focus on #NoSQL, Documents, Graphs and #SKOS. Any suggestions?

— Dan McCreary (@dmccreary) November 14, 2015

It reminded me of some work I had done a couple of years ago for a project which was at the time based on Linked Data, but which later switched away from that platform, leaving various bits of RDF-based work orphaned.

One particular piece which sprung to mind was a tool for dealing with vocabularies. Whether it’s useful for Dan’s talk I don’t know, but I thought I would dig it out and blog a little about it in case it’s of interest more generally to people working in Linked Open Data in Libraries, Archives and Museums (LODLAM).

I told Dan:

@dmccreary i did a thing once with an xform using a sparql query to assemble a skos concept scheme, edit it, save in own graph. Of interest?

— Unholy Taco (@conal_tuohy) November 14, 2015

When he sounded interested, I made a promise:

@dmccreary i have the code somewhere. .. will dig it out

— Unholy Taco (@conal_tuohy) November 14, 2015

I know I should find a better home for this and the other orphaned LODLAM components, but for now, the original code can be seen here:

https://github.com/Conal-Tuohy/huni/tree/master/data-store/www/skos-tools/vocabulary.xml

I’ll explain briefly how it works, but first, I think it’s necessary to explain the rationale for the vocabulary tool, and for that you need to see how it fits into the LODLAM environment.

At the moment there is a big push in the cultural sector towards moving data from legacy information systems into the “Linked Open Data (LOD) Cloud” – i.e. republishing the existing datasets as web-based sets of inter-linked data. In some cases people are actually migrating from their old infrastructure, but more commonly people are adding LOD capability to existing systems via some kind of API (this is a good approach, to my way of thinking – it reduces the cost and effort involved enormously). Either way, you have to be able to take your existing data and re-express it in terms of Linked Data, and that means facing up to some challenges, one of which is how to manage “vocabularies”.

Vocabularies, controlled and uncontrolled

What are “vocabularies” in this context? A “vocabulary” is a set of descriptive terms which can be applied to a record in a collection management system. For instance, a museum collection management system might have a record for a teacup, and the record could have a number of fields such as “type”, “maker”, “pattern”, “colour”, etc. The value of the “type” field would be “teacup”, for instance, but another piece in the collection might have the value “saucer” or “gravy boat” or what have you. These terms, “teacup”, “plate”, “dinner plate”, “saucer”, “gravy boat” etc, constitute a vocabulary.

In some cases, this set of terms is predefined in a formal list, This is called a “controlled vocabulary”. Usually each term has a description or definition (a “scope note”), and if there are links to other related terms (e.g. “dinner plate” is a “narrower term” of “plate”), as well as synonyms, including in other languages (“taza”, “plato”, etc) then the controlled vocabulary is called a thesaurus. A thesaurus or a controlled vocabulary can be a handy guide to finding things. You can navigate your way around a thesaurus, from one term to another, to find related classes of object which have been described with those terms, or the thesaurus can be used to automatically expand your search queries without you having to do anything; you can search for all items tagged as “plate” and the system will automatically also search for items tagged “dinner plate” or “bread plate”.

In other cases, though, these vocabularies are uncontrolled. They are just tags that people have entered in a database, and they may be consistent or inconsistent, depending on who did the data entry and why. An uncontrolled vocabulary is not so useful. If the vocabulary includes the terms “tea cup”, “teacup”, “Tea Cup”, etc. as distinct terms, then it’s not going to help people to find things because those synonyms aren’t linked together. If it includes terms like “Stirrup Cup” it’s going to be less than perfectly useful because most people don’t know what a Stirrup Cup is (it is a kind of cup).

The vocabulary tool

So one of the challenges in moving to a Linked Data environment is taking the legacy vocabularies which our systems use, and bringing them under control; linking synonyms and related terms together, providing definitions, and so on. This is where my vocabulary tool would come in.

In the Linked Data world, vocabularies are commonly modelled using a system called Simple Knowledge Organization System (SKOS). Using SKOS, every term (a “Concept” in SKOS) is identified by a unique URI, and these URIs are then associated with labels (such as “teacup”), definitions, and with other related Concepts.

The vocabulary tool is built with the assumption that a legacy vocabulary of terms has been migrated to RDF form by converting every one of the terms into a URI, simply by sticking a common prefix on it, and if necessary “munging” the text to replace, or encode spaces or other characters which aren’t allowed in URIs. For example, this might produce a bunch of URIs like this:

  • http://example.com/tableware/teacup
  • http://example.com/tableware/gravy_boat
  • etc.

What the tool then does is it finds all these URIs and gives you a web form which you can fill in to describe them and link them together. To be honest I’m not sure how far I got with this tool, but ultimately the idea would be that you would be able to organise the terms into a hierarchy, link synonyms, standardise inconsistencies by indicating “preferred” and “non-preferred” terms (i.e. you could say that “teacup” is preferred, and that “Tea Cup” is a non-preferred equivalent).

When you start the tool, you have the opportunity to enter a “base URI”, which in this case would be http://example.com/tableware/ – the tool would then find every such URI which was in use, and display them on the form for you to annotate. When you had finished imposing a bit of order on the vocabulary, you would click “Save” and your annotations would be stored in an RDF graph whose name was http://example.com/tableware/. Later, your legacy system might introduce more terms, and your Linked Data store would have some new URIs with that prefix. You would start up the form again, enter the base URI, and load all the URIs again. All your old annotations would also be loaded, and you would see the gaps where there were terms that hadn’t been dealt with; you could go and edit the definitions and click “Save” again.

In short, the idea of the tool was to be able to use, and to continue to use, legacy systems which lack controlled vocabularies, and actually impose control over those vocabularies after converting them to LOD.

How it works

OK here’s the technical bit.

The form is built using XForms technology, and I coded it to use a browser-based (i.e. Javascript) implementation of XForms called XSLTForms.

When the XForm loads, you can enter the common base URI of your vocabulary into a text box labelled “Concept Scheme URI”, and click the “Load” button. When the button is clicked, the vocabulary URI is substituted into a pre-written SPARQL query and sent off to a SPARQL server. This SPARQL query is the tricky part of the whole system really: it finds all the URIs, and it loads any labels which you might have already assigned them, and if any don’t have labels, it generates one by converting the last part of the URI back into plain text.

prefix skos: <http://www.w3.org/2004/02/skos/core#> construct { ?vocabulary a skos:ConceptScheme ; skos:prefLabel ?vocabularyLabel. ?term a skos:Concept ; skos:inScheme ?vocabulary ; skos:prefLabel ?prefLabel . ?subject ?predicate ?object . } where { bind(&lt;<vocabulary-uri><!--http://corbicula.huni.net.au/data/adb/occupation/--></vocabulary-uri>&gt; as ?vocabulary) { optional {?vocabulary skos:prefLabel ?existingVocabularyLabel} bind("Vocabulary Name" as ?vocabularyLabel) filter(!bound(?existingVocabularyLabel)) } union { ?subject ?predicate ?term . bind( replace(substr(str(?term), strlen(str(?vocabulary)) + 1), "_", " ") as ?prefLabel ) optional {?term skos:prefLabel ?existingPrefLabel} filter(!bound(?existingPrefLabel)) filter(strstarts(str(?term), str(?vocabulary))) filter(?term != ?vocabulary) } union { graph ?vocabulary { ?subject ?predicate ?object } } }

The resulting list of terms and labels is loaded into the form as a “data instance”, and the form automatically grows to provide data entry fields for all the terms in the instance. When you click the “Save” button, the entire vocabulary, including any labels you’ve entered, is saved back to the server.

Pages

Subscribe to code4lib aggregator