My first session at OSCON this year was hosted by Jono Bacon on Community Management.
We’ve seen a remarkable growth in community all over the world – people are getting together to make things, do things, hack, etc. Just this simple idea of peel getting together to make communities makes Jono excited (me too). If you take away the screens, computers, internet, etc – we’re all just people. We all have a basic set of concerns, opportunities, and insecurities. We all want a feeling of self-worth and to do that we need to contribute to communities (family, friends, etc). One key to this is the growth of internet connectivity. People in countries who were never connected before are getting connected – we also have the grown of smart phone use – this means that we as human beings can get to gather and connect to create communities and contribute to making, sharing, creating and more.
Open source is powered by communities! Wikipedia is powered by communities sharing knowledge and making it open! There are sustainable farming groups all over the world. We have the maker revolution. We also notice a lot more political activism because people can get together in easier ways.
Despite all of that we’re inefficient as people – these communities were all mostly accidents. We learn about communities by watching others, the renaissance comes when people swap from watching to writing it down and replicating that information.
Jono shared with us his written down/packaged thoughts on community management in this 1/2 day workshop.
If we want to build strong communities we have to start with a mission. We have to have a point and a focus. In order to assess the type of mission we want we have to look at the world we’re in. First off we’re in the post-Snowden land of privacy, the land of 3D printing and the maker revolution and a world where everyone is getting connected to the internet.
If building a community within or for your business seems like a marketing ploy it will fail. The day was broken up as follows:
- We need a vision – this is the ‘fluffy’ part
- We need requirements – Communities are chaotic, and that makes them fun, but we do need to have some sort of requirements
- We need to make a plan – there are many communities that have naturally sprung up (the ice bucket challenge) but the very best communities have a plan behind them
- We need an infrastructure
- We then need to figure out how to get people involved
- Once we have people join we need to measure the value of the community (especially if you’re at a company)
- The key thing is refinement. We will screw some stuff up – and this is a good thing. Failure is an opportunity to be better
Want to learn more sign up at : http://communityleadershipforum.com
Community leadership is about taking all the talents you’re surrounded about and bringing them together. Contributions come in many shapes and sizes. Not all contributions are code and documentation – some of it is just ideas!Strategy (Vision + Mission + Plans):
Vision – what are we going out there to do? The elevator pitch that will get people excited. Take a global community of connected people and make then as efficient as possible. Jono breaks communities in to two types : read and write. Read communities are those that are user groups – people who need a place to talk and share. Write communities want to get together to change things – open source projects are write communities and the focus of today.
The first thing we need to accept is that people are irrational. We need to use a bit of social engineering or behavioral economics to manage our communities.
Jono brought up the SCARF model (read the full PDF) – this is the core foundation for creating a successful community:
- Status – Clarity in relative importance
- Certainty – Creating a sense of security and predicability
- Autonomy – Building in choice in your environment (even if those choices all lead to the same results – order don’t work – letting people pick is the key)
- Relatedness – Defining clear social groupings and systems (build strong teams and help them work together)
- Fairness – Reducing unfair opportunity and rewards
Every community is different, but every community that is great is great because of great leadership. Some of the most impactful leaders though can be at the bottom of the food chain.
What is great leadership? It’s broken in to two areas:
- Helping people to succeed in their goals
- Helping people to be the best that they can be
The goal with strategy is that we want to build predictable yet surprising results. Instead of trying to convince people who are skeptical – go out there and do it and surprise them. You also have to be honest – you cannot promise success when starting a new community – some things are going to work and some are not.
There are three steps to starting your community within a company or as an extension of your company:Observe:
- look at your environment
- define requirements
- define expectations
- identify key players – this is really important – you need to find the people you want to influence and that you want to influence you
- assess risks/threats to you and others – when you join a company there are going to be people who are gunning for you and those people will bemoan the work that you’re doing and others will actively try to derail your work – these are the people you want to make friends with
- explore short/long term changes – see how quickly people are joining and leaving a company
- create a mission statement – this isn’t something you create once and never look at again – it’s something people should think about every single day ‘why are we doing this?’
- create a set of values – from the mission statement you can pull out a set of values
- create a longer term roadmap – “in 2 years we want to be here”
- create a staff engagement plan – if you work for a company how are you going to get out there an engage with people
- create a community engagement plan – find a way to make visiting the community a habit
- create a budget – “pick a budget and don’t spend all of it”
- a strategic plan (for the execs)
- an elevator pitch (for the staff) – max 5 min – better if under 3
- an execution plan (for you)
- relationships (for the teams)
In the end you have 4 core documents you end up with: mission statement, elevator pitch, strategic plan, implementation plan. Through all of this you want to communicate your strategy, keep people included and make them feel like they’re part of the process.Planning:
Collaborative planning is really really hard! We want to build a culture in which people can plan together but not everyone in your community should play a role in how you plan. These people might be loud, but lack the skills to assist in planning. You need to find the best people to contribute to the plan because they have earned it.
There are two types of people in open source communities – hackers and maintainers. Hackers want to create things! Maintainers want to build stable software and fix bugs and do QA.
For the hackers you want to build a culture of chaos so people and join in easily. This is like an on ramp to the project. You also need project plans in place for the maintainers.
5 areas to consider when planning:
- opinionated – it’s okay to say no to people! If you say yes to everyone the best you can be is average
Objective Key Results (OKR) – a process used at Google. The first step is to plan your next 3 month period – create some measurable objectives (no more than 5). Next you define key results – set these to be deliberately ambitious (on the edge of impossible), but measurable outcomes (no more than 3 for each objective. Next you document the previous two steps and share them with everyone (when you share ambitious goals with the public you don’t want to look like an idiot by not achieving them). You need to provide updates regularly and you have to stress that these are ambitious goals that I might not meet. We shouldn’t just seek to have great results, but regularly exercising and stretching ourselves to make ourselves better. After the 3 month period you grade yourself from 1 to 10. 1 being that you didn’t do a thing – 10 being you finished everything. You should be getting about a 6 or 7 – if you’re getting a 10 then you’re not stretching yourself enough. Finally you want to revise and improve your goals for the next period. Because your assessing yourself you get to improve yourself – it’s not designed to be a tool for your boss to grade you.
The next thing we need to do is connect to the hearts and minds of people. A plan that doesn’t have people on board is just words. We want people to really excited about the work we do – building communities is the way we make the world a better place.Infrastructure:
To build a community is a collaborative effort.
New people will join your community and won’t know what it is or how they can contribute. They want to see that this is a community that is eager to include them – this is the marketing part of things. Next they’re on the ‘on ramp’ in to your community. To get people on the on ramp you want to make it clear that people are critical to what we’re doing and that we want them participate. The next step is to get those community members to develop skills. This is more than providing tools to help people learn, but including instructions on how to participate. People don’t want to read reams of information – we live in the time of twitter and Facebook – we need to provide efficient instructions – quick bullet points. Once our new members have learned how to contribute you want them to ‘do something’. To help with this create a list of bite sized bugs – easy bugs to fix that new members are encouraged to fix. Then once they contribute be sure to provide feedback – people want to feel validated.
For your open source project you’re going to see a basic facilities:
- communication channels
- collaborative editing / knowledge base (wiki)
- code hosting
- issue tracking
- news delivery (blog)
- social media
Jono shared his list of recommendations for these different tools:
The one tool missing on the slide was issue tracking – Jono says Bugzilla is popular and so is Launchpad.Growth
Growth is about engagement. We want people to become ‘sticky’ – we want them to stick around. Jono’s goal is 66 days. 66 days is how long it takes to develop a habit. So we want to encourage conversation, creation, communications and conduct to get our communities to grow in a healthy way.Measuring Impact
“If you’re not measuring it, it didn’t happen”
Aggregate measurements tell a fuller story than KPIs (single number to tell how well something is working). KPI is something like there are a 1000 people on the forum, but an aggregate measurement is something like levels where at level 1 you have to spent X amount of time on the site, participate in X topics, etc etc etc. So then when you say you have 500 level 1 members on your site you know what that means.
What you’re looking for are the stories, the patterns and the trends. If you want to identify a great community member is – look a the whole of their contribution – not just how much code the contribute, but how they participate in discussions as well. Come up with a scale for your community.
Quality is way more important than quantity. Having lots of data is not more important than providing quality data. The data is there to show outcomes and outcomes are about patterns and trends not numbers. You want to illustrate the practical ways that you have succeeded in your community.
Our measurements might show that we failed – and that’s okay. You need to fail and learn from it and improve upon things. Don’t let the fear of failure stop you from measuring the impact of your community. Seeing “failure” in your data lets you realign your plans and community to figure out how to succeed at your goals.Reading recommendations:
Abundance: The Future Is Better Than You Think – by Peter H. Diamandis and Steven Kotler
Art of Community by Jono Bacon (of course)
The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change by Stephen R. Covey
Making Things Happen: Mastering Project Management by Scott Berkun
The Starfish and the Spider: The Unstoppable Power of Leaderless Organizations by Ori Brafman and Rod A. Beckstrom
- Keynote: Licensing Models and Building an Open Source Community
- How to not do support
- Being a woman in an open source community
The following is a guest post by Barrie Howard, IT Project Manager at the Library of Congress.
This post is part of a series about digital preservation training informed by the Library’s Digital Preservation Outreach & Education (DPOE) Program. Today I’ll focus on an exceptional individual, Danielle Spalenka, Project Director for the Digital POWRR Project. Prior to managing Digital POWRR, she was the Curator of Manuscripts for the Regional History Center and University Archives at Northern Illinois University.
Barrie: Danielle, first I’d like to applaud the POWRR Project for all its efforts to provide practical digital preservation solutions for low-resourced institutions. For those that aren’t familiar, can you provide a brief overview and recount some of the highlights of the project?
Danielle: Thank you very much Barrie! We are really proud of all that has been accomplished. The Digital POWRR project really began because of a failed attempt to apply for a major digitization grant. The two Co-PI’s of the project, Lynne Thomas and Drew VandeCreek, wanted to digitize a collection of dime novels in the Rare Books and Special Collections department at Northern Illinois University. They applied for an Institute of Museum and Library Services (IMLS) grant to digitize the novels, only to be rejected because they did not have a digital preservation plan built into their proposal. They realized that they probably weren’t the only medium-sized or under-funded institution with this same problem. What resulted was a National Leadership Grant to investigate the problem of, and potential solutions for, digital preservation at institutions with restricted resources. And that’s how POWRR (Preserving digital Objects with Restricted Resources) was born.
From 2012 to 2014, five institutions in Illinois – Northern Illinois University (serving as the lead), Illinois State University, Western Illinois University, Chicago State University, and Illinois Wesleyan University – participated in the study. We investigated, evaluated, and recommended scalable, sustainable digital preservation solutions for libraries with smaller amounts of data and/or fewer resources. During the course of the study, Digital POWRR Project team members realized that many information professionals were aware of the risk of digital object loss but often failed to move forward because they felt overwhelmed by the scope of the problem.
Team members prepared a workshop curriculum based on the findings of the study and presented it to several groups of information professionals as part of the project’s dissemination phase. Demand for the workshops was high – registration filled up quickly and created a long waiting list of eager professionals trying to get into the workshops. Towards the end of the project, organizations of information professionals were still reaching out to team members to bring the workshop to their area. We applied for a grant from the National Endowment for the Humanities Division of Preservation and Access to continue giving the workshops, and in January 2015 received funding from the NEH to extend the reach of the Digital POWRR workshops. That is when I came on board as the project director, replacing Jaime Schumacher, who is now a Co-PI on the project with Drew and Lynne.
In addition to the workshop, another highlight from the project has been the publication of a white paper that has been widely read. The white paper recently won the Preservation Publication Award from the Society of American Archivists, which we are really excited about. Our project team traveled across the country and around the world to present the findings from the study. We look forward to continue traveling across the country to provide the workshop (for free!) through the end of 2016, thanks to funding from the NEH.
Barrie: It must be very exciting to be moving POWRR into a new phase. What’s been accomplished to date?
Danielle: Since we received funding from the NEH in January 2015, we have made a few changes to the workshop based on evaluations from previous participants. We have worked with several regional organizations of information professionals who provided letters of support in our grant application to schedule and promote individual workshops. We’ve done workshops in two locations so far, and were able to provide some travel scholarships that allowed institutions with very limited funds to send a representative to the workshop!
Barrie: I know you just wrapped up back-to-back workshops in Portland, OR. What other cities are hosting POWRR “From Theory to Action” workshops?
Danielle: Our next workshop will be another back-to-back workshop in Albany, NY in October, and we’ll be traveling to Deadwood, SD in November. In 2016, so far we have workshops scheduled in Little Rock, AR (April 2016), St. Paul, MN (June 2016) and San Antonio, TX in July 2016. I’m working to confirm dates in a few other locations, including Atlanta. Depending on our budget for 2016, we hope to go to more locations. I’ve had requests to come to Philadelphia, Montana, New York City, California, and even Alaska! As I continue scheduling workshops, I encourage anyone interested in attending to visit our website for updates.
I would like to thank several organizations that have helped us make sure the workshop remains free: the Black Metropolis Research Consortium; Northwest Archivists, Inc.; the Sustainable Heritage Network; Mid-Atlantic Region Conference of Archivists; the East New York Chapter of ACRL; the Midwest Archives Conference; the Digital Curation Interest Group of ACRL; the Oberlin Group; and the American Association for State and Local History.
Barrie: I understand that your team looked at the DPOE Train-the-Trainer Workshop training materials in developing your own curriculum, as well as other digital preservation training offerings. Can you share some of your observations?
Danielle: When we first started developing the workshop, we did look at what was currently being offered. We wanted the workshop to follow best practices and standards presented by digital preservation instruction currently available. Many of our project team members attended workshops and training sessions, including the DPOE Train-the-Trainer and offerings from the Society of American Archivists (SAA). We also talked to several digital preservation instructors, including Chris Prom and Jackie Esposito – who teach some of the Digital Archives Specialist courses offered through SAA – and Liz Bishoff.
Our review of the landscape of digital preservation instruction was that it is largely aimed at an audience beginning to come to grips with the idea that digital objects are subject to loss if we don’t actively care for them. There are lots of offerings discussing the theory of digital preservation – the “why” of the problem – and we found that there were limited opportunities to learn the “how” of digital preservation, both on the advocacy and technical sides. We also found that other great offerings, like the Digital Preservation Management Workshop Series based at MIT, had a tuition fee that was unaffordable for many prospective attendees, especially from under-funded institutions. Our goal in this phase is to make the workshops free to attend.
A major goal of the workshop is to discuss specific tools and provide a hands-on portion so that participants could try a tool that they could apply directly at their own institutions. We found that hands-on instruction for a specific, basic digital preservation tool, and critical overviews of other available tools that we tested, are largely absent from some course offerings. In the case of DPOE’s Train-the-Trainer Workshop, we liked how it focuses on understanding digital preservation conceptually by describing its individual steps, and also clarifying the difference between preservation and access. Our workshop diverges from the DPOE curriculum by directly training front-line practitioners and providing a critical overview of how digital preservation services and tools actually relate to the steps, their effective use in a workflow, and how to advocate for implementation.
Barrie: Another fantastic outcome of your project is the POWRR Tool Grid, which I read covers over 60 tools. Let’s say I’m just getting up to speed and found the grid a little overwhelming, so, what would be a good place to start?
Danielle: I’m glad you mention the tool grid, because a lot of work went into its creation. I want to mention that the POWRR Project is no longer maintaining the tool grid. When the first phase of the project ended in 2014, so did our ability to maintain it. Instead, we have thrown our support behind COPTR (Community Owned digital Preservation Tool Registry). They have produced the POWRR Tool Grid v2, which combines the form and function of our original tool grid with the sustainability provided by the COPTR data feed.
For those just starting out, I recommend first considering what type of tool you might be interested in. Are you looking for a tool that can help process your digital materials? Are you looking for storage options? How about a tool or service that can do everything? Looking at the specific function of a tool might be a good place to understand the wide variety of tools better.
While we don’t endorse or recommend any specific tool or service, I do encourage people to take a look at the tools we cover in-depth in the workshop. The reason being that we are more familiar with these tools from our testing phase of the project. For help with front-end processing, I suggest looking at Data Accessioner and Archivematica. I have heard good things coming from BitCurator, which might be of particular interest for those interested in digital forensics. For those more interested in storage, services like MetaArchive, DuraCloud, and Internet Archive would be good to investigate. There are very few services that pretty much do it all (at least in the price range for our target audience), but Preservica and the new ArchivesDIRECT are two we have investigated and discuss in our workshops as potential options for institutions with restricted resources.
Barrie: Any other advice on developing a skillset for managing digital content you’d like to share with the readers?
Danielle: A number of tools and services offer free webinars and information sessions to learn more about a specific tool. Some also offer free trial versions that allow you to gain hands-on experience to see if it might work at your institution. You can download and play with the many open-source tools out there to gain some hands-on experience.
Remember that digital preservation is an incremental process, and there are small steps you can take now to start digital preservation activities at your own institution. You don’t have to feel like an expert to begin! And finally, remember you are not alone! One thing we’ve learned through the study and by traveling to the various workshops is that there are many practitioners who recognize the need for digital preservation but have yet to engage in these activities. An easy way to get started is to see what others are doing and talking about. You can ask a question on the Digital Preservation Q&A forum. You can also learn about the latest in digital preservation activities through blogs like The Signal and the blog Digital Preservation Matters. And finally, you can attend a free POWRR workshop!
11:00 - Fedora 4
1:00 - GIS /Documentation (two rooms)
2:00 - Dev Ops
3:00 - UI
4:00 - Metadata These will take place at the Robertson Library in the Language Learning Lab (don't worry, there will be lots of signs!). Unlike the Hackfest, registration is not required. Workshops A sign-up page for the workshops is available here. Your choices will not be set in stone, but we would like to get some general numbers to plan for, so your participation is much appreciated. We also have some homework for Wednesday and Thursday, in general and for a few of the specific workshops: General Install an Islandora Virtual Machine on your laptop, and bring it with you. You will get far more out of the workshops if you can play along on your own laptop. We recommend a bare minimum of 4GB of RAM to run the standard Islandora VM. Installing it in advance and making sure you can run it is highly recommended. The Islandora 7.x-1.5 VM is available for download here. We can provide help before the workshop to get the VM up and running if you are having difficulties. An informal "installfest" will take place at 4:30 on Tuesday, August 4 for anyone who wants a little guidance or troubleshooting. If you are not able to bring a laptop that can run the VM, please let me know. I can set you up with an online sandbox that will suffice for most of the Admin and Intermediate level workshops. Specific If you are planning to attend Islandora Development 101: This workshop (and likely many of the other Developer Track workshops) will work with HEAD, so the 7.x-1.5 VM will not match up. Please install the Islandora Vagrant. If you are planning to attend Fedora 4: The workshop will include a hands-on section using using a Fedora 4 virtual machine image, so please follow these instructions to get the VM up and running on your laptop before the workshop. NOTE: The VM uses 2GB of RAM, so you will need a laptop with at least 4GB of RAM to run it. Depending on your laptop manufacturer, you may also need to enable virtualization in the BIOS.
- Download and install VirtualBox: https://www.virtualbox.org/wiki/Downloads
- Download and install Vagrant: http://www.vagrantup.com/downloads.html
- Download the 4.2.0 release of the Fedora 4 VM: https://github.com/fcrepo4-labs/fcrepo4-vagrant/releases/tag/fcrepo4-vagrant-4.2.0
- Note that you can either clone the repository to your desktop using git or just download the ZIP [https://github.com/fcrepo4-labs/fcrepo4-vagrant/archive/fcrepo4-vagrant-4.2.0.zip] and unzip it
- Using a Command Line Interface, navigate to the VM folder from step 3 and run the command: vagrant up
- Note that this step will take a while as the VM downloads and installs a full virtual environment
- Test the VM by opening your web browser and navigating to: http://localhost:8080/fcrepo
In my last post, I talked about the sprint review meeting; this month we look into planning a sprint. As I said last time, this meeting should be separate from the review, both to differentiate the two and to avoid meeting fatigue.
Sprint planning takes into account the overall project plan and the results of the previous sprint (as presented in the sprint review) and sets out a plan for the next week discrete development time period.
The timing of the sprint planning meeting is the subject of much discussion, and different teams adopt different conventions based on what they feel is the best fit for their particular process. Personally, I prefer to hold the planning meeting on the same day as the review. While this puts pressure on the Product Owner to quickly adjust planning materials based on the outcome of the review, it has several important advantages:
- The knowledge acquired during the review meeting is fresh on everyone’s mind. Given that sprints typically end on a Friday, waiting until after the weekend to plan the next iteration can lead to loss of organizational memory.
- During the time between the review and planning meeting, in theory, no work can be performed (because development priorities have not been set), so minimizing that time is crucial to improved productivity.
- Given that Agile philosophy aims to decrease overhead, having all the necessary meetings in one day helps to contain that part of the development process and focus the team on actual development work.
My ideal sprint boundary process is as follows: have the sprint review in the morning, then take a break (the sprint retrospective can happen here). After lunch, reconvene and hold the planning meeting.
The planning meeting should be less open than the review, as it is more concerned with internal team activities rather than disseminating information to as wide an audience as possible. Only team members and the Product Owner should be present, and the Product Owner may be dismissed after requirements have been presented.
Before the meeting begins, the Product Owner should spend some time rearranging the Product Backlog to reflect the current state of the project. This should take into account the results of the review meeting, so if both happen on the same day the PO will need to be quick on her feet (maybe a kind developer can drop by with some takeout for lunch?).
The planning meeting itself can be divided into two major parts. First, the team will move as many user stories from the backlog into the sprint as it thinks it can handle. Initially this will take some guessing in terms of the team’s development velocity, but as sprints come and go the team should acquire a strong sense for how much work it can accomplish in a given time period. Because the PO has updated the backlog priorities, the team should be able to simply take items off the top until capacity is reached. As each item is moved, the team should ask the PO as many questions as necessary to truly understand the scope of the story.
One the sprint bucket is full, the team will move on to the second part of the exercise, which involves taking each item and breaking it down into tasks. The PO should not be needed for this part, as the team should have collected all the information it needs in the first part of the meeting. When an item has been fully dissected and broken down, individual team members should take responsibility for each of the tasks to complete, and dependencies should be identified and documented.
It’s important to remember that sprint planning is not driven by how much work is left in the backlog, but by how much the team can realistically accomplish. If you have 3 sprints left and there are 45 user stories left in the backlog, but the team’s velocity is 10 stories per sprint, you can’t just put 15 stories in the sprint; at that point the team needs to renegotiate scope and priorities, or rethink deadlines. Pushing a team beyond its comfort zone will result in decreased software quality; a better approach is to question scope and differentiate key features from nice-to-haves.
If you want to learn more about sprint planning meetings, you can check out the following resources:
- Vikrama Dhiman’s slideshare presentation.
- Derek Huether’s simple cheat sheet.
- A look at how Atlassian, the company the makes JIRA, does their own sprint planning.
I’ll be back next month to discuss the sprint retrospective.
What are your thoughts on how your organization implements sprint planning? How do you handle the timing of the review/retrospective/planning meeting cycle? What mechanisms do you have in place to handle the tension between what needs to be done and what the team can accomplish?
“BIS-Sprint-Final-24-06-13-05” image By Birkenkrahe (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons.
I recently decided I need to spend some time after each work day writing up my thoughts on work stuff. If nothing else, these can serve as an outlet and time for improved pondering. So please forgive the winding, rambling nature these posts will take. From these, however, I hope to find little nuggets of ideas that I can actually take and make into decent presentations, workshops, articles, whatever. As ever, I welcome your feedback and questions on these thoughts.
Recently, I made an OpenRefine Reconciliation Service for Geonames, and in particular, for reconciling a metadataset using Library of Congress Authority terms (think LCSH and LCNAF accessible via id.loc.gov) and wanting to pull back Geonames identifiers (URIs, rather, as they are in that vocabulary) and coordinates following ISO 6709 standard.
A number of interesting (perhaps only to me) points came up while working on this project and using this reconciliation service (which I do quite often, as we are migrating legacy DC/XML from a variety of platforms to MODS/XML, and part of this involves batch data enhancements). The questions and points below are what in particular I hope to address in this post:
- Can we standardize, for lack of a better work, the process by which someone creates an OpenRefine reconciliation service based off of a REST API on top of any vocabulary? Also, API keys are the devil.
- More specific to geographic terms/metadata, why do I feel the need to use Geonames? Why not just use LC Authorities, considering they’ve ‘pulled in’ Geonames information, matching URIs, in batch?
- Do we really want to store coordinates and a label AND a URI (and whatever else) for a geographic term within a descriptive metadata record element? Does it even matter what we want when we have to consider what we need and what our systems can currently handle?
- As a follow-up, where the heck would we even put all those datapoints within anything other than MODS? What are some of the RDF metadata models doing, and how can folks still working with XML (or even MARC) prepare for conversion to RDF? Some ideas on the best practices I’m seeing put about, as well as a few proposals for our own work.
And other various points that come up while I’m writing this.Lets All Make OpenRefine Reconciliation Services!
Professionally, I’m in some weird space between cataloging, general data stuff, and systems. So don’t take my word on anything, as usually I’m just translating what already exists in one subdomain to a different subdomain (despite the fact that library domains just assume they can already talk to each other, often).
I start with this though to say I’m not a developer of any sort, yet I was able to pull together the Geonames OpenRefine Reconciliation Service via trial and error, knowledge of how Geonames REST API works (in particular, how queries are structured and data returned), and also by building off all the great community-sourced work that exists. In particular, Ted Lawless wrote a great FAST OpenRefine Reconciliation Service that I used to create something for Geonames. There are some OpenRefine Reconciliation Service templates for others to build off of - in particular, a very simple one in python, some other examples written in php - and an OpenRefine Reconciliation Service API wiki document that you should take with a grain of salt, as it needs seriously revisions, updates, and expansions (which, er, maybe I should help with). This is just scratching the surface of OpenRefine reconciliation examples, templates, and documentation.
However, once you get into building or modifying an existing reconciliation service (recon service from this point on for the sake of my typing hands), you might run into some of the same roadblocks and questions I did. For example, with the Geonames recon service, I wanted in particular to return coordinates for a matched name. However, I did not want to do this via matching with a name, pulling the entire record for that name serialized however (json, xml, doesn’t matter) into a new column, then parsing that records column to find the particular datapoints I wanted to add for each row. This method of ‘reconciliation’ in OpenRefine - seen when someone adds a new column in OpenRefine by retrieving URLs generated from the originating column values - takes far longer than using a recon service, is not scalable unless your metadatasets are amazingly consistent, and offers more chances for errors as you have to parse the records in batch for each datapoint therein you want to pull out (otherwise, you’re spending so much time on each value, you might as well have faceted the column then searched manually in your authority of choice for the datapoints you’re hoping to retrieve). Yet, the recon service templates and the OpenRefine Recon Service metadata (explained somewhat in that above wiki page) did not offer me a place to store and return the coordinates within the recon service metadata (without a hack I didn’t want to make).
As I’m writing this, I realize that a post detailing all the ways one can use OpenRefine to do ‘reconciliation’ work would be helpful, so we know we are comparing apples to apples when discussing. For example, another way that reconciliation can happen in OpenRefine - using the now unsupported but still viable and immensely useful DERI RDF Extension - is yet another approach that has its merits, but could possibly muddle someone’s understanding of what I’m discussing here: the Reconcilation Service API script/app, in my case built in python and working with a REST API.
For what it’s worth, I’d really like to have an OpenRefine in LODLAM call on the different reconciliation services, examples, and how to build them. If you’re interested in this, let me know. I’m happy to talk part of this about my own experiences, but I’d like to have at least 1 other person talk.
Regardless, back to building the Geonames recon service, I could get a basic recon service running by plugging in the Geonames API information in place of the FAST API information in Lawless’ FAST Recon service code, with minor modifications for changes in how Geonames returns data, and the inclusion of an API key. The requirement of an API key made this work that much harder, because it means folks need to go in an add their own (or use the sample one provided and hit the daily API calls limit rather quickly) in the core flask app code. I’m sure there are ways to have the user submit their own code via the CLI before firing up the service, or in other ways, but I kept it as simple as possible since this annoyance wasn’t my main concern.
My main concern with this API was getting good matches with metadata using terms taken from the Library of Congress Authorities, in particular LCSH and LCNAF, and returning top matches along with coordinates (and the term and term URI from Geonames, luckily built into the recon service metadata by default). The matching for terms use the fuzzy-wuzzy library, normally seen in most python Openrefine recon apps regardless. The coordinates for a match were simply appended to the matched term with a vertical bar, something easy to split values off of in OpenRefine (or to remove via the substring function if you happen to not want the coordinates).
But the first tests of this service described above returned really poor results (less than 10 direct or above 90% matches for ~100 record test metadatasets), considering the test metadatasets were already reconciled, meaning the subject_geographic terms I was reconciling were consistent and LCNAF or LCSH (as applicable) form. This is when I took a few and searched in Geonames manually. I invite you to try this yourself: search Knoxville (Tenn.) in http://www.geonames.org. You get no matches to records from the Geonames database, and instead (as is the Geonames default), have results from Wikipedia instead. This is because Geonames doesn’t like that abbreviation - and my sample metadatasets, all taken from actually metadatasets here at work - are all United States-centric, particularly as regards subject_geographic terms. Search http://www.geonames.org now for Knoxville (Tennessee), or Knoxville Tennessee, or Tennessee Knoxville - the first result will be exactly what you’re searching.
What to do, at least in the context of OpenRefine recon services? Well, write a simple python script that replaces those LC abbreviations for states with the full name of the state, then searches Geonames for matches. See that simple, embarassingly simple solution here: http://github.com/cmh2166/geonames-reconcile/blob/master/lc_parse.py. Yep, it’s very basic, but all of a sudden, the reconciliation service was returning much, much better results (for my tests, around 80% direct matches). I invite others to try using this recon service and return your results, as well as other odd Library of Congress to Geonames matching roadblocks for more international geographic metadata sets.
There are other things I wish to improve in the Geonames recon service - some recon services offer the ability, if the top three results returned from reconciliation are not what you wanted at all, to then search the vocabulary further without leaving OpenRefine. I played around a bit with adding this, but had little luck getting it to work. I also want to see if I can expand the OpenRefine recon service metadata to avoid the silly Coordinates hack. I’d love to show folks how to host this somewhere on the web so you do not need to run the Geonames recon service via the simple python server before being able to use it in OpenRefine - however, the API key requirement nips this in the bud.
More to the point though, I want to figure out how better to improve Geonames matching for other, ‘standard’ library authority sources. It seems to me like something is fundamentally off with library data work with the authority services are, from an external data reconciliation viewpoint, so siloed. Not at all what we want if going towards a library data on the web, RDF-modeled, world. It seems. to me.Geonames versus Library of Congress Authorities
So this brings me to two questions, both of which I got from various people hearing my talk about this work: why not just reconcile with the Library of Congress Authorities (which have been matched with Geonames terms via some batch enhancements recently and should have coordinates information now, as it is a requirement for geographic names authoirties in RDA)? And, alternatively, why not just match with Geonames and use their URI, leaving out LCSH for subject_geographic or other geographic metadata (and using it instead for personal/corporate names that aren’t geographic entities, or topical terms, etc.)?
I think this shows better than anything I could say a fundamental divide in how different parts of library tech world see “Authority” data work.
Here is why I decided to not use the Library of Congress Authorities entirely for geographic reconciliation:
- The reconciliation with Geonames within LCNAF/LCSH is present, but is also a second level of work that undermines my wanting to make a helpful, fast, error-averse Openrefine recon service. This is not to say that I don’t think linked authorities data shouldn’t have these cross-file links; of course they should, be also read my bit below on descriptive versus authority record contents.
- The hierarchies in LCNAF/LCSH are…lacking. I’d like to know that, for example, Richmond (Va.) is in Virginia (yes, I know it says Va. in that original heading, but where is the link to the id.loc.gov record to Virginia? It’s not there), which is in United States, etc. etc. Geonames has this information captured.
- When there are coordinates, even if matched with Geonames, it is often stored in a mads:citation-note, without machine-readable data on how the coordinates are encoded. I know I want to pull ISO 6709, but not have to check manually the coordinates for each record to get the information from the right statement and check the encoding.
Note: I’d really love to pull the Library of Congress Name Authority File linked open dataset from id.loc.gov and test what my limited experience has led me to believe on LCNAF lacking consistent Geonames matching, coordinates, and hierarchies - particular for international geographic names, as my own work leads me often to work just with geographic names from the United States.
Note: Because I don’t think the Library of Congress Authorities are the best currently for geographic metadata DOES NOT MEAN I do not use them all the time, appreciate the work it took to build them, or think they should be depreciated or disappear. What I’d like to see is more open ways for the Library of Congress Authorities to share data and data modeling improvements with 1. the library tech community already working on this stuff 2. other, non-traditional ‘authorities’ like Geonames that have a lot to offer. Some batch reconciliation work pulling in limited parts of existing, non-traditional ‘authorities’ without a mind to how we can pull that data out in matchine-actionable reconciliation processes hasn’t really helped boost their implementation in the new world of library data work.
Yet, I am really, really appreciative of all the work the Library of Congress folks do, I wish they were so understaffed, and hell, I’d give my left arm to work there except I’m not smart enough.
Alright, moving on…why not just use the Geonames URIs and labels alone, if I feel this way about the Library of Congress Authorities and geographic terms? The simple reasons is: Facets. Most subject terms are being reconciled, if they weren’t already created using, the LCNAF and LCSH vocabularies. LCSH and LCNAF makes perfect sense as the remaining top choice for topical subjects and names (although there are other players in the non-traditional names authorities field, which I’ll discuss in some other post maybe). As our digital platform discovery interface, as well as our primary library catalog/all data sources discovery interface, are not built currently to facet geographic subjects separate from topical subjects (separate from names as subjects, etc. etc.). So for the sake of good sorting, intelligible grouping, etc., LC Authorities for the terms/labels remain the de facto.
Additionally, I’m not sold that the Library of Congress won’t catch on to the need to open up more to cross-referencing and using non-traditional data sources in more granular and machine-actionable ways. They seem to be at work on it, so I’d prefer to keep their URIs or labels there so reconciliation can happen with their authorities. One must mention too that Geonames does not store references to the same concepts but in the Library of Congress Authorities, so keeping just the Geonames term and URI would make later reconciliation with the Library of Congress Authorities a pain (not to mention, search ‘Knoxville Tennessee’, the preference for Geonames queries, in id.loc.gov, and see all the results you get that aren’t ‘Knoxville (Tenn.)’. ARgh.)
What to do, what to do… well, build a Geonames recon service that takes Library of Congress Authorities headings and returns Geonames additional information, for now.Descriptive Metadata & Authority ‘Metadata’
Let me start this section by saying that Authorities are within the realm of ‘descriptive metadata’, sure. However, when we say ‘descriptive metadata’, we normally think of what is known in Cataloging parlance (for better or for worse) as bibliographic metadata. Item-specific metadata. This digital resource, or physical resources, present in the catalog/digital collection, that we are describing so you can discovery, identify, access (okay okay access metadata isn’t descriptive metadata).
What about authority data? We see a lot of authority files/vocabularies are becoming available as Linked Open Data, but how do we see these interacting with our descriptive metadata beyond generation of ‘authorized’ access point URIs and perhaps some reconciliation, inference, tricks of the discovery interface via the linking and modeling? The Linked Open Data world is quickly blurring the demarcation between authority and non-authority, in my opinion - and I find this really exciting.
So, returning to geospatial metadata, it is not my preference to store coordinates, label, URI(s) - maybe even multiple URIs if I really want to make sure I capture both Geonames and LCNAF - in the descriptive record access point. That’s to say, I’m not terribly excited that, in MODS/XML, this is how I handle the geospatial metadata involved presently:<mods:mods> <mods:subject> <mods:geographic authority="naf" valueURI="http://id.loc.gov/authorities/names/n79109786">Knoxville (Tenn.)</mods:geographic> <mods:cartographics> <mods:coordinates>35.96064, -83.92074</mods:coordinates> </mods:cartographics> </mods:subject> </mods:mods>
Sure, that works, but it can be hell to parse for advanced uses in the discovery layer, as well as to reconcile/create in the descriptive metadata records in batch, and to update as information changes. Also, where is the Geonames authority/uri? Can, and more importantly, should we repeat the authority and valueURI attributes? Break the MODS validation and apply perhaps an authority attribute to the coordinates element, stating from where we retrieved that data? Where is the attribute on either cartographics or coordinates stating what standard the coordinates are following for the machine parsing this to know?
Also, more fundamentally, how much of this should be statements in an Authority record? Wouldn’t you rather have (particularly if you’re soon going to MODS/RDF or perhaps another model in RDF that is actually working at present) something that just gives the locally-preferred label and a valueURI to 1 authority source that can then link to other authority sources in a reliable and efficient manner? Perhaps link to the URI for the Geonames record, then use the Library of Congress Authorities Geonames batch matching to pull the appropriate, same as Library of Congress Authority record that way.
So this is something I’ve been thinking about and working on a lot lately: creating an intermediate, local datastore that handles authority data negotiation. Instead of waiting for LC Authorities to add missing terms to their database (like Cherokee town names, or Southern Appalachia cross-references for certain regional plants, or whatever), or parseable coordinates from Geonames, or for Geonames to add LC Authorities preferred terms or URIs, or whatever other authority you’d like to work with but has XYZ issues, have a local datastore working based off an ontology that is built to interact with the chosen authorities you want to expand upon, links to their records, but puts in your local authority information too. It is a bit of a pipe dream at the moment, but I’ve had some small luck building such a thing using Skosmos, LCNAF, LCSH, Geonames, and Great Smoky Mountain Regional Project vocabularies. We’ll see if this goes anywhere.
Basically, returning to the point of this post, I want the authority data to store information related to the access point mentioned in the descriptive record, not the descriptive record storing all the information. There are data consistency issues as mentioned, as well as the need then for discovery interfaces being built for ever more complex data models (speaking of XML).
However, for the time being, the systems I work with are not great at this Authority reconciliation, so I put it as consistently as I can all in that MODS element(s).
I should note, as a final note I think for this post, that I do not add these URIs or other identifiers as ‘preparation for RDF’. Sure, it’ll help, but I’m adding these URIs and identifiers because text matching has many flaws, especially when it comes to names.Things to follow-up on:
- Getting an OpenRefine Recon Service call together
- Discussing some of the geographic data models out there, as well as what a person working with something other than MODS can do with geographic or other complex authorized accent point data
- A million other things under the sun.
The Senate made great strides this week to ensure needed reform to the Elementary and Secondary Education Act (ESEA). After much debate and across the aisle discussion, yesterday the Senate overwhelmingly passed S. 1177, the Every Child Achieves Act, by a vote of 81-17.
As we discussed in a previous post, the inclusion in the bill of the bi-partisan Reed-Cochran amendment, makes S. 1177 a monumental step forward for schools, their libraries and the millions of students they serve. Most fundamentally and importantly, the amendment (approved 98-0) makes explicit that ESEA funds may be used to support school libraries and “effective school library programs” in multiple ways.
As detailed in ALA’s recent press statement, “The Every Child Achieves Act of 2015 contains several provisions in support of libraries, including state and local planning requirements related to developing effective school library programs and digital literacy skills; professional development activities for school librarians; partnership opportunities for libraries; and competitive grants for developing and enhancing effective school library programs.”
Now that both the House (H.R. 5) and the Senate have completed their bills, the next step will be the appointment of members from both chambers to a conference committee to reconcile differences between the two pieces of legislation. That new bill then must be approved again by both the House and Senate.
Although we do not anticipate this happening before the fall, please do stay tuned and watch for legislative alerts! Your voices will be needed at that time to remind your Members of Congress about the importance of school libraries and how essential it is that the provisions supporting school libraries remain in the final bill.
The post Victory for school libraries as Senate passes education bill appeared first on District Dispatch.
To buy, or not to buy–that is the question:
Whether ’tis nobler in the end to suffer
The slings and arrows of outrageous vendors
Or to take up coding against a sea of problems
And by opposing end them. To code, to commit–
No, more–and by a bug fix to say we end
The heartache, and the thousand natural shocks
That our code is heir to. ‘Tis a consummation
Devoutly to be wished. — With deepest apologies to William Shakespeare
I bet you didn’t know that Hamlet was a librarian. A librarian who was just as pinned on the horns of a software sourcing dilemma as many of us are today. This was one of the takeaways from the OCLC Research Library Partnership June meeting in San Francisco. Previous posts have summarized other aspects of the discussion.
The classic “build vs. buy” debate is certainly not unique to libraries or the institutions of which they are a part, but we all must wrestle that monster. When are vendor solutions good enough? If there is no good solution, how soon might there be one? What happens if we make a major investment in developing our own software solutions and they are eclipsed by the market in 3-5 years? Do we really want to “build tomorrow’s legacy systems today?” (quoted by attendee David Seaman, who is leaving Dartmouth to become the Dean of Libraries and University Librarian at Syracuse University). What if no vendor understands our problems and no commercial solution will ever be good enough?
It was clear from the presentations that this decision employs a unique calculus that takes a wide variety of factors into account. Here are just a few:
- Local ability (staff with coding skill).
- Available development resources.
- Having a problem worth solving that remains unsolved by any commercial option.
- The potential, or lack thereof, of the commercial market solving your problem.
- Whether keeping control over one’s data is important.
- Whether providing leadership in a new area is important to your institution.
For UCLA, Ginny Steele reported, their answers to questions like those above led them to develop their own academic information and profile system. Dubbed Opus, it is “the information system of record for academic appointees at UCLA” and is just now being rolled out, with additional features to come. It will be interesting to see how it works, and how it compares to any similar commercial systems.
A “middle ground” solution that a number of libraries are adopting, including the Dartmouth Library as reported by David Seaman, is to come together with other institutions to collaborate on software development. One of the most robust of such efforts in the library space is the Hydra Project which Dartmouth chose to join. Such collective efforts have many of the benefits of writing your own solutions while minimizing some of the drawbacks.
Whichever path you choose to take it was also clear from a number of speakers that identifiers are an important part of a modern technological infrastructure. Identifiers can serve as a kind of “glue” that enables disambiguation and linkages to other data sources, among other benefits. Wouter Gerritsma, Manager Digital Services & Innovation at VU University Amsterdam, probably made these points as well as any of the speakers that day, although the need for identifiers wound like Ariadne’s thread through the meeting.About Roy Tennant
Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.Mail | Web | Twitter | Facebook | LinkedIn | Flickr | YouTube | More Posts (90)
The primary conclusion was that the models of FRBR and BIBFRAME, with their separation of bibliographic information into distinct entities, are too inflexible for general use. There are simply too many situations in which either the nature of the materials or the available metadata simply does not fit into the entity boundaries defined in those models. This is not news -- since the publication of FRBR in 1998 there are have numerous articles pointing out the need for modifications of FRBR for different materials (music, archival materials, serials, and others). The report of the audio-visual community to BIBFRAME said the same. Similar criticisms have been aimed at recent generations of cataloging rules, whose goal is to provide uniformity in bibliographic description across all media types. The differences in treatment that are needed by the various communities are not mutually compatible, which means that a single model is not going to work over the vast landscape that is "cultural heritage materials."
At the same time, folks in this week's informal discussion were able to readily cite use cases in which they would want to identify a group of metadata statements that would define a particular aspect of the data, such as a work or an item. The trick, therefore, is to find a sweet spot between the need for useful semantics and the need for flexibility within the heterogeneous cultural heritage collections that could benefit from sharing and linking their data amongst them.
One immediate thought is: let's define a core! (OK, it's been done, but maybe that's a different core.) The problem with this idea is that there are NO descriptive elements that will be useful for all materials. Title? (seems obvious) -- but there are many materials in museums and archives that have no title, from untitled art works, to museum pieces ("Greek vase",) to materials in archives ("Letter from John to Mary"). Although these are often given names of a sort, none have titles that function to identify them in any meaningful way. Creators? From anonymous writings to those Greek vases, not to mention the dinosaur bones and geodes in a science museum, many things don't have identifiable creators. Subjects? Well, if you mean this to be "topic" then again, not everything has a topic; think "abstract art" and again those geodes. Most things have a genre or a type but standardizing on those alone would hardly reap great benefits in data sharing.
The upshot, at least the conclusion that I reach, is that there are no universals. At best there is some overlap between (A & B) and then between (B & C), etc. What the informal group that met this week concluded is that there is some value in standardizing among like data types, simply to make the job of developers easier. The main requirement overall, though, is to have a standard way to share ones metadata choices, not unlike an XML schema, but for the RDF world. Something that others can refer to or, even better, use directly in processing data you provide.
Note that none of the above means throwing over FRBR, BIBFRAME, or RDA entirely. Each has defined some data elements that will be useful, and it is always better to re-use than to re-invent. But the attempts to use these vocabularies to fix a single view of bibliographic data is simply not going to work in a world as varied as the one we live in. We limit ourselves greatly if we reject data that does not conform to a single definition rather than making use of connections between close but not identical data communities.
There's no solution being offered at this time, but identifying the target is a good first step.
The Digital Public Library of America is pleased to announce the appoint of Niko Pfund to its distinguished Board of Directors. Pfund is the Global Academic Publisher of Oxford University Press, and President of Oxford University Press, USA.
“The principles of education, dissemination, and access that lie at the core of DPLA’s mission are unimpeachable,” Pfund said. “I’m very pleased to be involved in such a valuable, even noble, enterprise.”
Pfund, a graduate of Amherst College, began his career at Oxford in 1987 as an editorial assistant in law and social science before moving to New York University Press in 1990. At NYU Press, he was an editor and then editor in chief before becoming director in 1996. He returned to Oxford in 2000 in the role of Academic Publisher and is responsible for oversight of the Press’s scholarly and research publishing across the humanities, social sciences, science, law, and medicine, spanning the Press’s offices in Oxford, New York, and Delhi. A frequent speaker on publishing, scholarship, and media, he has given talks at the Library of Congress, the World Bank, and many colleges, universities, libraries, scholarly conferences, and publishing institutes. In 2012, Pfund was named a Notable Person of the Year by Publishers Weekly.
“Niko’s broad knowledge of publishing and the media will contribute significantly to advancing DPLA’s commitment to providing content of interest to people of all ages and to everyone from scholars to school age children,” said Amy Ryan, President of the Board of Directors.
“We couldn’t be happier that Niko Pfund has agreed to join the DPLA board,” said Dan Cohen, DPLA’s Executive Director. “To have a publisher of his great experience and range will be a tremendous asset to our organization.”
Working closely with Cohen, the Board seeks to fulfill DPLA’s broad commitment to openness, inclusiveness, and accessibility, and it endeavors towards those ends in the best interest of its stakeholders, employees, future users, and other affected parties. The Board supports the DPLA’s goal of creating and maintaining a free, open, and sustainable national digital library resource.
Full biographies of the entire DPLA Board of Directors can be found at http://dp.la/info/about/board-committees/.
The MARC data structure, and the AACR2 rules that usually accompany it, are strange beasts. Every once in a while I’m asked why I get so frustrated with them, and I explain that there are things — strange things — that I have to deal with by writing lots of code when I could be spending my time trying to improve relevancy ranking or extending the reporting tools my librarians use to make decisions that affect patrons and their access.
This is one of those tales.
I’m a systems librarian, which in my case means that I deal with MARC metadata pretty much all day, every day. Coming from outside the library world, it took me a while to appreciate the MARC format and how we store data in it, where appreciate can be read as hate hate hate hate hate.
I find it frustrating to deal with data typed into free-text fields all willy-nilly with never a thought for machine readability, where a question like what is the title is considered a complicated trap, and where the word unique, when applied to identifiers, has to have air quotes squeezing it so hard that the sarcasm drips out of the bottom of the ‘q’ in a sad little stream of liquid defeat.
One of the most frustrating things, though, is when a cataloger has clearly worked hard to determine useful information about a work and then has nowhere to put those data. To wit: date of publication.
Many programmers have to deal with timestamps, with all the vagaries of time zones, leap years, leap seconds, etc. In contrast, you’d think that the year in which something was published wouldn’t be fraught with ambiguity and intrigue, but you’d be wrong. Dates are spread out over MARC records in several places, often in unparsable free-text minefields (I’m looking at you, enumeration/chronology) and occasionally in different calendars.
The most “reliable” dates (see? there are those air-quotes again!) live in the 008 fixed field. Of course, they mean different things depending on format determination and so on, but generally you get four bytes to put down four ASCII characters representing the year. When you don’t know the all the digits of the year exactly, you substitute a u for the unknown numbers.
- 1982 — published in 1982
- 198u — published sometime in in the 1980s
- 19uu — published between 1900 and 1999
So, that’s fine. Except that it isn’t. It’s dumb. It made sense to someone at the time to only allow four bytes, because bytes were expensive. But those days have been gone for decades, and we still encode dates like this, despite the fact that having actual start and endpoints for a known range would be better in every way.
Look at what we lose!
- 1982 or 1983 — 198u (ten years vs. two)
- Between 1978 and 1982 — 19uu (one-hundred years vs. five)
- Between the Civil War and WWI — 1uuu (one-thousand years vs about fifty)
The other day, in fact, I came across this date:
Yup. The work was published sometime between 2000 and 2099. My guess is that it was narrowed down to, say, 2009-2011 and this is what we were stuck with. I’d bet big money that its date of publication isn’t, say, after 2016, unless time travel gets invented in the next few years.
But the MARC format works against us, and once again we throw data away because we don’t have a good place to store it, and I’m spending my time trying to figure out a reasonable maximum based on the current date or the date of cataloging or whatnot when it could have just been entered at the time.
As much as we’d like to pretend otherwise, no one is ever going to go back and re-catalog everything. I can almost stomach the idea that we did this thirty years ago. It drives me crazy that we’re still doing it today.
How about it, library-nerd-types? What do you spend your time dealing with that should have been dealt with at another place in the workflow?? [Image: Calendary Calculator from Nuremberg, 1588; Germanic National Museum in Nuremberg. By Anagoria (Own work) [GFDL or CC BY 3.0], via Wikimedia Commons.]
Over the past weekend I participated in a Twitter conversation on the topic of meaning, data, transformation and packaging. The conversation is too long to repost here, but looking from July 11-12 for @metadata_maven should pick most of it up. Aside from my usual frustration at the message limitations in Twitter, there seemed to be a lot of confusion about what exactly we mean about ‘meaning’ and how it gets expressed in data. I had a skype conversation with @jonphipps about it, and thought I could reproduce that here, in a way that could add to the original conversation, perhaps clarifying a few things. [Probably good to read the Twitter conversation ahead of reading the rest of this.]
Jon Phipps: I think the problem that the people in that conversation are trying to address is that MARC has done triple duty as a local and global serialization (format) for storage, supporting indexing and display; a global data interchange format; and a focal point for creating agreement about the rules everyone is expected to follow to populate the data (AACR2, RDA). If you walk away from that, even if you don’t kill it, nothing else is going to be able to serve that particular set of functions. But that’s the way everyone chooses to discuss bibframe, or schema.org, or any other ‘marc replacement’.
Diane Hillmann: Yeah, but how does ‘meaning’ merely expressed on a wiki page help in any way? Isn’t the idea to have meaning expressed with the data itself?
Jon Phipps: It depends on whether you see RDF as a meaning transport mechanism or a data transport mechanism. That’s the difference between semantic data and linked data.
Diane Hillmann: It’s both, don’t you think?
Jon Phipps: Semantic data is the smart subset of linked data.
Diane Hillmann: Nice tagline
Jon Phipps: Zepheira, and now DC, seem to be increasingly looking at RDF as merely linked data. I should say a transport mechanism for ‘linked’ data.
Diane Hillmann: It’s easier that way.
Jon Phipps: Exactly. Basically what they’re saying is that meaning is up to the receiver’s system to determine. Dc:title of ‘Mr.’ is fine in that world–it even validates according to the ‘new’ AP thinking. It’s all easier for the data producers if they don’t have to care about vocabularies. But the value of RDF is that it’s brilliantly designed to transport knowledge, not just data. RDF data is intended to live in a world where any Thing can be described by any Thing, and all of those descriptions can be aggregated over time to form a more complete description of the Thing Being Described. Knowledge transfer really benefits from Semantic Web concepts like inferences and entailments and even truthiness (in addition to just validation). If you discount and even reject those concepts in a linked data world than you might as well ship your data around as CSV or even SQL files and be done with it.
One of the things about MARC is that it’s incredibly semantically rich (marc21rdf.info) and has also been brilliantly designed by a lot of people over a lot of years to convey an equally rich body of bibliographic knowledge. But throwing away even a small portion of that knowledge in pursuit of a far dumber linked data holy grail is a lot like saying that since most people only use a relatively limited number of words (especially when they’re texting) we have no need for a 50,000 word, or even a 5,000 word, dictionary.
MARC makes knowledge transfer look relatively easy because the knowledge is embedded in a vocabulary every cataloger learns and speaks fairly fluently. It looks like it’s just a (truly limiting) data format so it’s easy to think that replacing it is just a matter of coming up with a fresh new format, like RDF. But it’s going to be a lot harder than that, which is tacitly acknowledged by the many-faceted effort to permanently dumb-down bibliographic metadata, and it’s one of the reasons why I think bibframe.org, bibfra.me, and schema.org might end up being very destructive, given the way they’re being promoted (be sure to Park Your MARC somewhere).
[That’s why we’re so focused on the RDA data model (which can actually be semantically richer than MARC), why we helped create marc21rdf.info, and why we’re working at building out our RDF vocabulary management services.]
Diane Hillmann: This would be a great conversation to record for a podcast
Jon Phipps: I’m not saying proper vocabulary management is easy. Look at us for instance, we haven’t bothered to publish the OMR vocabs and only one person has noticed (so far). But they’re in active use in every OMR-generated vocab.
The point I was making was that we we’re no better, as publishers of theoretically semantic metadata, at making sure the data was ‘meaningful’ by making sure that the vocabs resolved, had definitions, etc.
[P.S. We’re now working on publishing our registry vocabularies.]
DuraSpace News: NEW at the 2015 VIVO Conference–Linked Open Data Contest: Early Registration Ends TODAY
Winchester, MA VIVO is great at sharing data. Do you have data to share? Enter the VIVO Linked Open Data Contest. The winning team will be selected based on five star linked open data criteria, and will receive podium recognition at the conference.
Here’s the lunch box I received today from Shako Club.
I applied to receive a bento box a couple of months ago. The application process was a slightly odd questionnaire that I had some trouble answering. I don’t often get songs stuck in my head and it’s hard to pick my absolute favourite story from my childhood. We were told that our bento contents would be determined by the answers to this questionnaire.
The theme of land, sea, mountains is represented here with:
– top left (land) – chicken karaage, half a boiled egg on lettuce with 2 perfect crunchy cucumber sticks underneath
– top right (sea) red bean jelly made with kanten with a sansho leaves and a wee piece of candied ginger. there was a sliced strawberry hidden under the paper cup that held the jelly.
– bottom right (mountain) – veggie gyoza made with okara and spinach goma-ae
– bottom left – rice with umeboshi
It’s in a gorgeous handmade maple box that’s been oiled with a cute Shako Club stamp on the bottom.
I sat down and Tazuko and I introduced ourselves to each other. There was also a translator who I didn’t introduce myself to until halfway through, which I feel was a bit rude of me. Tazuko talked a little bit about the process that they went through to make the bentos and then invited me to take the lid off and look. She explained the different ingredients and elements of this gorgeous lunch box. I was already familiar with the Japanese ingredients: okara (byproduct of making tofu), sansho leaves and kanten (agar agar).
She asked me if I liked Japanese food and I explained that I’m half Japanese and love Japanese food. I told her that karaage is my favourite and that I have really fond memories of the Japanese food that my Grandma used to make when we would visit each summer. Tazuko told me more about her history. She was born between Osaka and Nara in the mountains, and during the war their family fled their home to Yokohama.
She talked about the Japanese Canadian internment and the impact that WWII had on many Japanese and Japanese Canadian people. She talked about only having rice and umeboshi for lunch when she was a kid. I know how poor Japan was after the war and that for many people this is all they could afford, but hearing this truth from someone I had just met was really emotional for me. I was so touched about how much someone I had just met was sharing about their life with me, a complete stranger. I was also overcome with how lucky and privileged I am right now. I was blinking back tears then I really started crying, which didn’t seem to phase her or the translator. I forgot this cultural difference. In Japan it’s generally not seen as embarrassing to cry when you are extremely moved. In Canada I find that we don’t know what to do when people cry. We are generally uncomfortable with tears and “negative” emotions.
We chatted a bit more and I learned that she came to Canada 40 years ago and married a Nisei Japanese man. I was curious if she had kids but didn’t want to pry, so I didn’t ask.
We were asked to bring something small to gift back to the person we received the lunch box from. In my questionnaire I said that one of my hobbies is gardening. I ended up with a bunch of volunteer purple shiso plants in my community garden plot. I repotted one of these and brought one of the first cloves of garlic I had ever grown this past year. After all, who doesn’t like garlic? Also from living in Japan I know that gifts that can be consumed are often better. Tazuko and I chatted a bit about the connection between the umeboshi in the bento and the purple shies that I gave her—purple shiso is what gives umeboshi it’s colour.
We chatted a bit more. I took a few pictures of Tazuko and the bento she had made and then Cindy Mochizuki came by and said that Tazuko is her mom. Cindy is the artist responsible for this project and someone I’ve been getting to know better over the past year. It was awesome to find out that this amazing woman is her mom. If I had asked if she had kids earlier in the conversation I would have learned this.
I biked down to the seawall and enjoyed my lunch box and was reflecting on some relationships with work colleagues over the past month. I’ve delighted in a bunch of work relationships shifting to be more open and honest where other people have demonstrated courage in sharing stuff about themselves including: mental illness, learning disabilities, gender identity, sexuality, neurodiversity and personal insecurities that are incongruent with how I see them professionally. All of these people didn’t need to disclose these things about themselves but it made it easier for me to understand how they operate and gave me a glimpse of what they might be going through. To me these are acts of courage because they involve unpacking stigma and shame which is a revolutionary act that gives us all a little more room to breathe freely.
Rachel Vacek announced at her LITA President’s program at the ALA Annual Conference in San Francisco, the winners of annual scholarships LITA sponsors jointly with three organizations: Baker & Taylor, LSSI and OCLC. These scholarships are for master’s level study, with an emphasis on library technology and/or automation, at a library school program accredited by the American Library Association. LITA, the Library and Information Technology Association, is a division of the American Library Association.Andrew Meyer
This year’s winner of the LITA/Christian Larew Memorial Scholarship ($3,000) sponsored by Baker & Taylor is Andrew Meyer who will pursue his studies at the University of Illinois at Urbana-Champaign. The LITA/LSSI Minority Scholarship ($2,500) winner is Jesus Espinoza who will pursue his studies at the University of Illinois at Urbana Champaign. Young-In Kim, the winner of the LITA/OCLC Minority Scholarship ($3,000), will pursue her studies at Syracuse University.Jesus Espinoza
Criteria for the scholarships include previous academic excellence, evidence of leadership potential and a commitment to a career in library automation and information technology. Two of the scholarships, the LITA/LSSI Minority Scholarship and LITA/OCLC Minority Scholarship, also require U.S Citizenship and membership in one of four minority groups: American Indian or Alaskan Native, Asian or Pacific Islander, African-American, or Hispanic.
About LITAYoung-In Kim
Established in 1966, the Library and Information Technology Association is the leading organization reaching out across types of libraries to provide education and services for a broad membership. The membership includes new professionals, web services librarians, systems librarians, digital initiatives librarians, library administrators, library schools, vendors and anyone else interested in leading edge technology and applications for librarians and information providers.
For more information, visit www.lita.org.
Ask around and you’ll hear that data is the new bacon (or turkey bacon, in my case. Sorry, vegetarians). It’s the hot thing that everyone wants a piece of. It is another medium with which we interact and derive meaning from. It is information; potentially valuable and abundant. But much like [turkey] bacon, un-moderated gorging, without balance or diversity of content, can raise blood pressure and give you a heart attack. To understand how best to interact with the data landscape, it is important to look beyond it.
What do academic libraries need to know about data? A lot, but in order to separate the signal from the noise, it is imperative to look at the entire environment. To do this, one can look to job postings as a measure of engagement. The data curation positions, research data services departments, and data management specializations focus almost exclusively on digital data. However, these positions, which are often catch-alls for many other things do not place the data management and curation activities within the larger frame of digital curation, let alone scholarly communication. Missing from job descriptions is an awareness of digital preservation or archival theory as it relates to data management or curation. In some cases, this omission could be because a fully staffed digital collections department has purview over these areas. Nonetheless, it is important to articulate the need to communicate with those stakeholders in the job description. It may be said that if the job ad discusses data curation, digital preservation should be an assumed skill, yet given the tendencies to have these positions “do-all-the-things” it is negligent not to explicitly mention it.
Digital curation is an area that has wide appeal for those working in academic and research libraries. The ACRL Digital Curation Interest Group (DCIG) has one of the largest memberships within ACRL, with 1075 members as of March 2015. The interest group was intentionally named “digital curation” rather than “data curation” because the founders (Patricia Hswe and Marisa Ramirez) understood the interconnectivity of the domains and that the work in one area, like archives, could influence the work in another, like data management. For example, the work from Digital POWRR can help inform digital collection platform decisions or workflows, including data repository concerns. This Big Tent philosophy can help frame the data conversations within libraries in a holistic, unified manner, where the various library stakeholders work collaboratively to meet the needs of the community.
The absence of a holistic approach to data can result in the propensity to separate data from the corpus of information for which librarians already provide stewardship. Academic libraries may recognize the need to provide leadership in the area of data management, but balk when asked to consider data a special collection or to ingest data into the institutional repository. While librarians should be working to help the campus community become critical users and responsible producers of data, the library institution must empower that work by recognizing this as an extension of the scholarly communication guidance currently in place. This means that academic libraries must incorporate the work of data information literacy into their existing information literacy and scholarly communication missions, else risk excluding these data librarian positions from the natural cohort of colleagues doing that work, or risk overextending the work of the library.
This overextension is most obvious in the positions that seek a librarian to do instruction in data management, reference, and outreach, and also provide expertise in all areas of data analysis, statistics, visualization, and other data manipulation. There are some academic libraries where this level of support is reasonable, given the mission, focus, and resourcing of the specific institution. However, considering the diversity of scope across academic libraries, I am skeptical that the prevalence of job ads that describe this suite of services is justified. Most “general” science librarians would scoff if a job ad asked for experience with interpreting spectra. The science librarian should know where to direct the person who needs help with reading the spectra, or finding comparative spectra, but it should not be a core competency to have expertise in that domain. Yet experience with SPSS, R, Python, statistics and statistical literacy, and/or data visualization software find their way into librarian position descriptions, some more specialized than others.
For some institutions this is not an overextension, but just an extension of the suite of specialized services offered, and that is well and good. My concern is that academic libraries, feeling the rush of an approved line for all things data, begin to think this is a normal role for a librarian. Do not mistake me, I do not write from the perspective that libraries should not evolve services or that librarians should not develop specialized areas of expertise. Rather, I raise a concern that too often these extensions are made without the strategic planning and commitment from the institution to fully support the work that this would entail.
Framing data management and curation within the construct of scholarly communication, and its intersections with information literacy, allows for the opportunity to build more of this content delivery across the organization, enfranchising all librarians in the conversation. A team approach can help with sustainability and message penetration, and moves the organization away from the single-position skill and knowledge-sink trap. Subject expertise is critical in the fast-moving realm of data management and curation, but it is an expertise that can be shared and that must be strategically supported. For example, with sufficient cross-training liaison librarians can work with their constituents to advise on meeting federal data sharing requirements, without requiring an immediate punt to the “data person” in the library (if such a person exists). In cases where there is no data point person, creating a data working group is a good approach to distribute across the organization both the knowledge and the responsibility for seeking out additional information.
Data specialization cuts across disciplinary bounds and concerns both public services and technical services. It is no easy task, but I posit that institutions must take a simultaneously expansive yet well-scoped approach to data engagement – mindful of the larger context of digital curation and scholarly communication, while limiting responsibilities to those most appropriate for a particular institution.
 Lest the “data-information-knowledge-wisdom” hierarchy (DIKW) torpedo the rest of this post, let me encourage readers to allow for an expansive definition of data. One that allows for the discrete bits of data that have no meaning without context, such as a series of numbers in a .csv file, and the data that is described and organized, such as those exact same numbers in a .csv file, but with column and row descriptors and perhaps an associated data dictionary file. Undoubtedly, the second .csv file is more useful and could be classified as information, but most people will continue to call it data.
Yasmeen Shorish is assistant professor and Physical & Life Sciences librarian at James Madison University. She is a past-convener for the ACRL Digital Curation Interest Group and her research focus is in the areas of data information literacy and scholarly communication.
The future of libraries and publishers attracts a lot of debate and writing. But what have we learned overall from the efforts to date? This question of synthesis and looking ahead is the theme of the “Special Reports” in the just-released 2015 Library and Book Trade 2015 Library and Book Trade Alamanac (LBTA) or Bowker Annual (LBTA), formerly known as the Bowker Annual, for which I served as the consulting editor.
In her article “Re-thinking the Roles of U.S. Libraries,” Larra Clark brings together three major activities that focus on the future of libraries. Amy K. Garmer leads the Aspen Institute effort on the future of libraries, whereas Miguel Figueroa is the founding director of the new Center for the Future of Libraries of the American Library Association (ALA). An important thinker of the future of public libraries is consultant Roger E. Levien, who previews some thoughts from his forthcoming book. But however impressive and insightful our learning may be, it is useless unless decision makers understand the roles and capacities of future libraries—active communication and a proactive stance with these decision makers are essential to ensure that libraries may fully contribute to communities and campuses in the decades to come.
School libraries represent an important library segment in the midst of fundamental change. Christopher Harris and Barbara K. Stripling in “School Libraries Meet the Challenges of Change,” explain how school libraries (and librarians) are in the digital crossroads of schools. School libraries are well placed to keep up with advances in digital technology and services and disseminate and incorporate them into pedagogical practice. Also, school libraries and librarians are centrally situated to coordinate digital resources to maximize their efficient and effective use, with the goal of empowering all students.
Digital preservation is a topic of great importance to the library community as discussed by Melissa Goertzen, Robert Wolven, and Jeffrey D. Carroll, who focus on ebooks. While emphasizing the long-term stewardship of the nation’s cultural heritage, the digital format changes the rules by also creating urgency in addressing the matter, because unlike with analog materials, without proper current action, digital files may not be around for retroactive preservation. To move forward, libraries will need to build more effective relationships with publishers and other key stakeholders.
Finally, James LaRue discusses “publishing” in the context of contemporary libraries. In “New Publishing and the Library: E-books, Self-publishing, and Beyond,” he explores a number of the newer directions in libraries such as libraries as publishers; collecting self-published works; 3D printers; and libraries as booksellers. LaRue also reviews the problems with current business models for the library ebook market, especially those centered around high prices for the more popular titles. Despite the challenges, LaRue is optimistic for libraries as they navigate through the disruption in the publishing marketplace.
I appreciate the opportunity to serve as the consulting editor for this edition of the Library and Book Trade Almanac. It is quite the challenge to select the few topics for focus from the many possibilities. It is then a further challenge to identify the best authors for a given topic and then to obtain their agreement to write, as the most knowledgeable people tend to have many demands on their time. I am deeply grateful to the authors, LBTA editor Dave Bogart, and Information Today Inc. for their contributions and support in completing this Special Reports section.
Take a look at your local library!
The post Looking ahead: Special reports in the 2015 Library and Book Trade Almanac appeared first on District Dispatch.
Last updated July 16, 2015. Created by Peter Murray on July 16, 2015.
Log in to edit this page.
veraPDF is a purpose-built, open source, file-format validator covering all PDF/A parts and conformance levels. veraPDF is designed to meet the needs of digital preservationists and is supported by the PDF software developer community.Package Type: Data Preservation and ManagementLicense: GPLv3 Package Links In Development Releases for veraPDF
- veraPDF - 0.2.0 15-Jul-2015