From Mike Giarlo, software architect, Stanford University Libraries, on behalf of the Hydra-in-a-Box tech team
Palo Alto, CA Development on the Hydra-in-a-Box repository application continues, and here's our latest demo. Thanks to the Chemical Heritage Foundation and Indiana University for contributing to these sprints!
The next DPLA Board of Directors call is scheduled for Thursday, September 15 at 3:00 PM Eastern. Agenda and dial-in information is included below. This call is open to the public, except where noted.Agenda
- [Public] Welcome, Denise and Mary
- [Public] General updates from Executive Director
- [Public] DPLAfest 2017
- [Public] Questions/comments from the public
- Executive Session to follow public portion of call
'Running on HEAD,' in the Islandora context, means to run Islandora code from the actively developed 7.x branch in GitHub instead of running a stable release.
Islandora has a twice-yearly release cycle, with a new version coming out roughly every April and October (in fact, we're getting started on the next one right now!). A lot of Islandora sites run on those releases, updating when the new code is out, or sticking with an older version until there's some new bug fix or feature they have to have. Others... don't wait for those bug fixes and features. They pick them up as soon as they are merged, by running Islandora on HEAD in production.
It's an approach that has a lot of benefits (and a few pitfalls to be aware of if you're considering it for your installation). I talked with a few members of the Islandora community who take the HEAD approach, to ask them why they're on the bleeding edge:
- TJ Lewis is the COO of discoverygarden, Inc, which is the longest-running Islandora service company and has a deep history with the project. They are big supporters of running on HEAD and do so for most of their clients.
- Mark Jordan is the Head of Library Systems at Simon Fraser University, which has been running on HEAD since they launched Islandora in April 2016.
- Jared Whiklo is a Developer with Digital Initiatives at the University of Manitoba Libraries, and has made the move to HEAD quite recently.
- Jennifer Eustis is the Digital Repository Content Administrator at the University of Connecticut, which supports Islandora for several other institutions. They started running on HEAD in the Fall of 2014 and now operate on a quarterly maintenance schedule.
- Brad Spry handles Infrastructure Architecture and Programming at the University of North Carolina at Charlotte and works his Islandora updates around official releases while still taking advantage of the fixes and improvements available from HEAD.
Why run Islandora on HEAD instead of using a release?
Main advantage: Getting bug fixes, features, and improvements faster, without having to resort to manual patching of release code. It's also handy when developing custom modules, since you do not have to worry about developing against an outdated version of Islandora. Running on HEAD puts you in a position to adopt security fixes more easily, without waiting for backports.
Islandora's GitHub workflows and integrated Travis testing mean that updates to the code are well reviewed and tested before being merged, making HEAD quite stable.
Bottom line: it's safe, it's useful, and it gets you fast access to the latest Islandora goodies.
What are the drawbacks?
Not running on releases can be perceived as more risky - with some justification. New features may not be entirely complete, as Brad found with the Solr-powered collection display introduced in a recent release:
We badly needed the ability to sort certain collections, like photographs, in the exact chronological order they were taken in. To do that, we determined we needed to sort the collection by mods_titleInfo_partNumber_int. The new Solr-powered collection display enables just that, the ability to sort collections using Solr fields. However, as powerful as the new Solr-powered collection display was for solving our photo collection sorting issues, it was not feature-parity with the SPARQL (Legacy) powered collection display. The Solr-powered display omitted collection description display on each collection, and our Archivists spotted the discrepancy immediately... Given the choice of Solr-powered photo collection sorting vs. informative collection descriptions, our Archivists chose collection descriptions. So we had to revert back to SPARQL (Legacy) until the Solr-powered collection description functionality was fully realized.
Adopting new features outside of a release may also mean adopting them without complete documentation, as Jennifer discovered:
Some new features are not always explained in ways that a non developer might understand especially in terms of the consequences of turning on a new module. For us, the batch report is a good example. The batch set report once enabled doesn't really provide information that our users understand. It does store a batch set report in a temporary directory. Because our content users are busy bees ingesting night and day, our batch queue kept growing resulting in a temporary directory that didn't have any more space. This meant we had to find a way to batch delete these set reports. We decided to disable this module.
Dealing with these issues does have a bright side, as Jared noted: There is an inherent risk that code has not been fully tested for your use cases/setup and something may misbehave. With more experience, this means you can fix that issue and submit the fix back to the community to save someone else the problem.
What tools or tricks can you use to manage running on HEAD?
There are a variety of approaches to make running on HEAD safer and easier. At discoverygarden, TJ and his team manage everything with Puppet, to ensure consistent development environments and staging environments that mimic production where they can develop and enable QA to happen prior to pushing to production. Brad uses a tool designed for managing Islandora Releases, called Islandora Release Manager Helper Scripts as a basis for his own scripts; one for backing up modules, another for removing them, and a third to install them again fresh from HEAD. Mark and his team at SFU run a bash command to switch to 7.x on all modules and run git pull.
At the University of Connecticut, where Jennifer is managing a multisite, she's found it helpful to create a common theme and module library shared by each of their Drupal instances. When they build and test new functionality independent of running on HEAD, they add those to their core module library that gets updated during their maintenance schedule. This also ensures that those extra modules work with the newest and best Islandora. They also have a dedicated maintenance schedule, complete with rigorous testing at development, staging, and production to ensure that they work out all of the wrinkles.
The University of Manitoba is too new to HEAD to have much procedure worked out, but Jared is working on some small scripts to ensure that backend and display instances of Islandora have all the same commit points for the various modules as they share a Fedora repository.
What do you think someone considering a move to HEAD should know before they begin? (or, what do you wish you had known?)
Brad: Learn how to watch modules on Github for changes. Learn how to file and watch issues on Islandora's JIRA. Join the Google Group, Islandora Interest Groups, and conference calls, especially the Committer's Call. Login to IRC (Editor's Note: #islandora irc channel on freenode). Participate as much as you can, build relationships, get to know people; they're human after all (at least most of them are :-) You'll want to have your finger on the pulse of Islandora. And you don't have to be a supreme programmer, you just need to be able to represent and effectively communicate your organization's needs and desires in order to make Islandora better.
Mark: It might be a bit riskier than running on releases, but Islandora development and bug fixing happens at such a fast pace that waiting for releases (even though they are frequent and well tested) seems more appropriate for a commercial product.
Jennifer: When we didn't run on HEAD and decided to run on HEAD, almost 2 years had passed. There had been so many changes. It was pretty much an entirely new Islandora that we were working on when we did our update and decided to 1st run at HEAD. The transition was an extremely difficult one for those who had to perform the update and users who had to get to know an entirely new system. Now that we're running at HEAD, these transitions are easier.
What do you need to know? We're still figuring that out but here are some in the middle thoughts:
- Ensure your administration is on board with maintenance as being required.
- Develop a maintenance schedule with a core team.
- Determine if running on head also means updating Drupal, security patches, server applications, database patches or the like.
- If you don't have a dedicated person in charge of monitoring when systems need patches, then think of a support contract.
- Realize that things change and this is OK.
- Get involved with the community. Try testing out the new releases as this will get you a first hand view of the changes.
- Have fun.
Jared: Open source code has no warranty, even code that has been “released” does not mean bug free. So a lot of the HEAD versus release discussion is about your comfort. I think your institution should have some comfort with the code; you might encounter a bug and being able to either correct it or help to diagnose it clearly means that it can be resolved, and (running on HEAD) you can update and benefit from those changes.
The Memento protocol is a straightforward extension of HTTP that adds a time dimension to the Web. It supports integrating live web resources, resources in versioning systems, and archived resources in web archives into an interoperable, distributed, machine-accessible versioning system for the entire web. The protocol is broadly supported by web archives. Recently, its use was recommended in the W3C Data on the Web Best Practices, when data versioning is concerned. But resource versioning systems have been slow to adopt. Hopefully, the investment made by the W3C will convince others to follow suit.This is a very significant step towards broad adoption of Memento. Below the fold, some details.
The specifications and the Wiki use different implementation techniques:
Memento support was added to the W3C Wiki pages by deploying the Memento Extension for MediaWiki. Memento support for W3C specifications was realized by installing a Generic TimeGate Server for which a handler was implemented that interfaces with the versioning capabilities offered by the W3C API.Herbert, Harihar Shankar and Shawn M. Jones also have a much more detailed blog post covering many of the technical details, and the history leading up to this, starting in 2010 when:
Herbert Van de Sompel presented Memento as part of the Linked Data on the Web Workshop (LDOW) at WWW. The presentation was met with much enthusiasm. In fact, Sir Tim Berners-Lee stated "this is neat and there is a real need for it". Later, he met with Herbert to suggest that Memento could be used on the W3C site itself, specifically for time-based access to W3C specifications.Even for its inventor, getting things to happen on the Web takes longer than it takes! They conclude with by stressing the importance of Link headers, a point that relates to the Signposting proposal discussed in Signposting the Scholarly Web and Improving e-Journal Ingest (among other things):
Even though the W3C maintains the Apache server holding mementos and original resources, and LANL maintains the systems running the W3C TimeGate software, it is the relations within the Link headers that tie everything together. It is an excellent example of the harmony possible with meaningful Link headers. Memento allows users to negotiate in time with a single web standard, making web archives, semantic web resources, and now W3C specifications all accessible the same way. Memento provides a standard alternative to a series of implementation-specific approaches.Both posts are well worth reading.
The more I work with faculty and students on integrating new technologies such as 3D printing and virtual reality into the curriculum, the more I think about ways we can measure learning for non-Information Literacy related competencies.
How do we know that students know how to use a 3D printer successfully? How can we measure the learning that occurred when they designed a file for upload into a visualization software package? While the Association of College and Research Libraries (ACRL) has taken the lead on delineating national standards for Information Literacy, and more recently updated them to the Framework for Information Literacy, there isn’t quite as much information available about designing and assessing assignments that are less traditional than the ubiquitous 3-5 page research paper. I’m not sure that we will find one set of competencies to rule them all, simply because there are so many dimensions to these areas. In one seemingly straightforward activity such as creating an online presentation, you might have elements of visual literacy, creativity, and communication to name a few. But it would be interesting to try-so here goes!
What might an actual competency look like? Measurable learning outcomes are structured similarly no matter what the context. They have to explain:
- What the learner is able to do
- How the learner does it
- To what degree of success
ACRL has a great tool for developing these types of outcomes: http://www.ala.org/acrl/aboutacrl/directoryofleadership/sections/is/iswebsite/projpubs/smartobjectives
Applying that to a digital competency might work like this. Students will be able to create effective online presentations utilizing various free web tools by:
- Selecting appropriate images and visual media aligned with the presentation’s purpose
- Integrating images into projects purposefully, considering meaning, aesthetic criteria, visual impact, and audience
- Editing images as appropriate for quality, layout, and display (e.g., cropping, color, contrast)
- Including textual information as needed to convey an image’s meaning (e.g., using captions, referencing figures in a text, incorporating keys or legends)
- Adapting writing purpose, style, content and format to the appropriate digital context
A sample assignment that includes those competencies might be: Create a 1-3 minute presentation on a given topic and consisting of the following elements:
- Must use one of these presentation tools
- Content must be relevant to the theme
- Visual design must contain at least 3-5 images or video elements. Color scheme, layout and overall design must be consistent with the guidelines mentioned above
- All material created by someone other than the student is given attribution in citations and used according to ethical and legal best practices
A rubric could then be developed to measure how well the presentation integrates the various elements involved:Goal Outcome Levels Benchmark Create effective online presentations utilizing various free web tools Select appropriate images and visual media aligned with the presentation’s purpose 0 (does not meet competency) Visual elements do not lend any value to the content and there is no overarching purpose or structure to their inclusion
1 (meets competency) Some images and media elements are integrated well into the presentation and align with its content and purpose
2 (exceeds competency) Images and visual media significantly support the content presented and are effectively integrated into the overall presentation At least 75% of students score a 1 or above
As we continue to forge new digital paths, we are constantly challenged to to re-define the notions of instruction, authorship and intellectual property in our ever shifting landscape of learning. I’m excited at the possibilities that digital literacy brings to student learning in this new environment, and I can only imagine the power and complexity these various assignments entail, and how much fun students (and faculty) would have in developing them.
Some additional standards to consider are:
- Digital literacy: http://www.ala.org/news/press-releases/2013/06/ala-task-force-releases-digital-literacy-recommendations
- Visual literacy: http://www.ala.org/acrl/standards/visualliteracy
- International Society for Technology in Education Standards: http://www.iste.org/standards/standards
Last month, I had lunch with two friends who are also in academia. We talked a lot about professional ambitions and “extracurricular” professional involvement. One of them is starting a new book and the other is thinking about doing consulting as a side-job. In every job I’ve had (even before librarianship), I’ve been focused on moving up in my career, whether that was new responsibilities, a promotion, or a job elsewhere, so I was always focused on doing things that might help me get there. When asked about my current professional ambitions, I realized that I didn’t have any. Or, more accurately, I didn’t have any that were not related to my current job.
The fact is, I love my job. I love what I do every day. I love the people I work with in the library, the collegial atmosphere, and their dedication to the students and faculty here. I love the academic community I’m a part of at PCC. I feel a sense of fit that’s uncanny. My major professional ambitions now center around progressing in the work I’m doing, build stronger relationships with faculty, and do work that really helps our students be successful.
Not having ambitions toward moving up or out has, at times, made me feel weirdly adrift, especially as someone who has always felt like I wasn’t doing enough in any area of my life. I was so engaged professionally over my first decade in the profession — starting with blogging and social media, then professional writing and national service. At Norwich, that kind of engagement wasn’t required, but I did it to connect with other wonderful librarians around the world, to support things I believed in, and to build a professional network. I did a lot of unorthodox things like creating Five Weeks to a Social Library and the ALA Unconference with some amazing partners-in-crime, because I wasn’t hamstrung by a specific vision of what being professionally involved should look like. All that helped me build the professional network I have today.
Then, at Portland State, I was on the tenure track, and was required to contribute to the profession. While there wasn’t a specific list of what we should or should not do to get tenure, the assumption was that ideal involvement included publishing peer-reviewed articles, presenting at major national conferences, and serving on state or national committees. I did all of those things and enjoyed some of what I did, but I kept asking myself what I really would do if I had the freedom to choose.
And then, suddenly, I did again. And it was hard to start saying no to opportunities because, for so long, that was what made me feel good about myself; speaking at conferences, getting published, etc. I based so much of my happiness and self-esteem on things that were not very meaningful in the big picture. And I was so focused on my career to the detriment of other aspects of my life. The past year has reminded me of what was important. This year has been soul-crushingly hard for me and my family, and I’m lucky that I could step away from a lot of my outside-of-work engagement without repercussions. I think we’re lucky to be in a profession where most librarians are understanding of people’s needs to step away and focus on their family/spouse/child/parent/health. We often have a more difficult time letting ourselves off the hook, I think. I’m working on that myself.
When I came to PCC, what I did stay engaged with was the Oregon Library Association. I love my service at the state level — the librarians in Oregon are so positive and passionate and have such an ethic of sharing and collaboration. They also are very open to new ideas, like when I and another librarian proposed creating a mentoring program. I’ve been administering the OLA mentoring program for the past three years (and this year we launched a resume review program!) and is has been really rewarding and fun. I’ve stepped away from my leadership role in the organization for the coming year, and I feel lucky that I can continue to contribute in a more limited capacity.
I have friends who are engaged professionally in many different ways. Some are loyal committee members in state, national, or international organizations. Some have taken on leadership roles in those organizations. Some are more focused on contributing to the profession through publishing and presenting. I have friends for whom writing is a passion and have published one or more books. I have friends who are annoyed by the poor quality of library research and want to produce more solid evidence-based literature. I have friends who are fantastic speakers and have engaged and inspired so many librarians by sharing their insights. Many do a combination of all these things. Some do big, visible, shiny things and others do vital work that will never get them national recognition. Some do just a little and others do more than seems possible for one person to do. The key is that they do what is a good fit for them; what makes them feel fulfilled. For many, professional involvement ebbs and flows at different points in their career, depending on other priorities. And that’s a good thing. We sometimes need to step back from things to focus on other priorities in our lives and we shouldn’t feel badly about that.
I also have friends who are not professionally involved beyond their day jobs. Many of them are active in other things, like service to their communities, and even if they’re not, that is a reasonable choice. I am involved in service to the profession because I find the work satisfying, not because I feel like it’s my obligation. Finding the things that make us happy in this life can be hard when we are bombarded with the expectations and assumptions of others. I feel like the past 12 years of my professional life have been spent trying to figure out what makes me happy, and untangling that from what I think will make others think well of me.
My advice to new librarians is to ask yourself what makes you feel like a good librarian? What gives you satisfaction? Don’t feel like you have to follow the same path as your boss or someone you admire; have to join the same organizations and serve on similar committees. Find your tribe. Find your happy place. The opportunities for connecting with other librarians and giving back to the profession are only limited by your imagination. If you don’t see the sort of thing you’d like to contribute to (a conference, a service, a publication, etc.) find some like-minded people and create it! I’ve seen so many librarians do just that. If you’re tenure track, you may have to do things that aren’t a perfect fit for you, but, even then, you usually can tailor your service to the profession as much as possible to things that make you feel fulfilled. I’ve been on too many committees with people who contribute nothing and are clearly only there to say that they served on x committee. Service without engagement is meaningless.
Life is so short that spending time trying to fit a mould or live up to other people’s expectations seems like a tremendous waste of time and energy. Be the professional you want to be.
Happy belated Labor Day (to those in the US) and happy Tuesday to everyone else. On my off day Monday, I did some walking (with the new puppy), a little bit of yard work, and some coding. Over the past week, I’ve been putting the plumbing into the Mac version of MarcEdit to allow the tool to support the OCLC integrations — and today, I’m putting out the first version.
Couple of notes — the implementation is a little quirky at this point. The process works much like the Windows/Linux version — once you get your API information from OCLC, you enter the info and validate it in the MarcEdit preferences. Once validated, the OCLC options will become enabled and you can use the tools. At this point, the integration work is limited to the OCLC records downloader and holdings processing tools. Screenshots are below.
Now, the quirks – when you validate, the Get Codes button (to get the location codes), isn’t working for some reason. I’ll correct this in the next couple days. The second quirk — after you validate your keys, you’ll need to close MarcEdit and open it again to enable the OCLC menu options. The menu validation isn’t refreshing after close — again, I’ll fix this in the next couple of days. However, I wanted to get this out to folks.
Longer term, right now the direct integration between worldcat and the MarcEditor isn’t implemented. I have most of the code done, but not completed. Again, over this week, I hope to have time to get this done.
You can get the Mac update from: http://marcedit.reeset.net/downloads
Don’t get me wrong, Data Science is really cool – trust me. I say this (tongue in cheek of course) because most of the information that you see online about Data Science is (go figure) written by Data “Scientists”. And you can usually tell – they give you verbiage and mathematical formulas that only they understand so you have to trust them when they say things like – “it can easily be proven that …” – that a) it has been proven and b) that it is easy – for them, not for you. That’s not an explanation, it’s a cop out. But to quote Bill Murray’s line from Ghostbusters: “Back off man, I’m a Scientist”.
Science is complex and is difficult to write about in a way that is easy for non-scientists to understand. I have a lot of experience here. As a former scientist, I would have difficulty at parties with my wife’s friends because when they asked me what I do, I couldn’t really tell them the gory details without boring them to death (or so I thought) so I would give them a very high level version. My wife used to get angry with me for condescending to her friends when they wanted more details and I would demur – so OK its hard and I punted – I told her that I was just trying to be nice – not curmudgeonly. But if we are smart enough to conceive of these techniques, we should be smart enough to explain them to dummies. Because if you can explain stuff to dummies, then you really understand it too.Lucy – you’ve got lots of ‘splainin’ to do
Don’t you love it when an article starts out with seemingly english-like explanations (the promise) and quickly degenerates into data science mumbo jumbo. Your first clue that you are in for a rough ride is when you hit something like this – with lots of Greek letters and other funny looking symbols:
and paragraphs like this:
“In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.” (from Wikipedia article on the “Principle of Maximum Entropy“)
Huh? Phrases like “epistemic modesty” and “stated prior data” are not what I would call “ordinary language”. I’ll take a shot at putting this in plain english later when I discuss Information Theory. “Maximum Ignorance” is a good theme for this blog though – just kidding. But to be fair to the Wikipedia author(s) of the above, this article is really pretty good and if you can get through it – with maybe just a Tylenol or two or a beverage – you will have a good grasp of this important concept. Information Theory is really cool and is important for lots of things besides text analysis. NASA uses it to clean up satellite images for example. Compression algorithms use it too. It is also important in Neuroscience – our sensory systems adhere to it big time by just encoding things that change – our brains fill in the rest – i.e. what can be predicted or inferred from the incoming data. More on this later.
After hitting stuff like above – since the authors assume that you are still with them – the rest of the article usually leaves you thinking more about pain killers than the subject matter that you are trying to understand. There is a gap between what they know and what you do and the article has not helped much because it is a big Catch-22 – you have to understand the math and their jargon before you can understand the article – and if you already had that understanding, you wouldn’t need to read the article. Jargon and complex formulas are appropriate when they are talking to their peers because communication is more efficient if you can assume prior subject mastery and you want to move the needle forward. When you try that with dummies you crash and burn. (OK, not dummies per se, just ignorant if that helps you feel better. As my friend and colleague Erick Erickson would say, we are all ignorant about something – and you guessed it, he’s a curmudgeon too.) The best way to read these articles is to ignore the math that you would need a refresher course in advanced college math to understand (assuming that you ever took one) and just trust that they know what they’re talking about. The problem is that this is really hard for programmers to do because they don’t like to feel like dummies (even though they’re not).
It reminds me of a joke that my dad used to tell about Professor Norbert Weiner of MIT. My dad went to MIT during WWII (he was 4-F because of a chemistry experiment gone wrong when he was 12 that took his left hand) – and during that time, many of the junior faculty had been drafted into the armed forces to work on things like radar, nuclear bombs and proximity fuses for anti-aircraft munitions. Weiner was a famous mathematician noted for his work in Cybernetics and non-linear mathematics. He was also the quintessential absent minded professor. The story goes that he was recruited (most likely ordered by his department chairman) to teach a freshman math course. One day, a student raised his hand and said “Professor Weiner, I was having trouble with problem 4 on last nights assignment. Could you go over problem 4 for us?” Weiner says, “Sure, can somebody give me the text book?” So he reads the problem, thinks about it for a minute or two, turns around and writes a number on the board and says, “Is that the right answer?” The student replies, “Yes, Professor that is what is in the answer section in the book, but how did you get it? How did you solve the problem?” Weiner replies, “Oh sorry, right.” He erases the number, looks at the book again, thinks for another minute or so, turns back to the board and writes the same number on the board again. “See”, he says triumphantly, “I did it a different way.”
The frustration that the students must have felt at that point is the same frustration that non-data scientists feel when encountering one of these “explanations”. So lets do some ‘splainin’. I promise not to use jargon or complex formulas – just english.OK dummies, so what is Data Science?
Generally speaking, data science is deriving some kind of meaning or insight from large amounts data. Data can be textual, numerical, spatial, temporal or some combination of these. Two branches of mathematics that are used to do this magic are Probability Theory and Linear Algebra. Probability is about the chance or likelihood that some observation (like a word in a document) is related to some prediction such as a document topic or a product recommendation based on prior observations of the huge numbers of user choices. Linear algebra is the field of mathematics that deals with systems of linear equations (remember in algebra class the problem of solving multiple equations with multiple unknowns? Yeah – much more of that). Linear algebra deals with things called Vectors and Matrices. A vector is a list of numbers. A vector in 2 dimensions is a point on a plane – a pair of numbers – it has an X and a Y coordinate and can be characterized by a length and angle from the origin – if that is too much math so far, then you really are an ignorant dummy. A matrix is a set of vectors – for example a simultaneous equation problem can be expressed in matrix form by lining up the coefficients – math-eze for constant numbers so if one of the equations is 3x – 2y + z = 5, the coefficients are 3, -2, and 1 – which becomes a row in the coefficient matrix, or 3, -2, 1 and 5 for a row in the augmented or solution matrix. Each column in the matrix of equations is a vector consisting of the sets of x, y and z coefficients.
Linear algebra typically deals with many dimensions that are difficult or impossible to visualize, but the cool thing is that the techniques that have been worked out can be used on any size vector or matrix. It gets much worse than this of course, but the bottom line is that it can deal with very large matrices and has techniques that can reduce these sets to equivalent forms that work no matter how many dimensions that you throw at it – allowing you to solve systems of equations with thousands or millions of variables (or more). The types of data sets that we are dealing with are in this class. Pencil and paper won’t cut it here or even a single computer so this is where distributed or parallel analytic frameworks like Hadoop and Spark come in.
Mathematics only deals with numbers so if a problem can be expressed numerically, you can apply these powerful techniques. This means that the same methods can be used to solve seemingly disparate problems like semantic topic mapping and sentiment analysis of documents to recommendations of music recordings or movies based on similarity to tunes or flicks that the user likes – as exemplified by sites like Pandora and Netflix.
So for the text analytics problem, the first head scratcher is how to translate text into numbers. This one is pretty simple – just count words in documents and determine how often a given word occurs in each document. This is known as term frequency or TF. The number of documents in a group that contain the word is its document frequency or DF. The ratio of these two TF/DF or term frequency multiplied by inverse document frequency (1/DF) is a standard number known affectionately to search wonks as TF-IDF (yeah, jargon has a way of just weaseling its way in don’t it?) I mention this because even dummies coming to this particular blog have probably heard this one before – Solr used to use it as its default way of calculating relevance. So now you know what TF-IDF means, but why is it used for relevance? (Note that as of Solr 6.0 TF-IDF is no longer the default relevance algorithm, it has been replaced by a more sophisticated method called BM25 which stands for “Best Match 25” a name that gives absolutely no information at all – it still uses TF-IDF but adds some additional smarts.) The key is to understand why TF-IDF was used, so I’ll try to ‘splain that.
Some terms have more meaning or give you more information about what a document is about than others – in other words, they are better predictors of what the topic is. If a term is important to a topic, it stands to reason that it will be used more than once in documents about that topic, maybe dozens of times (high TF) but it won’t be often used in documents that are not about the subject or topic to which it is important – in other words, it’s a keyword for that subject. Because it is relatively rare in the overall set of documents it will have low DF so the inverse 1/DF or IDF will be high. Multiplying these two (and maybe taking the logarithm just for kicks ‘cause that’s what mathematicians like to do) – will yield a high value for TF-IDF. There are other words that will have high TF too, but these tend to be common words that will also have high DF (hence low IDF). Very common words are manually pushed out of the way by making them stop words. So the relevance formula tends to favor more important subject words over common or noise words. A classic wheat vs chaff problem.
So getting back to our data science problem, the reasoning above is the core concept that these methods use to associate topics or subjects to documents. What these methods try to do is to use probability theory and pattern detection using linear algebraic methods to ferret out the salient words in documents that can be used to best predict their subject areas. Once we know what these keywords are, we can use them to detect or predict the subject areas of new documents (our test set). In order to keep up with changing lingo or jargon, this process needs to be repeated from time to time.
There are two main ways that this is done. The first way called “supervised learning” uses statistical correlation – a subject matter expert (SME) selects some documents called “training documents” and labels them with a subject. The software then learns to associate or correlate the terms in the training set with the topic that the human expert has assigned to them. Other topic training sets are also in the mix here so we can differentiate their keywords. The second way is called “unsupervised learning” because there are no training sets. The software must do all the work. These methods are also called “clustering” methods because they attempt to find similarities between documents and then label them based on the shared verbiage that caused them to cluster together.
In both cases, the documents are first turned into a set of vectors or a matrix in which each word in the entire document set is replaced by its term frequency in each document, or zero if the document does not have the word. Each document is a row in the matrix and each column has the TF’s for all of the documents for a given word, or to be more precise, token (because some terms like BM25 are not words). Now we have numbers that we can apply linear algebraic techniques to. In supervised learning, the math is designed to find the terms that have a higher probability of being found in the training set docs for that topic than in other training sets. Two probability theorems that come into play are Bayes’ Theorem which deals with conditional probabilities and Information Theory. A conditional probability is like the probability that you are a moron if you text while driving (pretty high it turns out – and would be a good source of Darwin awards except for the innocent people that also suffer from this lunacy.) In our case, the game is to compute the conditional probabilities of finding a given word or token in a document and the probability that the document is about the topic we are interested in – so we are looking for terms with a high conditional probability for a given topic – aka the key terms.
Bayes’ Theorem states that the probability of event A given that event B has occurred – p(A|B) – is equal to the probability that B has occurred given A – p(B|A) – times the overall probability that A can happen – p(A) – divided by the overall probability that B can happen – p(B). So if A is the probability that a document is about some topic and B is the probability that a term occurs in the document, terms that are common in documents about topic A but rare otherwise will be good predictors. So if we have a random sampling of documents that SMEs have classified (our training set), the probability of topic A is the fraction of documents classified as A. p(B|A) is the frequency of a term in these documents and p(B) is the term’s frequency in the entire training set. Keywords tend to occur together so even if any one keyword does not occur in all documents classified the same way (maybe because of synonyms), it may be that documents about topic A are the only ones (or nearly) that contain two or three of these keywords. So, the more keywords that we can identify, the better our coverage of that topic will be and the better chance we have of correctly classifying all of them (what we call recall). If this explanation is not enough, check out this really good article on Bayes’ Theorem – its also written for dummies.
Information Theory looks at it a little differently. It says that the more rare an event is, the more information it conveys – keywords have higher information than stop words in our case. But other words are rare too – esoteric words that authors like to use to impress you with their command of the vocabulary but are not subject keywords. So there are noise words that are rare and thus harder to filter out. But noise words of any kind tend to have what Information Theorists call high Entropy – they go one way and then the other whereas keywords have low entropy – they consistently point in the same direction – i.e., give the same information. You may remember Entropy from chemistry or physics class – or maybe not if you were an Arts major who was in college to “find” yourself. In Physics, entropy is the measure of disorder or randomness – the Second Law of Thermodynamics states that all systems tend to a maximum state of disorder – like my daughter’s room. So noise is randomness, both in messy rooms and in text analytics. Noise words can occur in any document regardless of context, and in both cases, they make stuff harder to find (like a matching pair of socks in my daughter’s case). Another source of noise are what linguists call polysemous words – words with multiple meanings in different contexts – like ‘apple’ – is it a fruit or a computer? (That’s an example of the author showing off that I was talking about earlier – ‘polysemous’ has way high IDF, so you probably don’t need to add it to your every day vocabulary. Just use it when you want people to think that you are smart.) Polysemous words also have higher entropy because they simultaneously belong to multiple topics and therefore convey less information about each one – i.e., they are ambiguous.
The more documents that you have to work with, the better chance you have to detect the words with the most information and the least entropy. Mathematicians like Claude Shannon who invented Information Theory have worked out precise mathematical methods to calculate these values. But there is always uncertainty left after doing this, so the Principle of Maximum Entropy says that you should choose a classification model that constrains the predictions by what you know, but hedges its bets so to speak to account for the remaining uncertainties. The more you know – i.e. the more evidence that you have, the maximum entropy that you need to allow to be fair to your “prior stated data” will go down and the model will become more accurate. In other words, the “maximum ignorance” that you should admit to decreases (of course, you don’t have to admit it – especially if you are a politician – you just should). Ok, these are topics for advanced dummies – this is a beginners class. The key here is that complex math has awesome power – you don’t need to fully understand it to appreciate what it can do – just hire Data Scientists to do the heavy lifting – hopefully ones that also minored in English Lit so you can talk to them – (yes, a not so subtle plug for a well rounded Liberal Arts Education!). And on your end of the communication channel as Shannon would say – knowing a little math and science can’t hurt.
In unsupervised learning, since both the topics and keywords are not known up front, they have to be guessed at, so we need some kind of random or ‘stochastic’ process to choose them (like Dirichlet Allocation for example or a Markov Chain). After each random choice or “delta”, we see if it gives us a better result – a subset of words that produce good clustering – fro is small sets of documents that are similar to each other but noticeably different from other sets. Similarity between the documents is determined by some distance measure. Distance in this case is like the distance between points in the 2-D example I gave above but this time in a high-dimensional space of the matrix – but it’s the same math and is known as Euclidian distance. Another similarity measure just focuses on the angles between the vectors and is known as Cosine Similarity. Using linear algebra, when we have a vector of any dimension we can compute distances or angles in what we call a “hyper-dimensional vector space” (cool huh? – just take away ‘dimensional’ and ‘vector’ and you get ‘hyperspace’. Now all we need is Warp Drive). So what we need to do is to separate out the words that maximize the distance or angle between clusters (the keywords) from the words that tend to pull the clusters closer together or obfuscate the angles (the noise words).
The next problem is that since we are guessing, we will get some answers that are better than others because as we take words in and out of our guessed keywords list, the results get better and then get worse again so we go back to the better alternative. (This is what the Newton method does to find things like square roots). The rub here is that if we had kept going, the answer might have gotten better again so what we stumbled on is what is known as a local maximum. So we have to shake it up and keep going till we can find the maximum maximum out there – the global maximum. So we do this process a number of times until we can’t get any better results (the answers “converge”), then we stop. As you can imagine all of this takes a lot of number crunching which is why we need powerful distributed processing frameworks such as Spark or Hadoop to get it down to HTT – human tolerable time.
The end results are pretty good – often exceeding 95% accuracy as judged by human experts (but can we assume that the experts are 100% accurate – what about human error? Sorry about that, I’m a curmudgeon you know – computers as of yet have no sense of humor so you can’t tease them very effectively.) But because of this pesky noise problem, machine learning algorithms sometimes make silly mistakes – so we can bet that the Data Scientists will keep their jobs for awhile. And seriously, I hope that this attempt at ‘splainin’ what they do entices you dummies to hire more of them. Its really good stuff – trust me.
After Reading will begin soon.
As part of its modernization of the Lifeline program in March, the Federal Communications Commission (FCC) charged its Consumer and Government Affairs Bureau (CGB) with developing a digital inclusion plan that addresses broadband adoption issues. Yesterday the ALA filed a letter with the Commission with recommendations for the plan.
ALA called on the Commission to address non-price barriers to broadband adoption by:
- Using its bully pulpit to increase public awareness about the need for and economic value of broadband adoption; highlight effective adoption efforts; and recognize and promote digital literacy providers like libraries to funders and state and local government authorities that can help sustain and grow efforts by these providers.
- Expanding consumer information, outreach and education that support broadband adoption—both through the FCC’s own website and materials and by effectively leveraging aligned government (e.g., the National Telecommunications and Information Administration’s Broadband Adoption Toolkit) and trusted noncommercial resources (e.g., EveryoneOn, DigitalLearn.org, Digital IQ).
- Encouraging and guiding Eligible Telecommunications Carriers (ETCs) in the Lifeline program to support broadband adoption efforts through libraries, schools and other trusted local entities.
Building and strengthening collaborations with other federal agencies, including the Institute of Museum and Library Services, the National Telecommunications and Information Administration, and the Department of Education.
- Convening diverse stakeholders (non-profit, private and government agencies, representatives from underserved communities, ISPs, and funders) to review and activate the Commission’s digital inclusion plan.
- Regional field hearings also should be held to extend conversation and connect digital inclusion partners beyond the Beltway. There should be mechanisms for public comment and refinement of the plan (e.g., public notice or notice of inquiry).
- Improving data gathering and research to better understand gaps and measure progress over time.
- Exploring how the Universal Service Fund and/or merger obligations can be leveraged to address non-price barriers to broadband adoption. Sustainable funding to support and expand broadband adoption efforts and digital literacy training is a challenge, particularly in light of the need for one-on-one help and customized learning in many cases.
The letter builds on past conversations with CGB staff at the 2016 ALA Annual Conference and the last Schools, Health and Libraries Broadband (SHLB) Coalition Conference. The CGB is due to submit the plan to the Commission before the end of the year.
The post ALA makes recommendations to FCC on digital inclusion plan appeared first on District Dispatch.
My current position at DPLA, especially since we are remote-first organization, requires me to be on lots of conference calls, both video and audio. While I’ve learned the value of staying muted while I’m not talking, there are a couple of things that make this challenging. First, I usually need the window for the call to have focus to unmute myself by the platform’s designated keystroke. Forget that working well if you need to bring something up in another window, or switch to another application. Secondly, while we have our own preferred platform internally (Google Hangouts), I have to use countless others, too; each of those platforms has its own separate keystroke to mute.
This all leads to a less than ideal situation, and naturally, I figured there must be a better way. I knew that some folks have used inexpensive USB footpedals for things like Teamspeak, but that ran into the issue where a keystroke would only be bound to a specific application. Nonetheless, I went ahead and bought a cheap PCSensor footswitch sold under another label from an online retailer. The PCSensor footswitches are programmable, but the software that ships with them is Windows-only. However, I also found a command-line tool for programming the switches.
After doing some digging, I came across an application for Mac OS X called Shush, which provides both push-to-talk and push-to-silence modes, which are activated by a keystroke. Once installed, I bound Shush to the Fn keystroke, which would allow me to activate push-to-talk even if I didn’t have the pedal plugged in. However, I couldn’t get the pedal to send the Fn keystroke alone since it’s a modifier key. As a workaround, I put together a device-specific configuration for Karabiner, a flexible and powerful tool to configure input devices for Mac OS. By default, the pedal sends the keycode for b, and the configuration rebinds b for an input device matching the USB vendor and product IDs for the pedal to Fn.
Since I’ve bought and set up my first pedal, I’ve gotten used to using the pedal to quickly mute and unmute myself, making my participation in conference calls become much more smooth than it was previously. I’ve also just replaced my first pedal which broke suddenly with a nearly identical one, but I might make the switch to a more durable version. My Karabiner configuration is available as a gist for your use - I hope this helps you as much as it helped me!
Here we are, ten years after the world started using Evergreen, twelve years after the first line of code was written. Gandhi may not have coined the phrase, but whoever did was right: first they ignore you, then they laugh at you, then they fight you, then you win.
Evergreen won, in many senses, a long time ago. If nothing else, these recent reflections on the last ten years have reminded us that winning takes many forms. Growing from a single-site effort into a self-sustaining project with a positive, inviting community is certainly one type of win. Outlasting the doubt, the growing pains, the FUD, which Evergreen has, is another. Then, when you get past all that, most of your detractors become your imitators. It’s interesting how that happens. And, we can call that a win as well.
There has been, in just the last couple years, much talk in the ILS industry of openness. Essentially every ILS available in the last two decades has been built on the shoulders of Open Source projects to a greater or lesser degree: the Apache web server; the Perl and PHP languages; the Postgres and MySQL databases. The truth is that while most always have been, it’s only recently become fashionable to admit the fact. Then, of course, there are the claims of “open APIs” that aren’t. Perhaps most interesting, though not at all surprising, is the recent claim by a proprietary vendor that it was taking “the unprecedented step” of trying to “make open source a viable alternative for libraries.” Some seem to have missed the last ten or fifteen years of technology in libraries.
And so here we are, ten years later. Evergreen is in use by some 1,500 libraries, serving millions of patrons. Other ILS’s, new and old, want to do what Evergreen has managed to do. We say more power to them. We hope they succeed. The Evergreen community, and our colleagues here at Equinox, have helped pave the way for libraries to define their own future. That was the promise of Evergreen twelve years ago, and that was the goal of the first, small team on Labor Day weekend ten years ago. We’ve won, and delivered on that promise.
Now it’s time to see what the next ten years will bring. We’re ready, are you?
— Mike Rylander and Grace Dunbar
This is the twelfth and final post in our series leading up to Evergreen’s Tenth birthday. Thank you to all those who have been following along!
As we countdown to the annual Lucene/Solr Revolution conference in Boston this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Brett Hoerner’s talk, “Solr at Scale for Time-Oriented Data”.
This talk will go over tricks used to index, prune, and search over large (>10 billion docs) collections of time-oriented data, and how to migrate collections when inevitable changes are required.
Brett Hoerner lives in Austin, TX and is an Infrastructure Engineer at Rocana where they are helping clients control their global-scale modern infrastructure using big data and machine learning techniques. He began using SolrCloud in 2012 to index the Twitter firehose at Spredfast, where the collection eventually grew to contain over 150 billion documents. He is primarily interested in the performance and operation of distributed systems at scale.
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana from Lucidworks
Join us at Lucene/Solr Revolution 2016, the biggest open source conference dedicated to Apache Lucene/Solr on October 11-14, 2016 in Boston, Massachusetts. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…
For libraries, proxying user requests is how we provide authenticated access--and some level of anonymized access--to almost all of our licensed resources. Proxying Google Scholar in the past would direct traffic through a campus IP address, which prompted Scholar to automatically include links to the licensed content that we had told it about. It seemed like a win-win situation: we would drive traffic en masse to Google Scholar, while anonymizing our user's individual queries, and enabling them swift access to our library's licensed content as well as all the open access content that Google knows about.
However, in the past few months things changed. Now when Google Scholar detects proxied access it tries to throw up a Recaptcha test--which would be an okay-ish speed bump, except it uses a key for a domain (google.ca of course) which doesn't match the proxied domain and thus dies with a JS exception, preventing any access. That doesn't help our users at all, and it hurts Google too because those users don't get to search and generate anonymized academic search data for them.
Folks on the EZProxy mailing list have tried a few different recipes to try to evade the Recaptcha but that seems doomed to failure.
If we don't proxy these requests, then every user would need to set their preferred library(via the Library Links setting) to include convenient access to all of our licensed content. But that setting can be hard to find, and relies on cookies, so behaviour can be inconsistent as they move from browser to browser (as happens in universities with computer labs and loaner laptops). And then the whole privacy thing is lost.
On the bright side, I think a link like https://scholar.google.ca/scholar_setprefs?instq=Laurentian+University+Get+full+text&inst=15149000113179683052 makes it a tiny bit easier to help users set their preferred library in the unproxied world. So we can include that in our documentation about Google Scholar and get our users a little closer to off-campus functionality.
But I really wish that Google would either fix their Recaptcha API key domain-based authentication so it could handle proxied requests, or recognize that the proxy is part of the same set of campus IP addresses that we've identified as having access to our licensed resources in Library Links and just turn off the Recaptcha altogether.
Open Knowledge Foundation: What skills do you need to become a data-driven storyteller? Join a week-long data journalism training #ddjcamp
European Youth Press is organising a week-long intensive training on data journalism funded by Erasmus+. It is aimed at young journalists, developers and human rights activists from 11 countries: Czech Republic, Germany, Belgium, Italy, Sweden, Armenia, Ukraine, Montenegro, Slovakia, Denmark or Latvia.
If you have always wanted to learn more about what it means to be a data-driven storyteller, then this is an opportunity not to miss! Our course was designed with wanna-be data journalists in mind and for people who have been following others’ work in this area but are looking to learn more about actually making a story themselves.
You will have classes and workshops along the data pipeline: where to get the data, what to do to make it ‘clean’, and how to find a story in the data. In parallel to the training, you will work in teams and produce a real story that will be published in the national media of one of the participating countries.
The general topic of all the stories produced has been chosen as migration/refugees. Data journalism has a reputation to be a more objective kind of journalism, opposed to ‘he said – she said’ narratives. However, there is still great potential to explore data-driven stories about migrants and the effects of migration around the world.
Praising the refugee hunters as national heroes; violence targeting international journalists and migrants; sentimental pleas with a short-time effect – those are few examples of media coverage of the refugee crisis. The backlash so far to these narratives has mostly been further distrust in the media. What are the ways out of it?
We want to produce more data-driven balanced stories on migrants. For this training, we are inviting prominent researchers and experts in the field of migration. They will help us with relevant datasets and knowledge. We will not fix the world, but we can make a little change together.
So, if you are between 18 and 30 years old and come from Czech Republic, Germany, Belgium, Italy, Sweden, Armenia, Ukraine, Montenegro, Slovakia, Denmark or Latvia, don’t wait – apply now (deadline is 11 Sept):
Geoff distinguishes between "DOI-like strings" and "fake DOIs", presenting three ways DOI-like strings have been (ab)used:
- As internal identifiers. Many publishing platforms use the DOI they're eventually going to register as their internal identifier for the article. Typically it appears in the URL at which it is eventually published. The problem is that: the unregistered DOI-like strings for unpublished (e.g. under review or rejected manuscripts) content ‘escape’ into the public as well. People attempting to use these DOI-like strings get understandably confused and angry when they don’t resolve or otherwise work as DOIs.Platforms should use internal IDs that can't be mistaken for external IDs, because they can't guarantee that the internal ones won't leak.
- As spider- or crawler-traps. This is the usage that Eric Hellman identified. Strings that look like DOIs but are not even intended to eventually be registered but which have bad effects when resolved: When a spider/bot trap includes a DOI-like string, then we have seen some particularly pernicious problems as they can trip-up legitimate tools and activities as well. For example, a bibliographic management browser plugin might automatically extract DOIs and retrieve metadata on pages visited by a researcher. If the plugin were to pick up one of these spider traps DOI-like strings, it might inadvertently trigger the researcher being blocked- or worse- the researcher’s entire university being blocked. In the past, this has even been a problem for Crossref itself. We periodically run tools to test DOI resolution and to ensure that our members are properly displaying DOIs, CrossMarks, and metadata as per their member obligations. We’ve occasionally been blocked when we ran across the spider traps as well. Sites using these kinds of crawler traps should expect a lot of annoyed customers whose legitimate operations caused them to be blocked.
- As proxy bait. These unregistered DOI-like strings can be fed to system such as Sci-Hub in an attempt to detect proxies. If they are generated afresh on each attempt, the attacker knows that Sci-Hub does not have the content. So it will try to fetch it using a proxy or other technique. The fetch request will be routed via the proxy to the publisher, who will recognize the DOI-like string, know where the proxy is located and can take action, such as blocking the institution: In theory this technique never exposes the DOI-like strings to the public and automated tools should not be able to stumble upon them. However, recently one of our members had some of these DOI-like strings “escape” into the public and at least one of them was indexed by Google. The problem was compounded because people clicking on these DOI-like strings sometimes ended having their university’s IP address banned from the member’s web site. ... We think this just underscores how hard it is to ensure DOI-like strings remain private and why we recommend our members not use them. As we see every day, designing computer systems that in the real world never leak information is way beyond the state of the art.
The following is what we have sometimes called a “fake DOI”
It is registered with Crossref, resolves to a fake article in a fake journal called The Journal of Psychoceramics (the study of Cracked Pots) run by a fictitious author (Josiah Carberry) who has a fake ORCID (http://orcid.org/0000-0002-1825-0097) but who is affiliated with a real university (Brown University).
Again, you can try it.
And you can even look up metadata for it.
Our dirty little secret is that this “fake DOI” was registered and is controlled by Crossref.These "starting with 5" DOIs are used by Crossref to test their systems. They too can confuse legitimate software, but the bad effects of the confusion are limited. And now that the secret is out, legitimate software can know to ignore them, and thus avoid the confusion.