Talis is delighted to be one of the sponsors of the 8th European summer School on Ontological Engineering and the Semantic Web (SSSW 2011). There will be more about this in coming posts, but just to start off:
We are sponsoring it for a very simple reason. The mix of theoretical, practical and collaboration skills used by all the students involved from across Europe directly corresponds to how we work at Talis. It’s an environment of support and challenge, contribution and connection that has proved beneficial for all involved over the years. Talis is proud to contribute and participate to further the aims of the community.
Talis is a small and ambitious company of likeminded, motivated people. A phrase we often use here is Human Scale. Culturally what we mean by that is we like working closely with people who we all know, whether as employees of Talis or (more likely) over time collaborating as partners in joint endeavours.
We want to grow our company and contribute to the communities we belong to. We know that it is by fostering relationships with others driven by the same passion to collaborate and learn that we can build on the ambitions we have for ourselves and for the communities we belong to. One particular aspect of the Summer School is this same notion of social connectedness, a personal network of trusted relationships that challenge and enhance the experience for everyone.
If you use Ibis Reader, you will have seen the “Get Books” link. This allows you to view OPDS catalogs (lists of web-accessible ebooks). The Feedbooks catalogs are pre-installed, and some of you may have set up a Calibre/Dropbox OPDS catalog of your own library.
On a whim, I used the “Add Your Own Catalog” link to add the WebScription Stanza link,
It worked! I can access the Baen Free Library directly from Ibis Reader! In “Get Books” I click “Baen” (or whatever you named the OPDS link), then “Baen Free Library”, then the “Read” button next to the book I want to add to my Ibis Reader library. Yay!
Edited the above URL to go straight to the free books, bypassing the top-level catalog.
The Stanza catalog is an earlier form of OPDS, so some features like cover images won’t work in Ibis or other OPDS readers. If Baen updates their catalog to OPDS 1.0 then those features will be enabled, and we’d definitely consider adding it as a built-in catalog like the Feedbooks ones.
Please share any other OPDS catalogs that you’d like to see added (or at least listed as optional catalogs users can add themselves).
Add Open Publishing Distribution System(OPDS) to the slew of metadata acronyms to be aware of.Based on the widely implemented Atom Syndication Format, OPDS Catalogs have been developed since 2009 by a group of ebook developers, publishers, librarians, and booksellers interested in providing a lightweight, simple, and easy to use format for developing catalogs of digital books, magazines, and other content.How this compares to OAI-PMH I'll have to investigate. When would one or the other be most appropriate? What tools are there to create it and use it?
I'm teaching a TEI class this weekend, so I've been pondering it a bit. I've come to the conclusion that calling what we do with TEI "text encoding" is misleading. I think what we're really doing is text modeling.
TEI provides an XML vocabulary that lets you produce models of texts that can be used for a variety of purposes. Not a Model of Text, mind you, but models (lowercase) of texts (also lowercase).
TEI has made the (interesting, significant) decision to piggyback its semantics on the structure of XML, which is tree-based. So XML structure implies semantics for a lot of TEI. For example, paragraph text appears inside <p> tags; to mark a personal name, I surround the name with a <persname> tag, and so on. This arrangement is extremely convenient for processing purposes: it is trivial to transform the TEI <p> into an HTML <p>*, for example, or the <persname> into an HTML hyperlink, which points to more information about the person. It means, however, that TEI's modeling capabilities are to a large extent XML's own. This approach has opened TEI up to criticism. Buzetti (2002) has argued that its tree structure simply isn't expressive enough to represent the complexities of text, and Schmidt (2010) criticizes TEI for (among other problems) being a bad model of text, because it imposes editorial interpretation on the text itself.
The main disagreement I have with Schmidt's argument is the assumption that there is a text independent of the editorial apparatus. Maybe there is sometimes, but I can point at many examples where there is no text, as such, only readings. And a reading is, must be, an interpretive exercise. So I'd argue that TEI is at least honest in that it puts the editorial interventions front and center where they are obvious.
As for the argument that TEI's structure is inadequate to model certain aspects of text, I can only agree. But TEI has proved good enough to do a lot of serious scholarly work. That, and the fact that its choice of structure means it can bring powerful XML tools to bear on the problems it confronts, means that TEI represents a "worse is better" solution.† It works a lot of the time, doesn't claim to be perfect, and incrementally improves. Where TEI isn't adequate to model a text in the way you want to use it, then you either shouldn't use it, or should figure out how to extend it.
One should bear in mind that any digital representation of a text is ipso facto a model. It's impossible do anything digital without a model (whether you realize it's there or not). Even if you're just transcribing text from a printed page to a text editor you're making editorial decisions, like what character encoding to use, how to represent typographic features in that encoding, how to represent whitespace, and what to do with things you can't easily type (inline figures or symbols without a Unicode representation, for example).
So why argue that TEI is a language for modeling texts, rather than a language for "encoding" texts? The simple answer is that this is a better way of explaining what people use TEI for. TEI provides a lot of tags to choose from. No-one uses them all. Some are arguably incompatible with one another. We tag the things in a text that we care about and want to use. In other words, we build models of the source text, models that reflect what we think is going on structurally, semantically, or linguistically in the text, and/or models that we hope to exploit in some way.
For example, EpiDoc is designed to produce critical editions of inscribed or handwritten ancient texts. It is concerned with producing an edition (a reading) of the source text that records the editor's observations of and ideas about that text. It does not at this point concern itself with marking personal or geographic names in the text. An EpiDoc document is a particular model of the text that focuses on the editor's reading of that text. As a counterexample, I might want to use TEI to produce a graph of the interactions of characters in Hamlet. If I wanted to do that, I would produce a TEI document that marked people and whom they were addressing when they spoke. This would be a completely different model of the text than a critical edition of Hamlet might be. I could even try to do both at the same time, but that might be a mess—models are easier to deal with when they focus on one thing.
This way of understanding TEI makes clear a problem that arises whenever one tries to merge collections of TEI documents: that of compatibility. Just because two documents are marked up in TEI, that does not mean they are interoperable. This is because each document represents the editor's model of that text. Compatibility is certainly achievable if both documents follow the same set of conventions, but we shouldn't expect it any more than we'd expect to be able to merge any two models that follow different ground rules.
Notes* with the caveat that the semantics of TEI <p> and HTML <p> are different, and there may be problems. TEI's <p> can contain lists, for example, whereas HTML's cannot.
† See http://www.dreamsongs.com/RiseOfWorseIsBetter.html
Yes, I wrote a blog post with endnotes and bibliography. Sue me.
So, a new year is here. Again. I'm getting a bit sick of this straining repetition, but apparently the rest of society thinks it is quite alright. So.
A lot of stuff have happened. We've sold one house, bought and moved into another (and I'm sure I'll write more on this later), and various events have come and gone. I've gotten a new camera for Christmas which I'm excited about (a Panasonic Lumix G2), and I'm reading Bill Bryson's latest "At Home" which is brilliant as usual. Oh, and Mr Mister have released their album "Pull" after 20 years (!!), and it is AWESOME!
I'm writing a book. And I'm enjoying it, when I get the time to do it. I'm some 70 pages in, and it's about ... uh, part technology, part human and cosmological evolution, some laser shooting which defies the laws of physics, project management, opinions on the strong need for secularity, on music, and some more parts technology, programming and development, syntax and language, lots about language, and about libraries and culture, and then some. Yeah, so not your average book, but some people are interested, and I'm taking advice on publishing, format and schedule from anyone.
I'm opening ThinkPlot again, an organisation for people who care about the well-being of the human race and the world we live in in an intelligent fashion, to promote education, science and rationality amongst the people that live near you. Our patron "saint" is the late great Carl Sagan. I'm definitely talk more about this later.
Work is good. It's intranets all the way, interspersed with UCD, IA, UX, hacking, supervision, PMing, and all other goodies, and it's in the health-care system doing important work. So, yeah. Good stuff, and enjoyable. In fact, one of the things I've noticed is that in the few years since my last stints in the Intranet world not much have improved in terms of content and document management. The old systems that sucked have been overtaken by systems that also sucks, just in different ways. Enterprise systems of various kinds follow suit. There's so much bad software out there, even from people who should know better. So, yes, I've decided to make something funky from scratch in the Intranet space, using REST, Topic Maps and simpler development tools readily available. We'll see where it takes us.
Kids and wife doing fine. Kids winning awards, playing violin brilliantly, and growing up fine. (Crossing fingers!) Things are chugging along. Oh, and we've just been introduced to and getting hooked on Carcassonne, so now you know what we often do in the evenings. The beach is down the road next to the shop and cafe, and the pool in the backyard is a favorite past-time, so do come over. Things are good.
PS. Send more salty liquorice.
There is a code4lib IRC channel for folks who are interested in the convergence of computers and library/information science. The channel is a less formal and more interactive alternative to the code4lib mailing list for the discussion of code, projects, ideas, music, first computers, etc., etc..
del.icio.us: The Code4Lib Journal – A Principled Approach to Online Publication Listings and Scientific Resource Sharing
A Principled Approach to Online Publication Listings and Scientific Resource Sharing The Max Planck Institute (MPI) for Psycholinguistics has developed a service to manage and present the scholarly output of their researchers. The PubMan database manages publication metadata and full-texts of publications published by their scholars. All relevant information regarding a researcher’s work is brought together in this database, including supplementary materials and links to the MPI database for primary research data. The PubMan metadata is harvested into the MPI website CMS (Plone). The system developed for the creation of the publication lists, allows the researcher to create a selection of the harvested data in a variety of formats. by Jacquelijn Ringersma, Karin Kastens, Ulla Tschida and Jos van Berkum
I’m trying to really polish off the edges and provide a slick interface in our Blacklight implementation, to contrast with the very hacky legacy OPAC.
Applied to the display of your items out with due dates… In a display of due dates, the user has to do some arithmatic to figure out how far away a given date is. What they really care about is, is this tomorrow? In a week? In a month?
So why not display it to them? There’s a nice Rails helper to calculate human readable time deltas, although I customized it just a bit to handle the fact that some of our due dates have specific times attached (like for reserves; I use a ruby Time object), and others are just dates with no time, before close of business (I use ruby Date object). The built-in method alone works with Dates, but doesn’t always do the most sensible thing with them for this case.# handle both Date's without a time, and Time's with hour/minute/second # appropriately. def relative_due_date(date_or_time) if date_or_time.kind_of?(Time) distance_of_time_in_words( Time.now, date_or_time) elsif date_or_time == Date.today "today" elsif (date_or_time == (Date.today + 1)) "tomorrow" else distance_of_time_in_words( Date.today, date_or_time) end end
There are other corners that it wasn’t really feasible to polish off based on the functionality and business rules of the underlying ILS. Like if the item isn’t renewable, I’d rather now show a renew button at all, instead showing the reason it’s not renewable. But the underlying ILS doesn’t really support that, you have to click ‘renew’ first, and then (maybe, sometimes) get a reason if it wasn’t renewable. So I polished off what I could within reasonable constraints — like at least in this interface if you choose ‘renew all’, it specifically marks which items were renewed and which weren’t, instead of the legacy OPAC that just let you look at the new due dates without telling you if they were actually new or the same old ones due to not being renewable. Oh well.
Filed under: General
There are some good spam solutions going these days, e.g., filtering, but spam is a complex problem and here’s one more simple idea that might help. Blog software can require a person to be approved before leaving a comment. Why not use the same approach with email?
It seems the COinS Generator is not working, at least when given a DOI it returns the target not a Content Object in Spans. Is there no alternative tool? I couldn't find one. If that is the case, is it because COinS is pretty useless and no one bothered to create another generator? If COinS are useful, maybe another instance of the generator or a different but similar service would be a good thing.
In principle, I like microformats. Anything that supplies more semantics to information is something I tend to support. COinS seem like a very useful microformat, nothing in RDFa, that I know of, is a decent substitute. What's happening here?
Have you seen the cover of the new issue of American Libraries?
Just curious any news of color e-ink readers from either ALA or CES?
Tematres 1.2 has been released.TemaTres is an open source vocabulary
server, web application to manage and exploit vocabularies, thesauri, taxonomies and formal representations of knowledge.... Export in any format: Skos-Core, Zthes, TopicMap, Dublin Core, MADS, BS8723-5, RSS, SiteMap, txt, SQL.
Light pollution is a major issue which concerns not only astronomers and stargazers, but has serious impacts on the environment and human health. The BMP project is an initiative founded in early 2008, by Francesco Giubbilini Francesco and Andrea Giacomelli, two environmental engineers with Tuscan roots, aimed at collecting data on light pollution by non-professionals as a form of environmental awareness raising.
Ordinary citizens, families, and schools can participate. The project also has a scientific aspect, as it allows the collection of valuable data, using a tool, called Sky Quality Meter, which has been on the market for a couple of years now. Measurements can be collected with a user’s instrument, or borrowing one from the BMP group, and subsequently uploaded to the BMP web site.
In addition to collecting new measurements, the BMP team takes care of:
The data uploaded on the BMP web site can then be viewed and downloaded freely (data are available under the Open DataBase Licence). Other contents produced by the BMP team are released under a CC BY-NC-SA license. Furthermore, free and open source geospatial technologies are used for the database and the web mapping engine.
The project has generated considerable interest at national level, among other things, winning an award for innovation in early 2009 and receiving a diverse media coverage. The BMP project is interested in:
No related posts.
Or we could save our energy and find untapped sources of content created by our local users and work together to create a single publishing platform and rights-management tool to allow easy creation and access to local content.
That’s the excellent ending of Kathryn Greenhill’s answer to her own question: How do we force publishers to give us ebook content that includes works that our users want and that they find easy to download to their chosen device?
This is such a compelling vision of a way forward for libraries. Not only is it more attainable than forcing publishers to do anything (or even compelling them) but it would result in a much more meaningful public library.
I’m looking forward to the rest of the posts in her series!
GetSatisfaction‘s “How does this make you feel?” intrigues me: why do people answer this? Conventional wisdom says that people don’t classify their posts.
Network diagrams are great ways to illustrate relationships. In such diagrams nodes represent some sort of entity, and lines connecting nodes represent some sort of relationship. Nodes clustered together and sharing many lines denote some kind of similarity. Conversely, nodes whose lines are long and not interconnected represent entities outside the norm or at a distance. Network diagrams are a way of visualizing complex relationships.
Are you familiar with the phrase “in the same breath”? It is usually used to denote the relationship between one or more ideas. “He mentioned both ‘love’ and ‘war’ in the same breath.” This is exactly one of the things I want to do with texts. Concordances provide this sort of functionality. Given a word or phrase, a concordance will find the query in a corpus and display the words on either side of it. A KWIK (key word in context) index, concordances make it easier to read how words or phrases are used in relationship with their surrounding words. The use of network diagrams seem like good idea to see — visualize — how words or phrases are used within the context of surrounding words.
The implementation of the visualization requires the recursive creation of a term matrix. Given a word (or regular expression), find the query in a text (or corpus). Identify and count the d most frequently used words within b number of characters. Repeat this process d times with each co-occurrence. For example, suppose the text is Walden by Henry David Thoreau, the query is “spring”, d is 5, and b is 50. The implementation finds all the occurrences of the word “spring”, gets the text 50 characters on either side of it, finds the 5 most commonly used words in those characters, and repeats the process for each of those words. The result is the following matrix:spring day morning first winter day days night every today morning spring say day early first spring last yet though winter summer pond like snow
Thus, the most common co-occurrences for the word “spring” are “day”, “morning”, “first”, and “winter”. Each of these co-occurrences are recursively used to find more co-occurrences. In this example, the word “spring” co-occurs with times of day and seasons. These words then co-occur with more times of day and more seasons. Similarities and patterns being to emerge. Depending on the complexity of a writer’s sentence structure, the value of b (“breath”) may need to be increased or decreased. As the value of d (“detail”) is increased or decreased so does the number of co-occurrences to return.
“spring” in Walden
It is interesting enough to see the co-occurrences of any given word in a text, but it is even more interesting to compare the co-occurrences between texts. Below are a number of visualizations from Thoreau’s Walden. Notice how the word “walden” frequently co-occurs with the words “pond”, “water”, and “woods”. This makes a lot of sense because Walden Pond is a pond located in the woods. Notice how the word “fish” is associated with “pond”, “fish”, and “fishing”. Pretty smart, huh?
“walden” in Walden
“fish” in Walden
“woodchuck” in Walden
“woods” in Walden
Compare these same words with the co-occurrences in a different work by Thoreau, A Week on the Concord and Merrimack Rivers. Given the same inputs the outputs are significantly different. For example, notice the difference in co-occurrences given the word “woodchuck”.
“walden” in Rivers
“fish” in Rivers
“woodchuck” in Rivers
“woods” in Rivers Give it a try
Give it a try for yourself. I have written three CGI scripts implementing the things outlined above:
In each implementation you are given the opportunity to input your own queries, define the “size of the breath”, and the “level of detail”. The result is an interactive network diagram visualizing the most frequent co-occurrences of a given term.
The root of the Perl source code is located at http://infomotions.com/sandbox/network-diagrams/.Implications for librarianship
The visualization of co-occurrences obviously has implications for text mining and the digital humanities, but it also has implications for the field of librarianship.
Given the current environment where data and information abound in digital form, libraries have found themselves in an increasingly competitive environment. What are libraries to do? Lest they become marginalized, librarians can not rest on their “public good” laurels. Merely providing access to information is not good enough. Everybody feels as if they have plenty of access to information. What is needed are methods and tools for making better use of the data and information they acquire. Implementing text mining and visualization interfaces are one way to accomplish that goal within context of online library services. Do a search in the “online catalog”. Create a subset of interesting content. Click a button to read the content from a distance. Provide ways to analyze and summarize the content thus saving the time of the reader.
Us librarians have to do something differently. Think like an entrepreneur. Take account of your resources. Examine the environment. Innovate and repeat.
Active forum topics
There are currently 0 users and 13 guests online.