From Tim Donohue, DSpace Tech Lead, on behalf of the DSpace Committers
I'm pleased to announce that the DSpace 6.0 codebase is now ready for Testathon! We will be holding a two-week 6.0 Testathon extending from Monday, April 25 through Friday, May 6.
Please help ensure the success of DSpace 6.0 by helping us to test it during the 6.0 Testathon that will be held April 25 through May 6.
Today, the Schools and Libraries Divisions of the Universal Service Administrative Company (USAC), which administers the E-rate program, announced that it will extend the current form 471 filing window through May 26. For libraries and consortia, a second window will open to extend the filing window for those two groups until July 21. This additional window is in recognition of the difficulty libraries and larger consortia have had in completing the application process. The new and final day to file the 470 will be June 23.
With the new online filing system, there have been numerous issues preventing libraries from moving forward on their applications. While USAC has continually made efforts to provide updated information and fixes to the EPC system, it has proved challenging for many libraries to accurately finish and file an application. ALA requested the Federal Communications Commission (FCC) work with USAC to extend the filing window so libraries that have been struggling would still have an opportunity to apply, especially given the funding available for Category 2 services which most libraries have not been able to touch for most of the life of the E-rate program. We know that USAC and the FCC both heard the concerns of the library community and responded so that libraries are not inadvertently disadvantaged in this year’s filing window.
Regardless of the window extension, libraries will still need to work through the application process and solve any continuing kinks in their EPC accounts. There’s help out there! If you haven’t yet, connect with your state E-rate coordinator, who is there to help you. Check out Libraryerate, a new peer 2 peer portal for E-rate resources – and be sure to sign up for the News Brief from USAC for the latest information. As we get further information, we will be sure to make sure the library community is aware of it.
As of this posting you have 41 days, 6 hours, 14 minutes, and 32 seconds till the new May 26 deadline. Someone else who is better at math than I am can figure out the remaining hours before July 21. Or better still, allow yourself a little breathing room on a Friday afternoon and be ready to start fresh and take full advantage of the extra time.
The Access 2016 Program Committee invites proposals for participation in this year’s Access Conference, which will be held on the beautiful campus of the University of New Brunswick in the hip city of Fredericton, New Brunswick from 4-7 October.
There’s no special theme to this year’s conference, but — in case you didn’t know — Access is Canada’s annual library technology conference, so … we’re looking for presentations about cutting-edge library technologies that would appeal to librarians, technicians, developers, programmers, and managers.
Access is a single-stream conference that will feature:
• 45-minute sessions,
• lightning talks (try an Ignite-style talk: five minutes to talk while slides—20 in total—automatically advance every 15 seconds),
• a half-day workshop on the last day of the conference, or;
• dazzle us with a bright idea for something different (panel, puppet show, etc.). We’d love to hear it!
To submit your proposal, please fill out the form by 15 April. Deadline extended to April 22!
Please take a look at the Code of Conduct too.
We’re looking forward to hearing from you!
My first library “job” was as “volunteen” for the summer reading program at the Garden Grove Chapman Branch of the Orange County Public Library. I did this during the summers in junior high school and into my freshman year of high school. I spent my time helping smaller kids tally up the number of books they had read, doling out prizes, and making suggestions for books they might enjoy. We helped out with decorations, crafts, story time and puppet shows. We auditioned for and rehearsed for the peak moment of the summer reading program series, the annual teen melodrama. I also did “other duties as assigned” — pasting, cutting, sorting books in preparation for shelving (I had a very tenuous grasp of the Dewey Decimal System) “repairing” cheap paperback books that were near the end of their life, and running small errands. I have always liked to stay busy, so I’m sure I drove the librarians crazy with requests for more tasks. I’m also amazed with the relative autonomy I had. Never overlook the power of the 7th and 8th grade work force!
For National Library Week I helped to pull together an OCLC Next series focusing on OCLC staff “first library job” experiences. I’ve always been impressed with the depth and breadth of my OCLC colleagues’ experience working in libraries before coming to OCLC, and the commitment that they continue to show in working with libraries at OCLC. In reading their responses to our questions I’m struck by how many of my colleagues started working in libraries at a very young age. This should be instructive to all of us who work with young people — they may well stick around!
I note that this year the theme of National Library Week is “libraries transform” — the blog posts help to underscore that theme of transformation, both for the libraries we have worked for and with, and also for the careers we have had. There is a lot of of wisdom in these posts, so I hope you read and enjoy them.
- My first library job
- The road to librarianship
- Advice to my younger self
- How my first job prepared me for today’s challenges
- The future of libraries
Thanks to all who participated in the series — we had a great response and more content than we could possibly use. Special thanks to Brad Gauder, who did much of the heavy lifting in helping out with this series.
You can share your own “first library story” in the comments below, or on Twitter (use #NLW16 and #OCLCnext).About Merrilee ProffittMail | Web | Twitter | Facebook | LinkedIn | More Posts (285)
With the Code4Lib and PLA Conferences behind us, we’re now looking ahead to the Evergreen International Conference coming up on April 20. As you could probably guess, this is our favorite conference of the year! We love Evergreen and we love sharing things we’ve learned. Plus it gathers some of our favorite Evergreen aficionados in one place!
For 2016, Equinox is proud to be a Platinum Sponsor. We’re also sponsoring the Development Hackfest. In addition to our sponsorship roles, the Equinox team is participating in a combined nineteen presentations out of the forty scheduled for the conference. Here’s a sneak peek into those presentations:
- SQL for Humans (Rogan Hamby, Data and Project Analyst, Equinox Software, Inc.)
- Mashcat in Evergreen (Galen Charlton, Infrastructure and Added Services Manager, Equinox Software, Inc.)
- Introduction to the Evergreen Community (Ruth Frasur, Hagerstown-Jefferson Township Library; Kathy Lussier, MassLNC; Shae Tetterton, Equinox Software, Inc.)
- Digging Deeper: Acquisition Reports in Evergreen (Angela Kilsdonk, Equinox Software, Inc.)
- Staging Migrations and Data Updates for Success (Jason Etheridge, Equinox Software, Inc.)
- A Tale of Two Consortiums (Rogan Hamby, Equinox Software, Inc.)
- A More Practical Serials Walkthrough (Erica Rohlfs, Equinox Software, Inc.)
- SQL for Dummies (John Yorio and Dale Rigney, Equinox Software, Inc.)
- It Turns Out That This is a Popularity Contest, After All (Mike Rylander, Equinox Software, Inc.)
- We Are Family: Working Together to Make Consortial Policy Decisions (Shae Tetterton, Equinox Software, Inc.)
- Encouraging Participation in Evergreen II: Tools and Resources (and badges!) (Grace Dunbar, Equinox Software, Inc.)
- Metadata Abattoir: Prime Cuts of MARC (Mike Rylander, Equinox Software, Inc.)
- Serials Roundtable (Erica Rohlfs, Equinox Software, Inc.)
- Back to the Future: The Historical Evolution of Evergreen’s Code and Infrastructure (Mike Rylander and Jason Etheridge, Equinox Software, Inc; Bill Erickson, King County Library System)
- Fund Fun: How to Set Up and Manage Funds in Acquisitions (Angela Kilsdonk, Equinox Software, Inc.)
- Not Your High School Geometry Class: How to Develop for the Browser Client with AngularJS (Galen Charlton and Mike Rylander, Equinox Software, Inc; Bill Erickson, King County Library System)
- To Dream the Impossible Dream: Collaborating to Achieve Shared Vision (Grace Dunbar and Mike Rylander, Equinox Software, Inc.)
- The Catalog Forester: Managing Authority Records in Evergreen, Singly and In Batch (Galen Charlton and Mary Jinglewski, Equinox Software, Inc; Chad Cluff, Backstage Library Works)
- Many Trees, Each Different: Sprucing Up Your Evergreen TPAC (James Keenan, C/W MARS; John Yorio and Dale Rigney, Equinox Software, Inc.)
The Equinox Team will be loading up in the Party Bus (No, we’re not kidding) on Tuesday, April 19. Hope to see you all in Raleigh! Follow along at home using the #evgils16 hashtag on Twitter/Facebook.
Our staff were recently asked to check thousands of ISBNs to find out if we already have the corresponding books in our catalogue. They in turn asked me if I could run a script that would check it for them. It makes me happy to work with people who believe in better living through automation (and saving their time to focus on tasks that only humans can really achieve).
Rather than taking the approach that I normally would, which would be to just load the ISBNs into a table in our Evergreen database and then run some queries to take care of the task as a one-off, I opted to try for an approach that would enable others to run these sort of adhoc reports themselves. As with most libraries, I suspect, we work with spreadsheets a lot--and as our university has adopted Google Apps for Education, we are slowly using Google Sheets more to enable collaboration. So I was interested in figuring out how to build a custom function that would look for the ISBN and then return a simple "Yes" or "No" value according to what it finds.
Evergreen has a robust SRU interface, which makes it easy to run complex queries and get predictable output back, and it normalizes ISBNs in the index so that a search for an 10-digit ISBN will return results for the corresponding 13-digit ISBN. That made figuring out the lookup part of the job easy; after that, I just needed to figure out how to create a custom function in Google Sheets.
Then I just add a column beside the column with ISBN values and invoke the function as (for example) =CheckForISBN(C2).
Given a bit more time, it would be easy to tweak the function to make it more robust, offer variant search types, and contribute it as a module to the Chrome Web Store "Sheet Add-ons" section, but for now I thought you might be interested in it.
Caveats: With thousands of ISBNs to check, occasionally you'll get an HTTP response error ("#ERROR") in the column. You can just paste the formula back in again and it will resubmit the query. The sheet also seems to resubmit the request on a periodic basis, so some of your "Yes" or "No" values might change to "#ERROR" as a result.
I feel that this series is becoming a little long in the tooth. As such, this will be my last post in the series. This series will be aggregated under the following tag: linked data journey.
After spending a good amount of time playing with RDF technologies, reading authoritative literature, and engaging with other linked data professionals and enthusiasts, I have come to the conclusion that linked data, as with any other technology, isn’t perfect. The honeymoon phase is over! In this post I hope to present a high-level, pragmatic assessment of linked data. I will begin by detailing the main strengths of RDF technologies. Next I will note some of the primary challenges that come with RDF. Finally, I will give my thoughts on how the Library/Archives/Museum (LAM) community should move forward to make Linked Open Data a reality in our environment.Strengths
Modularity. Modularity is a huge advantage RDF modeling has over modeling in other technologies such as XML, relational databases, etc. First, you’re not bound to a single vocabulary, such as Dublin Core, meaning you can describe a single resource using multiple descriptive standards (Dublin Core, MODS, Bibframe). Second, you can extend existing vocabularies. Maybe Dublin Core is perfect for your needs, except you need a more specific “date”. Well, you can create a more specific “date” term and assign it as a sub-property of DC:date. Third, you can say anything about anything: RDF is self-describing. This means that not only can you describe resources, you can describe existing and new vocabularies, as well as create complex versioning data for vocabularies and controlled terms (see this ASIST webinar). Finally, with SPARQL and reasoning, you can perform metadata cross-walking from one vocabulary to another without the need for technologies such as XSLT. Of course, this approach has its limits (e.g. you can’t cross-walk a broader term to a specific term).
Linking. Linking data is the biggest selling point of RDF. The ability to link data is great for the LAM community, because we’re able to link our respective institutions’ data together without the need for cross-referencing. Eventually, when there’s enough linked data in the LAM community, it will be a way for us to link our data together across institutions, forming a web of knowledge.Challenges
Identifiers. Unique Resource Identifiers (URIs) are double-edged swords when it comes to RDF. URIs help us uniquely identify every resource we describe, making it possible to link resources together. They also make it much less complicated to aggregate data from multiple data providers. However, creating a URI for every resource and maintaining stables URIs (which I think will be a requirement if we’re going to pull this off) can be cumbersome for a data provider, as well as rather costly.
Duplication. I have been dreaming of the day when we could just link our data together across repositories, meaning we wouldn’t need to ingest external data into our local repositories. This would relieve the duplication challenges we currently face. Well, we’re going to have to wait a little longer. While there are mechanisms out there that could tackle the problem of data duplication, they are unreliable. For example, with SPARQL you can run what is called a “federated query”. A federated query queries multiple SPARQL endpoints, which presents the potential of de-duplicating data by accessing the data from its original source. However, I’ve been told by linked data practitioners that public SPARQL endpoints are delicate and can crash when too much stress is exerted on them. Public SPARQL endpoints and federated querying are great for individuals doing research and small-scale querying; not-so-much for robust, large-scale data access. For now, best practice is still to ingest external data into local repositories.Moving forward
Over the past few years I have dedicated a fair amount of research time developing my knowledge of linked data. During this time I have formed some thoughts for moving forward with linked data in the LAM community. These thoughts are my own and should be compared to others’ opinions and recommendations.
Consortia-level data models. Being able to fuse vocabularies together for resource description is amazing. However, it brings a new level of complexity to data sharing. One institution might use DC:title, DC:date, and schema:creator. Another institution might use schema:name (DC: title equivalent), DC:date, and DC:creator. Even though both institutions are pulling from the same vocabularies, they’re using different terms. This poses a problem when trying to aggregate data from both institutions. I still see consortia such as the Open Archives Initiative forming their own requirements for data sharing. This can be seen now in the Digital Public Library of America (DPLA) and Europeana data models (here and here, respectively).
LD best practices. Linked data in the LAM community is in the “wild west” stages of development. We’re experimenting, researching, presenting primers to RDF, etc. However, RDF and linked data has been around for a while (a public draft of RDF was presented in 1997, seen here). As such, the larger linked data and semantic web community has formed established best practices for creating RDF data models and linked data. In order to seamlessly integrate into the larger community we will need to adopt and adhere to these best practices.
Linked Open Data. Linked data is not inherently “open”, meaning data providers have to make the effort to put the “open” in Linked Open Data. To maximize linked data, and to follow the “open” movement in libraries, I feel there needs to be an emphasis on data providers publishing completely open and accessible data, regardless of format and publishing strategy.Conclusion
Linked data is the future of data in the LAM community. It’s not perfect, but it is an upgrade to existing technologies and will help the LAM community promote open and shared data.
I hope you enjoyed this series. I encourage you to venture forward; start experimenting with linked data if you haven’t. There are plenty of resources out there on the topic. As always, I’d like to hear your thoughts, and please feel free to reach out to me in the comments below or through twitter. Until next time.
This year's Drupal in Libraries Birds of a Feather session will be on Wednesday, May 11th from 3:45 to 4:45 in the Cherry Hill BoF Room (291) at the Morial Convention Center.
There is no agenda, so please bring your questions and stories. We would all love to see what you have been up to.
Among the things that we are interested in are the upcoming version of Islandora and summer reading programs.
Last year, the information provider ProQuest decided to discontinue its "Illustrata Technology" and "Illustrata Natural Science" databases. Unfortunately, this represents a preliminary end to ProQuest’s long-year investment into deep indexing content.
In a corresponding support article ProQuest states that there "[…] will be no loss of full text and full text + graphics images because of the removal of Deep Indexed content". In addition, they announce to "[…] develop an even better way for researchers to discover images, figures, tables, and other relevant visual materials related to their research tasks".
Learn about choosing which data to return using FILTER OPTIONAL and UNION
A protester throwing cookies at the parliament.
Here are some things that caught our ear this fine Thursday at the International Internet Preservation Consortium web archiving conference:
- Tom Storrar at the UK Government Web Archive reports on a user research project: ~20 in person interviews and ~130 WAMMI surveys resulting in 5 character vignettes. “WAMMI” replaces “WASAPI” as our favorite acronym.
- How do we integrate user research into day-to-day development? We’ll be chewing more on that one.
- Jefferson Bailey shares the Internet Archive’s learnings ups and downs with Archive-It Research Services. Projects from the last year include .GOV (100TB of .gov data in a Hadoop cluster donated by Altiscale), the L3S Alexandria Project, and something we didn’t catch with Ian Milligan at Archives.ca.
- You too can learn archive research with Vinay Goel’s Archive Research Services Workshop.
- PLUS Jefferson threw in some amazing stuff we still haven’t quite figured out involving iPython Notebooks with connections to big data sets.
- What the WAT? We hear a lot about WATs this year. Common Crawl has a good explainer.
- Ditte Laursen sets out to answer a big research question: “What does Danish web look like?” What is the shape of .dk? Eld Zierau reports that in a comparison of the Royal Danish Library’s .dk collection with the Internet Archive’s collection of Danish-language sites, only something like 10% were in both.
- Hugo Huurdeman asks an important question: what exactly is a website? Is it a host, a domain, or a set of pages that share the same CSS? To visualize change in whatever that is, he uses ssdeep, a fuzzy hashing mechanism for page comparison.
- Let’s just pause to say how inspiring this all is. It’s at about this point in the day that we started totally rethinking a project we’ve been working on for months.
- Justin Littman shares the Social Feed Manager, his happenin’ stack to harvest tweets and such.
- We learned that TWARC is either twerking for WARCs or a Twitter-harvesting Python package — we’re not entirely sure. Either way it’s our new new favorite acronym. Sorry, WAMMI.
- Nick Ruest and Ian Milligan give a very cool talk about sifting through hashtagged content on Twitter. Did you know that researchers only have 7-9 days to grab tweets under a hashtag before Twitter only makes the full stream available for a fee? (We did not know that.)
- We were also impressed by Canada’s huge amount of political social media engagement. Even though Canada isn’t a huge country,[Ian’s words not ours] 55,000 Tweets were generated in one day with the #elxn42 tag.
- Fernando Melo of Arquivo.pt pointed out that the struggle is real with live-web leaks in his research comparing OpenWayback and pywb. Fernando says in his tests OpenWayback was faster but pywb has higher-quality playbacks (more successes, fewer leaks). Both tools are expected to improve soon. We say it’s time for something like arewefastyet.com to make this a proper competition.
- Nicola Bingham is self-deprecating about the British Library’s extensive QA efforts: “This talk title isn’t quite right because it implies that we have Quality Assurance Practices in the Post Legal Deposit Environment.” They use the Web Curator Tool QA Module, but are having to go beyond that for domain-scale archiving.
- We’re also curious about this paper: Current Quality Assurance Practices in Web Archiving.
- Todd Stoffer demos NC State’s QA tool. A clever blend of tools like Google Forms, Trello, and IFTTT to let student employees provide archive feedback during downtime. Here are Todd’s [snazzy HTML/JS] slides.
TL;DR: lots of exciting things happening in the archiving world. Also exciting: the Icelandic political landscape. On the way to dinner, the team happened upon a relatively small protest right outside of the parliament. There was pot clanging, oil barrel banging, and an interesting use of an active smoke alarm machine as a noise maker. We were also handed “red cards” to wave at the government.
Now we’re off to look for the northern lights!
Don’t miss these amazing speakers at this important LITA preconference to the ALA Annual 2016 conference in Orlando FL.
Digital Privacy and Security: Keeping You And Your Library Safe and Secure In A Post-Snowden World
Friday June 24, 2016, 1:00 – 4:00 pm
Presenters: Blake Carver, LYRASIS and Jessamyn West, Library Technologist at Open Library
Learn strategies on how to make you, your librarians and your patrons more secure & private in a world of ubiquitous digital surveillance and criminal hacking. We’ll teach tools that keep your data safe inside of the library and out — how to secure your library network environment, website, and public PCs, as well as tools and tips you can teach to patrons in computer classes and one-on-one tech sessions. We’ll tackle security myths, passwords, tracking, malware, and more, covering a range of tools from basic to advanced, making this session ideal for any library staff.Jessamyn West
Jessamyn West is a librarian and technologist living in rural Vermont. She studies and writes about the digital divide and solves technology problems for schools and libraries. Jessamyn has been speaking on the intersection of libraries, technology and politics since 2003. Check out her long running professional blog Librarian.net.
Jessamyn has given presentations, workshops, keynotes and all-day sessions on technology and library topics across North America and Australia. She has been speaking and writing on the intersection of libraries and technology for over a decade. A few of her favorite topics include: Copyright and fair use; Free culture and creative commons; and the Digital divide. She is the author of Without a Net: Librarians Bridging the Digital Divide, and has written the Practical Technology column for Computers in Libraries magazine since 2008.
See more information about Jessamyn at: http://jessamyn.infoBlake Carver
Blake Carver is the guy behind LISNews, LISWire & LISHost. Blake was one of the first librarian bloggers (he created LISNews in 1999) and is a member of Library Journal’s first Movers & Shakers cohort. He has worked as a web librarian, a college instructor, and a programmer at a startup. He is currently the Senior Systems Administrator for LYRASIS Technology Services where he manages the servers and infrastructure that support their products and services.
See more information about Blake at: http://eblake.com/
More LITA Preconferences at ALA Annual
Friday June 24, 2016, 1:00 – 4:00 pm
- Islandora for Managers: Open Source Digital Repository Training
- Technology Tools and Transforming Librarianship
Questions or Comments?
For all other questions or comments related to the preconference, contact LITA at (312) 280-4269 or Mark Beatty, firstname.lastname@example.org.
On 22 March 2016, the Library of Congress announced [pdf] that the subject heading Illegal aliens will be cancelled and replaced with Noncitizens and Unauthorized immigration. This decision came after a couple years of lobbying by folks from Dartmouth College (and others) and a resolution [pdf] passed by the American Library Association.
Among librarians, responses to this development seemed to range from “it’s about time” to “gee, I wish my library would pay for authority control” to the Annoyed Librarian’s “let’s see how many MORE clicks my dismissiveness can send Library Journal’s way!” to Alaskan librarians thinking “they got this change made in just two years!?! Getting Denali and Alaska Natives through took decades!”.
Business as usual, in other words. Librarians know the importance of names; that folks will care enough to advocate for changes to LCSH comes as no surprise.
The change also got some attention outside of libraryland: some approval by lefty activist bloggers, a few head-patting “look at what these cute librarians are up to” pieces in mainstream media, and some complaints about “political correctness” from the likes of Breitbart.
The Librarian of Congress shall retain the headings ‘‘Aliens’’ and ‘‘Illegal Aliens’’, as well as related headings, in the Library of Congress Subject Headings in the same manner as the headings were in effect during 2015.
There’s of course a really big substantive reason to oppose this move by Black: “illegal aliens” is in fact pejorative. To quote Elie Wiesel: “no human being is illegal.” Names matter; names have power: anybody intentionally choosing to refer to another person as illegal is on shaky ground indeed if they wish to not be thought racist.
There are also reasons to simply roll one’s eyes and move on: this bill stands little chance of passing Congress on its own, let alone being signed into law. As electoral catnip to Black’s voters and those of like-minded Republicans, it’s repugnant, but still just a drop in the ocean of topics for reactionary chain letters and radio shows.
Still, there is value in opposing hateful legislation, even if it has little chance of actually being enacted. There are of course plenty of process reasons to oppose the bill:
- There are just possibly a few matters that a member of the House Budget Committee could better spend her time on. For example, libraries in her district in Tennessee would benefit from increased IMLS support, to pick a example not-so-randomly.
- More broadly, Congress as a whole has much better things to do than to micro-manage the Policy and Standards Division of the Library of Congress.
- If Congress wishes to change the names of things, there are over 31,000 post offices to work with. They might also consider changing the names of military bases named after generals who fought against the U.S.
- Professionals of any stripe in civil service are owed a degree of deference in their professional judgments by legislators. That includes librarians.
- Few, if any, members of Congress are trained librarians or ontologists or have any particular qualifications to design or maintain controlled vocabularies.
However, there is one objection that will not stand: “Congress has no business whatsoever advocating or demanding changes to LCSH.”
If cataloging is not neutral, if the act of choosing names has meaning… it has political meaning.
And if the names in LCSH are important enough for a member of Congress to draft a bill about — even if Black is just grandstanding — they are important enough to defend.
If cataloging is not neutral, then negative reactions must be expected — and responded to.
Updated 2016-04-14: Add link to H.R. 4926.
WARC is often thought of as a useful preservation format for websites and Web content, but it can also be a useful tool in your toolbox for Web maintenance work.
At work we are in the process of migrating a custom site developed over 10 years ago to give it a new home on the Web. The content has proven useful and popular enough over time that it was worth the investment to upgrade and modernize the site.
You can see in the Internet Archive that the Early Americas Digital Archive has been online at least since 2003. You can also see that it hasn’t changed at all since then. It may not seem like it, but that’s a long time for a dynamic site to be available. It speaks to the care and attention of a lot of MITH staff over the years that it’s still running, and that it is even possible to conceive of migrating it to a new location using a content management system that didn’t even exist when the website was born.
As a result of the move the URLs for the authors and documents in the archive will be changing significantly, and there are lots of links to the archive on the Web. Some of these links can even be found in books, so it’s not just a matter of Google updating their indexes when they encounter a permanent redirect. Nevertheless, we do want to create permanent redirects from the old location to the new location so these links don’t break.
If you are the creator of a website, itemizing the types of URLs that need to change may not be a hard thing to do. But for me, arriving on the scene a decade later, it was non-trivial to digest the custom PHP code, look at the database, and the filesystem and determine the full set of URLs that might need to be redirected.
So instead of code spelunking I decided to crawl the website, and then look at the URLs that are present in the crawled data. There are lots of ways to do this, but it occurred to me that one way would be to use wget to crawl the website and generate a WARC file that I could then analyze.
The first step is to crawl the site. wget is a venerable tool with tons of command line options. Thanks to the work of Jason Scott and Archive Team a --warc-file command line option was added a few years ago that serializes the results of the crawl as a single WARC file.
In my case I also wanted to create a mirror copy of the website for access purposes. The mirrored content is an easy way to see what the website looked like without needing to load the WARC file into a player of some kind like Wayback…but more on that below.
So here was my wget command:wget --warc-file eada --mirror --page-requisites --adjust-extension --convert-links --wait 1 --execute robots=off --no-parent http://mith.umd.edu/eada/
The EADA website isn’t huge, but this ran for about an hour because I decided to be nice to the server with a one second pause between requests. When it was done I had a single WARC file that represented the complete results of the crawl: eada.warc.gz.
With this in hand I could then use Anand Chitipothu and Noufal Ibrahim’s warc Python module to read in the WARC file looking for HTTP responses for HTML pages. The program simply emits the URLs as it goes, and thus builds a complete set of webpage URLs for the EADA website.import warc from StringIO import StringIO from httplib import HTTPResponse class FakeSocket(): def __init__(self, response_str): self._file = StringIO(response_str) def makefile(self, *args, **kwargs): return self._file for record in warc.open("eada.warc.gz"): if record.type == "response": resp = HTTPResponse(FakeSocket(record.payload.read())) resp.begin() if resp.getheader("content-type") == "text/html": print record['WARC-Target-URI']
As you can probably see the hokiest part of this snippet is parsing the HTTP response embedded in the WARC data. Python’s httplib wanted the HTTP response to look like a socket connection instead of a string. If you know of a more elegant way of going from a HTTP response string to a HTTP Response object I’d love to hear from you.
I sorted the output and came up with a nice list of URLs for the website. Here is a brief snippet:http://mith.umd.edu/eada/gateway/winslow.php http://mith.umd.edu/eada/gateway/winthrop.php http://mith.umd.edu/eada/gateway/witchcraft.php http://mith.umd.edu/eada/gateway/wood.php http://mith.umd.edu/eada/gateway/woolman.php http://mith.umd.edu/eada/gateway/yeardley.php http://mith.umd.edu/eada/guesteditors.php http://mith.umd.edu/eada/html/display.php?docs=acrelius_founding.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=alsop_character.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=arabic.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=ashbridge_account.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=banneker_letter.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlow_anarchiad.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlow_conspiracy.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlow_vision.xml&action=show http://mith.umd.edu/eada/html/display.php?docs=barlowe_voyage.xml&action=show
The URLs with docs= in them are particularly important because they identify documents in the archive, and seem to form the majority of inbound links. So there still remains work to map the old URLs to the new ones, but now we at least now what they are.
I mentioned earlier that the mirror copy is an easy way to view the crawled content without needing to start a Wayback or equivalent web archive player. But one other useful thing you can on your workstation is download Ilya Kreymer’s WebArchivePlayer for Mac or Windows, start it up, at which point it asks you to select a WARC file to view, which it then lets you view in your browser.
In case you don’t believe me, here’s a demo of this little bit of magic:
As I’ve with other web preservation work I’ve been doing at MITH I then took the WARC file, the mirrored content, the server side code and database export and put them in a bag which I then copied up to MITH’s S3 storage. Will the WARC file stand the test of time? I’m not sure. But the WARC file was useful to me here today. So there’s reason to hope.
I’ve decided to get rid of my CDs, so I’m ripping them all (to FLAC) with Rhythmbox. It can talk to MusicBrainz to get metadata: album title, album artist, song titles, genre, etc. Sometimes that doesn’t work, and then it’s nice to use the feature of EasyTAG (which I use to edit metadata) where it can look up the information on FreeDB based on the raw information about track lengths and such. Almost always, that works. Sometimes, it doesn’t. One time, it presented me with a strange choice:Hmm … how to tell?
Auto-captioned photo of Jack, Genève, and Matt — thanks CaptionBot!
- Andy Jackson kicks the conference off with “Have I accidentally committed international journalism?” — he has contributed to the open source software that was used to review the Panama Papers.
- Andrea Goethals describes the desire for smaller modules in the web archive tool chain, one of her conclusions from Harvard Library’s Environmental Scan of Web Archiving. This was the first of many calls throughout the day for more nimble tools.
- Stephen Abrams shares the California Digital Library’s success story with Archive-It. “Archive-It is good at what it does, no need for us to replicate that service.”
- John Erik Halse encourages folks to contribute code and documentation. Don’t be intimidated and just dive in.
- There seems to be consensus that Heritrix is a tool that everyone needs but no one is in charge of — that’s tough for contributors. A few calls for the Internet Archive to ride in and save the day.
- We’re not naming names, but a number of organizations have had their IT departments, or IT contractors, seek to run virus scanners that would edit the contents of an archive after preservation. (Hint: it’s not easy to archive malware, but “just delete it” isn’t the answer.)
- Some kind member of IIPC reminds us of the amazing Malware Museum hosted by the Internet Archive.
- David Rosenthal notes that Iceland has been called the ”Switzerland of bits”. After being in Reykjavik for only a few days, we sort of agree!
- Jefferson Bailey of the Internet Archive echoed concerns about looming web entropy: there is significant growth in web archiving, but a concentration of storage for archives.
- Nicholas Taylor of the Stanford Digital Library is responsible for the most wonderful acronym of all time, WASAPI (“Web Archiving Systems API”).
- The Memento Protocol remains the greatest thing since sliced bread. (Here we refer to the web discovery standard, not the Jason Bourne movie.)
- We chat with Michael Nelson about his projects at ODU, from the Mink browser plugin to the icanhazmemento Twitter bot.
- Hjálmar Gíslason points out that 500 hours of video are uploaded to YouTube each minute. It would take 90,000 employees working full time to watch it all. Conclusion: Google needs to hire some people and get on this.
- Hjálmar also mentions Tim Berners-Lee’s 5-Star Open Data standard. Nice goal to work toward for Free the Law!
- Vint Cerf on Digital Vellum: the Catholic Church has lasted for an awfully long time, and breweries tend to stick around a long time. How could we design a digital archiving institution that could last that long?
- (Perma’s suggestion: how about a TLD for URLs that never change? We were going to suggest .cool, because cool URLs don’t change. But that seems to be taken.)
- Ilya Kramer shows off the first webpage ever in the first browser ever, running in a simulated NeXT Computer, courtesy of oldweb.today.
- Dragan Espensch says Rhizome views the web as “performative media” while showing Jan Robert Leegte’s [untitled]scrollbars piece through different browsers in oldweb.today. Sometimes the OS is the artwork.
- Matthew S. Weber and Ian Milligan have been running web archive hackathons to connect researchers to computer programmers. Researchers need this: “It would be dishonest to do a history of the 90s without using web archives.” Cue <marquee> tags here.
- Brewster Kahle pitches the future of national digital collections, using as a model the fictional (but oh-so-cool) National Library of Atlantis. Shows off clever ways to browse a nation’s tv news, books, music, video games, and so much more.
- Brewster encourages folks to recognize that there is no “The Web” anymore: collections will differ based on context and provenance of the curator or crawler. (What is archiving “The Web” if each of us has a different set of sites that are blocked, allowed, or custom-generated for us?)
- Brewster voices the need for broad, high level visualizations in web archives. He highlights existing work and thinks we can push it further.
- And oh by the way, he also shows off Wayback Explorer over at Archive Labs — graph major and minor changes in websites over time.
- Bonus: We’re fortunate enough to grab some whale sushi (or vegan alternatives) with David Rosenthal, Ilya Kreymer, and Dragan Espenschied.
Looking forward to the next couple of days …