You are here

Feed aggregator

Dan Scott: Dear database vendor: defending against scraping is going to be very difficult

planet code4lib - Wed, 2014-12-10 16:16

Our library receives formal communications from various content/database vendors about "serious intellectual property infringement" on a reasonably regular basis, that urge us to "pay particular attention to proxy security". Here is part of the response I sent to the most recent such request:

We use the UsageLimit directives that OCLC's EZProxy solution offers to block users who go over certain thresholds. However, the UsageLimit directives are really too coarse to be extremely useful. For example, you can set a limit based on the number of transfers in a given time period, but you can't set different thresholds for content types (such as CSS, JavaScript, HTML, images, or PDFs). The compromised account had gathered a set of URLs that enabled them to directly request a series of PDFs, thus staying below the general threshold for transfers. If EZProxy offered a "transfer threshold by MIME type" directive, then we could easily block users who tried to download more than, say, 100 PDFs in an hour.

We also set UsageLimit directives for total bandwidth consumed. However, again this is limited by the coarseness of the directives available to us in EZProxy, as well as the increased richness of the variety of content available from electronic resources these days. With individual PDFs varying in size from 0.25 MB to 2.5 MB, not to mention streaming audio and video services, finding the right threshold without locking out legitimate users is quite challenging.

I therefore urge you to contact OCLC directly and demand that they add the ability to include finer-grained directives for UsageLimit throttling to EZProxy. As EZProxy is by far the most common proxy solution deployed by libraries worldwide, this would enable many of your customers to benefit from the enhancement. While OCLC's customers have been requesting functionality like this for years via the EZProxy mailing list, they are slow to react (having taken months to update EZProxy to address recent SSL vulnerabilities, for example). Perhaps OCLC will listen to an enterprise partner.

For our part, at Laurentian, I have asked our IT Services department (who controls our proxy server) to write a simple script that parses the EZProxy event logs and emails us when a user is blocked due to going past a threshold. This would have helped us catch the compromised account much earlier on, and should also be another basic feature of EZProxy. Right now, every library has to implement their own solution for this basic requirement, and many do not.

All that said, even with finer-grained threshold directives and active monitoring of account blocking events, I have to note that a savvy attacker intent on harvesting your content will, once they have compromised an account, simply slow down the number of requests to the level that emulates the activity that a normal human would generate, and spread the requests out across all of the accounts they have compromised, and introduce a level of randomness into the requests so that they aren't detectable patterns (such as linear requests for only PDFs), etc. No system is going to offer a perfect defence against those efforts.

I'm sympathetic to the content vendors' concerns, but really, even if OCLC does add some of these features to their core EZProxy offering, the content-scraping approaches will simply increase in sophistication. Removing proxy access isn't a real option for our users, even though cutting off proxy access is what the content vendors do. This is a game that nobody is going to win.

John Miedema: PirateBay went down yesterday. Text analysts can take a page from pirates.

planet code4lib - Wed, 2014-12-10 14:55

This post deserves an essay. I’m going to take big leaps with too little explanation, but it’s been rattling in my head for awhile and yesterday’s bust of PirateBay compelled me to write something down.

PirateBay went down yesterday. Police in Sweden seized computers and the site went down. This is not the first time the site went down and people expect it to come back up. Torrent technology was invented for just this kind of event. A torrent only stores metadata about files available elsewhere. The entire PirateBay set of magnets can be stored on a USB disk. Cached versions of PirateBay still exist on the web and people can still download files.

One might dismiss torrent technology as a hack by pirates unwilling to pay for content, but torrents are driving real-world innovation. In earlier posts, I compared the classical “Hot Water Tank” architecture of a QA system with an alternative “Tank-less” architecture. The Tank approach is solid but cumbersome, while the Tank-less approach is deft. The idea is part of a larger shift in the world of big data processing and a demand for real-time stream processing. One of the technologies in play are torrents.

The pirate flag flies in winter in Wakefield Quebec

Go ahead and question the motive of pirates but their purpose overlaps with freedom of information advocates. Consider PirateBox. PirateBox is a do-it-yourself file sharing system, built with a cheap router and open source software. Bring it to a public space and anyone can anonymously upload and download content. It can be used to share movies. It could also be used to legally share health care information in the aftermath of a natural disaster when the internet is not available. It is no surprise that the technology has been taken up by librarians in the form of LibraryBox.

The fight for net neutrality does not seem to end. A two-tiered internet seems inevitable. Those who seek greater internet surveillance powers keep coming back. What can be done? In 2012 PirateBay experienced a downtime. They came back on, announcing a plan to move its servers to the sky, tethered to drones. It got me thinking, strap a PirateBox to a drone from BestBuy, and you have a flying internet. The cost is cheap. Build a fleet. A flying internet would deftly sidestep unwanted controls, for geeks wanting the latest Marvel movie, for teachers in Syria.

PirateBay, PirateBox, a drone-based internet. It sounds fantastic but the driver is practical. People want agile access to content. If things get too boxed in then people will invent PirateBoxes to get out. It is the same challenge faced in big data and text analytics today. Faced with an ocean of unstructured content waiting to be mined, traditional database design and top-down programming is simply too rigid. New approaches with Natural Language Processing divide content into fragments and apply bottom-up pattern recognition to extract meaning. You can see the parallel with the pirates, the use of sophisticated techniques to preserve access to distributed content.

I think of Fahrenheit 451 and the character Granger, the leader of a group of exiled drifters. Each has memorized fragments of books in readiness for a time when society will be ready to discover them.

ACRL TechConnect: This Is How I Work (Nadaleen Tempelman-Kluit)

planet code4lib - Wed, 2014-12-10 14:37

Editor’s Note: This post is part of ACRL TechConnect’s series by our regular and guest authors about The Setup of our work.


Nadaleen Tempelman-Kluit @nadaleen

Location: New York, NY

Current Gig: Head, User Experience (UX), New York University Libraries

Current Mobile Device: iPhone 6

Current Computer:

Work: Macbook pro 13’ and Apple 27 inch Thunderbolt display

Old dell PC that I use solely to print and to access our networked resources


I carry my laptop to and from work with me and have an old MacBook Pro at home.

Current Tablet: First generation iPad, supplied by work

One word that best describes how you work: has anyone said frenetic yet?

What apps/software/tools can’t you live without?

Communication / Workflow

Slack is the UX Dept. communication tool in which all our communication takes place, including instant messaging, etc. We create topic channels in which we add links and tools and thoughts, and get notified when people add items. We rarely use email for internal communication.

Boomeranggmail-I write a lot of emails early in the morning so can schedule them to be sent at different times of the day without forgetting.

Pivotal Tracker-is a user story-based project planning tool based on agile software development methods. We start with user flows then integrate them into bite size user stories in Pivotal, and then point them for development

Google Drive


Google Hangouts-We work closely with our Abu Dhabi and Shanghai campus libraries, so we do a lot of early morning and late night meetings using Google Hangouts (or GoToMeeting, below) to include everyone.

Wireframing, IA, Mockups

Sketch: A great lightweight design app

OmniGraffle: A more heavy duty tool for wire framing, IA work, mockups, etc. Compatible with a ton of stencil libraries, including he great Knoigi (LINK) and Google material design icons). Great for interactive interface demos, and for user flows and personas (link)

Adobe Creative Cloud

Post It notes, Graph paper, White Board, Dry-Erase markers, Sharpies, Flip boards

Tools for User Centered Testing / Methods 

GoToMeeting- to broadcast formal usability testing to observers in another room, so they can take notes and view the testing in real time and ask virtual follow up questions for the facilitator to ask participants.

Crazy Egg-a heat mapping hot spotting A/B testing tool which, when coupled with analytics, really helps us get a picture of where users are going on our site.

Silverback- Screen capturing usability testing software app.

PostitPlus – We do a lot of affinity grouping exercises and interface sketches using post it notes,  so this app is super cool and handy.

OptimalSort-Online card sorting software.

Personas-To think through our user flows when thinking through a process, service, or interface. We then use these personas to create more granular user stories in Pivotal Tracker (above).

What’s your workspace like?

I’m on the mezzanine of Bobst Library which is right across from Washington Square Park. I have a pretty big office with a window overlooking the walkway between Bobst and the Stern School of Business.

I have a huge old subway map on one wall with an original heavy wood frame, and everyone likes looking at old subway lines, etc. I also have a map sheet of the mountain I’m named after. Otherwise, it’s all white board and I’ve added our personas to the wall as well so I can think through user stories by quickly scanning and selecting a relevant persona.

I’m in an area where many of my colleagues mailboxes are, so people stop by a lot. I close my door when I need to concentrate, and on Fridays we try to work collaboratively in a basement conference room with a huge whiteboard.

I have a heavy wooden L shaped desk which I am trying to replace with a standing desk.

Every morning I go to Oren’s, a great coffee shop nearby, with the same colleague and friend, and we usually do “loops” around Washington Square Park to problem solve and give work advice. It’s a great way to start the day.

What’s your best time saving trick

Informal (but not happenstance) communication saves so much time in the long run and helps alleviate potential issues that can arise when people aren’t communicating. Though it takes a few minutes, I try to touch base with people regularly.

What’s your favorite to do list manager

My whiteboard, supplemented by stickies (mac), and my huge flip chart notepad with my wish list on it. Completed items get transferred to a “leaderboard.”

Besides your phone and computer, what gadget can’t you live without?


What everyday thing are you better at than everyone else?

I don’t think I do things better than other people, but I think my everyday strengths include:  encouraging and mentoring, thinking up ideas and potential solutions, getting excited about other people’s ideas, trying to come to issues creatively, and dusting myself off.

What are you currently reading?

I listen to audiobooks and podcasts on my bike commute. Among my favorites:

In print, I’m currently reading:

What do you listen to while at work?

Classical is the only type of music I can play while working and still be able to (mostly) concentrate. So I listen to the masters, like Bach, Mozart and Tchaikovsky

When we work collaboratively on creative things that don’t require earnest concentration I defer to one of the team to pick the playlist. Otherwise, I’d always pick Josh Ritter.

Are you more of an introvert or an extrovert?

Mostly an introvert who fakes being an extrovert at work but as other authors have said (Eric, Nicholas) it’s very dependent on the situation and the company.

What’s your sleep routine like?

Early to bed, early to rise. I get up between 5-6 and go to bed between around 10.

Fill in the blank: I’d love to see _________ answer these same questions.

@Morville (Peter Morville)

@leahbuley (Leah Buley)

What’s the best advice you’ve ever received?

Show up

LITA: Virtual Machines in a Nutshell

planet code4lib - Wed, 2014-12-10 13:27

Many of you have probably heard the term “virtual machine“, but might not be familiar with what a VM is or does. Virtualization is a complicated topic, as there are many different kinds and it can be difficult for the novice to tell which is which. Today we’re going to talk specifically about OS virtualization and why you should care about this pretty fabulous piece of tech.

Let’s start with a physical computer. For the sake of having a consistent example, we’ll say it’s a Dell laptop running Windows 7. Dual booting is a popular method of installing an additional operating system onto a physical computer in order to have more options and flexibility with what programs you want to run. Lots of Mac users run Boot Camp so they can have both OS X and Windows side by side. While dual booting is a great choice for many, it has limitations. Installing an OS directly onto the hardware is expensive in terms of time and system resources, and doesn’t scale very well if you want to install LOTS of operating systems as a test. What if we want Mac, Windows, and a few flavors of Linux? Bringing more than two operating systems onto the hardware is asking for trouble. Dual booting is also overkill if you are just experimenting with an OS. If you are like me and you like to install things just to see if you like them and then throw them away when you are done, dual booting just takes too long.

Enter OS virtualization. Using virtualization software like VirtualBox, a user can have any number of operating systems running as virtual machines. Our trusty Dell laptop, henceforth referred to as the “host machine”, running Windows 7, henceforth referred to as the “host OS”, downloads a copy of VirtualBox for Windows and installs it just like any other program. Virtualization software is built to manage VMs (also known as a “guest OS”) just like Microsoft Word manages documents and iTunes manages music. VMs are just files that the virtualization software runs, making it far easier to download, install, backup and destroy any number of operating systems at will. It also allows the host machine to run several operating systems at once; Windows can be running VirtualBox which is running a Mac OS X guest OS and a Linux guest OS. As you could probably guess, having several VMs running at once can be a drain on memory, so just because you can run several at once doesn’t mean you should.

Now let’s talk about why you would want to virtualize operating systems. The first and most obvious reason is that it’s more convenient than installing a new OS straight onto the hardware. Many libraries are starting to leverage OS virtualization as part of their IT strategy. When you have hundreds of computers to manage, it’s a lot easier to install virtualization software on all of them and then deploy a single managed VM file (called an “image”) to all of them instead of installing the exact same set of programs on each one individually. It’s also a great way for regular users to experiment with new environments without fear of turning their computers into expensive paperweights. Since the host OS is never overwritten, there’s never any danger of accidentally deleting your entire system, and you can always go back to the OS you are familiar with when finished.

If you are a coder, VMs are mana from heaven for a many reasons. The first is that it allows you to download whatever you want without mucking up your host machine. I’m constantly downloading new tools and programs to test out, and I don’t keep 95% of them. Testing them out in a virtual machine means that I can just delete the entire VM when I’m done, taking all that junk leftover from the installation and any test files I created along with it. I can also play around with configurations in a VM without fear of doing irreparable damage.

Perhaps one of the most useful aspects of a VM for coders is the ability to mimic target environments. Here at FSU, all of our servers are running a specific kind of Linux called Red Hat v6.5. With OS virtualization, I can download a Red Hat v6.5 image and go hog wild installing, deleting and reconfiguring whatever I want without fear of accidentally trashing the server and taking down our website. If I do inadvertently break something in the VM, I just delete it and spin up another instance. This can be a great tool for teaching newbies how to work on your production server without actually letting them anywhere near your production server.

You can prepackage software on an image as well, which is handy when you and your team want a simple way to play around with some software that might be difficult to install. The Islandora project distributes a virtual machine containing all the necessary parts configured correctly to create a working Islandora instance. This has been a huge boon to the project because it lets newbies who don’t know what they are doing (such as myself) have access to a disposable Islandora to hack on without the pain of setting one up themselves. Catmandu, a bibliographic data processing toolkit, can also be downloaded as a VM for experimentation. Expect to see this trend of software being distributed in a virtual machine continue in the future.

Learning to leverage OS virtualization effectively has changed the way I work. I do almost all of my work inside of disposable VMs now just because it’s so much more clean and convenient; it’s like a quarantined area for when you are working on things that may or may not explode. Even if you aren’t a developer, there are plenty of convenient ways to use virtualization in your everyday work environment. Despite the complicated technology running under the hood, getting started with virtualization has never been easier. Give it a shot today and let me know what you think in the comments!

Journal of Web Librarianship: A Review of “Building and Managing E-Book Collections: A How-to-Do-It Manual for Librarians”

planet code4lib - Wed, 2014-12-10 06:24
Volume 8, Issue 4, October-December 2014, pages 418-419
David Gibbs

Journal of Web Librarianship: A Review of “The CSS3 Anthology: Take Your Sites to New Heights, 4th ed.”

planet code4lib - Wed, 2014-12-10 06:24
Volume 8, Issue 4, October-December 2014, pages 419-420
Elizabeth Fronk

Journal of Web Librarianship: A Review of “Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web”

planet code4lib - Wed, 2014-12-10 06:24
Volume 8, Issue 4, October-December 2014, pages 420-421
Kali Davis

Journal of Web Librarianship: Editorial Board/EOV

planet code4lib - Wed, 2014-12-10 06:24
Volume 8, Issue 4, October-December 2014, pages ebi-ebi

Journal of Web Librarianship: Tutorials on Google Analytics: How to Craft a Web Analytics Report for a Library Web Site

planet code4lib - Wed, 2014-12-10 06:23
Volume 8, Issue 4, October-December 2014, pages 404-417
Le Yang

Journal of Web Librarianship: Exploring Library Discovery Positions: Are They Emerging or Converging?

planet code4lib - Wed, 2014-12-10 06:23
Volume 8, Issue 4, October-December 2014, pages 331-348
Nadine P. Ellero

Journal of Web Librarianship: Assessment of Digitized Library and Archives Materials: A Literature Review

planet code4lib - Wed, 2014-12-10 06:23
Volume 8, Issue 4, October-December 2014, pages 384-403
Elizabeth Joan Kelly

Journal of Web Librarianship: A Review of “More Technology for the Rest of Us: A Second Primer on Computing for the Non-IT Librarian”

planet code4lib - Wed, 2014-12-10 06:23
Volume 8, Issue 4, October-December 2014, pages 423-424
Dena L. Luce

Journal of Web Librarianship: A Review of “Information Services and Digital Literacy: In Search of the Boundaries of Knowing”

planet code4lib - Wed, 2014-12-10 06:23
Volume 8, Issue 4, October-December 2014, pages 422-423
Joseph Grobelny

Journal of Web Librarianship: A Review of “Handbook of Indexing Techniques: A Guide for Beginning Indexers”

planet code4lib - Wed, 2014-12-10 06:23
Volume 8, Issue 4, October-December 2014, pages 421-422
Bradford Lee Eden

Journal of Web Librarianship: A Review of “The Transformed Library: E-Books, Expertise, and Evolution”

planet code4lib - Wed, 2014-12-10 06:23
Volume 8, Issue 4, October-December 2014, pages 425-426
Robert J. Vander Hart

Journal of Web Librarianship: A Review of “Optimizing Academic Library Services in the Digital Milieu: Digital Devices and Their Emerging Trends”

planet code4lib - Wed, 2014-12-10 06:22
Volume 8, Issue 4, October-December 2014, pages 424-425
Paula Barnett-Ellis

Coral Sheldon-Hess: Playing with GitHub

planet code4lib - Wed, 2014-12-10 01:33

I had the opportunity, at work (and a bit outside of work), to learn the GitHub API, as wrapped by Python’s github3 module. I found the documentation really hard to follow, maybe because I don’t have a lot of experience reading API docs, or because it wasn’t organized in the way I think about things, or maybe just because my work on this API was part of a larger, much more harrowing project, and I was already discouraged* … who knows?

I made a thing! Maybe it’s helpful!

Ultimately, I ended up documenting the parts of it I needed to understand in an IPython notebook; if you’d like to play with the GitHub API, then, please, feel free to download and run it, after filling in your own username, your chosen repository name, and your API token for GitHub (which you will want to keep secret, of course).

I’m not sure the format I used is going to be helpful to others, but I kept referring back to this notebook over and over as I worked, so, at the very least, I’ve found a format that’s helpful for me! Since I was trying to mock GitHub objects, I was very interested in return types. I hope it’s useful for others who want to understand github3…

I made a funnier thing! Maybe you’ll want play with it?

Now, because I had a funny idea (and I wanted to remind myself that I like programming), I also spent most of a Sunday building a tool to help make GitHub feel less like a game I have to win**. In short, it makes one commit per day on my GitHub account—actually, to the repository in which it is housed, because I was feeling puckish—so that I maintain a perfect streak. See? Perfection, since the day I wrote the script!

The code itself is pretty short; look at You can tell I had fun writing it, though. :)

The trickier (read: way more time-consuming) part ended up being the “make it run daily” bit. I had to learn about launchd, which Mac seems to prefer to cron. That’s too bad, since I already understood cron, but launchd has some nice features, like running when the machine wakes up if it was sleeping at the appointed time.

Anyway, once I gave in and took the advice of the launchd tutorial linked above and installed LaunchPad, it went better for me.

I’ve included my plist file in the repository, so nobody else has to write their own; I also gave the directory to put it in and listed the one necessary change, in the README. It shouldn’t be too hard to get running if you’ve got a Mac on hand.

I’d like to deploy this on Heroku at some point, rather than having to keep my laptop on all the time; maybe if I find that doable I’ll write a follow-up post, or just edit this one. :)


*We have this big code base, which is built on Flask (which I don’t know) and MongoDB (which I don’t know) and which has a RESTful API (which I haven’t learned yet; that’s my current project, happily!), which requires some internal routing to build and resolve URLs (which I don’t really understand, though I can trace through it with PyCharm). My job was to use mock (which I didn’t know at the time and still have trouble keeping my head wrapped around) to write tests (which I have only minimal experience with) on some internal API routes and permissions stuff for the GitHub addon (which was written in a fairly complicated way, in part by necessity, and which I didn’t understand at all at the outset of this project). This was supposed to be a good learning project, and I don’t doubt the intentions of anyone involved with assigning it, but … it was a pretty terrible, demoralizing experience, made worse because I was instructed not to ask the more-senior devs any questions, and the time estimate I was given was not realistic. (I did finish, pretty much. I have to redo my pull request in the morning.)

** “GitHub displays a lot of useless stats about how many followers you have, and some completely psychologically manipulative stats about how often you commit and how many days it is since you had a day off” – Why GitHub is not your CV, by James Coglan

Ed Summers: Removing Bias

planet code4lib - Wed, 2014-12-10 01:00

Senate Intelligence Committee report on CIA torture Wikipedia article edited anonymously from US Senate

— congress-edits (@congressedits) December 9, 2014

OCLC Dev Network: All Services Unavailable 3 January 2015 for Technology Upgrade

planet code4lib - Tue, 2014-12-09 22:00

On Saturday, 3 January 2015, OCLC has scheduled a technology upgrade to support system performance and reliability. During this upgrade, all OCLC services will be unavailable on 3 January 2015, from 12:01 am to 3:00 pm, U.S. Eastern Standard Time (approximately 15 hours).


Subscribe to code4lib aggregator