You are here

planet code4lib

Subscribe to planet code4lib feed
Planet Code4Lib - http://planet.code4lib.org
Updated: 37 min 18 sec ago

Peter Murray: Blocking /xmlrpc.php Scans in the Apache .htaccess File

Thu, 2014-09-04 02:41

Someone out there on the internet is repeatedly hitting this blog’s /xmlrpc.php service, probably looking to enumerate the user accounts on the blog as a precursor to a password scan (as described in Huge increase in WordPress xmlrpc.php POST requests at Sysadmins of the North). My access logs look like this:

176.227.196.86 - - [04/Sep/2014:02:18:19 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)" 195.154.136.19 - - [04/Sep/2014:02:18:19 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)" 176.227.196.86 - - [04/Sep/2014:02:18:19 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)" 176.227.196.86 - - [04/Sep/2014:02:18:21 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)" 176.227.196.86 - - [04/Sep/2014:02:18:22 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)" 176.227.196.86 - - [04/Sep/2014:02:18:24 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)" 195.154.136.19 - - [04/Sep/2014:02:18:24 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)" 176.227.196.86 - - [04/Sep/2014:02:18:26 +0000] "POST /xmlrpc.php HTTP/1.0" 200 291 "-" "Mozilla/4.0 (compatible: MSIE 7.0; Windows NT 6.0)"

By itself, this is just annoying — but the real problem is that the PHP stack is getting invoked each time to deal with the request, and at several requests per second from different hosts this was putting quite a load on the server. I decided to fix the problem with a slight variation from what is suggested in the Sysadmins of the North blog post. This addition to the .htaccess file at the root level of my WordPress instance rejects the connection attempt at the Apache level rather than the PHP level:

RewriteCond %{REQUEST_URI} =/xmlrpc.php [NC] RewriteCond %{HTTP_USER_AGENT} .*Mozilla\/4.0\ \(compatible:\ MSIE\ 7.0;\ Windows\ NT\ 6.0.* RewriteRule .* - [F,L]

Which means:

  1. If the requested path is /xmlrpc.php, and
  2. you are sending this particular agent string, then
  3. send back a 403 error message and don’t bother processing any more Apache rewrite rules.

If you need to use this yourself, you might find that the HTTP_USER_AGENT string has changed. You can copy the user string from your Apache access logs, but remember to preface each space or each parenthesis with a backslash.

Link to this post!

Peter Murray: 2nd Workshop on Sustainable Software for Science: Practice and Experiences — Accepted Papers and Travel Support

Thu, 2014-09-04 02:08

The conference organizers for WSSSPE2 have posted the list of accepted papers and the application for travel support. I was on the program committee for this year’s conference, and I can point to some papers that I think are particularly useful to libraries and the cultural heritage community in general:

Link to this post!

William Denton: Moodie's Tale

Thu, 2014-09-04 01:19

Somebody said we need a Moo for libraries. We still do. But I just read Moodie’s Tale by Eric Wright and I think it’s the Moo of Canadian academia. I don’t know Susanna Moodie or The Canterbury Tales so I think I’m missing a fair bit, but I still enjoyed it very much.

There are a few mentions of libraries, like this:

“Here’s an example,” the president continued. “I propose that henceforth you fellows be called ‘deans.’ Most places have deans nowadays. Sound the others out to see if there’s a problem. Now what else? What else does a college have? A proper college.”

“A library?”

“We’ve got one of sorts, haven’t we? In the corner room of the Drug Mart.”

“Just a few shelves, Gravely. Not many of the faculty know about it. It ought to have some standard reference works. Encyclopedias, that kind of thing.”

“We can afford a couple of thousand from the cleaning budget. Draw up a list. But now you’ve mentioned it, what is the real mark of a library?”

“Other than books?”

“Yes. What else?”

“A copying machine?”

“What else?”

It was important to guess right. Cunningham was getting impatient. “I am not sure of your emphasis, Gravely,” he hedged.

“Emphasis? How do you know it is a library?”

“The sign on the door?”

“Exactly. The label, William, the label. Get a sign made. And what do people find inside the door?”

“The librarian?”

“Now you’re on to it. Apart from the sign, the cheapest thing in the library is the librarian, especially since they aren’t unionized. We could put anyone in and call him the librarian. Now who have we got?”

“Beckett?”

Beckett was a religious maniac, a clerk in the maintenance department who spent his hours walking the streets with a billboard, warning of the end. His fellow workers complained constantly of his proselytizing in the storeroom.

“Perfect. He’s a bit more eccentric than most librarians, I suppose, but he’ll do. Is he conscientious?”

“It’s the other thing his colleagues dislike about him.”

“Done, then.”

Islandora: Varnish, Islandora, and Islandnewspapers.ca

Thu, 2014-09-04 00:24
Varnish and Islandora

Below you will find some information on how UPEI's Robertson Library configured Varnish for use with Islandora. Currently we have Varnish running on our Newspaper site and it is working well with the OpenSeadragon viewer, but we have not tested with the IA Bookviewer yet.

Why use Varnish?

At Robertson Library we have been digitizing the Guardian newspaper for a while now. We expected there would be a good amount of traffic to this site when it went live so prior to launch we wanted to do some benchmarks. We also noticed with the stock Islandora Newspaper solution pack that loading the Guardian newspaper page was very slow and we expected we would have to try to optimize things to handle load.

The benchmarks we used were pretty simple and were really just a way to help us determine whether or not an optimization was worth keeping. We used The Grinder, a Java based load testing framework.

We loaded Grinder with a simple scenario - hit the homepage, the main Guardian newspaper page, a Newspaper page (in the Openseadragon viewer) and the main Guardian page again (the one that lists all the Issues of the Guardian, we have almost 20,000 issues of the Guardian so far). Grinder was configured to hit these pages 250 times with 50 threads.

Our first run at it was with the stock islandora newspaper solution pack.

The numbers were not great with the stock Islandora Newspaper solution pack, we could handle about 1 request per second and we were starting to receive some errors. Total throughput was 1106.59KB/sec. CPU usage on the server was very high, all cores were pretty steady at or near 100%.

The biggest problem seemed to be hitting the resource index over and over again and manipulating the resulting array. So to try and speed things up a little we modified the code to query Solr instead of the Resource Index.

Test results with Solr query.

By querying Solr we were able to speed things up quite a bit. We were now getting close to 5 requests per second, no errors and a throughput of 4874.92 KB/sec. Our CPU usage was still very high, all cores at or near 100%.

We couldn’t see other ways to make the main Guardian page load faster without significantly changing how the Newspaper solution packed worked. Dynamically listing almost 20,000 issues on one page was going to take time no matter how we did it, unless we broke the page up into several requests. Breaking the page up into several requests would not be ideal either, as we would have to make roundtrips to the server to get the list of years available as well as all issues for a selected year. Instead of breaking this page up into several requests we discussed caching it.

So our next step was to install and configure Varnish so that this page would be cached. With Varnish installed and configured we ran the same Grinder tests.

Test with Varnish enabled

By using Varnish our numbers improved again. We were now handling 10 requests per second, no errors and a throughput of 9808.21 KB/sec. Our CPU usage was way down with our all cores between 3% and 20% usage (most were closer to the 3%). By using Varnish we got a speed boost but I think the biggest advantage will be in the number of users we can handle as our most expensive requests now come from the cache with little server overhead.

Of course using Grinder to test with Varnish makes Varnish look even better, as we are hitting the same URLs over and over but the results especially the low CPU usage lead us to believe Varnish is worth using on the Islandnewspapers.ca site.

Since we have launched we have had as many as 75 concurrent users and response times are great even under load.

Configuring Drupal and Islandora for Varnish Configure Drupal Performance

On the Drupal Performance admin page (admin/config/development/performance) we configured Drupal to cache and compress pages. We also aggregate and compress css and javascript.

Configure Islandora

On the Islandora config page (admin/islandora/configure) we disabled setting the cache headers.

If we enable the Generate/parse datastream HTTP cache headers Varnish doesn’t serve the page thumbnail images from it’s cache, on the plus side we may get better browser caching of thumbnails.

We seemed to get better performance with Generate/parse datastream HTTP headers unchecked so we have left it off for now.

Installing and configuring Varnish

We installed Varnish on Ubuntu with sudo apt-get install varnish. We are currently using Varnish 3.0.2.

Varnish Configuration

We modified the default.vcl in /etc/varnish.

Our vcl file looks like this:

# This is a basic VCL configuration file for varnish. See the vcl(7) # man page for details on VCL syntax and semantics. # # Default backend definition. Set this to point to your content # server. # backend default { .host = "127.0.0.1"; .port = "8090"; .connect_timeout = 30s; .first_byte_timeout = 30s; .between_bytes_timeout = 30s; } sub vcl_recv { // Remove has_js and Google Analytics __* cookies. set req.http.Cookie = regsuball(req.http.Cookie, "(^|;\s*)(__[a-z]+|has_js)=[^;]*", ""); // Remove a ";" prefix, if present. set req.http.Cookie = regsub(req.http.Cookie, "^;\s*", ""); // Remove empty cookies. if (req.http.Cookie ~ "^\s*$") { unset req.http.Cookie; } //in testing pipe seemed to give us better results then pass if(req.url ~ "^/adore-djatoka"){ unset req.http.Cookie; return (pipe); } if (req.url ~ "\.(png|gif|jpg|js|css)$") { unset req.http.Cookie; return (lookup); } if(req.url ~ "^/search"){ unset req.http.Cookie; return (pass); } if (req.request == "GET" || req.request == "HEAD") { return (lookup); } } sub vcl_pipe { # http://www.varnish-cache.org/ticket/451 # This forces every pipe request to be the first one. set bereq.http.connection = "close"; }

In /etc/default/varnish (Ubuntu/Debian) or /etc/sysconfig/varnish (Centos/Fedora) you will have to change your DAEMON_OPTS. Ours look like this:

DAEMON_OPTS="-a :80 \ -T localhost:6082 \ -f /etc/varnish/default.vcl \ -S /etc/varnish/secret \ -s malloc,5g"

You can see from the two config files that we have Varnish listening on port 80 and looking for the backend on port 8090.

Our Apache server is configured to listen on port 8090, other than that Apache is using a standard Islandora type setup.

The timeouts in our VCL are pretty high and could probably be set a lot lower. With an earlier version of Varnish we were having some inconsistencies with loading times when using the OpenSeadragon viewer, the higher timeouts were left over from testing with the older version of Varnish and we will adjust them.

We have Varnish configured to use RAM (malloc) for it’s cache but this could be set to a file.

One thing we decided to do is pipe requests to Djatoka. Since Djatoka is already caching images we decided not to cache them twice.

We have also made some optimizations to Djatoka’s configs. Basically we increased the number of tiles and images Djatoka would keep in it’s cache.

Note: We are not using the Varnish Drupal module.

There are many great resources for Varnish on the web. Pantheon has a great page regarding Varnish and Drupal.

Pages