Every once in a while, some doofus points a web crawler at our web site and, ignoring the disallowed areas in our robots.txt file, starts crawling through some of our cgi-bin scripts at a rate of 4 to 8 hits a second. This is particularly annoying with some of the more processor and disk intensive CGI programs, such as man2html, which also happens to generate lots of links back to itself.
Is there anything I can set up in Apache to throttle back and slow down remote hosts when they start hitting hard on cgi-bin? I don't want to do anything that would adversely affect legitimate users, nor make important things like the manual pages hard to find by removing any public links to them. But when a client starts making 10 or more GET requests on /cgi-bin in a 5 second period, it would be nice if I could get the server to progressively add longer and longer delays before servicing these requests, to keep the load down and prevent the server from thrashing.
I'd appreciate any tips.
Thanks, Gilles
I'm not sure this is possible with Apache alone (anyone correct me if I'm wrong), but I have an idea.
You didn't mention which language the CGI scripts are written in, but assuming that you can edit them, you could create a temporary file when at their top, delete it at the end, and skip the execution if the file exists. That way, only one execution can be made at a time. I don't know how well it fits your requirements, but that's the first thing that comes to mind.
Kind regards, Helgi Hrafn Gunnarsson helgi@binary.is
On Fri, Sep 17, 2010 at 4:08 PM, Gilles Detillieux < grdetil@scrc.umanitoba.ca> wrote:
Every once in a while, some doofus points a web crawler at our web site and, ignoring the disallowed areas in our robots.txt file, starts crawling through some of our cgi-bin scripts at a rate of 4 to 8 hits a second. This is particularly annoying with some of the more processor and disk intensive CGI programs, such as man2html, which also happens to generate lots of links back to itself.
Is there anything I can set up in Apache to throttle back and slow down remote hosts when they start hitting hard on cgi-bin? I don't want to do anything that would adversely affect legitimate users, nor make important things like the manual pages hard to find by removing any public links to them. But when a client starts making 10 or more GET requests on /cgi-bin in a 5 second period, it would be nice if I could get the server to progressively add longer and longer delays before servicing these requests, to keep the load down and prevent the server from thrashing.
I'd appreciate any tips.
Thanks, Gilles
-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada) _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable
Thanks for all the suggestions, everyone. In the end I went with Helgi's suggestion. As I already had a wrapper script around man2html, it was pretty easy to add some quick tests involving a count file and locking link to do some speed limiting on the script.
The alternative, which would be more generally useful, but apparently more time and effort to set up, would be an add-on module for Apache, the most promising of which seemed to be mod_bwshare. Unfortunately, none of these is included in the httpd package that comes with RHEL 5 clones, and I don't know how much work it would be to install & configure this, and maintain it through updates to either its source or the httpd package. I may look into this further if I run into further problems with other CGI programs, or if my patches to the man2html wrapper don't quite do the trick (though they seemed to hold up well to manual testing).
On 17/09/2010 4:19 PM, Helgi Hrafn Gunnarsson wrote:
I'm not sure this is possible with Apache alone (anyone correct me if I'm wrong), but I have an idea.
You didn't mention which language the CGI scripts are written in, but assuming that you can edit them, you could create a temporary file when at their top, delete it at the end, and skip the execution if the file exists. That way, only one execution can be made at a time. I don't know how well it fits your requirements, but that's the first thing that comes to mind.
Kind regards, Helgi Hrafn Gunnarsson helgi@binary.is mailto:helgi@binary.is
On Fri, Sep 17, 2010 at 4:08 PM, Gilles Detillieux <grdetil@scrc.umanitoba.ca mailto:grdetil@scrc.umanitoba.ca> wrote:
Every once in a while, some doofus points a web crawler at our web site and, ignoring the disallowed areas in our robots.txt file, starts crawling through some of our cgi-bin scripts at a rate of 4 to 8 hits a second. This is particularly annoying with some of the more processor and disk intensive CGI programs, such as man2html, which also happens to generate lots of links back to itself. Is there anything I can set up in Apache to throttle back and slow down remote hosts when they start hitting hard on cgi-bin? I don't want to do anything that would adversely affect legitimate users, nor make important things like the manual pages hard to find by removing any public links to them. But when a client starts making 10 or more GET requests on /cgi-bin in a 5 second period, it would be nice if I could get the server to progressively add longer and longer delays before servicing these requests, to keep the load down and prevent the server from thrashing. I'd appreciate any tips. Thanks, Gilles
There used to be a third party module, something like mod_limit or mod_bwlimit, that lets you limit connection rates.
That said, the developer in me says "cache". :)
Sean
On Fri, Sep 17, 2010 at 4:08 PM, Gilles Detillieux < grdetil@scrc.umanitoba.ca> wrote:
Every once in a while, some doofus points a web crawler at our web site and, ignoring the disallowed areas in our robots.txt file, starts crawling through some of our cgi-bin scripts at a rate of 4 to 8 hits a second. This is particularly annoying with some of the more processor and disk intensive CGI programs, such as man2html, which also happens to generate lots of links back to itself.
Is there anything I can set up in Apache to throttle back and slow down remote hosts when they start hitting hard on cgi-bin? I don't want to do anything that would adversely affect legitimate users, nor make important things like the manual pages hard to find by removing any public links to them. But when a client starts making 10 or more GET requests on /cgi-bin in a 5 second period, it would be nice if I could get the server to progressively add longer and longer delays before servicing these requests, to keep the load down and prevent the server from thrashing.
I'd appreciate any tips.
Thanks, Gilles
-- Gilles R. Detillieux E-mail: grdetil@scrc.umanitoba.ca Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada) _______________________________________________ Roundtable mailing list Roundtable@muug.mb.ca http://www.muug.mb.ca/mailman/listinfo/roundtable
On Fri, Sep 17, 2010 at 04:08:00PM -0500, Gilles Detillieux wrote:
Every once in a while, some doofus points a web crawler at our web site and, ignoring the disallowed areas in our robots.txt file, starts crawling through some of our cgi-bin scripts at a rate of 4 to 8 hits a second. This is particularly annoying with some of the more processor and disk intensive CGI programs, such as man2html, which also happens to generate lots of links back to itself.
Is there anything I can set up in Apache to throttle back and slow down remote hosts when they start hitting hard on cgi-bin? I don't want to do anything that would adversely affect legitimate users, nor make important things like the manual pages hard to find by removing any public links to them. But when a client starts making 10 or more GET requests on /cgi-bin in a 5 second period, it would be nice if I could get the server to progressively add longer and longer delays before servicing these requests, to keep the load down and prevent the server from thrashing.
You could implement mod_security rules which enforce a delay after a certain number of events. There is an example here which does something similar based on failed logins. The main difference is that you don't care about the response going back to the client.
http://www.packtpub.com/article/blocking-common-attacks-using-modsecurity-2....
If you use OSSEC (a spiffy package in its own right), you could get it to block the IP for a while, but that may be a little harsh in that it wouldn't be saying *why* the IP is rejected. In any case, the method I think would be to look at the apache access log, with a rule configured to punt that IP for a while if the logs access /cgi-bin too often within a set period.
Tim