.htaccess file: stopping robot with escape character in name
Hello Folks, I'm trying to stop a bot from crawling a site using the .htaccess file. The problem is that it's using the backslash character as its name. Grrr... It's called: Unknown robot identified by bot\* This generates an internal server error: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") [NC] I tried, this, but it didn't help: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\\*") [NC] Any thoughts? Regards, -Montana _______________________________________________ Roundtable mailing list -- roundtable@muug.ca To unsubscribe send an email to roundtable-leave@muug.ca
Urlencode or octal? Or if it's a regex just use ".". -Adam Get Outlook for Android<https://aka.ms/AAb9ysg> ________________________________ From: Montana Quiring <montanaq@gmail.com> Sent: Tuesday, April 22, 2025 1:47:31 PM To: Continuation of Round Table discussion <roundtable@muug.ca> Subject: [RndTbl] .htaccess file: stopping robot with escape character in name Hello Folks, I'm trying to stop a bot from crawling a site using the .htaccess file. The problem is that it's using the backslash character as its name. Grrr... It's called: Unknown robot identified by bot\* This generates an internal server error: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") [NC] I tried, this, but it didn't help: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\\*") [NC] Any thoughts? Regards, -Montana _______________________________________________ Roundtable mailing list -- roundtable@muug.ca To unsubscribe send an email to roundtable-leave@muug.ca
Sorry man, excuse my ignorance, but not sure what you are asking. I got the bot name from AWstats, which I assume is just ASCII. Regards, -Montana On Tue, Apr 22, 2025 at 1:58 PM Adam Thompson <athompso@athompso.net<mailto:athompso@athompso.net>> wrote: Urlencode or octal? Or if it's a regex just use ".". -Adam Get Outlook for Android<https://aka.ms/AAb9ysg> ________________________________ From: Montana Quiring <montanaq@gmail.com<mailto:montanaq@gmail.com>> Sent: Tuesday, April 22, 2025 1:47:31 PM To: Continuation of Round Table discussion <roundtable@muug.ca<mailto:roundtable@muug.ca>> Subject: [RndTbl] .htaccess file: stopping robot with escape character in name Hello Folks, I'm trying to stop a bot from crawling a site using the .htaccess file. The problem is that it's using the backslash character as its name. Grrr... It's called: Unknown robot identified by bot\* This generates an internal server error: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") [NC] I tried, this, but it didn't help: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\\*") [NC] Any thoughts? Regards, -Montana _______________________________________________ Roundtable mailing list -- roundtable@muug.ca<mailto:roundtable@muug.ca> To unsubscribe send an email to roundtable-leave@muug.ca<mailto:roundtable-leave@muug.ca> _______________________________________________ Roundtable mailing list -- roundtable@muug.ca To unsubscribe send an email to roundtable-leave@muug.ca
Caution! This message was sent from outside the University of Manitoba. I think Adam is suggesting to use a regex in the RewriteCond, to avoid the problematic characters in the pattern... https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritecond ... states that "CondPattern is usually a perl compatible regular expression, but there is additional syntax available to perform other useful tests against the Teststring:". So, something like this might work... RewriteCond %{HTTP_USER_AGENT} "Unknown robot identified by bot.." [NC] BTW, I don't think you want parentheses around the string, as that's probably not supported syntax. (Parentheses within the string will have the usual PCRE syntax and semantics.) Hope this helps. Gilbert On 2025-04-22 2:05 p.m., Montana Quiring wrote:
Sorry man, excuse my ignorance, but not sure what you are asking. I got the bot name from AWstats, which I assume is just ASCII.
Regards, -Montana
On Tue, Apr 22, 2025 at 1:58 PM Adam Thompson <athompso@athompso.net <mailto:athompso@athompso.net>> wrote:
Urlencode or octal? Or if it's a regex just use ".". -Adam
Get Outlook for Android <https://aka.ms/AAb9ysg> ------------------------------------------------------------------------ *From:* Montana Quiring <montanaq@gmail.com <mailto:montanaq@gmail.com>> *Sent:* Tuesday, April 22, 2025 1:47:31 PM *To:* Continuation of Round Table discussion <roundtable@muug.ca <mailto:roundtable@muug.ca>> *Subject:* [RndTbl] .htaccess file: stopping robot with escape character in name Hello Folks,
I'm trying to stop a bot from crawling a site using the .htaccess file. The problem is that it's using the backslash character as its name. Grrr... It's called: Unknown robot identified by bot\* This generates an internal server error: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") [NC] I tried, this, but it didn't help: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\\*") [NC]
Any thoughts?
Regards, -Montana
-- Gilbert E. Detillieux E-mail: <gedetil@muug.ca> Manitoba UNIX User Group Web: http://muug.ca/ _______________________________________________ Roundtable mailing list -- roundtable@muug.ca To unsubscribe send an email to roundtable-leave@muug.ca
Ahh ok, thanks. I actually had the names of a bunch of bots in there, so wouldn't I need the parentheses? ie: RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider|"AhrefsBot/6.1"|"Ahrefs"|"Baiduspider"|"BLEXBot"|"SemrushBot"|"claudebot"|"YandexBot/3.0"|Bytespider) [NC] Regards, -Montana On Tue, Apr 22, 2025 at 2:56 PM Gilbert Detillieux <Gilbert.Detillieux@umanitoba.ca<mailto:Gilbert.Detillieux@umanitoba.ca>> wrote: I think Adam is suggesting to use a regex in the RewriteCond, to avoid the problematic characters in the pattern... https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritecond ... states that "CondPattern is usually a perl compatible regular expression, but there is additional syntax available to perform other useful tests against the Teststring:". So, something like this might work... RewriteCond %{HTTP_USER_AGENT} "Unknown robot identified by bot.." [NC] BTW, I don't think you want parentheses around the string, as that's probably not supported syntax. (Parentheses within the string will have the usual PCRE syntax and semantics.) Hope this helps. Gilbert On 2025-04-22 2:05 p.m., Montana Quiring wrote:
Sorry man, excuse my ignorance, but not sure what you are asking. I got the bot name from AWstats, which I assume is just ASCII.
Regards, -Montana
On Tue, Apr 22, 2025 at 1:58 PM Adam Thompson <athompso@athompso.net<mailto:athompso@athompso.net> <mailto:athompso@athompso.net<mailto:athompso@athompso.net>>> wrote:
Urlencode or octal? Or if it's a regex just use ".". -Adam
Get Outlook for Android <https://aka.ms/AAb9ysg> ------------------------------------------------------------------------ *From:* Montana Quiring <montanaq@gmail.com<mailto:montanaq@gmail.com> <mailto:montanaq@gmail.com<mailto:montanaq@gmail.com>>> *Sent:* Tuesday, April 22, 2025 1:47:31 PM *To:* Continuation of Round Table discussion <roundtable@muug.ca<mailto:roundtable@muug.ca> <mailto:roundtable@muug.ca<mailto:roundtable@muug.ca>>> *Subject:* [RndTbl] .htaccess file: stopping robot with escape character in name Hello Folks,
I'm trying to stop a bot from crawling a site using the .htaccess file. The problem is that it's using the backslash character as its name. Grrr... It's called: Unknown robot identified by bot\* This generates an internal server error: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") [NC] I tried, this, but it didn't help: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\\*") [NC]
Any thoughts?
Regards, -Montana
-- Gilbert E. Detillieux E-mail: <gedetil@muug.ca<mailto:gedetil@muug.ca>> Manitoba UNIX User Group Web: http://muug.ca/
Caution! This message was sent from outside the University of Manitoba. Yes, you would then need parentheses within the the one quoted string for the entire pattern, rather than quoting the individual substring patterns to be matched... RewriteCond %{HTTP_USER_AGENT "(googlebot|bingbot|Baiduspider|AhrefsBot/6.1|Ahrefs|Baiduspider|BLEXBot|SemrushBot|claudebot|YandexBot/3.0|Bytespider)" [NC] As for the unknown robot(s), you'd best look at the raw access log files to see what the actual UserAgent string(s) is/are, as Adam suggested. Gilbert On 2025-04-22 3:05 p.m., Montana Quiring wrote:
Ahh ok, thanks. I actually had the names of a bunch of bots in there, so wouldn't I need the parentheses? ie: RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider|"AhrefsBot/6.1"|"Ahrefs"|"Baiduspider"|"BLEXBot"|"SemrushBot"|"claudebot"|"YandexBot/3.0"|Bytespider) [NC]
Regards, -Montana
On Tue, Apr 22, 2025 at 2:56 PM Gilbert Detillieux <Gilbert.Detillieux@umanitoba.ca <mailto:Gilbert.Detillieux@umanitoba.ca>> wrote:
I think Adam is suggesting to use a regex in the RewriteCond, to avoid the problematic characters in the pattern...
https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritecond <https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritecond>
... states that "CondPattern is usually a perl compatible regular expression, but there is additional syntax available to perform other useful tests against the Teststring:".
So, something like this might work...
RewriteCond %{HTTP_USER_AGENT} "Unknown robot identified by bot.." [NC]
BTW, I don't think you want parentheses around the string, as that's probably not supported syntax. (Parentheses within the string will have the usual PCRE syntax and semantics.)
Hope this helps.
Gilbert
On 2025-04-22 2:05 p.m., Montana Quiring wrote: > Sorry man, excuse my ignorance, but not sure what you are asking. > I got the bot name from AWstats, which I assume is just ASCII. > > Regards, > -Montana > > > On Tue, Apr 22, 2025 at 1:58 PM Adam Thompson <athompso@athompso.net <mailto:athompso@athompso.net> > <mailto:athompso@athompso.net <mailto:athompso@athompso.net>>> wrote: > > Urlencode or octal? Or if it's a regex just use ".". > -Adam > > Get Outlook for Android <https://aka.ms/AAb9ysg <https://aka.ms/AAb9ysg>> > ------------------------------------------------------------------------ > *From:* Montana Quiring <montanaq@gmail.com <mailto:montanaq@gmail.com> <mailto:montanaq@gmail.com <mailto:montanaq@gmail.com>>> > *Sent:* Tuesday, April 22, 2025 1:47:31 PM > *To:* Continuation of Round Table discussion <roundtable@muug.ca <mailto:roundtable@muug.ca> > <mailto:roundtable@muug.ca <mailto:roundtable@muug.ca>>> > *Subject:* [RndTbl] .htaccess file: stopping robot with escape > character in name > Hello Folks, > > I'm trying to stop a bot from crawling a site using the .htaccess > file. The problem is that it's using the backslash character as its > name. Grrr... > It's called: Unknown robot identified by bot\* > This generates an internal server error: > RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") > [NC] > I tried, this, but it didn't help: > RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by > bot\\*") [NC] > > Any thoughts? > > Regards, > -Montana
-- Gilbert E. Detillieux E-mail: <gedetil@muug.ca> Manitoba UNIX User Group Web: http://muug.ca/ _______________________________________________ Roundtable mailing list -- roundtable@muug.ca To unsubscribe send an email to roundtable-leave@muug.ca
Thanks, ya, I'll pour over it tonight. :) Regards, -Montana On Tue, Apr 22, 2025 at 3:30 PM Gilbert Detillieux <Gilbert.Detillieux@umanitoba.ca<mailto:Gilbert.Detillieux@umanitoba.ca>> wrote: Yes, you would then need parentheses within the the one quoted string for the entire pattern, rather than quoting the individual substring patterns to be matched... RewriteCond %{HTTP_USER_AGENT "(googlebot|bingbot|Baiduspider|AhrefsBot/6.1|Ahrefs|Baiduspider|BLEXBot|SemrushBot|claudebot|YandexBot/3.0|Bytespider)" [NC] As for the unknown robot(s), you'd best look at the raw access log files to see what the actual UserAgent string(s) is/are, as Adam suggested. Gilbert On 2025-04-22 3:05 p.m., Montana Quiring wrote:
Ahh ok, thanks. I actually had the names of a bunch of bots in there, so wouldn't I need the parentheses? ie: RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider|"AhrefsBot/6.1"|"Ahrefs"|"Baiduspider"|"BLEXBot"|"SemrushBot"|"claudebot"|"YandexBot/3.0"|Bytespider) [NC]
Regards, -Montana
On Tue, Apr 22, 2025 at 2:56 PM Gilbert Detillieux <Gilbert.Detillieux@umanitoba.ca<mailto:Gilbert.Detillieux@umanitoba.ca> <mailto:Gilbert.Detillieux@umanitoba.ca<mailto:Gilbert.Detillieux@umanitoba.ca>>> wrote:
I think Adam is suggesting to use a regex in the RewriteCond, to avoid the problematic characters in the pattern...
https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritecond <https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritecond>
... states that "CondPattern is usually a perl compatible regular expression, but there is additional syntax available to perform other useful tests against the Teststring:".
So, something like this might work...
RewriteCond %{HTTP_USER_AGENT} "Unknown robot identified by bot.." [NC]
BTW, I don't think you want parentheses around the string, as that's probably not supported syntax. (Parentheses within the string will have the usual PCRE syntax and semantics.)
Hope this helps.
Gilbert
On 2025-04-22 2:05 p.m., Montana Quiring wrote: > Sorry man, excuse my ignorance, but not sure what you are asking. > I got the bot name from AWstats, which I assume is just ASCII. > > Regards, > -Montana > > > On Tue, Apr 22, 2025 at 1:58 PM Adam Thompson <athompso@athompso.net<mailto:athompso@athompso.net> <mailto:athompso@athompso.net<mailto:athompso@athompso.net>> > <mailto:athompso@athompso.net<mailto:athompso@athompso.net> <mailto:athompso@athompso.net<mailto:athompso@athompso.net>>>> wrote: > > Urlencode or octal? Or if it's a regex just use ".". > -Adam > > Get Outlook for Android <https://aka.ms/AAb9ysg <https://aka.ms/AAb9ysg>> > ------------------------------------------------------------------------ > *From:* Montana Quiring <montanaq@gmail.com<mailto:montanaq@gmail.com> <mailto:montanaq@gmail.com<mailto:montanaq@gmail.com>> <mailto:montanaq@gmail.com<mailto:montanaq@gmail.com> <mailto:montanaq@gmail.com<mailto:montanaq@gmail.com>>>> > *Sent:* Tuesday, April 22, 2025 1:47:31 PM > *To:* Continuation of Round Table discussion <roundtable@muug.ca<mailto:roundtable@muug.ca> <mailto:roundtable@muug.ca<mailto:roundtable@muug.ca>> > <mailto:roundtable@muug.ca<mailto:roundtable@muug.ca> <mailto:roundtable@muug.ca<mailto:roundtable@muug.ca>>>> > *Subject:* [RndTbl] .htaccess file: stopping robot with escape > character in name > Hello Folks, > > I'm trying to stop a bot from crawling a site using the .htaccess > file. The problem is that it's using the backslash character as its > name. Grrr... > It's called: Unknown robot identified by bot\* > This generates an internal server error: > RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") > [NC] > I tried, this, but it didn't help: > RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by > bot\\*") [NC] > > Any thoughts? > > Regards, > -Montana
-- Gilbert E. Detillieux E-mail: <gedetil@muug.ca<mailto:gedetil@muug.ca>> Manitoba UNIX User Group Web: http://muug.ca/
Caution! This message was sent from outside the University of Manitoba. OK! So we can make a few guesses: 1. The bot name isn't "Unknown robot identified by bot\*", the bot name is just "bot\*". (Actually, even this is highly suspect.) 2. AWStats is tell you it doesn't recognize the bot ("Unknown robot identified by...") 3. The bot name is likely bot<something>, not a literal asterisk. I think this is AWStats telling you it matched a bot by identifying the prefix "bot", i.e. AWStats did a substring match on 'bot\*' 4. You'll have to go awk'ing and grep'ing your access_log files (or maybe tweaking awstats?) to get the actual bot name. If the bot name were truly "Unknown robot identified by bot\*", then 1. you don't need the parentheses, RewriteCond expects PCRE so ( ) are only needed if grouping 2. the backslash+asterisk combination is pretty much a worst-case scenario for correctly escaping , I would sidestep the issue by matching "Unknown robot identified by bot.." instead of "Unknown robot identified by bot\*". A single period "." in regex is like a "?" in filename globbing, it matches any single character. This is not a new thing with AWStats - see https://forums.classicpress.net/t/how-to-block-uknown-robots-identified-by-a... for discussion about what "bot*" actually means. From that page, however, we can guess that you might be able to just write: RewriteCond %{HTTP_USER_AGENT} bot[\s_+:,\.\;\/\\\-] [NC] -Adam ________________________________ From: Montana Quiring <montanaq@gmail.com> Sent: April 22, 2025 14:05 To: Continuation of Round Table discussion <roundtable@muug.ca> Subject: [RndTbl] Re: .htaccess file: stopping robot with escape character in name Sorry man, excuse my ignorance, but not sure what you are asking. I got the bot name from AWstats, which I assume is just ASCII. Regards, -Montana On Tue, Apr 22, 2025 at 1:58 PM Adam Thompson <athompso@athompso.net<mailto:athompso@athompso.net>> wrote: Urlencode or octal? Or if it's a regex just use ".". -Adam Get Outlook for Android<https://aka.ms/AAb9ysg> ________________________________ From: Montana Quiring <montanaq@gmail.com<mailto:montanaq@gmail.com>> Sent: Tuesday, April 22, 2025 1:47:31 PM To: Continuation of Round Table discussion <roundtable@muug.ca<mailto:roundtable@muug.ca>> Subject: [RndTbl] .htaccess file: stopping robot with escape character in name Hello Folks, I'm trying to stop a bot from crawling a site using the .htaccess file. The problem is that it's using the backslash character as its name. Grrr... It's called: Unknown robot identified by bot\* This generates an internal server error: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") [NC] I tried, this, but it didn't help: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\\*") [NC] Any thoughts? Regards, -Montana _______________________________________________ Roundtable mailing list -- roundtable@muug.ca<mailto:roundtable@muug.ca> To unsubscribe send an email to roundtable-leave@muug.ca<mailto:roundtable-leave@muug.ca> _______________________________________________ Roundtable mailing list -- roundtable@muug.ca To unsubscribe send an email to roundtable-leave@muug.ca
Ok, interesting, thanks. Ya, you are onto something. I looked at a different screen and see that the asterisk means something: [hits.png] Regards, -Montana On Tue, Apr 22, 2025 at 3:06 PM Adam Thompson <athompso@athompso.net<mailto:athompso@athompso.net>> wrote: OK! So we can make a few guesses: 1. The bot name isn't "Unknown robot identified by bot\*", the bot name is just "bot\*". (Actually, even this is highly suspect.) 2. AWStats is tell you it doesn't recognize the bot ("Unknown robot identified by...") 3. The bot name is likely bot<something>, not a literal asterisk. I think this is AWStats telling you it matched a bot by identifying the prefix "bot", i.e. AWStats did a substring match on 'bot\*' 4. You'll have to go awk'ing and grep'ing your access_log files (or maybe tweaking awstats?) to get the actual bot name. If the bot name were truly "Unknown robot identified by bot\*", then 1. you don't need the parentheses, RewriteCond expects PCRE so ( ) are only needed if grouping 2. the backslash+asterisk combination is pretty much a worst-case scenario for correctly escaping , I would sidestep the issue by matching "Unknown robot identified by bot.." instead of "Unknown robot identified by bot\*". A single period "." in regex is like a "?" in filename globbing, it matches any single character. This is not a new thing with AWStats - see https://forums.classicpress.net/t/how-to-block-uknown-robots-identified-by-a... for discussion about what "bot*" actually means.
From that page, however, we can guess that you might be able to just write: RewriteCond %{HTTP_USER_AGENT} bot[\s_+:,\.\;\/\\\-] [NC]
-Adam ________________________________ From: Montana Quiring <montanaq@gmail.com<mailto:montanaq@gmail.com>> Sent: April 22, 2025 14:05 To: Continuation of Round Table discussion <roundtable@muug.ca<mailto:roundtable@muug.ca>> Subject: [RndTbl] Re: .htaccess file: stopping robot with escape character in name Sorry man, excuse my ignorance, but not sure what you are asking. I got the bot name from AWstats, which I assume is just ASCII. Regards, -Montana On Tue, Apr 22, 2025 at 1:58 PM Adam Thompson <athompso@athompso.net<mailto:athompso@athompso.net>> wrote: Urlencode or octal? Or if it's a regex just use ".". -Adam Get Outlook for Android<https://aka.ms/AAb9ysg> ________________________________ From: Montana Quiring <montanaq@gmail.com<mailto:montanaq@gmail.com>> Sent: Tuesday, April 22, 2025 1:47:31 PM To: Continuation of Round Table discussion <roundtable@muug.ca<mailto:roundtable@muug.ca>> Subject: [RndTbl] .htaccess file: stopping robot with escape character in name Hello Folks, I'm trying to stop a bot from crawling a site using the .htaccess file. The problem is that it's using the backslash character as its name. Grrr... It's called: Unknown robot identified by bot\* This generates an internal server error: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\*") [NC] I tried, this, but it didn't help: RewriteCond %{HTTP_USER_AGENT} ("Unknown robot identified by bot\\*") [NC] Any thoughts? Regards, -Montana _______________________________________________ Roundtable mailing list -- roundtable@muug.ca<mailto:roundtable@muug.ca> To unsubscribe send an email to roundtable-leave@muug.ca<mailto:roundtable-leave@muug.ca> _______________________________________________ Roundtable mailing list -- roundtable@muug.ca<mailto:roundtable@muug.ca> To unsubscribe send an email to roundtable-leave@muug.ca<mailto:roundtable-leave@muug.ca> _______________________________________________ Roundtable mailing list -- roundtable@muug.ca To unsubscribe send an email to roundtable-leave@muug.ca
participants (3)
-
Adam Thompson -
Gilbert Detillieux -
Montana Quiring