Banning bad bots that over-crawl the directory

Amazonbot
and Semrushbot
, or at least rogue
bots that are impersonating them. 😅I estimate they generate 20k requests per day, from ~400 IP addresses. I decided to block those user agents for now by returning a
HTTP 444 No Response
. The server can
handle that load without sweating but it's unacceptable to crawl that much.At The Wild Wild Web, I try very hard not to over-crawl a website. I am not a search engine so I don't need to gather much information anyway. Here are the current cache values for different pieces of information I collect:
-
index.html
, favicon, screenshot: 1 month- feed articles: every 6 hours
I believe this is very reasonable. I do intend to build a "down site detector" at some point, as websites tend to die and a directory full of dead links is just sad!
Last, but not least, the directory now indexes over 300 websites! 🎉