The Wild Wild Web - Banning bad bots that over-crawl the directory

Banning bad bots that over-crawl the directory

My last blog post did not aged well, I am being over-crawled by Amazonbot and Semrushbot, or at least rogue bots that are impersonating them. 😅

I estimate they generate 20k requests per day, from ~400 IP addresses. I decided to block those user agents for now by returning a HTTP 444 No Response. The server can handle that load without sweating but it's unacceptable to crawl that much.

At The Wild Wild Web, I try very hard not to over-crawl a website. I am not a search engine so I don't need to gather much information anyway. Here are the current cache values for different pieces of information I collect:

- index.html, favicon, screenshot: 1 month

- feed articles: every 6 hours

I believe this is very reasonable. I do intend to build a "down site detector" at some point, as websites tend to die and a directory full of dead links is just sad!

Last, but not least, the directory now indexes over 300 websites! 🎉

- j0d1, 4 months ago