The Wild Wild Web

  • Surf Mode new!
  • About
  • Banning bad bots that over-crawl the directory

    rainbow separator
    My last blog post did not aged well, I am being over-crawled by Amazonbot and Semrushbot, or at least rogue bots that are impersonating them. 😅

    I estimate they generate 20k requests per day, from ~400 IP addresses. I decided to block those user agents for now by returning a HTTP 444 No Response. The server can handle that load without sweating but it's unacceptable to crawl that much.

    At The Wild Wild Web, I try very hard not to over-crawl a website. I am not a search engine so I don't need to gather much information anyway. Here are the current cache values for different pieces of information I collect:

    - index.html, favicon, screenshot: 1 month

    - feed articles: every 6 hours

    I believe this is very reasonable. I do intend to build a "down site detector" at some point, as websites tend to die and a directory full of dead links is just sad!

    Last, but not least, the directory now indexes over 300 websites! 🎉
    - j0d1,