Block Crawler Bots

NOTE: This article is still work in progress as I haven’t had the time to experiment with a few ideas to see which one works best.

After years of allowing it, I now want to block the majority of crawler bots. Since 50% of traffic on websites is from bots this creates a few problems I wouldn’t have had a problem problem but since most bots are used for illegaly training LLM’s and and do not respect server resources, copyright. Worse of all, most bots are used for AI profiling on humans

There are a few possible options to help you block AI bots. I’ll discuss some of them and also state why CLoudflare should NOT be used.

I’ll start with the most promising one, namely the one which makes the bots WORK!

Proof of Work Challenge

The easiest, and best option is to actually craft a Proof of Work system in which the server expects a random PoW response and the javascript client side must “work” a few seconds to get access. Then a cookie is sent and the server keeps track of the IP’s which solved the equation.

The idea is that most bots won’t store cookies, and some won’t run javascript. So we know which ones are bots. If they don’t store cookies but run javascript have them do the challenge at every request, using essential resources. SOlved? Great, here’s a cookie, valid for your IP for.. 12,24 hours etc so you won’t be bothered anymore. This also brings the problem of.. what happens when someone doesn’t have javascript on ? (NoScript plugin etc). The solution is simple, provide them with a message.

This solution works as a reverse proxy system, so you can inmtegrate it easily to your existing solution without making any major changes.

Luckily, such systems already exist all you have to do is use them:

Caddy is an easy to use golang high performance webserver and reverse proxy which can handle your traffic, automatic TLS/SSl. ETc.

Using Robots.txt to block them

Robots.txt is a file which is followed by some bots to tell them what they are allowed to do and not do. DOn’t fool yourself, most bots don’t follow it. And if they don’t follow it you can easily give them a gzip bomb OR block those IP’s for 1 hour for following a link to a page which they should NOT follow. This creates some complexity and can be thought of as a potential solution.

Blocking based on IP - How to detect bots and cralwers?

There are plenty of such bots, hard to detect. since they can spoof their useragent. Some don’t spoof it (using the default one) which youcan easily detect and block. Use existing IP lists of the major bots, but this does leave those which are using proxies OR just have mallicious intent.

An easy thing to do is serving them a gzip bomb

What happens when JavaScript is disabled? - False POsitives

There is a chance that someone accessing your site could be using a real browser with javascript disabled for security and privacy purposes.

Most systems will flag this as a BOT even though if that person might be a real human using a older version of a browser OR even a browser which is built for privacy. Which could lead many people blocked out of the websites unless they complete complex captcha’s or whatever.

A solution is to provide a message that that page is blocked and .. well you invite them to enable javascript and cookies on the page.

IF you do block, makes sure you choose some alternative means of people finding you on the internet or allow a few bot crawlers like google to index your website.

Gzip Bomb THem

If bots won’t follow robots.txt, and go where they aren’t supposed to. Well bad for them since you have a surprise for them

You can create a 10GB or even a 1TB (terabyte) file which contains just zeroes and gzip it a few times to only be between 5 ~ 50KB When the browser/crawler accesses that page (gzipped), especially if they are using headless browsers will use huge amounts of resources (RAM,CPU) to decipher the 10GB/1TB file (which is tiny when gzipped), Effectively making them run out of memory and crash. This will teach them to not mess with your website if they can’t follow basic instructions.

TODO caddy example to follow!

Cloudflare is NOT the soluiton agasint bots

Using a third-party service like Cloudflare for bot protection introduces significant security and privacy concerns by placing a crucial layer of your infrastructure outside of your direct control. By proxying all traffic, Cloudflare acts as a Man-in-the-Middle (MITM), effectively terminating your users’ secure connection (SSL/TLS) at their edge servers before re-establishing a connection to your origin server. While they secure the connection to your server, you are dependent on their security practices and privacy policies for the handling of your users’ data at their global points of presence. This means Cloudflare sees all unencrypted traffic, including request bodies, headers, and the original IP addresses of your users. For organizations handling sensitive data or operating under strict regulatory compliance, this loss of control and visibility over the complete secure chain, including the management of the SSL certificate at the edge, is a non-trivial security and privacy risk.


A common operational drawback of using third-party bot protection is the inadvertent blocking of legitimate users, which directly impacts accessibility and business goals. Cloudflare’s bot detection relies on various techniques, including behavioral analysis and client-side integrity checks, which can be overly aggressive or rely on identifying known, up-to-date browser fingerprints. This often leads to a phenomenon known as “false positives,” where users with older or non-mainstream browsers, niche operating systems, or those using privacy-enhancing tools (like specific VPNs or certain ad/script blockers) are erroneously flagged as malicious bots and subjected to frustrating challenges or outright blocks. This exclusion of real customers, particularly those with older devices or with a strong preference for privacy-focused browsers, can be detrimental to the user experience and can significantly shrink the potential audience for your web application.


The superior alternative is to implement a reverse proxy that you own and operate—such as Nginx, Apache, or Caddy—as your gateway for bot protection. By doing so, you maintain complete ownership and control over all security logs, configurations, and, most importantly, the entire SSL/TLS termination process, ensuring your users’ traffic is decrypted only on infrastructure you trust. You can deploy a Web Application Firewall (WAF) or specialized bot detection modules directly on your self-managed reverse proxy, custom-tuning the rules to minimize false positives and suit your specific audience’s traffic patterns, rather than relying on a one-size-fits-all, opaque, cloud-based solution. This architectural choice enhances security through full control, guarantees maximum privacy for your user base, and offers the flexibility to ensure genuine users, even those with older technology, are never mistakenly blocked.

Various resources

https://robotstxt.com/ai

https://medium.com/swlh/keeping-ai-bots-off-your-blog-how-to-protect-your-intellectual-property-efd1699da980

https://www.theverge.com/24315071/ai-training-chatgpt-gemini-copilot-how-to

https://github.com/dmitrizzle/disallow-ai/blob/main/src/robots.txt

https://github.com/caddy-plugins/nobots

Comments:

Subscribe to my Newsletter

Receive emails about Privacy, Security, Linux, Programming, and on projects i'm working on