AI web crawling bots are often likened to the “cockroaches of the internet” by software developers, with some developers taking inventive and humorous measures to counteract them. It is noted that while any website can fall victim to negative crawler activity, which can sometimes disable a site, open source developers face a “disproportionate” impact, as highlighted by Niccolò Venerandi, a developer for the Plasma Linux desktop and blogger at LibreNews.
Websites hosting free and open source (FOSS) projects generally disclose more of their infrastructure and often operate with more limited resources compared to commercial enterprises. A significant issue arises because many AI bots overlook the Robots Exclusion Protocol, the robot.txt file designed to instruct bots which areas of a site are off-limits, originally intended for search engines.
In January, FOSS developer Xe Iaso posted a “cry for help” on their blog, expressing how AmazonBot incessantly attacked a Git server website to the extent of causing DDoS outages. Git servers facilitate the hosting of FOSS projects, allowing anyone to download or contribute to the code. Iaso observed that the bot evaded the robot.txt file, masked its identity using different IP addresses, and mimicked other users.
According to Iaso, attempting to block AI crawler bots is futile since they can change their user agent, utilize residential IP addresses as proxies, and more. These bots persistently scrape sites until they collapse, clicking every link multiple times.
In response, Iaso developed a tool named Anubis. Anubis is a reverse proxy proof-of-work check that ensures only browsers operated by humans can access a Git server, effectively blocking bots. The name Anubis, deriving from Egyptian mythology, humorously reflects its function of weighing the “soul” of web requests. If a request is verified as human, it is greeted with a humorous anime depiction, an artistic interpretation of the mythological Anubis. If it’s identified as a bot, the request is denied.
The project quickly gained traction within the FOSS community. Upon being shared on GitHub on March 19, Anubis rapidly accumulated 2,000 stars, 20 contributors, and 39 forks.
The popularity of Anubis indicates that Iaso’s challenges are not isolated and are shared by others. Venerandi recounted numerous similar instances:
– Drew DeVault, Founder and CEO of SourceHut, discussed spending substantial time addressing aggressive Large Language Model (LLM) crawlers, which led to frequent brief outages.
– Jonathan Corbet, an esteemed FOSS developer running the Linux news site LWN, noted his site was slowed by DDoS-level AI scraper bot traffic.
– Kevin Fenzi, sysadmin of the Linux Fedora project, had to block Brazil entirely due to the bots’ aggressiveness.
Venerandi shared with TechCrunch the extreme measures developers have to take, like banning entire countries, to handle AI bots disregarding robot.txt files, echoing the sentiment that some developers see vengeance as a viable defense.
On a Hacker News discussion, a user proposed filling robot.txt forbidden pages with misleading content to discourage bots. A creator known as “Aaron” introduced a tool called Nepenthes, which ensnares crawlers in misleading content, described as both aggressive and, potentially, malicious. This tool is named after a carnivorous plant.
Cloudflare, a significant commercial entity, recently released AI Labyrinth, a tool designed to mislead AI crawlers by providing irrelevant content, preventing them from accessing real site data.
Drew DeVault of SourceHut commented on both Nepenthes and Anubis, noting the former’s appeal in feeding nonsensical data to crawlers but ultimately finding Anubis to be the more effective solution for his site. Despite this, DeVault publicly urged for a more permanent solution by eliminating reliance on LLMs, AI image generators, GitHub Copilot, and similar technologies, a plea made in earnest despite its unlikely adoption.
Given the persistent threat, developers, particularly within the FOSS community, continue to devise clever and occasionally humorous strategies to defend against AI crawler bots.