Managing "bad" bots is not always black(list) or white(list)
By: Ken Hardin | June 12, 2024 | developer tools and AI
Managing bot access to your website is one of the least sexy tasks a site manager handles. You just identify the bot’s user agent, edit robots.txt to allow or disallow access, and you’re done. Right?
Of course, nothing’s ever quite that simple when it comes to safeguarding your web identity. Sure, robots.txt is still the industry stalwart – it’s a tactic we often use for our clients here at Mugo Web when we want to block a specific bot for some practical reason.
But there are literally thousands of bots out there, and there are more coming every day. Research by Arkose Labs found that site scraping jumped more than 400 percent in just one quarter last year. Not all that activity is nefarious, but you can bet a lot of it is. In fact, Arkose estimates there are about twice as many bad bots as good bots, and together they can account for 50 percent or more of all internet traffic.
And your definition of a “bad” bot may vary, depending on the business you’re in.
In this post, we’ll look at some issues to consider when managing bot access to your site and the three approaches to handle this mundane but vital task.
Bad bots and their impact on your site
Bad bots have been around for decades now, and they can cause all kinds of mischief.
The most obvious are DDoS (Distributed Denial of Service) and brute force attacks, but credible hosting services have long since implemented rate limiting and other volume-based solutions to thwart robots that try to just overwhelm your site or abuse APIs. Similarly, programmatic advertising platforms have developed filters to weed out click fraud and other old-school bad bot behavior.
Scalper bots are still causing consumers fits, but that’s really more a regulatory issue than a technical one at this point. Most observers agree that retail and ticket seller sites could use rate limits and other tried-and-true tactics to curtail this practice, but they (maybe) are not all that incentivized to not sell all their stuff in a huge hurry.
The real issue these days is scraping – why is this bot reading my site, and what is it going to do with the information it gathers? Social media sites are particularly fertile hunting grounds for scraper bots that are looking for hints about account passwords or leverage points for social engineering scams.
And then there is Generative AI. Not only can these bots consume all your content, they can re-write it and place it on a competitor’s website. And since there’s no real difference in load or rate against your server – on your box, they basically act like a search engine indexer – the only way to stop Gen AI is to identify them and deny them access.
To block or not to block?
One of our most recent bot-related projects here at Mugo Web was to disallow access for Gen AI bots on the site of one of our clients, Habitat Magazine, which invests a lot of resources in its proprietary, highly specific content. While Habitat certainly wants to be indexed by Google and other search engines, having ChatGPT simply spit out its content in a re-written version, with no referral traffic benefit, might be a real drain on Habitat’s business.
Most news publishers have decided to block bots from OpenAI and other Gen AI engines for the same reason (although, like everything else, the question seems to have become political, according to this piece from Wired.) Google has even introduced a specific tool to block its Bard AI, in response to publishers’ concerns.
However, if you manufacture flooring or run a luxury resort, you may not mind showing up in what amounts to a long-form search result. (ChatGPT and other Gen AI tools are terrible research tools, but consumers are using them to compare products and whip up shopping lists.) Not surprisingly, Yoast has a pretty good walk-through of the pros and cons of blocking AI bots for SEO and visibility purposes.
In Habitat’s case, the decision was pretty cut and dry. We just edited robots.txt to disallow Gen AI bot access, and that was it. We’re confident this tactic will be sufficient, since OpenAI, Google, and other Gen AI platforms are generally trustworthy actors and will comply with the directives in robots.txt.
Of course, you don’t have that level of clarity into all of the thousands of bots that may be trying to scan your site. Which brings us to the real question with bot management: blacklist or whitelist?
Just shut them all down? (That’s a C3PO joke)
In our example with Habitat, we took a “blacklist” approach to managing robot.txt – when we determine that a bot may have a negative impact, we simply block it. All other bots can access the site, until such a time as server logs or our customers say they are causing trouble.
This is the way most site managers use robot.txt to block bad bots. It works, but you’ll also see a lot of web admins describe the process as a game of “whack-a-mole.” You’re constantly on the hunt for harmful bots and adding them to your blacklist.
So, you have to ask, why not make a “whitelist” of friendly bots, and then just block everything else?
The “whitelist” approach seems to make sense, at least at first glance. At the very least, it’s definitely a lot less work. You just add all the leading search engines, and link checkers your partners may be using, and any third party grammar-checkers, and price comparison bots, and…
You get the picture. There are so many bots, good and bad, that it’s going to take a lot of work to identify either a credible “white” or “black” list.
- If you go the whitelist route, you stand to lose out on SEO, visibility, or other benefits from a “good” bot you just don’t know about.
- If you use the blacklist approach, you’re going to be open to new scrapers or other troublemakers until your server logs or general security intelligence alerts you to add them to your block list.
Of course, web admins and security advocates are always on the lookout for threats, and you can find regularly updated lists of bad bots on the internet. (GitHub is a wellspring for this kind of stuff.)
So, it’s really up to you. Most admins take the blacklist approach, since they are monitoring server logs and other threat analysis anyway. You can always spot and block a bad bot. If you just block everything but a select list of “good” bots, there’s a chance you’ll miss out on some benefit without ever knowing it.
Three methods for bot management
There are three approaches to managing bot access to your site, ranging from the tried-and-true to add-on services that promise to take this chore off your plate.
robots.txt
robots.txt is a text file that lives at the root of your site. Its main advantage is that it’s pretty easy to use. You simply declare the user agent name of the bot and then allow or disallow it access to certain areas of your site.
Technically speaking, robots.txt does not allow for conditional expressions, but most of the major search indexers (Google, etc.) do recognize the * wildcard character, giving you some flexibility is setting a global access rule and then listing expectations to that rule – either a blacklist or whitelist.
For example:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Disallow: /private/
User-agent: Bingbot
Allow: /page3.html
Allow: /page4.html
Disallow: /private/
Tells all robots (*) that they are not allowed to read your site, but then allows the whitelisted Googlebot and Bingbot to access some parts of your site, but definitely not your /private/ directory.
You can use the wildcard * to omit or include file types and specific subdirectories from bot scans.
Note that robots.txt does not include IP address or other criteria for identifying potential threats. And it relies on the visiting bot to honor the directives it provides – a genuinely bad robot will just ignore it and go about its nasty business. That’s what firewalls and other layered security is for.
Server-side access
Web server platforms offer a variety of tools – including IP address blocking and custom response paths – that can give you more discrete control over bots.
Apache often uses .htaccess files, which take the form
DirectiveName directive_value
For example, the .htaccess file:
RewriteEngine On
RewriteCond %{REMOTE_ADDR} ^192\.168\.[0-9]{1,3}\.[0-9]{1,3}$
RewriteRule ^(.*)$ http://example.com/301-error-page.html [R=301,L]
Tells Apache to send any request from a stated IP range to a 301 error.
There are far too many possible directives to cover in this blog post, of course (here’s Apache’s resource, if you need a reference.)
But, needless to say, admins have gotten really creative in the ways they use scripting and .htaccess files to create custom security alerts and enforce “crawl budgets” for the number of pages they will allow a bot to read before it compromises site performance. We particularly like this project at GitHub, which includes a comprehensive and regularly updated list of about 1,000 bad bots, as well as 7,000 bad referrers.
Third-party and integrated services
Not surprisingly, a market has grown in web application firewall (WAF) tools that provide enormous flexibility and granularity over how you manage bots. Some of these tools are offered as third-party add-ons to self-hosted sites; others are integrated directly into virtual hosting environments.
These tools propose to offer an anti-virus level of protection from nefarious bots. After all, IP addresses can be spoofed and your team probably has better stuff to do than constantly updating your blacklist.
And, depending on the volume of traffic your site handles, these tools can be remarkably affordable. Amazon Web Service (AWS) bot control costs about $10 per 1 million targeted requests. For high-traffic sites that want to carefully manage compute costs and possible data security issues, you can see how this would pay off pretty quickly. And, of course, these tools allow you to custom configure settings for special use cases, like the Gen AI crawlers we discussed a bit earlier.
Bot management is basic, but critical
In this post, we’ve looked at the various ways you can identify and manage access for the bots that want to crawl your website – often with bad intentions. None of these maintenance tasks are particularly challenging, from a technical standpoint, but they are critical in keeping your site up and running and protecting your intellectual property.