禁止OpenAI爬取网站

Keaeye2025-03-082025-03-08

如何禁止 OpenAI GPTBot 爬取我的网站？

OpenAI 于最近宣布了 GPTBot（他们的爬虫）的一些技术细节，其中就给出了相当重要的一点——如何禁止 OpenAI 的爬虫爬取我们的网站，用于给他们的模型“添砖加瓦”？

什么是 GPTBot？

GPTBot is OpenAI’s web crawler and can be identified by the following user agent and string.（GPTBot 是 OpenAI 的网络爬虫，可以通过以下用户代理和字符串来识别。）

User agent token: GPTBot
Full user-agent string:

GPTBot 的用途？

根据 OpenAI 官方文档：

Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.

译文：
GPTBot 抓取的网页内容可能会被用于改进未来的 AI 模型，OpenAI 会过滤掉：

需要付费访问的内容
可能收集个人身份信息 (PII) 的页面
违反 OpenAI 政策的内容

然而，GPTBot 使用你的内容后，并不会像 Google 那样提供回链，这意味着 OpenAI 可能会“白嫖”你的内容，而你的网站不会因此受益。

如何禁止 GPTBot 爬取我的网站？

1. 修改 `robots.txt`

OpenAI 官方表示 GPTBot 会遵守 robots.txt 规则，因此可以使用 robots.txt 进行屏蔽。

禁止 GPTBot 爬取整个网站

在网站的 robots.txt 文件中添加：

只允许爬取特定目录

如果你希望 GPTBot 仅能访问特定目录（例如 /example-directory-1/），但禁止其他内容，则使用：

Allow 具有更高的优先级，即使 Disallow 规则中包含该目录，Allow 仍然会生效。

2. 屏蔽 OpenAI 爬虫的 IP 地址

如果担心 GPTBot 不遵守 robots.txt 规则，可以直接屏蔽其 IP 段。

OpenAI 提供的 GPTBot 爬虫 IP 段如下（可随时在 OpenAI 官方列表查看最新的 IP）：

20.15.240.64/28 
20.15.240.80/28 
20.15.240.96/28 
20.15.240.176/28 
20.15.241.0/28 
20.15.242.128/28 
20.15.242.144/28 
20.15.242.192/28 
40.83.2.64/28

Vercel Edge Config 屏蔽 GPTBot

如果你的站点托管在 Vercel，可以在 Edge Config 进行屏蔽：

进入 Vercel 后台，选择你的项目
点击 Storage -> Edge Config，编辑 blocked_ips

添加以下内容：

   {
     "blocked_ips": [
       "20.15.240.64/28",
       "20.15.240.80/28",
       "20.15.240.96/28",
       "20.15.240.176/28",
       "20.15.241.0/28",
       "20.15.242.128/28",
       "20.15.242.144/28",
       "20.15.242.192/28",
       "40.83.2.64/28"
     ]
   }

## 服务器端拦截 GPTBot
如果你的站点是 自建服务器（VPS / 云服务器），可以在 Nginx 或 Apache 服务器 配置中拦截 GPTBot。

### Nginx 拦截
在 Nginx 配置文件（nginx.conf）的 server 块内添加：
```bash
if ($http_user_agent ~* "GPTBot") {
    return 403;
}

然后重启 Nginx：

1	systemctl restart nginx

Apache 拦截

在 .htaccess 文件中添加：

1
2
3

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule .* - [F,L]

使用 Cloudflare WAF 规则拦截

如果你的网站使用 Cloudflare，可以通过 WAF 防火墙规则拦截 GPTBot：

进入 Cloudflare 控制台
选择你的网站，进入 Security -> WAF
创建新的 防火墙规则：

字段：User-Agent
运算符：包含（contains）
值：GPTBot
操作：阻止（Block）

结尾

OpenAI 允许网站所有者决定是否让 GPTBot 访问，但这只是表面上的选择。
模型早已训练完成，现在禁止 GPTBot 还有意义吗？ 🤔

与 Google 爬虫不同，Google 至少会将爬取的内容链接回你的网站，带来 SEO 价值，而 GPTBot 只会利用你的内容，而不会给你任何回报。
如果你希望保护自己的网站内容，可以使用上述方法屏蔽 GPTBot。