AI Crawler Tracker

Why Track Crawlers?

AI companies and search engines regularly crawl websites to train models and index content. Understanding which bots visit your site helps you:

Monitor data collection by AI training bots (GPTBot, Claude, etc.)
Understand search engine indexing patterns
Identify unwanted or malicious crawlers
Analyze traffic sources beyond traditional analytics

How It Works

This system uses three complementary tracking methods to ensure comprehensive coverage:

1x1 Pixel Image: A transparent GIF that loads when bots render images. Works for crawlers that process visual content.
CSS Link Beacon: A stylesheet link that triggers when CSS is loaded. Catches bots that skip images but load styles.
JavaScript Pixel: A client-side script that dynamically creates a tracking image. Captures JavaScript-enabled crawlers.

Implementation Overview

The tracking system consists of three components:

1. Cloudflare Worker

Handles tracking endpoints (/pixel.gif, /log.css, /api/logs) and stores visit data in Cloudflare KV.

2. Tracking Beacons

Small code snippets embedded in your website that send requests to the Worker endpoints.

3. Dashboard (This Site)

A Next.js application that fetches data from the Worker and displays it in a filterable table.

Setting Up the Worker

Create a Cloudflare Worker with the following endpoints:

/pixel.gif - Returns a 1x1 transparent GIF
/log.css - Returns an empty CSS file
/api/logs - Returns stored visit records as JSON

Each request captures: timestamp, IP, user agent, location (country/city), ASN, referer, and requested path.

Adding Tracking to Your Site

Add the tracking beacons to your website's HTML:

<!-- 1. Pixel Image -->
<img src="https://your-worker.workers.dev/pixel.gif?token=YOUR_TOKEN&page=/page-path"
     width="1" height="1" alt="" referrerpolicy="origin-when-cross-origin">

<!-- 2. CSS Beacon -->
<link rel="stylesheet"
      href="https://your-worker.workers.dev/log.css?token=YOUR_TOKEN&page=/page-path"
      referrerpolicy="origin-when-cross-origin">

<!-- 3. JavaScript Beacon -->
<script>
(function(){
  var p = new Image();
  p.src = 'https://your-worker.workers.dev/pixel.gif?token=YOUR_TOKEN&page=' +
          encodeURIComponent(location.pathname);
})();
</script>

Viewing the Data

Check the Crawler Visits page to see recent bot activity. Use the filters to find specific crawlers like GPTBot, Perplexity, or Claude.

Privacy Considerations

This tracking system captures technical metadata (IP, user agent) but doesn't use cookies or track personal information. It's designed specifically for bot detection, not user tracking. Consider your local privacy regulations and add appropriate disclosures if needed.

AI Training Bots

These bots crawl web content to train large language models:

GPTBot - OpenAI's web crawler for ChatGPT training
Claude-Bot / Anthropic - Anthropic's crawler for Claude AI
PerplexityBot - Perplexity AI's search and training crawler
Cohere-Bot - Cohere's AI training crawler

Search Engine Crawlers

Traditional search engines indexing content for search results:

Googlebot - Google's web crawler
Bingbot - Microsoft Bing's crawler
DuckDuckBot - DuckDuckGo's crawler
Baiduspider - Baidu's search crawler
YandexBot - Yandex's search crawler

SEO & Analytics Bots

Commercial services that analyze websites:

AhrefsBot - SEO analysis and backlink tracking
SemrushBot - SEO and competitive analysis
MJ12bot - Majestic SEO crawler

Social Media Bots

Platforms that preview links and fetch metadata:

facebookexternalhit - Facebook link previews
Twitterbot - Twitter/X card previews
LinkedInBot - LinkedIn link previews

Tutorials

Getting Started with Crawler Tracking