AI Crawler Tracker

Tutorials

Learn how to track AI crawlers and bots visiting your website.

Getting Started with Crawler Tracking

Last updated: November 2024

Why Track Crawlers?

AI companies and search engines regularly crawl websites to train models and index content. Understanding which bots visit your site helps you:

  • Monitor data collection by AI training bots (GPTBot, Claude, etc.)
  • Understand search engine indexing patterns
  • Identify unwanted or malicious crawlers
  • Analyze traffic sources beyond traditional analytics

How It Works

This system uses three complementary tracking methods to ensure comprehensive coverage:

  1. 1x1 Pixel Image: A transparent GIF that loads when bots render images. Works for crawlers that process visual content.
  2. CSS Link Beacon: A stylesheet link that triggers when CSS is loaded. Catches bots that skip images but load styles.
  3. JavaScript Pixel: A client-side script that dynamically creates a tracking image. Captures JavaScript-enabled crawlers.

Implementation Overview

The tracking system consists of three components:

1. Cloudflare Worker

Handles tracking endpoints (/pixel.gif, /log.css, /api/logs) and stores visit data in Cloudflare KV.

2. Tracking Beacons

Small code snippets embedded in your website that send requests to the Worker endpoints.

3. Dashboard (This Site)

A Next.js application that fetches data from the Worker and displays it in a filterable table.

Setting Up the Worker

Create a Cloudflare Worker with the following endpoints:

  • /pixel.gif - Returns a 1x1 transparent GIF
  • /log.css - Returns an empty CSS file
  • /api/logs - Returns stored visit records as JSON

Each request captures: timestamp, IP, user agent, location (country/city), ASN, referer, and requested path.

Adding Tracking to Your Site

Add the tracking beacons to your website's HTML:

<!-- 1. Pixel Image -->
<img src="https://your-worker.workers.dev/pixel.gif?token=YOUR_TOKEN&page=/page-path"
     width="1" height="1" alt="" referrerpolicy="origin-when-cross-origin">

<!-- 2. CSS Beacon -->
<link rel="stylesheet"
      href="https://your-worker.workers.dev/log.css?token=YOUR_TOKEN&page=/page-path"
      referrerpolicy="origin-when-cross-origin">

<!-- 3. JavaScript Beacon -->
<script>
(function(){
  var p = new Image();
  p.src = 'https://your-worker.workers.dev/pixel.gif?token=YOUR_TOKEN&page=' +
          encodeURIComponent(location.pathname);
})();
</script>

Viewing the Data

Check the Crawler Visits page to see recent bot activity. Use the filters to find specific crawlers like GPTBot, Perplexity, or Claude.

Privacy Considerations

This tracking system captures technical metadata (IP, user agent) but doesn't use cookies or track personal information. It's designed specifically for bot detection, not user tracking. Consider your local privacy regulations and add appropriate disclosures if needed.

Understanding Different Bot Types

Last updated: November 2024

AI Training Bots

These bots crawl web content to train large language models:

  • GPTBot - OpenAI's web crawler for ChatGPT training
  • Claude-Bot / Anthropic - Anthropic's crawler for Claude AI
  • PerplexityBot - Perplexity AI's search and training crawler
  • Cohere-Bot - Cohere's AI training crawler

Search Engine Crawlers

Traditional search engines indexing content for search results:

  • Googlebot - Google's web crawler
  • Bingbot - Microsoft Bing's crawler
  • DuckDuckBot - DuckDuckGo's crawler
  • Baiduspider - Baidu's search crawler
  • YandexBot - Yandex's search crawler

SEO & Analytics Bots

Commercial services that analyze websites:

  • AhrefsBot - SEO analysis and backlink tracking
  • SemrushBot - SEO and competitive analysis
  • MJ12bot - Majestic SEO crawler

Social Media Bots

Platforms that preview links and fetch metadata:

  • facebookexternalhit - Facebook link previews
  • Twitterbot - Twitter/X card previews
  • LinkedInBot - LinkedIn link previews