Getting Started with Crawler Tracking
Last updated: November 2024
Why Track Crawlers?
AI companies and search engines regularly crawl websites to train models and index content. Understanding which bots visit your site helps you:
- Monitor data collection by AI training bots (GPTBot, Claude, etc.)
- Understand search engine indexing patterns
- Identify unwanted or malicious crawlers
- Analyze traffic sources beyond traditional analytics
How It Works
This system uses three complementary tracking methods to ensure comprehensive coverage:
- 1x1 Pixel Image: A transparent GIF that loads when bots render images. Works for crawlers that process visual content.
- CSS Link Beacon: A stylesheet link that triggers when CSS is loaded. Catches bots that skip images but load styles.
- JavaScript Pixel: A client-side script that dynamically creates a tracking image. Captures JavaScript-enabled crawlers.
Implementation Overview
The tracking system consists of three components:
1. Cloudflare Worker
Handles tracking endpoints (/pixel.gif, /log.css, /api/logs) and stores visit data in Cloudflare KV.
2. Tracking Beacons
Small code snippets embedded in your website that send requests to the Worker endpoints.
3. Dashboard (This Site)
A Next.js application that fetches data from the Worker and displays it in a filterable table.
Setting Up the Worker
Create a Cloudflare Worker with the following endpoints:
/pixel.gif- Returns a 1x1 transparent GIF/log.css- Returns an empty CSS file/api/logs- Returns stored visit records as JSON
Each request captures: timestamp, IP, user agent, location (country/city), ASN, referer, and requested path.
Adding Tracking to Your Site
Add the tracking beacons to your website's HTML:
<!-- 1. Pixel Image -->
<img src="https://your-worker.workers.dev/pixel.gif?token=YOUR_TOKEN&page=/page-path"
width="1" height="1" alt="" referrerpolicy="origin-when-cross-origin">
<!-- 2. CSS Beacon -->
<link rel="stylesheet"
href="https://your-worker.workers.dev/log.css?token=YOUR_TOKEN&page=/page-path"
referrerpolicy="origin-when-cross-origin">
<!-- 3. JavaScript Beacon -->
<script>
(function(){
var p = new Image();
p.src = 'https://your-worker.workers.dev/pixel.gif?token=YOUR_TOKEN&page=' +
encodeURIComponent(location.pathname);
})();
</script>Viewing the Data
Check the Crawler Visits page to see recent bot activity. Use the filters to find specific crawlers like GPTBot, Perplexity, or Claude.
Privacy Considerations
This tracking system captures technical metadata (IP, user agent) but doesn't use cookies or track personal information. It's designed specifically for bot detection, not user tracking. Consider your local privacy regulations and add appropriate disclosures if needed.