METHODOLOGY

How the scans work

ARC Report runs one scan per brand per day at 02:00 UTC across 1,015 tracked brands — an HTTP-only scanner making roughly 25 requests per brand. No browser automation, no scraping of page content beyond the homepage and one product page. Everything published is reproducible from the requests described below. Last completed scan: 2026-06-20 06:31 UTC.

Machine-readable version: /methodology.md

The 9 tracked agents

User-Agent	Company	Used by
GPTBot	OpenAI	ChatGPT training
ChatGPT-User	OpenAI	ChatGPT live browsing
ClaudeBot	Anthropic	Claude training
Claude-Web	Anthropic	Claude live browsing
PerplexityBot	Perplexity	Perplexity / Comet
Google-Extended	Google	AI Mode / Gemini
Amazonbot	Amazon	Buy For Me
Bingbot	Microsoft	Copilot / Bing
CCBot	Common Crawl	Open training data

1. robots.txt parsing

We fetch /robots.txt and parse it with a standards-compliant parser, recording the effective rule for each of the 9 agents: explicitly allowed, explicitly blocked (Disallow), or no_rule (no mention — allowed by web convention). robots.txt is a policy declaration, so these verdicts are high-confidence text diffs.

2. Live HTTP access tests

Policy and enforcement differ, so we also send real requests with each agent's User-Agent string against the homepage and one product page, comparing each response to a Chrome-baseline request:

HTTP 403/406/429 or a bot-challenge page → the agent is restricted (WAF/CDN enforcement, not robots.txt policy).
Response body under 25% of the Chrome baseline → treated as degraded/challenged → restricted.
Timeouts and network errors → inconclusive. Inconclusive results are never published as changes.

3. Structured data, protocol files, infrastructure

Structured data — JSON-LD blocks, Schema.org Product markup, Open Graph tags, sitemap.xml, and product feeds, detected from server-rendered HTML.
Protocol files — llms.txt (recording size and link count), agents.txt variants, and UCP endpoints.
Infrastructure — e-commerce platform, CDN, and WAF fingerprinting from headers, cookies, and markup signatures.

The two-scan confirmation rule

Published changes follow a two-tier confirmation system. Tier 1 (immediate): robots.txt rule changes — these are text-file diffs; if the rule changed, it changed. Tier 2 (requires confirmation): HTTP access verdicts, blocked-agent counts, CDN/WAF detection, and structured-data presence are inferences that can flicker with timeouts, WAF moods, or CDN caches — a Tier 2 change must appear in two consecutive daily scans before it is published to the changelog. Scanner failures (timeouts, HTTP 429) are never published as brand changes.

ARC Score v1.0

A 0–100 score summarizing how accessible a brand is to AI agents, computed from the latest scan. The component breakdown is always shown alongside the number.

Component	Points	Computation
Agent access breadth	50	Mean per-agent access over the 9 agents: allowed / no_rule = 1.0, inconclusive = 0.5, restricted = 0.25, blocked = 0 — × 50.
Structured data quality	25	JSON-LD 7 · Schema.org Product 7 · Open Graph 4 · sitemap 4 · product feed 3.
Protocol files	15	llms.txt present 6 (+3 if it contains links) · agents.txt 3 · UCP 3.
Scan stability	10	Share of the 9 per-agent checks returning a conclusive verdict in the latest scan — × 10. Measures confidence, not access.

Versioning policy: the formula above is frozen as Score v1.0. Any change to weights or inputs ships as a new version with a changelog entry on this page, and the score version is included in all data downloads (score_version) and MCP responses so historical comparisons stay meaningful.

Known limitations

robots.txt declares policy; enforcement can differ. We measure both and label them separately (blocked vs restricted) — neither alone is the full story.
Structured-data detection reads server-rendered HTML; markup injected by JavaScript can be missed (we mark these signals as lower-confidence in the changelog).
WAF behaviour varies by region, time, and request fingerprint. The two-scan rule reduces flicker but cannot eliminate it; see /reliability.
UA-string tests approximate agent traffic; they don't execute JavaScript or replicate full agent behaviour.
One scan per day means sub-daily changes can be missed or appear as a single combined change.

Corrections & disputes

Think a data point is wrong? See the reliability page for the dispute process, or email hello@arcreport.ai with subject [DATA DISPUTE] your-domain.com.