How the scans work
ARC Report runs one scan per brand per day at 02:00 UTC across 1,015 tracked brands — an HTTP-only scanner making roughly 25 requests per brand. No browser automation, no scraping of page content beyond the homepage and one product page. Everything published is reproducible from the requests described below. Last completed scan: 2026-06-20 06:31 UTC.
Machine-readable version: /methodology.md
The 9 tracked agents
| User-Agent | Company | Used by |
|---|---|---|
| GPTBot | OpenAI | ChatGPT training |
| ChatGPT-User | OpenAI | ChatGPT live browsing |
| ClaudeBot | Anthropic | Claude training |
| Claude-Web | Anthropic | Claude live browsing |
| PerplexityBot | Perplexity | Perplexity / Comet |
| Google-Extended | AI Mode / Gemini | |
| Amazonbot | Amazon | Buy For Me |
| Bingbot | Microsoft | Copilot / Bing |
| CCBot | Common Crawl | Open training data |
1. robots.txt parsing
We fetch /robots.txt and parse it with a standards-compliant parser, recording the effective rule for each of the 9 agents: explicitly allowed, explicitly blocked (Disallow), or no_rule (no mention — allowed by web convention). robots.txt is a policy declaration, so these verdicts are high-confidence text diffs.
2. Live HTTP access tests
Policy and enforcement differ, so we also send real requests with each agent's User-Agent string against the homepage and one product page, comparing each response to a Chrome-baseline request:
- HTTP 403/406/429 or a bot-challenge page → the agent is restricted (WAF/CDN enforcement, not robots.txt policy).
- Response body under 25% of the Chrome baseline → treated as degraded/challenged → restricted.
- Timeouts and network errors → inconclusive. Inconclusive results are never published as changes.
3. Structured data, protocol files, infrastructure
- Structured data — JSON-LD blocks, Schema.org Product markup, Open Graph tags, sitemap.xml, and product feeds, detected from server-rendered HTML.
- Protocol files —
llms.txt(recording size and link count),agents.txtvariants, and UCP endpoints. - Infrastructure — e-commerce platform, CDN, and WAF fingerprinting from headers, cookies, and markup signatures.
The two-scan confirmation rule
Published changes follow a two-tier confirmation system. Tier 1 (immediate): robots.txt rule changes — these are text-file diffs; if the rule changed, it changed. Tier 2 (requires confirmation): HTTP access verdicts, blocked-agent counts, CDN/WAF detection, and structured-data presence are inferences that can flicker with timeouts, WAF moods, or CDN caches — a Tier 2 change must appear in two consecutive daily scans before it is published to the changelog. Scanner failures (timeouts, HTTP 429) are never published as brand changes.
ARC Score v1.0
A 0–100 score summarizing how accessible a brand is to AI agents, computed from the latest scan. The component breakdown is always shown alongside the number.
| Component | Points | Computation |
|---|---|---|
| Agent access breadth | 50 | Mean per-agent access over the 9 agents: allowed / no_rule = 1.0, inconclusive = 0.5, restricted = 0.25, blocked = 0 — × 50. |
| Structured data quality | 25 | JSON-LD 7 · Schema.org Product 7 · Open Graph 4 · sitemap 4 · product feed 3. |
| Protocol files | 15 | llms.txt present 6 (+3 if it contains links) · agents.txt 3 · UCP 3. |
| Scan stability | 10 | Share of the 9 per-agent checks returning a conclusive verdict in the latest scan — × 10. Measures confidence, not access. |
Versioning policy: the formula above is frozen as Score v1.0. Any change to weights or inputs ships as a new version with a changelog entry on this page, and the score version is included in all data downloads (score_version) and MCP responses so historical comparisons stay meaningful.
Known limitations
- robots.txt declares policy; enforcement can differ. We measure both and label them separately (blocked vs restricted) — neither alone is the full story.
- Structured-data detection reads server-rendered HTML; markup injected by JavaScript can be missed (we mark these signals as lower-confidence in the changelog).
- WAF behaviour varies by region, time, and request fingerprint. The two-scan rule reduces flicker but cannot eliminate it; see /reliability.
- UA-string tests approximate agent traffic; they don't execute JavaScript or replicate full agent behaviour.
- One scan per day means sub-daily changes can be missed or appear as a single combined change.
Corrections & disputes
Think a data point is wrong? See the reliability page for the dispute process, or email hello@arcreport.ai with subject [DATA DISPUTE] your-domain.com.