How we score 5,576 stores against the bar AI shoppers actually use
Eight dimensions. One hundred points. Four bands. One disqualifying gate. Published in full so you can replicate every signal we measure — and so the brands at the top earn it.
5,576
5,535
100.0%40
re-running100
robots.txt disallowBand distribution (live-audited only)
80–100
0 stores
0.0%
65–79
19 stores
0.3%
45–64
111 stores
2.0%
0–44
5,405 stores
97.7%
What we measure, and why each thing matters to an AI shopping agent
Weights total 100. Each dimension is the weighted sum of multiple sub-signals; the spec is public at scripts/agent-readiness-rubric/spec/rubric.v2.json and is the source of truth for both our offline pipeline (Python) and the live audit service (TypeScript), with parity tests pinning them.
1
Product JSON-LD
Whether the PDP emits a schema.org Product node with name, sku/gtin, description and an image[] array.
How we measure: We fetch a PDP from your sitemap, parse every <script type="application/ld+json"> block (including @graph), find the Product node and check each required sub-field.2
Offer clarity
Whether the Offer attached to the Product has price, priceCurrency, availability and priceValidUntil — and whether any sale window is still valid today.
How we measure: Read the Offer node from the same JSON-LD parse, validate priceValidUntil parses as a date in the future, cross-check price against your public catalog feed when available.3
Catalog feed quality
Whether a structured product feed exists (/products.json on Shopify, /wp-json/wc/store/products on WooCommerce, etc.) and how much agent-relevant metadata it carries: GTIN, google_product_category, brand/vendor, taxonomy depth.
How we measure: GET the platform-canonical feed path; count GTIN coverage across products; check for google_product_category, brand/vendor presence, and ≥2-level categories.4
Freshness and inventory
Whether the catalog and stock signals are current. Stale prices or stale inventory makes agents downweight you.
How we measure: Last-Modified header on PDP, snapshot recency of our cached catalog, inventory_count > 0 ratio, whether the platform has disable_checkout set.5
Rich PDP schema
Four schema.org node types beyond Product — AggregateRating (unlocks ratings in answer cards), BreadcrumbList (category resolution), FAQPage (long-tail "does X work with Y" answers), MerchantReturnPolicy (agents weight returns risk into recommendations).
How we measure: Each of the four nodes contributes 3 of the 12 points. Found via the same @graph-aware JSON-LD parse used for dim 1.6
Image and media
≥3 product images, absolute https URLs, alt text, and a populated schema.org image[] array.
How we measure: Count <img> tags on the PDP; fraction with absolute https src; fraction with alt= attribute; length of schema image array on Product.7
Agent commerce surface
Depth of your UCP profile (`/.well-known/ucp`) — basic presence is now tablestakes (Shopify defaults provide it). What matters: declared capabilities, whether you advertise an MCP service in `ucp.services`, signing keys for RFC 9421 request authentication, merchant-hosted vs platform-hosted, and the presence of an A2A `/.well-known/agent-card.json`.
How we measure: Fetch /.well-known/ucp, parse JSON, count capabilities, scan ucp.services for an mcp-typed endpoint, count signing_keys[], detect merchant-vs-xpay-hosted. Also probe /.well-known/agent-card.json and /.well-known/oauth-protected-resource. Penalised if any deprecated/fictitious well-known URI is served (we maintain a denylist).8
Agent accessibility
Whether shopping agents can actually reach you. robots.txt rules per tracked AI user-agent, meta robots consistency, no Cloudflare bot-fight challenge, no X-Robots-Tag: noai, no IPTC noai/noimageai on PDP images.
How we measure: Parse robots.txt per RFC 9309, check per-UA Disallow rules against a 12-agent allowlist (GPTBot, OAI-SearchBot, ChatGPT-User, PerplexityBot, ClaudeBot, Google-Extended, Applebot-Extended, Amazonbot, …). Probe meta tags + response headers.Blocking shopping agents caps the score
If a store’s robots.txt serves Disallow: / against any of GPTBot, OAI-SearchBot, PerplexityBot or ClaudeBot, the score is capped at 65 (top of largely_ready) and dimension 8 (Agent accessibility) is forced to zero — regardless of how good the rest of the rubric scores.
Rationale: a store that publicly opts out of agentic discovery shouldn’t appear on a leaderboard claiming to predict which brands AI shoppers will find. They’ve chosen to be invisible. 100 of 5,576 stores in this cohort currently hit the gate.
Honest limits, explicit gaps
UCP presence alone is not agent-ready
Every default Shopify store passes the basic UCP probe. The rubric reflects this — UCP only contributes meaningfully via depth signals (capabilities count, MCP advertisement, signing keys). Stores running default Shopify with no other agent-readable surface land in `emerging`, not `agent_ready`.
Pending-audit stores carry an offline-only baseline
Live audit coverage is currently 5,535 of 5,576 (100.0%). The remaining 40 carry an offline-only score derived from cached catalog signals; we mark them "Live audit pending" and refresh within 24h.
Headless-rendered schema is detected, but treated cautiously
When static HTML returns no JSON-LD, we re-probe via a headless render to catch JS-injected schema. Static-served schema scores higher than JS-injected — crawler reliability differs, and we publish the distinction so merchants moving to SSR see the bump.
LLM narrative is rubric-coupled, not free-form
The per-store narrative published in the index is grounded in the rubric output; it doesn't make claims the dimension scores can't support. Hand-eval calibration is the prerequisite for the LLM enrichment refresh.
Build the score yourself
Spec is one JSON file
scripts/agent-readiness-rubric/spec/rubric.v2.json holds every weight, threshold, band cutoff, agent allowlist entry and gate trigger. The Python and TypeScript scorers both load it; parity tests pin the outputs.
Live audit endpoint is public
audit.xpay.sh/api/v2/audit?url=<your-store> returns the same RubricResult shape every score on this site is derived from, plus the per-check evidence we extracted.
Hand-eval calibration is open
Spearman ρ between human scoring and the v2 rubric is the next checkpoint. Distribution will tune as the rubric earns confidence.
