How Quickly Can You Test Which Products AI Agents Prefer? A 6-Step Shopify Experiment
Most Shopify merchants have no idea which of their products AI recommends. They've never checked. If you run one structured week of testing, you'll have data your competitors don't. Build on it.
This isn't theory. It's a repeatable experiment you can run with a free ChatGPT account, Perplexity, and Google search. No developer needed. No paid tools. Just a spreadsheet and about 90 minutes of setup.
Here's how to do it.
Why Does It Matter Which Products AI Agents Recommend?
AI shopping assistants are already directing real purchase decisions, and the share is growing fast.
Gartner projected in early 2024 that traditional search engine volume would drop by 25% before 2026, driven largely by AI-generated answers absorbing queries that used to go to Google. Gartner, 2024. Salesforce reported that generative AI influenced more than $199 billion in global online sales during the 2023 holiday season alone. Salesforce Research, 2024.
When someone asks ChatGPT "what's the best running shoe for wide feet under $150," your store either shows up or it doesn't. There's no page 2 in an AI answer. You're in the response or you're invisible.
The difference between being recommended and being skipped comes down to your product data. But most merchants don't know where they stand. That's exactly what this experiment fixes.
What Do You Need Before You Start?
The setup is simpler than you'd expect. You don't need a dev, a six-figure SaaS subscription, or two weeks of planning.
Here's the full list:
- Shopify admin access with product export permissions
- Google Sheets or Excel
- A free account on ChatGPT, Perplexity, and Google (AI Overviews are on by default for most users)
- 20-30 products you want to test
- About 90 minutes for initial setup
That's it. Seven days of 20-30 minute daily check-ins, then a final analysis session. You'll have real data on which products AI platforms recommend and which ones they ignore.
How Do You Choose Which Products to Test?
Don't test randomly. Structure matters here.
Pick 20-30 products that represent a real spread of your catalog. Then divide them into three groups based on content quality:
| Group | What It Looks Like | Expected AI Visibility |
|---|---|---|
| Group A | Full descriptions, specs, use cases, reviews, structured data, clear category tags | Highest |
| Group B | Decent descriptions, some gaps in attributes, limited reviews | Medium |
| Group C | Thin content, missing specs, no reviews, vague category info | Lowest |
This grouping turns the experiment from a random data dump into a real hypothesis test. If Group A products consistently outperform Group C, you've proven that data quality drives AI visibility in your specific catalog. That's actionable. Random data isn't.
I've run versions of this with a lot of Shopify stores. The Group A vs. Group C gap is almost always bigger than merchants expect.
What's the 6-Step Process for Running the Experiment?
Here's the full process, day by day.
1 Export Your Catalog to CSV
In Shopify admin, go to Products > Export. Download all products as a CSV. Open it in Google Sheets.
The columns you care about: product title, description (body_html), product type, tags, price, and any metafields you've set. Filter for your 20-30 test products and paste them into a new tab called Test Set. Add three more columns: Group (A/B/C), Mention Rate, and Notes.
This CSV becomes your experiment baseline. Keep it. You'll want to compare it after you make improvements.
2 Build Your Test Prompts
Think like a shopper, not a product manager. Real AI queries are conversational and specific.
Some prompt patterns that work well:
- "What's the best [product type] for [use case] under $[price]?"
- "Which [category] would you recommend for someone who [specific need]?"
- "What do experts recommend for [specific problem] in [context]?"
- "Compare the top [product category] options for [audience]."
Write 5-10 prompts per product category. Don't use your store name or brand name in any prompt. You want to see organic recommendations, not branded searches. Branded queries tell you about brand recognition, not AI discoverability.
3 Run Queries Across All Three Platforms
Open ChatGPT, Perplexity, and Google in separate browser tabs. Paste each test prompt and capture the results. Screenshots work. A copy-paste log into your spreadsheet works too.
Do this on three separate days during the week (Monday, Wednesday, and Friday, for example). AI answers aren't perfectly consistent. A single run can miss products that would appear 40% of the time. Three runs gives you a real distribution.
One thing to watch: ChatGPT Shopping results may vary based on whether you're logged in and what region you're in. Run from the same browser profile each time.
4 Log What You See
For each query, record these five data points in your spreadsheet:
- Platform (ChatGPT / Perplexity / Google AI)
- Product mentioned? (Yes / No)
- Which product (if yes)
- Position in list (1st, 2nd, 3rd recommendation, etc.)
- Source cited? (Was your website listed as the source?)
Also note if a competitor showed up in your place. That data point is just as useful as tracking your own appearances.
5 Calculate Mention Rates
After three test days, tally the results for each product on each platform.
The formula: Mentions / Total Queries = Mention Rate
Example: Product A appeared in 9 of 15 total queries across three platforms = 60% mention rate. Product B appeared in 2 of 15 = 13%.
Break this down by platform. A product with a 55% rate on Perplexity and 5% on ChatGPT has a platform-specific problem, and the fix is different than a product that's low across all three. This distinction matters for how you focus on work.
6 Find the Pattern
This is where the experiment pays off. Pull up your Group A, B, and C products side by side. Look at what the high-scorers have that the low-scorers don't.
Common patterns from stores I've tested:
- Products with longer, more specific descriptions tend to get recommended more often
- Products with clear use-case language ("ideal for X, works well when Y") outperform generic listings
- Products with visible review counts surface more reliably on Perplexity, which pulls from web content
- Google AI Overviews tend to favor products from pages with strong supporting content: buying guides, comparison articles, FAQs
- ChatGPT Shopping relies more heavily on structured product data and feed quality
Each platform has different signals. This experiment shows you which gap you need to close first.
What Metrics Should You Track to Make This Useful?
Three numbers tell the real story.
Mention Rate is your primary metric: how often does a given product appear when relevant prompts are run? Anything above 30% for your best products is a solid starting point.
Citation Rate measures how often your actual website is linked or referenced as the source of an AI recommendation. A product can get mentioned by name without your store being credited. That's a different kind of visibility gap, and it matters for traffic.
Platform Spread tells you whether your visibility is concentrated on one AI tool or distributed across all three. A merchant showing up well on Perplexity but invisible on ChatGPT Shopping is missing a significant and growing traffic channel. Search Engine Land, 2024.
What Does a Good Result Look Like, and What's a Warning Sign?
Here's the data.
Across testing we've done with Shopify stores in several product categories, stores with strong product data (thorough descriptions, attribute tags, solid review volume, structured metadata) hit mention rates of 30-50% for their best listings. Stores with thin content rarely clear 10%.
The median mention rate across all products we've tracked is around 12%. The top quartile averages 38%. The gap between the two isn't brand recognition, ad spend, or domain authority. It's data quality. Almost every time.
A warning sign: if your Group A products score at or below your Group C products, the issue might be your product type rather than your content. Commodity products where price is the only differentiator are harder for AI to opinionize on. That's a different problem requiring a different fix (usually building supporting content around the products rather than just improving the listing itself).
What's on the Results Checklist After Week One?
Before you call the experiment done, work through this list:
- Mention rates calculated for all 20-30 products across all three platforms
- Group A vs. Group C gap measured. Is there a clear difference? How large?
- Platform spread documented. Where are you strong and where are you invisible?
- Citation rate tracked. Are you being recommended without being credited?
- Competitor appearances logged. Which competitors are AI tools recommending in your place?
- Top 5 "fixable" products identified. Which products could move from Group C or B to Group A with realistic effort?
- Baseline saved. Your CSV and spreadsheet are the baseline. You'll need them for round two in 4-6 weeks.
The experiment doesn't end at week one. It becomes a benchmark. Make the fixes, wait 4-6 weeks, rerun the queries, and measure the delta. That's when you move from hypothesis to proof.
Frequently Asked Questions
How long does this AI product visibility experiment take to run?
Setup is about 90 minutes. Daily queries take 20-30 minutes per day. Total commitment across 7 days is roughly 4-5 hours. No developer required.
Do I need paid accounts on ChatGPT or Perplexity to run this test?
No. ChatGPT's free tier, Perplexity's free search, and standard Google search (AI Overviews show for most users) are enough. Paid accounts let you run more queries per day but aren't required for a valid experiment.
What should I do if none of my products appear in AI results at all?
Zero appearances is actually the most useful result you can get. It means the gap is structural: missing product attributes, thin descriptions, or no structured data on the page. Start by improving your top 5 products first, then retest in 4-6 weeks. Don't try to fix everything at once.
How often should I repeat this experiment?
Run it before making changes to your product data, then again 4-6 weeks after. The change in mention rate tells you whether your fixes worked. A quarterly testing cadence is reasonable for ongoing monitoring once you've established a baseline.
Does this experiment work for every type of Shopify store?
It works best for stores with specific, describable products: apparel, gear, home goods, supplements, beauty, outdoor equipment, and similar categories. Stores selling commodity products where price is the primary differentiator will see less variation driven by content quality. If AI tools don't have a reason to prefer your version of the product, improving the description alone won't move the needle much.
Want to Know Where Your Store Actually Stands?
We've built a full AI Commerce readiness audit for Shopify stores. See which of your products are showing up in AI recommendations, which are invisible, and what to fix first. No guesswork. Just data.
Get Your AI Visibility Audit
