AI prompt-tracking tools promise to measure how often your brand appears in ChatGPT, Claude, Gemini, or Perplexity, like “SEO rank tracking” for Google.
The idea sounds great, but there are problems with that approach that are not discussed enough.
Topics
TL;DR
I expose the flawed approach of AI prompt-tracking tools like Profound, Otterly.ai, and Keyword.com, which claim to measure brand visibility in AI conversations. These tools, however, fail to capture real user behavior because they test artificial conversations, lack statistical validity, are volatile, favor established brands, and can’t accurately reflect changes in smaller brands’ visibility. Instead, focus on creating high-quality, easily-citable content, building authority, and maintaining a consistent brand identity to increase your chances of being cited by AI recommendations. For small brands, spot-checks and tracking actual conversions are more effective, while large enterprises should use these tools for broad trend-watching, not tactical decisions.
How the tools operate
Many of the prompt-tracking tools do a version of:
- Enter your brand and competitors
- Generate category prompts (e.g., “Best project management tools for remote teams”)
- Run each prompt several times
- Aggregate the responses into a “visibility percentage” and compare over time.
It seems logical, but that’s not how users behave or how the AI assistants work.
Here’s why:
1. Prompt-tracking tools simulate conversations that never happen
LLM recommendations emerge over context, not through a one-shot prompts.
Real usage follows a pattern:
- User asks a vague question
- The AI clarifies
The user specifies needs (budget, team size, industry). Only then does the model recommend specific brands
Most brand mentions happen after that gradual refinement. Prompt trackers test zero-context, one-shot questions. A “blank canvas” test doesn’t reflect how users interact.
2. Real measurement is statistically impossible
To track visibility accurately, you’d need to test:
- Every topic
- Many variations
- Every major model
- Multiple user personas
- Across time
The cost explodes fast.
A conservative setup easily reaches millions of AI calls per month, tens of thousands of dollars. Tools don’t run anywhere near that.
They produce tiny samples that are later treated as reliable signals.
3. LLMs are non-deterministic
Tiny prompt changes can produce completely different brand lists.
Even identical prompts produce different outputs due to how models sample tokens, run on different hardware, or apply safety filters.
4. Tracking a moving target
Models change constantly:
RLHF tuning shifts preference heuristics
Safety layers modify output ranking
Retrieval pipelines swap sources
New model versions replace old behaviors
A “visibility drop” may reflect model drift and not a current market reality.
5. Big brands get bigger
Large companies get mentioned because they:
- Already exists in training data
- Are heavily referenced in trusted sources
- Have dense “knowledge nodes” in model weights
- Can afford high sample sizes
For them, even noisy dashboards can provide useful trend signals.
Small brands don’t exist in the model’s mental universe. They don’t appear in prompts, can’t afford enough sampling, and aren’t cited by authoritative sources. For them, the tools return noise.
What works
Instead of trying to monitor AI responses:
- Create well-structured, deep, easily citable content
- Publish original research and data
- Build up authority through trusted publications
- Maintain consistent brand identity
- Write list-style and structured resources
These increase your chances of being cited by RAG systems, where recommendations really come from.
Consider prompt trackers useful for broad trend-watching, rather than precise reality insights.