Tracking Rankings on LLMs (AI Chatbots)

Chris Green
Jul 13, 2024
5 min read

Updated: Aug 12, 2024

Most major LLMs (like ChatGPT, Gemini, Co-Pilot etc) can cite their sources when returning answers. This means links in outputs - somewhere new to gain traffic from!

So now you’ll want to know how you rank on LLMs as well as search Engines, right? Rank tracking is bread & butter for SEO, but it may not be as simple here.

The answers from LLMs are based on probability & creative variance (temperature) amongst other things. This means the same question from the same person can receive different answers.

Rank tracking isn’t perfect but Google’s results are consistent-enough for this data to be valuable. Is this the case for LLMs? Can this data be useful?

Testing "Rankings" across LLMs (AI ChatBots)

We can test this quite quickly and easily.

5 different questions - angled towards products/solutions
4 openly available LLMs (ChatGPT, Gemini, Co-Pilot & Perplexity)
3 different users asking the same question as a new conversation
Across a 24hr period - not at the same time, but within the same time window as "convention" SEO rank tracking.

Such a small test isn't going to say anything definitive, but hopefully it'll get you thinking and maybe spur some more testing.

Thanks to Oli and Phil for helping to run this!

I'll be looking specifically at:

Local brands/businesses that are mentioned
The % of results that match from the same LLM (across x3 asks)
The % match of these across all LLMs tested

Why "Matches" and Not "Positions"?

I have decided to look at this in terms of how many times each business is mentioned across across each of the three asks. The number of business matches between each response shows the consistent mentions.

The higher % this is, the more valuable the "ranking data" here will be as it is more likely users will get the same results.

Search rank tracking is based more around how high you are from the top of search results - and with good reason!

The issue with all the LLMs is that the result order was very inconsistent and the lists were usually short and uncluttered (ignoring adds in Co-Pilot and ChatGPT maybe providing too much data). So the position of the rankings was generally different with every search and (probably) less useful. I'm not arguing against position tracking in the future, though!

Results - % Match Across All Questions and LLMs

Let's look across all questions and all LLM answers. There's no reason to expect these'll match/be the same - rankings aren't the same across search engines - but it's interesting to take the temperature early-on.

The key things here:

No question had more than a 60% match - so not a lot of similarities
Location intent (UK or Colchester) matched very infrequently, the "rankings" here weren't all-that useful
The listed references across each of the LLMs (I.e. the directories the results were found in) were often very similar, but the businesses selected from them differed still

From here it'd be safe to assume that if you are going to monitor LLM rankings for your business, you might want to keep an eye on more than one - you can't assume high visibility in one will mean high visibility in all.

Results - Summary % Match Across All Questions and LLMs

If we break it down, you can see that the picture across each is pretty different.

I haven't included some other supporting information like other links/references (beyond the quoted businesses) or other flourishes. Gemini/Co-Pilot are pretty similar in most ways, ChatGPT is the verbose older-brother and Perplexity has a slightly effortless minimalism about it.

Onto the data!

Key Findings:

Perplexity has the most similarity (highest matches) across questions - so to this end, makes running a useful Perplexity rank-tracker much easier.
Some questions (CRM and VPN) are more "agreed-on" - maybe this information is easier to identify, or those questions were less contested.
There are a lot of sub-50% matches, which really isn't that useful for tracking purposes. ChatGPT "wins" here (was the worst), but not by a lot. There's a strong chance that more questions may have seen this balance out.

Beyond These Initial Results

There are some other observations it's worth considering from running these quick tests.

ChatGPT tends to give more different options - so is less likely to produce consistent answers. As a personal preference, if I would set a pre-prompt (or a more detailed prompt), it'd be to be more concise. One of ChatGPT's responses was 13 sources long, it's too long in - for me at least.
Co-pilot and Gemini were a close second and third in their number of businesses cited per answer - but roughly equal. Perplexity's brevity felt quite distinct when comparing them all.
The quality of some of the answers was genuinely poor.
- For example, some LLMs weren’t able to correctly identify CRMs or Newspapers. Maybe these questions were too difficult, but I was surprised.
- Or, local results were almost randomly picked from directories. All the answers suggested you (the user) needed to select based on your need and own research, but that feels like a weak caveat.

Also, outside this test altogether, LLMs are changing so quickly, so this picture can/could change very frequently. Keeping up with these changes (besides parsing/processing the results), feels like a major undertaking now.

How Could You Do It, if You Wanted to do Better?

I think there are ways to make accurate LLM rank tracking possible, but the methodology is going to have to be radically different.

My two cents for anyone wondering:

- You create accounts with each LLM as a "persona" for your key audiences - many of the tools can/will use user context to form responses.

- You search X number of times and then the rankings are the % chance of each business appearing.

- The position of results within LLMs may be almost useless as positioning in responses as arbitrary/random.

- The more LLMs "remember" chats/different interactions, the more responses will diversify and differ from others - so any rank tracking becomes "ideal world" or you're just checking general visibility/direction of travel.

- Tracking traffic through LLMs is still problematic - so understanding the impact is very tricky.

I'm not suggesting that LLM tracking isn't going to be possible. I'd be amazed if there weren't multiple parties running the same/similar tests and are likely further ahead than this blog.

Is it worth it? Maybe. I'd want to test more thoroughly against a client's business needs and see how useful this data is in my day-to-day.

Have you tried this yourself or is this something you're planning with your clients? It'd be great to hear from you!