AI chatbots and the illusion of “reading an image ” - is it worth the risk?

Chris Green
Mar 1
2 min read

When we hand a URL to a AI Chatbot and ask it what’s on the page, we tend to assume something browser-like is happening.

After all, if you upload an image it can "see" the image, it can even regenerate it, right? Agents can also click and navigate around a page. So seeing the text in a single image must be trivial. Surely?

But what if that assumption is wrong?

To test this, I created two near-identical pages hosted on GitHub Pages.

One contains only standard text and a product image. The other is identical in text, but the image includes a price offer — and that offer appears nowhere in the HTML.

Test A - https://chr156r33n.github.io/ai-ocr-test/a.html - just a page with an image, no offer details
Test B - https://chr156r33n.github.io/ai-ocr-test/b.html just a page with an image, this time with a price offer in the image.

Both pages were given to Gemini and ChatGPT. In each case, I asked what could be seen on the page and then explicitly asked whether the model could see or read the image.

The results suggest something uncomfortable - but possibly expected.

The Results

I gave both these pages to Gemini and ChatGPT and asked what could be seen.

Even when I explicitly ask whether they can "see" the images or not, both say they cannot.

Gemini hallucinated a fake specification table and ChatGPT just flat-out said "no". Both could tell it was an OCR test - because of the URLs and other code elements, but there was nothing that gave away the offer except the image itself.

Here are the chats from each:

Gemini - https://gemini.google.com/share/df3d5626ffd6
ChatGPT - https://chatgpt.com/share/69a1ae5d-53b4-800d-ab10-1ec831ed7852

As an aside, ChatGPT spooked me in a subsequent conversation where it "knew" about the text in the image and the exact price. This was just ChatGPT's memory cutting in.

What do we do with this information?

If AI Chatbots are often summarising text extracted from HTML rather than rendering and OCR’ing full pages, then any critical information that exists only in pixels may effectively not exist.

Parsing HTML is cheap, rendering and processing an image to "read" text is expensive.

Asking the chatbot to go into "Agent mode" did lead them to "see" the image, but if we're looking at the RAG pipeline itself - i.e. what is good for search - this is almost irrelevant.

This small test does not prove that LLMs can never read images. In other contexts, vision models absolutely can. But it does suggest that in common “fetch a webpage and summarise it” workflows, image content is not guaranteed to be processed.

If pricing, disclaimers, offers, availability, or key claims live only inside images, they may be invisible to AI summaries, AI answers, and AI-driven discovery. This is not meaningfully different to "traditional search".

Technical SEO has ALWAYS been a mission to reduce friction for crawlers - that is still the same mission today.

If it matters, treat it as readable text. Nothing fancy because relying on pixels alone may be a risk you don’t realise you’re taking.