How Content Structure Matters for AI Search

Chris Green
Jun 13
6 min read

Updated: Jun 16

Is "writing for AI" Just optimising content for chunking?

One common recommendation I have started to see when preparing content for AI (breaking it into chunks and vectorisation for semantic matching) is to ensure key pieces of information remain within the same chunk.

Doing so increases the likelihood that a single chunk is recognised as highly relevant, rather than having relevance diluted across multiple chunks.

This logic makes sense, but is it too simplistic? Some writers are - for conversion, marketing copy - will be writing like that anyway, others may have been trained more on long-form, on detail and on rigour.

Which is right?

Let's see if we can test it!

TL;DR - Q&A is great for Optimising content, but it's not the only option

Q&A format consistently delivered the highest semantic relevance to queries in every scenario.
Dense prose performed the worst across all tests for matching queries.
Structured content (using headings/lists) was almost as effective as Q&A for non-question queries.
This was a deliberately small, controlled test - these results are illustrative, not definitive for “real world” performance.
We don’t know which chunking method Google actually uses; this is just one piece of the retrieval puzzle.
If the goal is maximum match to queries, Q&A should be your default format.
However, well-structured dense content can still perform well - just use proper HTML and structure for both search and users.

What Are We Testing?

To get a steer on the answer here, we are primarily examining two pieces of the puzzle:

How different writing styles affect the strength of semantic matching for search queries.
How various chunking methods impact this semantic matching process.

These elements are interconnected; the effectiveness of a writing style partially depends on the chosen chunking method. By using multiple chunking methods, we aim to determine general trends in how styles perform.

This test is significantly simplified from the overall process as RAG pipelines that generate results - such as AI Overview and AI Mode - do more than just retrieve chunks of content based on their similarity to a query - but we can certainly test the theory behind this part.

Testing Content Writing Styles

The Method

Create three articles on distinct topics, each rewritten in three different styles:

Dense Prose: Primarily paragraphs, no structure/additional markup.
Structured Content: The dense prose version, but marked up with headings and list items.
Q&A: Format of question followed by direct answers.
Test each article against five queries (mostly top-of-funnel research queries, selected randomly).

Test each article against five queries (mostly top-of-funnel research queries, selected randomly).

"Chunk" the content and Store it as Vectors

To evaluate content effectiveness, we applied four chunking methods:

Token-based: Split content by a fixed number of words.
Recursive text-based: Splitting at natural textual boundaries (paragraphs, sentences, words).
HTML-aware: Utilises HTML tags (like headings and paragraphs) to determine chunk boundaries.
Semantic chunking: AI-driven method, splitting based on significant shifts in topic.

While Google's exact approach to chunking is unknown, token-based is likely overly simplistic. HTML-aware provides a logical and intuitive baseline, while recursive and semantic chunking, borrowed from LangChain (credit to Dawn), offer sophisticated approaches closer to real-world application.

(More detail on these chunking approaches below.)

The Results

Despite the limited scale, clear patterns emerged:

Q&A format - overall had the highest semantic relevant to the query for every scenario
Dense Prose - overall matched the least-well across all tests
Structured Content - for non-question queries was very close to Q&A on average

The comparison tables show the score of the highest relevance chunk of different content type (variant) by chunking method, and the average across the chunking methods.

Average similarity between the closest-matching "chunk" per query

The final output contains all cosine similarities of all the vectorised queries against all of the chunks (plus the chunks themselves for reference), so that's a lot of data, not all is useful. If you are useful, I added them all into Git for people to take a look at.

What Do We Do With This Knowledge?

This IS a small test with some deliberately distinct content types to highlight the differences in the approach, not to simulate what you'll find "in the real world". Remember, we don't know which chunking method is the closest to what Google use AND this process is only part of how Google retrieves content.

That said, Q&A would be my chosen method of writing content if I wanted to really maximise the chances of it being confidently matched to these queries - based on this data.

Truth is, though, if the content is well-structured, it appears that more dense writing styles could still perform well. There will be times when a Q&A isn't the right format for the job, just ensure you use correct HTML and structure your work - which is of course better for users too.

Limitations

We’re simulating Google’s process - Not identical to how Google chunks or scores but it’s close enough to test patterns
Vector similarity does not equal search ranking - This tests semantic match, not the full set of stages involved in the process. It is only one part of the picture!
Chunking matters, but so does context - Google will use other signals to prioritise different chunks and we can’t be certain which chunking method Google uses.
Different embedding models might behave differently - Gemini vs. OpenAI vs. SBERT all embed text differently. We have used ones Google is likely to use, but it isn’t certain.
Test quality matters - This test requires the same content to be reproduced in multiple ways and then have it equally as relevant to the queries you run against it. It’s entirely possible that the content and queries can impact the performance, therefore clouding the conclusions.

Try it Yourself - The Code

Link to Git Containing the Python and the inputs and sample outputs can be found here.

What This Code Does — in Simple Terms

You provide different versions of content

You start with a CSV where each row is a different “variant” of a web page or paragraph (e.g. same blog written in different ways). Each version goes through the same process - you change want to want in "input_documents.csv" - just ensure you label the variants.

You can have as many or as few as you like here, it depends what you want to test.

Each version is split into “chunks”

You can edit which chunking techniques you test with in "enabled_chunkers".

Each chunk is converted into a vector

Each chunk is sent to Google’s Vertex AI, which turns the chunk into a vector embedding. This is a list of numbers that represent the chunk’s meaning - like an AI-friendly summary. You will need to have authenticated with GCP for this.

Each version is indexed using HNSW

You store these vectors using a HNSW index (Hierarchical Navigable Small World graph). It’s just a super-fast way to search for the most similar chunk to a given query.

You test search queries

You provide a list of questions or search queries (like what someone might type into Google). Each query is also turned into a vector.

Then, the AI tries to find the most similar chunk from each version, based on the query.

Results Ranked

The script shows:

Which content variant returned the best match
Which chunk from that variant matched
How similar it was to the query (score from 0 to 1)
Ranking per chunking method (so you can compare strategies)

This is saved to CSV files for analysis.

Details on the Chunking Methods

HTML-aware

Parses HTML structure and chunks by tags like <h1>, <p>, <li>.
How it works:
- Merges content inside HTML tags.
- Chunk size = 100 tokens with 20 token overlap.
Settings:
- Tags used: ['h1', 'h2', 'h3', 'h4', 'p', 'li', 'div']
- Min chunk length: 3+ words
Dependencies: BeautifulSoup

Token-based

Splits text by tokenizer
How it works:
- Overlaps 20 tokens into the next chunk to maintain context.
Settings:
- Chunk size: 100
- Overlap: 20
Dependencies: None (Python)

Recursive

Uses LangChain’s RecursiveCharacterTextSplitter to chunk by structure.
How it works:
- Tries to split by paragraphs → sentences → words.
- Good fallback logic for natural splits.
Settings:
- Chunk size: 100 tokens
- Overlap: 20 tokens
Dependencies: langchain_text_splitters

Semantic

Chunks based on semantic change using embedding distances.
How it works:
- Embeds entire document.
- Finds breakpoints where the topic shifts significantly.
Settings:
- Embedding model: text-embedding-005 (Vertex AI)
- Breakpoint threshold: 95th percentile
Dependencies:
- langchain_experimental
- langchain_google_vertexai (for VertexAIEmbeddings)

Closing Thoughts

I ended up a little deeper into this than I previously anticipated. I have an interest in IR (information retrieval), but I am far from an expert in the academic side of things. I set out initially to test some concepts - and I think I have taken what I intended from it - but this is far from finished.

The method (code and content) used is far-from-perfect, and this could do with further refinement and ultimately more testing. Anyone who wants to test/comment or feedback - please do!

Thanks to Dawn Anderson, Sam Monaghan, Phillip Pratt and Jon Wilks for feedback/input.