How I Audited 100+ Ecommerce Pages for Cannibalization Without Paid SEO Tools

Using semantic similarity, Python, and a reproducible workflow – not gut feel.

Table of Contents

Why keyword cannibalization audits break down in ecommerce

Keyword cannibalization is one of those SEO problems everyone knows exists – and almost everyone handles badly.

On ecommerce sites especially, overlap is inevitable:

Category pages vs product pages
Festival collections vs themed products
Multiple products answering the same intent with slightly different framing

The usual process looks like this:

Open 30–100 URLs
Skim content manually
Argue about “intent”
Guess which page Google might prefer
Make changes that are hard to justify later

The problem isn’t effort.
The problem is subjectivity.

I wanted a way to answer one question cleanly:

Which pages are actually competing in meaning — not just keywords?

And I wanted it to be:

Free
Local
Reproducible
Explainable

No paid SEO tools. No APIs. No eyeballing.

The constraints (intentional, not accidental)

This experiment was done on a real WordPress ecommerce site with 100+ product and category pages.

I deliberately restricted myself to:

❌ No paid SEO tools
❌ No Screaming Frog license hacks
❌ No cloud APIs or embeddings services
❌ No “trust me, this feels similar” logic

Everything runs:

Locally
With open-source libraries
In a way that anyone on the team could repeat

The core idea: treat cannibalization as a similarity problem

Instead of asking:

“Which page should rank for this keyword?”

I reframed the problem as:

“Which pages are semantically similar enough that Google could treat them as the same answer?”

That shift changes everything.

Modern search engines don’t evaluate pages by keywords alone.
They evaluate meaning.

So that’s what I measured.

Step 1: Extract the right content (not raw HTML)

From the crawl data, I did not use raw body text as-is.

Each page was normalized into a single column called:

Final_Content

This included:

Page title
Primary headings
Core descriptive content
Product descriptions (for PDPs)

What I intentionally ignored:

Navigation
Boilerplate
Cart / checkout
Filters and pagination

The goal wasn’t cleanliness for humans – it was signal clarity for embeddings.

Step 2: Focus only on Product pages

For similarity analysis, I filtered the dataset to:

Page_Type = Product

Why?

Because:

Product ↔ Product overlap is where cannibalization hurts most
Category pages need different handling (more on that later)

This left me with a clean products.csv containing:

URL
Final_Content

Nothing else.

Step 3: Generate semantic embeddings (locally)

I used a free, local embedding model:

all-MiniLM-L6-v2

Why this model?

Lightweight
Fast
Proven for semantic similarity tasks
Runs fully offline once downloaded

Using Python, each product’s Final_Content was converted into a vector representation of its meaning.

No keywords.
No rules.
Just semantics.

The full script is intentionally simple - the value is in how the output is interpreted.

Step 4: Measure similarity between every product page

With embeddings generated, I calculated cosine similarity between every pair of product pages.

The output was a table like this:

URL 1	URL 2	Similarity
/butterfly-themed-cake/	/unicorn-themed-cake/	0.91
/janmashtami-cake/	/krishna-themed-cake/	0.94
…	…	…

This immediately surfaced patterns that are invisible in keyword tools.

Step 5: Turn numbers into decisions (RED / YELLOW / WATCH)

Raw similarity scores aren’t useful by themselves.
So I introduced action bands inside Google Sheets.

RED     ≥ 0.90
YELLOW  0.85 – 0.89
WATCH   0.80 – 0.84
SAFE    < 0.80

This single step changed the entire workflow.

Instead of asking “what should we do?”, the data now said:

🚨 These pages are indistinguishable in intent
⚠️ These pages overlap but can be differentiated
👀 These pages are adjacent – monitor, don’t touch

No debates.
No opinions.

What the data revealed (this surprised me)

1. Category pages often compete with their own products

Examples like:

/janmashtami-cakes/
/janmashtami-cakes/janmashtami-themed-cake/

Semantic similarity: 0.90+

This isn’t a mistake – it’s a role clarity issue.

The fix wasn’t deletion or redirects.
It was:

Category pages → act as hubs
Product pages → act as destinations

Once that distinction is clear in content, overlap stops being harmful.

2. Not all similarity is bad

Some YELLOW overlaps were healthy.

For example:

Anniversary cakes vs birthday cakes
Princess cakes vs unicorn cakes

These serve adjacent but distinct emotional use-cases.

The right action wasn’t consolidation – it was intent framing, especially in the first 150–200 words.

3. WATCH pages should often be left alone

This was the hardest mindset shift.

Pages in the 0.80–0.84 range often:

Share visual language
Share ingredients
Serve different buyers

Touching these prematurely would likely cause more harm than good.

Sometimes the correct SEO action is no action.

How this changed my cannibalization workflow

Before:

Manual audits
Endless reviews
Subjective decisions
Hard to justify changes later

After:

Deterministic system
Clear thresholds
Explainable actions
Easy to revisit as the site grows

Most importantly, this approach scales.

Add 50 new products?
→ Rerun the script.
→ Re-evaluate clusters.
→ Make confident decisions.

What I’d improve next time

Cluster pages automatically (instead of pairwise review)
Layer in internal link signals
Track post-change impact by cluster, not page

But even in its current form, this workflow is already a massive upgrade over traditional audits.

Who this approach is for

This is especially useful if you:

Run ecommerce or large content sites
Struggle with category vs product overlap
Want automation that improves decision quality, not just speed
Are experimenting with semantic SEO workflows

Final thought

Keyword cannibalization isn’t really an SEO problem.

It’s a decision problem.

Once you stop guessing – and start measuring meaning – everything becomes calmer, clearer, and more defensible.keyword cannibalization audits

If you’re experimenting with similar workflows or thinking about automation in SEO, feel free to reach out. I’m refining this into a repeatable system and happy to exchange notes.

Conclusion

Follow me on Medium, X, and LinkedIn for more practical guides and deep dives into Python, AI, and SEO. I share fresh tips every week that can save you time and boost your results.

Got questions or ideas? Drop a comment – I love hearing from readers and sharing insights.

And don’t forget to share this post with your network if you think it’ll help them too!