How I Audited 100+ Ecommerce Pages for Cannibalization Without Paid SEO Tools

Using semantic similarity, Python, and a reproducible workflow – not gut feel.


Why keyword cannibalization audits break down in ecommerce

Keyword cannibalization is one of those SEO problems everyone knows exists – and almost everyone handles badly.

On ecommerce sites especially, overlap is inevitable:

  • Category pages vs product pages
  • Festival collections vs themed products
  • Multiple products answering the same intent with slightly different framing

The usual process looks like this:

  • Open 30–100 URLs
  • Skim content manually
  • Argue about “intent”
  • Guess which page Google might prefer
  • Make changes that are hard to justify later

The problem isn’t effort.
The problem is subjectivity.

I wanted a way to answer one question cleanly:

Which pages are actually competing in meaning — not just keywords?

And I wanted it to be:

  • Free
  • Local
  • Reproducible
  • Explainable

No paid SEO tools. No APIs. No eyeballing.

image

The constraints (intentional, not accidental)

This experiment was done on a real WordPress ecommerce site with 100+ product and category pages.

I deliberately restricted myself to:

  • ❌ No paid SEO tools
  • ❌ No Screaming Frog license hacks
  • ❌ No cloud APIs or embeddings services
  • ❌ No “trust me, this feels similar” logic

Everything runs:

  • Locally
  • With open-source libraries
  • In a way that anyone on the team could repeat

The core idea: treat cannibalization as a similarity problem

Instead of asking:

“Which page should rank for this keyword?”

I reframed the problem as:

“Which pages are semantically similar enough that Google could treat them as the same answer?”

That shift changes everything.

Modern search engines don’t evaluate pages by keywords alone.
They evaluate meaning.

So that’s what I measured.

minimal flow diagram showing

Step 1: Extract the right content (not raw HTML)

From the crawl data, I did not use raw body text as-is.

Each page was normalized into a single column called:

Final_Content

This included:

  • Page title
  • Primary headings
  • Core descriptive content
  • Product descriptions (for PDPs)

What I intentionally ignored:

  • Navigation
  • Boilerplate
  • Cart / checkout
  • Filters and pagination

The goal wasn’t cleanliness for humans – it was signal clarity for embeddings.


Step 2: Focus only on Product pages

For similarity analysis, I filtered the dataset to:

Page_Type = Product

Why?

Because:

  • Product ↔ Product overlap is where cannibalization hurts most
  • Category pages need different handling (more on that later)

This left me with a clean products.csv containing:

  • URL
  • Final_Content

Nothing else.


Step 3: Generate semantic embeddings (locally)

Screenshot 2026 01 15 at 3.18.47 PM

I used a free, local embedding model:

all-MiniLM-L6-v2

Why this model?

  • Lightweight
  • Fast
  • Proven for semantic similarity tasks
  • Runs fully offline once downloaded

Using Python, each product’s Final_Content was converted into a vector representation of its meaning.

No keywords.
No rules.
Just semantics.

The full script is intentionally simple - the value is in how the output is interpreted.

Step 4: Measure similarity between every product page

With embeddings generated, I calculated cosine similarity between every pair of product pages.

The output was a table like this:

URL 1URL 2Similarity
/butterfly-themed-cake//unicorn-themed-cake/0.91
/janmashtami-cake//krishna-themed-cake/0.94

This immediately surfaced patterns that are invisible in keyword tools.


Step 5: Turn numbers into decisions (RED / YELLOW / WATCH)

Raw similarity scores aren’t useful by themselves.
So I introduced action bands inside Google Sheets.

Screenshot 2026 01 15 at 3.23.41 PM
RED     ≥ 0.90
YELLOW  0.85 – 0.89
WATCH   0.80 – 0.84
SAFE    < 0.80

This single step changed the entire workflow.

Instead of asking “what should we do?”, the data now said:

  • 🚨 These pages are indistinguishable in intent
  • ⚠️ These pages overlap but can be differentiated
  • 👀 These pages are adjacent – monitor, don’t touch

No debates.
No opinions.


What the data revealed (this surprised me)

1. Category pages often compete with their own products

Examples like:

/janmashtami-cakes/
/janmashtami-cakes/janmashtami-themed-cake/

Semantic similarity: 0.90+

This isn’t a mistake – it’s a role clarity issue.

The fix wasn’t deletion or redirects.
It was:

  • Category pages → act as hubs
  • Product pages → act as destinations

Once that distinction is clear in content, overlap stops being harmful.


2. Not all similarity is bad

Some YELLOW overlaps were healthy.

For example:

  • Anniversary cakes vs birthday cakes
  • Princess cakes vs unicorn cakes

These serve adjacent but distinct emotional use-cases.

The right action wasn’t consolidation – it was intent framing, especially in the first 150–200 words.


3. WATCH pages should often be left alone

This was the hardest mindset shift.

Pages in the 0.80–0.84 range often:

  • Share visual language
  • Share ingredients
  • Serve different buyers

Touching these prematurely would likely cause more harm than good.

Sometimes the correct SEO action is no action.


How this changed my cannibalization workflow

Before:

  • Manual audits
  • Endless reviews
  • Subjective decisions
  • Hard to justify changes later

After:

  • Deterministic system
  • Clear thresholds
  • Explainable actions
  • Easy to revisit as the site grows

Most importantly, this approach scales.

Add 50 new products?
→ Rerun the script.
→ Re-evaluate clusters.
→ Make confident decisions.


What I’d improve next time

  • Cluster pages automatically (instead of pairwise review)
  • Layer in internal link signals
  • Track post-change impact by cluster, not page

But even in its current form, this workflow is already a massive upgrade over traditional audits.


Who this approach is for

This is especially useful if you:

  • Run ecommerce or large content sites
  • Struggle with category vs product overlap
  • Want automation that improves decision quality, not just speed
  • Are experimenting with semantic SEO workflows

Final thought

Keyword cannibalization isn’t really an SEO problem.

It’s a decision problem.

Once you stop guessing – and start measuring meaning – everything becomes calmer, clearer, and more defensible.keyword cannibalization audits

If you’re experimenting with similar workflows or thinking about automation in SEO, feel free to reach out. I’m refining this into a repeatable system and happy to exchange notes.

Conclusion

Follow me on MediumX, and LinkedIn for more practical guides and deep dives into Python, AI, and SEO. I share fresh tips every week that can save you time and boost your results.

Got questions or ideas? Drop a comment – I love hearing from readers and sharing insights.

And don’t forget to share this post with your network if you think it’ll help them too!

Leave a Comment