Cosine Similarity, Explained for Everyone

5 minute read

Introduction

If you have heard people say “we compare texts using cosine similarity” and thought “cool… but what does that actually mean?” then this post is for you.

In NLP, we often convert text into vectors (basically lists of numbers). Once we do that, cosine similarity lets us measure how similar two texts are by looking at the angle between those vectors.

The key idea: cosine similarity cares about direction, not size.

Cosine similarity measures how similar two things are by checking whether their vectors point in the same direction, not how big those vectors are.

That’s why it is so common in text retrieval and clustering: a long document should not automatically look “more similar” just because it contains more words.

What’s a vector, in plain English?

A vector is just a list of numbers. We normally use vectors because computers do nott understand words directly but… they understand numbers. So we convert sentences into numbers that capture what the object is about.

A tiny text example

Let’s pretend our entire vocabulary has only 4 words: pizza, sushi, soccer, finance. Now take two sentences:

Sentence A: “pizza pizza soccer”
Sentence B: “pizza sushi”

If we eepresent each sentence by counting the words, then:

A → [2, 0, 1, 0]
B → [1, 1, 0, 0]

Each position translates into “how much of that word is in the sentence”… so the vector is basically a content profile.

With TF-IDF, you don’t just count words: you give higher weight to rare words and lower weight to very common ones. But the outcome is the same: you get a vector.

So what does cosine similarity measure?

Picture each vector as an arrow starting at the origin:

the direction of the arrow ≈ what it talks about (topic)
the length of the arrow ≈ how much stuff is in it (document length, repetitions)

Cosine similarity looks only at the angle between the arrows.

The intuition

If two arrows point in the same direction → they are very similar
If they are 90° apart → they are unrelated
If they point in opposite directions → they are “opposites” (rare in basic bag-of-words TF-IDF settings)

Why “cosine”? Because it’s literally the cosine of the angle

The formula is:

\[\cos(A, B) = \frac{A \cdot B}{\|A\| \cdot \|B\|}\]

But you do not need to love the math. Here’s what it means:

A · B (dot product) gets bigger when both vectors are strong in the same dimensions
dividing by |A| and |B| removes the effect of length
what remains is basically: how aligned they are :rocket:

A fast mental model

Cosine similarity is like asking:

“Do these two texts have the same shape of content, even if one is longer?”

Think of two playlists:

one playlist has 10 songs
the other has 100 songs

Different size. But if both are 80% chill jazz, they feel similar.

“Do these two playlists have the same vibe / composition?”

Cosine similarity with text (TF vs TF-IDF)

To compute cosine similarity on text, we first need to transform text into vectors.

Term Frequency (TF)

TF simply counts how often a word appears in a document.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = [
    "machine learning is fun",
    "machine learning is a fun subject",
    "machine learning is fun fun",
    "pizza is great in milan",
    "milan is a famous soccer team",
]

X_tf = CountVectorizer(stop_words="english").fit_transform(docs)
S_tf = cosine_similarity(X_tf)

TF can work, but it tends to overemphasize repetition and common words.

TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF down-weights common words and up-weights more informative ones.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

X_tfidf = TfidfVectorizer(stop_words="english").fit_transform(docs)
S_tfidf = cosine_similarity(X_tfidf)

Why normalization often shows up with cosine similarity

Cosine similarity already focuses on direction (it divides by vector length). So why do we care about Normalizer?

Two practical reasons:

Efficiency/convenience: once vectors are L2-normalized, cosine similarity becomes just a dot product.
Consistency in pipelines: some workflows explicitly normalize to make the “unit-length vectors” assumption obvious and reproducible.

In scikit-learn, TF-IDF can already do L2 normalization internally (norm="l2" is the default), but it’s useful to know what is happening.

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer

pipeline = make_pipeline(
    TfidfVectorizer(stop_words="english", norm=None),  # turn off built-in normalization
    Normalizer(norm="l2")                              # explicitly normalize rows
)

X_tfidf_norm = pipeline.fit_transform(docs)

Some Tips and Tricks

Stopwords: words like “is”, “a”, “the” can dominate counts unless you remove them or use TF-IDF.
Empty/zero vectors: if a document becomes empty after preprocessing, similarities can be undefined.
High similarity ≠ identical: two texts can be aligned (same topic) but still differ in details.

Share on

Twitter Facebook Google+ LinkedIn

Andrea Giussani, PhD