Normalizer and TF-IDF in NLP

3 minute read

Introduction

In text analysis and Natural Language Processing (NLP), feature scaling plays a crucial role in improving model performance. While StandardScaler standardizes data by subtracting the mean and scaling to unit variance, Normalizer ensures that each sample (document) has unit norm. This is particularly useful when working with cosine similarity and TF-IDF, where only the direction of vectors matters, not their magnitude.

In this post, we will explore how Normalizer works, compare it with TF-IDF, and demonstrate practical applications using Python and scikit-learn.

1. Understanding Normalizer

1.1 When to Use Normalizer

Normalizer is useful when:

  • The magnitude of features is not important, only their direction.
  • You are using cosine similarity (e.g., in document retrieval or clustering).
  • You work with TF-IDF or kernel methods where normalization improves results.

If you remember from this blog post, a very popular choice of a transformer is the StandardScaler. However, it is important to notice that there is a quite crucial distinction between the well-known StandardScaler and the Normalizer.

Feature Description
What it does StandardScaler: Standardizes features (zero mean, unit variance)
  Normalizer: Normalizes each sample to unit norm
Applies to StandardScaler: Columns (features)
  Normalizer: Rows (samples)
Effect StandardScaler: Adjusts feature distributions
  Normalizer: Adjusts vector magnitudes
When to use StandardScaler: When features have different scales
  Normalizer: When rows represent vectors that should have the same magnitude

Normalizer therefore transforms each row (sample) in the dataset to have unit norm, meaning that the sum of squared elements is equal to 1. The three common norms used are:

  • L1 norm: Scales values so their absolute sum is 1.
  • L2 norm: Scales values so their squared sum is 1 (default).
  • Max norm: Scales values by the maximum absolute value.
from sklearn.preprocessing import Normalizer
import numpy as np

X = np.array([[4, 1, 2], [1, 3, 9], [5, 6, 7]])

norm = Normalizer()
X_normalized = norm.fit_transform(X)
print(X_normalized)

2. Comparing Normalizer and TF-IDF

2.1 Term Frequency (TF)

Term Frequency (TF) simply counts how often a word appears in a document but does not consider word importance across multiple documents.

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["Machine learning is amazing", "Deep learning is part of machine learning"]
vectorizer = CountVectorizer()
X_tf = vectorizer.fit_transform(corpus)
print(X_tf.toarray())

2.2 TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF adjusts TF by down-weighting frequent words and up-weighting rare but meaningful words.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(X_tfidf.toarray())

2.3 Why Use Normalizer with TF-IDF?

When dealing with text data, TF-IDF vectors can have varying magnitudes, but cosine similarity only considers direction. Using Normalizer ensures:

  • The length of vectors does not influence similarity calculations.
  • Documents with similar word distributions are treated similarly, even if they differ in length.
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(TfidfVectorizer(), Normalizer())
X_tfidf_normalized = pipeline.fit_transform(corpus)

3. Conclusion

  • Use TF when raw frequency matters.
  • Use TF-IDF when importance across documents is needed.
  • Use Normalizer when direction matters more than magnitude, especially with cosine similarity.