Multinomial Naive Bayes

Text Classification and Document Analysis

Nov 28, 2024

Handling Discrete Count Data

In the previous articles, we've explored the foundations of Naive Bayes, its extension to continuous data with Gaussian Naive Bayes, and binary classification with Bernoulli Naive Bayes. We'll examine Multinomial Naive Bayes, a variant specifically designed for discrete count data, particularly text classification.

Why Multinomial Naive Bayes?

While Bernoulli Naive Bayes considers feature presence/absence and Gaussian handles continuous values, Multinomial Naive Bayes works with discrete count data. This makes it ideal for:

Text classification (word frequencies)
Document categorization
Sentiment analysis
Email spam detection
Topic modeling

The Mathematical Framework

The Multinomial Distribution

The core of this variant relies on the multinomial distribution, which models the probability of observing counts across multiple categories:

P(X|class) = (n!/(x₁!...xₖ!)) * ∏(pᵢ^xᵢ)

Where:

X is a feature vector of counts
xᵢ is the count for feature i
pᵢ is the probability of feature i in the given class
n is the total count (∑xᵢ)

Text Classification Model

For document classification:

P(document|class) = P(class) * ∏P(word|class)^count(word)

Let's build a sophisticated Multinomial Naive Bayes classifier:

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils.validation import check_X_y, check_array
from scipy.sparse import issparse

class AdvancedMultinomialNB(BaseEstimator, ClassifierMixin):
    def __init__(self, alpha=1.0):
        """
        Initialize Multinomial Naive Bayes classifier
        
        Args:
            alpha (float): Smoothing parameter (Laplace smoothing)
        """
        self.alpha = alpha
        self.feature_log_prob_ = {}
        self.class_log_prior_ = {}
        
    def fit(self, X, y):
        """
        Fit the Multinomial Naive Bayes model
        
        Args:
            X: Feature matrix (document-term matrix)
            y: Target labels
        """
        X, y = check_X_y(X, y, accept_sparse='csr')
        
        self.classes_ = np.unique(y)
        n_samples, n_features = X.shape
        
        # Calculate class priors and feature probabilities
        for c in self.classes_:
            # Get samples for this class
            X_c = X[y == c]
            
            # Calculate class prior
            n_c = X_c.shape[0]
            self.class_log_prior_[c] = np.log((n_c + self.alpha) / 
                                            (n_samples + len(self.classes_) * self.alpha))
            
            # Calculate feature probabilities
            if issparse(X_c):
                feature_counts = np.array(X_c.sum(axis=0))[0] + self.alpha
                total_counts = feature_counts.sum()
            else:
                feature_counts = np.sum(X_c, axis=0) + self.alpha
                total_counts = feature_counts.sum()
            
            self.feature_log_prob_[c] = np.log(feature_counts / total_counts)
        
        return self
    
    def predict_proba(self, X):
        """Calculate class probabilities for X"""
        X = check_array(X, accept_sparse='csr')
        
        # Calculate log probabilities
        log_probs = np.zeros((X.shape[0], len(self.classes_)))
        
        for i, c in enumerate(self.classes_):
            log_prob = self.class_log_prior_[c]
            if issparse(X):
                log_prob += X.dot(self.feature_log_prob_[c])
            else:
                log_prob += np.dot(X, self.feature_log_prob_[c])
            log_probs[:, i] = log_prob
            
        # Convert to probabilities
        probs = np.exp(log_probs - np.max(log_probs, axis=1)[:, np.newaxis])
        probs /= np.sum(probs, axis=1)[:, np.newaxis]
        
        return probs
    
    def predict(self, X):
        """Predict class labels for X"""
        return self.classes_[np.argmax(self.predict_proba(X), axis=1)]

Real-World Application: News Classification

Let's implement a news article classifier:

def preprocess_text(text):
    """Basic text preprocessing"""
    import re
    # Convert to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

class NewsClassifier:
    def __init__(self, alpha=1.0):
        self.vectorizer = CountVectorizer(
            max_features=5000,
            stop_words='english',
            preprocessor=preprocess_text
        )
        self.classifier = AdvancedMultinomialNB(alpha=alpha)
    
    def fit(self, texts, labels):
        """Train the classifier"""
        # Transform texts to document-term matrix
        X = self.vectorizer.fit_transform(texts)
        # Train classifier
        self.classifier.fit(X, labels)
        return self
    
    def predict(self, texts):
        """Predict categories for new texts"""
        X = self.vectorizer.transform(texts)
        return self.classifier.predict(X)
    
    def predict_proba(self, texts):
        """Get probability estimates"""
        X = self.vectorizer.transform(texts)
        return self.classifier.predict_proba(X)
    
    def explain_prediction(self, text):
        """Explain why a prediction was made"""
        # Get feature names
        features = self.vectorizer.get_feature_names_out()
        
        # Transform text
        X = self.vectorizer.transform([text])
        
        # Get prediction
        pred_class = self.predict([text])[0]
        
        # Get feature importances
        feature_probs = np.exp(self.classifier.feature_log_prob_[pred_class])
        
        # Get non-zero features in the text
        if issparse(X):
            present_features = X.tocoo()
            feature_indices = present_features.col
            feature_counts = present_features.data
        else:
            feature_indices = np.where(X[0] > 0)[0]
            feature_counts = X[0][feature_indices]
        
        # Sort by importance
        importance = feature_probs[feature_indices] * feature_counts
        sorted_idx = np.argsort(importance)[::-1]
        
        print(f"Predicted class: {pred_class}")
        print("\nTop contributing words:")
        for idx in sorted_idx[:10]:
            word = features[feature_indices[idx]]
            prob = feature_probs[feature_indices[idx]]
            count = feature_counts[idx]
            print(f"{word}: probability={prob:.3f}, count={count}")

def demonstrate_news_classification():
    # Sample news articles
    articles = [
        "The stock market reached record highs today as tech companies reported strong earnings",
        "Scientists discover new species of deep-sea creatures in Pacific Ocean exploration",
        "Local team wins championship in dramatic overtime victory",
        "New cryptocurrency regulations proposed by government agencies",
        "Breakthrough in renewable energy storage announced by researchers"
    ]
    
    categories = [
        "business",
        "science",
        "sports",
        "business",
        "science"
    ]
    
    # Create and train classifier
    classifier = NewsClassifier(alpha=1.0)
    classifier.fit(articles, categories)
    
    # Test with new article
    new_article = """
    Tech giants announced revolutionary AI models that could transform 
    the industry, causing their stocks to surge in after-hours trading.
    """
    
    # Get and explain prediction
    classifier.explain_prediction(new_article)

Multinomial Naive Bayes proves exceptionally well-suited for text classification tasks. By effectively handling discrete count data, it leverages word frequencies to make accurate predictions. Our implementation demonstrates its efficiency in processing sparse document-term matrices, its intuitive probability calculations, and its ability to provide explainable predictions through feature importance analysis.

The news classification example highlights the practical application of Multinomial Naive Bayes in real-world scenarios. Its ability to not only classify but also explain the reasoning behind its decisions makes it a valuable tool, especially in domains where understanding the decision-making process is critical.

Raven-R’s Substack

Discussion about this post

Ready for more?