Writing

Understanding Logistic Regression: A Beginners Guide to Binary Classification

Understanding Logistic Regression: A Beginner's Guide to Binary Classification

Introduction

In my previous blog post, we explored linear regression, a fundamental supervised learning algorithm for predicting continuous values. Today, we'll take the next logical step in our machine learning journey by diving into logistic regression - despite its name, a powerful classification algorithm.

Logistic regression is one of the most widely used classification algorithms in data science, serving as a gateway to understanding more complex classification techniques. In this post, we'll explore how logistic regression works, its mathematical foundation, implementation in Python, and practical applications.

From Linear to Logistic: Understanding the Transition

While linear regression helps us predict continuous values (like house prices or temperatures), many real-world problems require predicting discrete categories or classes. For example:

  • Will a customer churn or stay?
  • Is an email spam or not spam?
  • Will a student pass or fail?

These are binary classification problems where the outcome belongs to one of two classes. This is where logistic regression shines.

The Sigmoid Function: The Heart of Logistic Regression

The key difference between linear and logistic regression lies in the transformation function. Linear regression uses a linear equation:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Logistic regression applies the sigmoid function (also called the logistic function) to this linear equation:

p(x) = 1 / (1 + e^-(β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ))

Or simplified as:

p(x) = 1 / (1 + e^-z)

Where z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

The sigmoid function transforms any real-valued number into a value between 0 and 1, which we can interpret as a probability. This is perfect for binary classification since we can set a threshold (typically 0.5) to decide which class a prediction belongs to.

Mathematical Foundation of Logistic Regression

The Odds Ratio and Log-Odds

To understand logistic regression more deeply, we need to introduce the concept of odds:

odds = p/(1-p)

Where p is the probability of the positive class. Taking the natural logarithm of the odds gives us the log-odds or logit:

log-odds = ln(p/(1-p)) = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

This transformation allows us to model a linear relationship between our predictors and the log-odds of the positive class.

Maximum Likelihood Estimation

Unlike linear regression which uses ordinary least squares, logistic regression uses maximum likelihood estimation (MLE) to find the best-fitting coefficients. MLE aims to find the model parameters that maximize the likelihood of observing the data given those parameters.

The likelihood function for logistic regression is:

L(β) = ∏ p(xᵢ)^yᵢ * (1-p(xᵢ))^(1-yᵢ)

Where yᵢ is the actual class (0 or 1) and p(xᵢ) is the predicted probability.

Implementing Logistic Regression in Python

Let's implement a simple logistic regression model using scikit-learn:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc

# Sample dataset: Predicting diabetes based on medical predictors
# Let's assume we have a diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
column_names = ['pregnancies', 'glucose', 'blood_pressure', 'skin_thickness', 
                'insulin', 'bmi', 'diabetes_pedigree', 'age', 'outcome']
data = pd.read_csv(url, names=column_names)

# Split features and target
X = data.drop('outcome', axis=1)
y = data['outcome']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the logistic regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_prob = model.predict_proba(X_test_scaled)[:, 1]

# Print model coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
})
print("Model Coefficients:")
print(coefficients.sort_values(by='Coefficient', ascending=False))

Evaluating the Logistic Regression Model

Unlike linear regression where we use metrics like MSE or R², classification models require different evaluation metrics:

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='blue', lw=2, 
         label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

Key Evaluation Metrics for Classification:

  1. Accuracy: The proportion of correct predictions.
  2. Precision: The proportion of positive identifications that were actually correct.
  3. Recall: The proportion of actual positives that were correctly identified.
  4. F1-Score: The harmonic mean of precision and recall.
  5. ROC Curve and AUC: A graph showing the performance of a classification model at all threshold settings, and the area under this curve.

Interpreting Logistic Regression Coefficients

One of the advantages of logistic regression is interpretability. The coefficients tell us:

  • Direction: Positive coefficients increase the probability of the positive class, while negative coefficients decrease it.
  • Magnitude: Larger absolute values indicate stronger effects.
  • Odds Ratio: We can convert coefficients to odds ratios using e^coefficient, which tells us how the odds of the positive outcome change when the predictor increases by one unit.

For example, if the coefficient for 'glucose' is 0.5, the odds ratio is e^0.5 ≈ 1.65, meaning the odds of diabetes increase by about 65% for each unit increase in glucose (assuming all other variables remain constant).

Regularization in Logistic Regression

Just like linear regression, logistic regression can suffer from overfitting. To combat this, we can use regularization:

  • L1 Regularization (Lasso): Adds a penalty equal to the absolute value of coefficients, can shrink some coefficients to zero.
  • L2 Regularization (Ridge): Adds a penalty equal to the square of coefficients, shrinks all coefficients but rarely to zero.

In scikit-learn, we can specify the regularization type:

# L1 Regularization
model_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)

# L2 Regularization
model_l2 = LogisticRegression(penalty='l2', C=0.1)

The parameter C controls the regularization strength (smaller values = stronger regularization).

Multiclass Logistic Regression

While we've focused on binary classification, logistic regression can be extended to multiclass problems using strategies like:

  1. One-vs-Rest (OvR): Train n binary classifiers, one for each class.
  2. Multinomial Logistic Regression: A direct extension that uses the softmax function instead of the sigmoid function.

Advantages and Limitations of Logistic Regression

Advantages:

  • Interpretable coefficients
  • Works well with linearly separable classes
  • Provides probability scores
  • Computationally efficient
  • Less prone to overfitting with small feature sets

Limitations:

  • Assumes linear decision boundary
  • May underperform with complex relationships
  • Requires more feature engineering
  • Struggles with imbalanced datasets without adjustments

Practical Tips for Using Logistic Regression

  1. Feature Scaling: Always scale your features when using regularization.
  2. Handle Imbalanced Data: Use techniques like SMOTE or class weighting.
  3. Feature Selection: Remove highly correlated features to improve model stability.
  4. Check Assumptions: Test for multicollinearity using VIF (Variance Inflation Factor).
  5. Tune Hyperparameters: Optimize the regularization strength and solver.

Conclusion

Logistic regression serves as an excellent introduction to classification algorithms. Despite its simplicity, it remains a powerful tool in a data scientist's arsenal, especially for problems where interpretability is crucial.

In our next post, we'll explore more advanced classification techniques like Support Vector Machines or Decision Trees, building on the foundation we've established with logistic regression.

What classification problems are you working on? Have you used logistic regression in your projects? Share your experiences in the comments below!


Further Reading

  • The mathematical proof of logistic regression
  • Implementing logistic regression from scratch
  • Handling imbalanced datasets in classification
  • Comparing logistic regression with other classification algorithms

Happy modeling!

Relentless pursuit of mastery and meaning