Blog
7 hours ago
Feature Selection for Imbalanced Datasets Using Pearson Distance and KL Divergence
Machine learning models often struggle with highly imbalanced datasets because they overfit the dominant class and miss the minority signals that matter most. This article introduces a lightweight, model-free feature screening method inspired by medical case-control studies. By directly comparing how each feature is distributed between groups using statistical distances like Pearson chi-squared and KL divergence, analysts can identify which variables truly separate outcomes such as churn vs. retention or fraud vs. normal activity. The technique is simple, transparent, computationally efficient, and provably reliable under certain statistical conditions, making it a powerful alternative to traditional model-based feature importance.
Source: HackerNoon →