4. Decision Trees
Decision Trees are one of the most fundamental yet versatile algorithms in a machine learning toolkit. They are simple enough to explain to non-technical stakeholders, yet powerful enough to serve as the building blocks for state-of-the-art models like Random Forests and XGBoost.
At its heart, a decision tree breaks a complex problem into smaller, simpler ones by asking a series of yes/no questions.
Core Concept: How They Work
Each internal node in a tree represents a decision rule, each branch represents a possible outcome, and each leaf node represents a final prediction.
Unlike linear models, decision trees:
- Model nonlinear relationships naturally.
- Capture feature interactions (e.g., "If Age > 30 AND Income > 70k").
- Are non-parametric, meaning they don't assume a specific distribution (like a Normal distribution) for your data.
Assumptions of Decision Trees
While trees are flexible, they rely on a few implicit principles:
| Assumption | Description |
|---|---|
| Feature Relevance | Features must contain signal; trees can struggle if the data is mostly noise. |
| Axis-Aligned Splits | Splits happen on one feature at a time, creating "staircase" boundaries. |
| i.i.d. Samples | Data points should be independent and identically distributed. |
| Threshold-Based | Relationships are best captured through "greater than/less than" conditions. |
How a Tree is Built (Optimization)
The goal is to find splits that reduce impurity—making the resulting child nodes as "pure" as possible.
1. Splitting Criteria
- Entropy & Information Gain: Measures disorder. We pick the split that maximizes the reduction in entropy.
- Gini Impurity: Used by the CART algorithm. It measures the probability of misclassification.
- Variance Reduction: Used for Regression Trees to minimize the spread of continuous values in each leaf.
2. Preventing Overfitting
A tree that grows too deep will memorize the training data (High Variance). We control this through:
- Pre-pruning (Early Stopping): Setting
max_depth,min_samples_leaf, ormax_leaf_nodes. - Post-pruning (Cost Complexity Pruning): Growing a full tree and then removing branches that provide little predictive power relative to their complexity.
Evaluation Metrics
For Classification
- Accuracy: Best for balanced datasets.
- Precision/Recall/F1-Score: Crucial for imbalanced data (e.g., fraud detection).
- ROC-AUC: Measures the model's ability to distinguish between classes across different thresholds.
For Regression
- MSE (Mean Squared Error): Penalizes large errors heavily.
- MAE (Mean Absolute Error): More robust to outliers.
- \(R^2\) Score: Measures the proportion of variance explained by the model.
Practical Tips and Pitfalls
- Scale Invariance: One of the biggest perks! Trees do not require feature scaling (standardization or normalization).
- Categorical Encoding: Trees cannot handle strings directly. Use One-Hot Encoding for nominal data or Ordinal Encoding for ordered data.
- The Stability Issue: Single trees are unstable—small data changes can flip the entire tree structure.
-
Solution: Use Ensembles (Random Forests or Boosting) to average out this variance.
-
Imbalanced Data: Trees are biased toward the majority class. Use
class_weight='balanced'or resampling techniques like SMOTE. - Feature Importance: Trees naturally rank features based on how much they reduce impurity. This is a powerful tool for feature selection.