3. Support Vector Machines (SVM)
Support Vector Machines are a staple of data science interviews because they combine geometry, optimization, and high-dimensional reasoning. Unlike models that focus on probabilities, SVMs focus on separation.
The Core Intuition: Maximum Margin Thinking
At its simplest, a classifier draws a boundary between classes. However, many lines can separate the same data. SVMs choose the line that provides the most "breathing room"—known as the Margin.
- The Margin: The distance between the decision boundary and the closest data points. A larger margin leads to better generalization and less sensitivity to noise.
- Support Vectors: These are the critical data points that lie closest to the boundary. They "support" the decision surface; if you move them, the boundary moves. Points far away from the margin have no influence on the model.
Hard Margin vs. Soft Margin
In the real world, data is rarely perfectly separable.
- Hard Margin SVM: Assumes the data is perfectly separable. It fails if even one point is misclassified or falls within the margin.
- Soft Margin SVM: Relaxed version that allows some points to violate the margin or be misclassified. It uses Slack Variables to measure how much a point violates the ideal separation.
The Regularization Parameter: \(C\)
The parameter \(C\) acts as a "strictness" lever. It controls the tradeoff between maximizing the margin and minimizing classification errors.
- Large \(C\) (High Penalty): The model is a "perfectionist." It shrinks the margin to classify every training point correctly.
-
Result: Low Bias, High Variance (Risk of Overfitting).
-
Small \(C\) (Low Penalty): The model is more forgiving. It allows some misclassifications to maintain a wider, smoother margin.
- Result: High Bias, Low Variance (Risk of Underfitting).
The Kernel Trick: Handling Non-Linear Data
When data cannot be separated by a straight line, we project it into a higher-dimensional space where a linear boundary can separate it.
The Kernel Trick allows us to calculate the relationship between points in this high-dimensional space without actually performing the expensive transformation.
Common Kernels
- Linear Kernel: Best for high-dimensional data like text classification.
- Polynomial Kernel: Looks for feature interactions.
- RBF (Gaussian) Kernel: The most popular choice. It creates localized, smooth boundaries based on distance.
Gamma: The Influence Parameter
Gamma (\(\gamma\)) is specific to non-linear kernels (like RBF). It defines how far the influence of a single training point reaches.
- High Gamma: Points have a "short reach." The boundary only cares about points very close to it, leading to complex, wiggly boundaries. (Overfitting)
- Low Gamma: Points have a "long reach." The boundary considers points far away, leading to smoother, more global shapes. (Underfitting)
Essential Best Practices
1. Feature Scaling
Scaling is NOT optional. SVMs calculate distances between points. If one feature (e.g., Salary) is in the thousands and another (e.g., Age) is under 100, the Salary feature will completely dominate the model. Always use StandardScaler.
2. Handling Outliers
Soft margin SVMs handle outliers via the \(C\) parameter. If your data is very noisy, a smaller \(C\) will prevent the model from chasing outliers and distorting the boundary.
3. High-Dimensional Data
SVMs excel here because the complexity of the model depends on the number of Support Vectors, not the number of features. This makes them highly effective for genomics or text data.
SVM vs. Logistic Regression
| Feature | Logistic Regression | SVM |
|---|---|---|
| Output | Probabilities (\(0\) to \(1\)) | Class Labels / Distance to Boundary |
| Influence | Every data point influences the weights | Only Support Vectors influence the boundary |
| Sensitivity | Sensitive to outliers | Robust to outliers (if \(C\) is tuned) |
| Best Case | Well-calibrated probabilities needed | Clear margin of separation exists |
Multiclass Classification
Since SVMs are binary by nature, they handle multiple classes using two main strategies:
- One-vs-Rest (OvR): Trains one classifier per class against all others.
- One-vs-One (OvO): Trains a classifier for every possible pair of classes.