Anomaly Detection

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.

D.M. Hawkins (1980) Identification of Outliers

Anomaly detection is a powerful set of techniques that belong in every data scientists toolset.

What is Anomaly Detection?

Anomaly detection is the identification of unusual data points that deviate significantly from expected patterns or normal behavior.

These unusual observations can indicate:

Data quality issues: Missing values, measurement errors, or corrupted records
Interesting phenomena: Novel discoveries, rare events, or emerging trends
System problems: Equipment failures, security breaches, or process breakdowns
Different data processes: Points generated by mechanisms different from the majority

Understanding your domain is crucial for effective anomaly detection. The definition of “normal” depends entirely on context - what’s anomalous in financial transactions differs from what’s anomalous in network traffic or medical data.

What is an Anomaly Detector?

An anomaly detector is an algorithm that automatically identifies records as anomalies based on statistical, distance-based, or machine learning techniques.

The choice of detector depends on your data characteristics, domain knowledge, computational constraints, and interpretability requirements.

Resources

Resources to learn anomaly detection:

Outlier Detection in Python: Manning book written by Brett Kennedy. A great book, and responsible for most of the content in this lesson. I owe Brett a lot for how important this book has been for me!
PyOD: Python library with 40+ anomaly detection algorithms pyod.readthedocs.io
Numenta Anomaly Benchmark: Real-world time series anomaly detection benchmark github.com/numenta/NAB

Why Learn Anomaly Detection?

Anomaly detection enables you to:

Find Interesting Data: Identify rare events, novel patterns, and unusual behaviors
Clean Data: Find and fix data quality issues
Label Data: Label anomalies for supervised learning

Anomaly detection methods apply across many domains - the same statistical methods that detect credit card fraud can find manufacturing defects or identify unusual patient symptoms.

Applications of anomaly detection include:

Prevent financial fraud: Detect unusual transaction patterns in banking, credit cards, and online payments before losses occur
Monitor system health: Identify server failures, network intrusions, and performance degradation in real-time
Ensure quality control: Find defective products, manufacturing errors, and quality issues in production lines
Enable predictive maintenance: Detect equipment failures before they happen, reducing downtime and maintenance costs
Enhance cybersecurity: Identify malicious activities, data breaches, and security threats in network traffic
Improve healthcare: Detect medical anomalies in patient data, unusual symptoms, and diagnostic outliers

Types of Outliers

Strong vs. Weak Outliers

If the outliers are only outliers because of noise, they are weak outliers. Strong outliers are outliers because they are generated by different data generating processes.

Internal vs. External Outliers

Internal outliers are less common than extreme values. They appear only in multimodal distributions or distributions with gaps.

Internal outliers can be detected by the following algorithms, which are more flexible to unusual distributions:

Histograms: Count-based methods
Kernel Density Estimation (KDE): Probability distribution density estimation
Nearest Neighbour methods: Distance-based methods

Local vs. Global Outliers

Global outliers are unusual when compared to the entire dataset. These points are far from all other data points and would be considered anomalies regardless of local context.

Local outliers are unusual only within their local neighborhood, but may appear normal when viewed globally. A point might be close to one cluster but far from the specific cluster it should belong to.

Local outliers are also known as in-distribution anomalies, while global outliers are out-of-distribution anomalies.

Masking & Swamping

Masking

Masking occurs when the presence of outliers causes other outliers to not be detected. This happens because extreme outliers can shift statistical measures (like mean and standard deviation) so much that other outliers appear normal by comparison.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate normal data with two types of outliers
np.random.seed(42)
normal_data = np.random.normal(0, 1, 100)
moderate_outliers = [3.5, 3.7]  # These should be detected
extreme_outliers = [15, 16]     # These mask the moderate ones

data = np.concatenate([normal_data, moderate_outliers, extreme_outliers])

# Z-score detection (affected by masking)
z_scores = np.abs(stats.zscore(data))
outliers_zscore = data[z_scores > 2]

# Robust detection using MAD
mad = np.median(np.abs(data - np.median(data)))
modified_z_scores = 0.6745 * (data - np.median(data)) / mad
outliers_mad = data[np.abs(modified_z_scores) > 2]

print(f"Z-score outliers: {outliers_zscore}")
print(f"MAD outliers: {outliers_mad}")

Z-score outliers: [15. 16.]
MAD outliers: [-1.91328024 -1.72491783  1.85227818 -1.95967012 -1.76304016 -2.6197451
 -1.98756891  3.5         3.7        15.         16.        ]

Moderate outliers [3.5, 3.7] missed by Z-score due to masking from extreme outliers [15, 16]

Masking Example

Swamping

Swamping occurs when the presence of outliers causes normal points to be incorrectly identified as outliers. This typically happens when outliers inflate the variance, making the detection threshold too sensitive.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Generate clustered data with one extreme outlier
np.random.seed(42)
cluster1 = np.random.normal(0, 0.5, 50)
cluster2 = np.random.normal(5, 0.5, 50)
extreme_outlier = [20]  # This will cause swamping

data = np.concatenate([cluster1, cluster2, extreme_outlier])

# Z-score detection (affected by swamping)
z_scores = np.abs(stats.zscore(data))
outliers_zscore = data[z_scores > 2]

# Robust detection using median and MAD
median_val = np.median(data)
mad = np.median(np.abs(data - median_val))
modified_z_scores = 0.6745 * (data - median_val) / mad
outliers_mad = data[np.abs(modified_z_scores) > 2]

print(f"Z-score outliers (swamping): {len(outliers_zscore)} points")
print(f"MAD outliers (robust): {len(outliers_mad)} points")

Z-score outliers (swamping): 1 points
MAD outliers (robust): 1 points
Z-score detected: [20.]
MAD detected: [20.]

The extreme outlier 20 inflated variance, causing normal points to be flagged as outliers.

Swamping Example

Labels & Scores

An anomaly label is a binary classification of whether a record is an anomaly or not. Generating a label is often done with a score and a threshold.

An anomaly score is a continuous value that indicates how abnormal a record is. Example of scores include a z score, a distance from a cluster, or a density estimate.

Scores can be combined and ranked, although care is needed when combining scores on different scales. Scores can be combined with the sum, maximum or sum of squares.

Univariate vs. Multivariate Anomaly Detection

Univariate outlier detection uses a single feature to find unusual values. The scores created with univariate anomaly detection can be combined for all features in a row, to generate an estimate for the entire row.

Univariate outiler detection has one source of unusualness - unusual because of the single feature. A univariate outlier detection model cannot understand the relationships between features.

Multivariate outlier detection uses many features, and can find unusual combinations of unusual values alongside just finding unusual values.

Multivariate outlier detection has two sources of unusualness - unusual because of the single feature and/or unusual when combined with other features.

Curse of Dimensionality

The curse of dimensionality refers to the problems that arise when working with high-dimensional data.

Very few outliers require many features to detect them. Some outliers can only be detected in high dimensions.

For outlier detection, high dimensions cause the following issues:

Interpretability: Outliers that relate to many features are less interpretable
Sparsity: High dimensional data can become spares, where many combinations cannot be meaningfully populated.
More ways to be unusual: Increasing dimensionality creates more opportunities for data points to stand out as anomalies

Categorical Features

One common way to detect anomalies with categorical variables is with counts. Anomalies can then be detected by thresholds on the counts.

Can set a threshold on a cumulative count. Here you would set a record to an anomaly based on the cumulative count of all values of that column, stopping where the total is equal to a threshold of 1% (for example).

You can also do outlier detection on the counts - for example using MAD or z score on the counts.

Marginal Probabilities

Marginal probabilities detect anomalies in categorical features by identifying values with unusually low probability of occurrence. This method calculates the probability of each category value and flags records containing rare combinations.

The marginal probabilities method is particularly effective for:

High-cardinality categorical features: Where many categories have low frequency
Multi-categorical records: Detecting unusual combinations across multiple categorical features
Interpretability: Easy to explain why a record is anomalous based on rare category values

Basic marginal probability calculation:

import pandas as pd
import numpy as np

# Sample categorical data
data = pd.DataFrame({
    'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'Marketing', 'Legal'],
    'location': ['NYC', 'NYC', 'SF', 'SF', 'NYC', 'Remote'],
    'level': ['Junior', 'Senior', 'Senior', 'Senior', 'Senior', 'Partner']
})

# Calculate marginal probabilities for each feature
dept_probs = data['department'].value_counts(normalize=True)
location_probs = data['location'].value_counts(normalize=True)
level_probs = data['level'].value_counts(normalize=True)

print("Department probabilities:")
print(dept_probs)
print("\nLocation probabilities:")
print(location_probs)
print("\nLevel probabilities:")
print(level_probs)

Department probabilities:
department
Sales          0.333333
Engineering    0.333333
Marketing      0.166667
Legal          0.166667

Location probabilities:
location
NYC       0.5
SF        0.333333
Remote    0.166667

Level probabilities:
level
Senior     0.666667
Junior     0.166667
Partner    0.166667

Detecting anomalies using probability thresholds:

# Set threshold for rare categories (probability < 0.2)
threshold = 0.2

# Find records with rare categorical values
anomalies = []
for idx, row in data.iterrows():
    dept_prob = dept_probs[row['department']]
    loc_prob = location_probs[row['location']]
    level_prob = level_probs[row['level']]
    
    # Flag if any category has probability below threshold
    if dept_prob < threshold or loc_prob < threshold or level_prob < threshold:
        anomalies.append({
            'index': idx,
            'department': row['department'],
            'location': row['location'], 
            'level': row['level'],
            'dept_prob': dept_prob,
            'loc_prob': loc_prob,
            'level_prob': level_prob
        })

print(f"Found {len(anomalies)} anomalous records:")
for anomaly in anomalies:
    print(f"Row {anomaly['index']}: {anomaly['department']}, {anomaly['location']}, {anomaly['level']}")
    print(f"  Probabilities: dept={anomaly['dept_prob']:.3f}, loc={anomaly['loc_prob']:.3f}, level={anomaly['level_prob']:.3f}")

Found 3 anomalous records:
Row 4: Marketing, NYC, Senior
  Probabilities: dept=0.167, loc=0.500, level=0.667
Row 5: Legal, Remote, Partner
  Probabilities: dept=0.167, loc=0.167, level=0.167
Row 0: Sales, NYC, Junior
  Probabilities: dept=0.333, loc=0.500, level=0.167

Combined probability scoring:

# Calculate combined probability score for each record
data['combined_prob'] = data.apply(lambda row: 
    dept_probs[row['department']] * 
    location_probs[row['location']] * 
    level_probs[row['level']], axis=1)

# Sort by combined probability (lowest = most anomalous)
data_sorted = data.sort_values('combined_prob')
print("Records sorted by combined probability (most anomalous first):")
print(data_sorted[['department', 'location', 'level', 'combined_prob']])

Records sorted by combined probability (most anomalous first):
  department location    level  combined_prob
5      Legal   Remote  Partner       0.004630
0      Sales      NYC   Junior       0.027778
4  Marketing      NYC   Senior       0.055556
1      Sales      NYC   Senior       0.111111
2 Engineering       SF   Senior       0.148148
3 Engineering       SF   Senior       0.148148

Visualizing Anomalies

Visualization is crucial for understanding your data and validating anomaly detection results. Different chart types work best for different data combinations.

Single Variable Visualizations

Histogram: Shows the distribution and identifies gaps or unusual peaks

data = [1, 2, 3, 4, 5, 100]  # 100 is an outlier
plt.hist(data, bins=10, alpha=0.7, edgecolor='black')

Histogram

Kernel Density Estimation (KDE): Smooth density curves that highlight low-density regions

sns.histplot(data, kde=True, alpha=0.7)

KDE Plot

Boxplot: Identifies outliers using quartiles and IQR

plt.boxplot(data)  # Points beyond whiskers are outliers

Boxplot

Two Variable Visualizations

Scatterplot: Reveals outlying points and clusters in two numeric variables

x = [1, 2, 3, 4, 10]
y = [1, 2, 3, 4, 10]  # (10, 10) is an outlier
plt.scatter(x, y)

Scatterplot

Heatmap: Shows frequency patterns between two categorical variables

df = pd.DataFrame({'cat1': ['A', 'A', 'B'], 'cat2': ['X', 'Y', 'Z']})
heatmap_data = pd.crosstab(df['cat1'], df['cat2'])
sns.heatmap(heatmap_data, annot=True)

Heatmap

Boxplot by category: Identifies outliers within categorical groups

data_by_group = [[1, 2, 3, 4], [10, 11, 12, 100]]  # 100 is outlier in group 2
plt.boxplot(data_by_group, tick_labels=['Group A', 'Group B'])

Boxplot by Category

Advanced Visualizations

Z-score visualization: Color-codes points by their statistical outlier scores

z_scores = np.abs(stats.zscore(data))
plt.scatter(range(len(data)), data, c=z_scores, cmap='Reds')
plt.colorbar(label='|Z-score|')

Z-score Visualization

Density contour plot: Shows bivariate density with outliers highlighted

plt.scatter(normal_points[:, 0], normal_points[:, 1], label='Normal')
plt.scatter(outlier_points[:, 0], outlier_points[:, 1], color='red', label='Outliers')
plt.legend()

Density Contour Plot

Anomaly Detection Methods

Anomaly detection methods can be categorized into three groups of methods:

Rules-based: Simple, interpretable thresholds and conditions
Statistical: Mathematical approaches using distributions and probabilities
Machine Learning: Advanced algorithms that learn patterns from data

Anomaly Detection with Rules

Rules-based anomaly detection uses thresholds and logical conditions to identify outliers:

Interpretable: Easy to understand and explain to stakeholders
Fast: Simple comparisons with minimal computation
Deterministic: Same input always produces same result
Manual effort: Requires domain knowledge to set appropriate thresholds
Rigid: May miss novel anomaly patterns

Univariate Rules

Univariate rules are simple threshold-based rules applied to single variables.

Range rules flag values outside expected ranges:

ages = [25, 30, -5, 150, 40]
outliers = [age for age in ages if age < 0 or age > 120]
print(f"Age outliers: {outliers}")

Age outliers: [-5, 150]

Percentage rules flag extreme percentiles:

import numpy as np
data = np.array([10, 12, 15, 18, 20, 22, 25, 28, 30, 100])
outliers = data[data > np.percentile(data, 95)]
print(f"Top 5% outliers: {outliers}")

Top 5% outliers: [100]

Business logic rules enforce domain-specific constraints:

employees = [{'salary': 80000, 'revenue': 70000}, {'salary': 50000, 'revenue': 200000}]
outliers = [emp for emp in employees if emp['salary'] > emp['revenue']]
print(f"Salary > revenue: {len(outliers)} employees")

Salary > revenue: 1 employees

Multivariate Rules

Multivariate rules are combinations of univariate rules.

These can do things like check for consistency (that a start date is before an end date) or on ratios of features.

Combining multivariate rules is not straightforward, as the number of combinations grows exponentially with the number of features.

Anomaly Detection with Statistics

Statistical anomaly detection uses mathematical properties of data distributions to identify outliers.

These methods assume data follows certain statistical patterns and flag points that deviate significantly from these patterns:

Computationally efficient: Simple calculations, fast execution
Threshold interpretability: Scores have statistical meaning
Distribution assumptions: Many methods assume normal distributions
Univariate focus: Don’t capture relationships between variables
Parameter sensitivity: Threshold choice affects performance significantly

Z-Score

The Z-score measures how many standard deviations a point is from the mean.

Points with an absolute Z-score greater than 2 or 3 are typically considered outliers.

Tradeoffs:

Simple
Works well for normal distributions
Sensitive to extreme outliers (masking and swamping)

import numpy as np

data = np.array([1, 2, 3, 4, 5, 100])
mean = np.mean(data)
std = np.std(data)
z_scores = (data - mean) / std
outliers = data[np.abs(z_scores) > 2]

Interquartile Range (IQR)

The interquartile range (IQR) identifies outliers based on the spread of the middle 50% of the data.

Points outside one and a half times the IQR from the first and third quartiles are flagged as outliers.

Tradeoffs:

Robust to outliers
Doesn’t assume normal distribution
The fixed multiplier (1.5) may not suit all datasets

import numpy as np

data = [1, 2, 3, 4, 5, 100]
Q1, Q3 = np.percentile(data, [25, 75])
IQR = Q3 - Q1
outliers = [x for x in data if x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR]

Median Absolute Deviation (MAD)

The Median Absolute Deviation (MAD) is a more robust alternative to standard deviation. It uses the median instead of the mean as a measurement of central expectation.

Tradeoffs:

Robust to outliers
Works with skewed distributions
Less intuitive than Z-score

import numpy as np

data = [1, 2, 3, 4, 5, 100]
median = np.median(data)
mad = np.median(np.abs(data - median))
outliers = [x for x in data if abs(x - median) > 3 * mad]

Modified Z-Score

The modified Z-score combines the interpretability of the Z-score with the robustness of MAD. It uses MAD instead of standard deviation for more robust outlier detection.

$$\text{Modified Z} = \frac{0.6745 \times (x - \text{median}(x))}{MAD}$$

The constant 0.6745 makes MAD equivalent to standard deviation for normal distributions.

Tradeoffs:

Combines interpretability of z-score with robustness of MAD
Constant (0.6745) assumes normal distribution

import numpy as np

data = np.array([1, 2, 3, 4, 5, 100])
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
outliers = data[np.abs(modified_z) > 3.5]

Combining Statistical Methods

Statistical scores can be combined across multiple features to create multivariate outlier detection:

Sum of scores: Adding across features
Maximum score: Take the highest absolute score across features
Sum of squares: Combine squared scores

Anomaly Detection with Machine Learning

Machine learning anomaly detection methods learn patterns from data. They can handle complex, multivariate relationships.

These algorithms are typically more sophisticated but less interpretable than rules or statistical methods.

We will look at a few different groups of machine learning methods:

Distance-based methods: Identify outliers based on distance to other points
Density-based methods: Identify outliers in low-density regions
Clustering-based methods: Use clustering algorithms to find outliers
Tree-based methods: Use decision trees to isolate anomalies
Probabilistic methods: Model data distributions and flag low-probability points
Histogram-based methods: Use histograms to identify outliers

Tradeoffs:

Multivariate: Capture complex relationships between features
Adaptive: Learn patterns from data without manual threshold setting
Flexible: Handle non-linear relationships and arbitrary distributions
Scalable: Many algorithms work well with large datasets
Less interpretable: Harder to explain why a point is flagged
Computational cost: More expensive than statistical methods
Overfitting risk: May learn noise patterns in training data
Data requirements: Need sufficient training data for reliable patterns

Distance-Based Methods

Identify outliers based on their distance to other points or clusters.

k-Nearest Neighbors (k-NN):

Flag points with large distances to their k nearest neighbors
Can use mean, median, or maximum distance to k neighbors
Works well when normal points form dense clusters

from sklearn.neighbors import NearestNeighbors
import numpy as np

knn = NearestNeighbors(
    # Number of neighbors to consider
    n_neighbors=5,          
    # Algorithm selection (ball_tree, kd_tree, brute, auto)
    algorithm='auto',       
    # Leaf size for tree algorithms
    leaf_size=30,          
    # Distance metric
    metric='minkowski',     
    # Power parameter for Minkowski metric (2=Euclidean)
    p=2                    
)

distances, indices = knn.fit(data).kneighbors(data)
anomaly_scores = distances.mean(axis=1)

Tradeoffs:

Arbitrary cluster shapes: Handles non-spherical clusters well
No distribution assumptions: Works with any data distribution
Parameter sensitivity: Performance highly dependent on k value
Varying density struggles: Poor performance when clusters have different densities

Density-Based Methods

Identify outliers as points in low-density regions.

Local Outlier Factor (LOF):

Compares local density of a point to densities of its neighbors
Points in sparser regions relative to neighbors are flagged
Handles clusters of different densities well

from sklearn.neighbors import LocalOutlierFactor
import numpy as np

lof = LocalOutlierFactor(
    # Number of neighbors to consider
    n_neighbors=20,
    # Algorithm for neighbor search (auto, ball_tree, kd_tree, brute)
    algorithm='auto',
    # Leaf size for tree algorithms
    leaf_size=30,
    # Distance metric
    metric='minkowski',
    # Power parameter for Minkowski metric
    p=2
)

outlier_labels = lof.fit_predict(data)
anomaly_scores = -lof.negative_outlier_factor_

Kernel Density Estimation (KDE):

Estimates probability density function and flags low-probability points
Can capture complex data distributions

Tradeoffs:

Varying cluster densities: Handles clusters with different densities well
Local anomaly detection: Finds outliers relative to local neighborhoods
Complex distributions: Captures non-parametric data patterns
Curse of dimensionality: Performance degrades in high dimensions

Clustering-Based Methods

Use clustering algorithms to identify outliers as points that don’t belong to any cluster or are far from cluster centers.

k-Means Outliers:

Points far from nearest cluster centroid
Points in small or sparse clusters

DBSCAN Outliers:

Points classified as “noise” by the algorithm
Points not belonging to any dense cluster

from sklearn.cluster import DBSCAN
import numpy as np

dbscan = DBSCAN(
    # Maximum distance between samples in same neighborhood
    eps=0.5,
    # Minimum samples in neighborhood to form core point
    min_samples=5,
    # Distance metric
    metric='euclidean',
    # Algorithm for neighbor search
    algorithm='auto',
    # Leaf size for tree algorithms
    leaf_size=30
)

cluster_labels = dbscan.fit_predict(data)
outliers = data[cluster_labels == -1]

Tradeoffs:

Intuitive concept: Easy to understand outliers as non-clustered points
Clustering quality dependence: Performance tied to underlying clustering success

Tree-Based Methods

Isolation Forest:

Isolates anomalies by randomly partitioning data
Anomalies require fewer partitions to isolate (shorter paths in trees)
Ensemble of isolation trees votes on anomaly score

from sklearn.ensemble import IsolationForest
import numpy as np

iforest = IsolationForest(
    # Number of base estimators in ensemble
    n_estimators=100,
    # Number of samples to draw to train each base estimator
    max_samples='auto',
    # Proportion of outliers in dataset
    contamination=0.1,
    # Number of features to draw to train each base estimator
    max_features=1.0,
    # Random state for reproducibility
    random_state=42
)

outlier_labels = iforest.fit_predict(data)
anomaly_scores = -iforest.score_samples(data)

Tradeoffs:

Fast execution: Linear time complexity, scales well
No distribution assumptions: Works with any data distribution
Feature independence: Handles mixed data types well
Less interpretable: Difficult to explain why specific points are outliers
Sparse region struggles: May flag normal points in low-density areas

Probabilistic Methods

Gaussian Mixture Models (GMM):

Model data as mixture of Gaussian distributions
Flag points with low likelihood under the learned model
Can capture multiple modes in data

One-Class SVM:

Learn boundary around normal data in high-dimensional space
Flag points outside the learned boundary

from sklearn.svm import OneClassSVM
import numpy as np

ocsvm = OneClassSVM(
    # Kernel type (linear, poly, rbf, sigmoid)
    kernel='rbf',
    # Kernel coefficient for rbf, poly, sigmoid
    gamma='scale',
    # Upper bound on fraction of outliers
    nu=0.1,
    # Regularization parameter
    C=1.0,
    # Degree for polynomial kernel
    degree=3
)

outlier_labels = ocsvm.fit_predict(data)
decision_scores = ocsvm.decision_function(data)

Tradeoffs:

Complex distributions: Handles multimodal and non-linear patterns
Uncertainty quantification: Provides confidence scores for predictions
Strong assumptions: Assumes Gaussian mixture distributions
Computationally intensive: Expensive training and inference

Histogram-Based Methods

Histogram-Based Outlier Score (HBOS):

Build histograms for each feature independently
Combine scores across features (assumes feature independence)
Fast and interpretable for high-dimensional data

from pyod.models.hbos import HBOS
import numpy as np

hbos = HBOS(
    # Number of histogram bins
    n_bins=10,
    # Regularization parameter for density estimation
    alpha=0.1,
    # Tolerance for histogram calculation
    tol=0.1,
    # Contamination proportion
    contamination=0.1
)

outlier_labels = hbos.fit_predict(data)
anomaly_scores = hbos.decision_scores_

Advantages:

Fast execution: Linear time complexity for training and prediction
Scales well: Handles high-dimensional data efficiently
Interpretable contributions: Easy to understand which features drive anomaly scores
Feature independence assumption: Ignores correlations and interactions between features

Practical Tips

Setting Realistic Expected Anomalies

Don’t expect zero anomalies in real data:

Normal anomaly rates: Expect 1-5% anomalies in typical datasets
Threshold accordingly: Set contamination parameters based on expected rates
Domain knowledge: Use business context to validate anomaly rates

Data Separation Strategy

Separate different record types before applying anomaly detection:

Transaction types: Analyze purchases and sales separately, not together
User segments: Apply detection within user groups (new vs. returning customers)
Time periods: Consider seasonal patterns by analyzing periods separately

# Separate analysis by transaction type
purchase_data = transactions[transactions['type'] == 'purchase']
sale_data = transactions[transactions['type'] == 'sale']

# Run anomaly detection on each separately
purchase_outliers = detect_anomalies(purchase_data)
sale_outliers = detect_anomalies(sale_data)

Ensemble Methods

Ensemble approaches combine multiple anomaly detectors for improved performance:

Why use ensembles:

Pattern diversity: Different algorithms detect different types of anomalies
Variance reduction: Averaging reduces impact of individual model errors
Reliability: More stable and robust final anomaly scores

Ensemble strategies:

from pyod.models.combination import aom, moa, average
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.hbos import HBOS

# Train multiple detectors
clf1 = IForest(contamination=0.1)
clf2 = LOF(contamination=0.1) 
clf3 = HBOS(contamination=0.1)

scores1 = clf1.fit(data).decision_scores_
scores2 = clf2.fit(data).decision_scores_
scores3 = clf3.fit(data).decision_scores_

# Combine scores using different method
avg_scores = average([scores1, scores2, scores3])
aom_scores = aom([scores1, scores2, scores3])
moa_scores = moa([scores1, scores2, scores3])

Tradeoffs:

Complementary detection: Statistical + ML methods catch different anomaly types
Reduced false positives: Consensus voting filters spurious detections
Improved recall: Multiple detectors increase chance of catching true anomalies
Complexity: More models to maintain and tune

Explainability & Interpretability

Understanding why an anomaly detector flags records as outliers can be crucial for building trust and actionable insights.

The requirements for explanation vary significantly based on your audience and use case.

Interpretability vs. Explainability

Interpretability generally refers to models that are inherently understandable, where you can directly see how decisions are made with minimal additional tools.

Explainability involves external techniques to understand anomaly detection labels and scores, often applied after the model has made predictions.

The distinction between interpretability and explainability exists on a spectrum rather than as rigid categories. Understanding this spectrum helps choose appropriate techniques for different audiences and use cases.

The spectrum in practice:

High interpretability: Z-score, IQR, simple business rules
Medium interpretability: Decision trees, linear models with feature importance
Lower interpretability: Isolation Forest, neural networks, ensemble methods
Requires explainability tools: Complex ensembles, deep learning models

Gray areas and considerations:

Audience dependency: What’s interpretable to a data scientist may need explanation for business users
Context matters: Tree models show rules but may be complex with many branches
Feature interactions: Some “interpretable” methods miss important relationships
Timing: Interpretability during model design vs. explainability applied post-hoc

Global vs. Local Explanations

Global explanations describe how the model works overall across the entire dataset.

Local explanations describe why a specific record was flagged as an anomaly.

Interpretable Methods

These methods are inherently understandable without additional explanation tools:

Statistical methods:

Z-score/MAD: Clear threshold-based rules with statistical meaning
IQR: Quartile-based boundaries with intuitive interpretation
Percentile-based: Direct ranking interpretable to any audience

Histogram-based methods:

HBOS: Shows which features have unusual value frequencies
Count-based detection: Simple frequency thresholds easy to understand

Rules-based methods:

Business logic rules: Domain-specific constraints anyone can understand
Range checks: Simple min/max boundaries
Consistency checks: Logical relationships between fields

Explainability Techniques

For complex models that lack inherent interpretability, use these post-hoc explanation methods.

Feature Importance

Shows which features contribute most to anomaly detection globally:

from sklearn.inspection import permutation_importance
from pyod.models.iforest import IForest

# Train isolation forest
clf = IForest(contamination=0.1)
clf.fit(X_train)

# Calculate feature importance via permutation
perm_importance = permutation_importance(clf, X_test, scoring='roc_auc')
feature_names = ['age', 'salary', 'experience']

for i, importance in enumerate(perm_importance.importances_mean):
    print(f"{feature_names[i]}: {importance:.3f}")

age: 0.023
salary: 0.156
experience: 0.087

Limitations: Feature importance doesn’t capture feature interactions or individual record explanations.

SHAP (SHapley Additive exPlanations)

Provides both global and local explanations with theoretical guarantees:

import shap
from pyod.models.iforest import IForest

# Train model and get SHAP explainer
clf = IForest(contamination=0.1)
clf.fit(X_train)
explainer = shap.Explainer(clf.decision_function, X_train)

# Local explanation for specific record
shap_values = explainer(X_test[0:1])
print(f"Base value: {explainer.expected_value:.3f}")
print(f"SHAP values: {shap_values.values[0]}")

Base value: 0.102
SHAP values: [-0.025, 0.089, 0.034]

Proxy Models

Train simple, interpretable models to mimic complex model predictions:

from sklearn.tree import DecisionTreeClassifier, export_text
from pyod.models.iforest import IForest

# Complex model predictions
iforest = IForest(contamination=0.1)
complex_predictions = iforest.fit_predict(X)

# Simple proxy model
proxy = DecisionTreeClassifier(max_depth=3)
proxy.fit(X, complex_predictions)

# Proxy provides interpretable rules
rules = export_text(proxy, feature_names=['age', 'salary', 'experience'])
print(rules)

|--- salary <= 75000.00
|   |--- age <= 25.00
|   |   |--- class: 0 (normal)
|   |--- age >  25.00
|   |   |--- experience <= 2.00
|   |   |   |--- class: 1 (anomaly)

Visualization Techniques

Partial Dependence Plots: Show how anomaly scores change with individual features:

from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt

# Create partial dependence plot
PartialDependenceDisplay.from_estimator(clf, X, features=[0, 1], feature_names=['age', 'salary'])
plt.show()

Individual Conditional Expectation (ICE) plots: Show prediction changes for individual records across feature values.

Counterfactual Explanations

Show minimum changes needed to change a prediction from anomaly to normal:

def generate_counterfactual(model, anomalous_record):
    """Find minimal changes to make record normal"""
    counterfactual = anomalous_record.copy()
    
    # Iteratively modify features with smallest changes
    for feature in features_by_importance:
        modified_record = counterfactual.copy()
        modified_record[feature] = find_normal_value(feature)
        
        if model.predict([modified_record])[0] == 0:  # Normal
            return modified_record, calculate_changes(anomalous_record, modified_record)
    
    return None

original = [25, 150000, 2]  # [age, salary, experience] - flagged as anomaly
counterfactual, changes = generate_counterfactual(model, original)
print(f"Change salary from {original[1]} to {counterfactual[1]} to be normal")

Change salary from 150000 to 80000 to be normal

Summary

In this lesson we’ve covered:

What is anomaly detection: Identification of unusual data points that deviate significantly from expected patterns or normal behavior
Types of outliers: Strong vs weak, internal vs external, local vs global outliers and their detection characteristics
Masking and swamping: How extreme outliers can hide other outliers (masking) or cause normal points to be flagged (swamping)
Labels vs scores: Binary classifications vs continuous measures of anomaly strength
Univariate vs multivariate: Single feature analysis vs capturing relationships between multiple features
Categorical feature detection: Using counts, thresholds, and marginal probabilities for non-numeric data
Visualization techniques: Histograms, boxplots, scatterplots, and advanced methods for understanding anomaly patterns
Rules-based methods: Simple, interpretable threshold and logical condition approaches
Statistical methods: Z-score, IQR, MAD, and modified Z-score for mathematical anomaly detection
Machine learning methods: Distance-based (k-NN), density-based (LOF), clustering-based (DBSCAN), tree-based (Isolation Forest), probabilistic (One-Class SVM), and histogram-based (HBOS) approaches
Practical considerations: Setting realistic anomaly rates, data separation strategies, and ensemble methods
Explainability: Understanding the difference between interpretable methods and explainability techniques like SHAP and counterfactuals

These techniques enable you to find interesting data, clean datasets, detect fraud, monitor systems, and identify rare events across diverse domains from finance to healthcare to cybersecurity.

Next Steps

Recommended resources:

Outlier Detection in Python: Manning book written by Brett Kennedy. A great book, and responsible for most of the content in this lesson. I owe Brett a lot for how important this book has been for me!
PyOD: Python library with 40+ anomaly detection algorithms pyod.readthedocs.io
Numenta Anomaly Benchmark: Real-world time series anomaly detection benchmark github.com/numenta/NAB

Blog

Competencies

Anomaly Detection

What is Anomaly Detection?

What is an Anomaly Detector?

Resources

Why Learn Anomaly Detection?

Types of Outliers

Strong vs. Weak Outliers

Internal vs. External Outliers

Local vs. Global Outliers

Masking & Swamping

Masking

Swamping

Labels & Scores

Univariate vs. Multivariate Anomaly Detection

Curse of Dimensionality

Categorical Features

Marginal Probabilities

Visualizing Anomalies

Single Variable Visualizations

Two Variable Visualizations

Advanced Visualizations

Anomaly Detection Methods

Anomaly Detection with Rules

Univariate Rules

Multivariate Rules

Anomaly Detection with Statistics

Z-Score

Interquartile Range (IQR)

Median Absolute Deviation (MAD)

Modified Z-Score

Combining Statistical Methods

Anomaly Detection with Machine Learning

Distance-Based Methods

Density-Based Methods

Clustering-Based Methods

Tree-Based Methods

Probabilistic Methods

Histogram-Based Methods

Practical Tips

Setting Realistic Expected Anomalies

Data Separation Strategy

Ensemble Methods

Explainability & Interpretability

Interpretability vs. Explainability

Global vs. Local Explanations

Interpretable Methods

Explainability Techniques

Feature Importance

SHAP (SHapley Additive exPlanations)

Proxy Models

Visualization Techniques

Counterfactual Explanations

Summary

Next Steps