10. Correlation Analysis
Correlation analysis quantifies the strength and direction of relationships between variables. Understanding correlations prevents spurious assumptions and reveals hidden patterns in data.
Pearson vs Spearman Correlation
Pearson measures linear relationships and assumes normal distribution. Spearman measures monotonic relationships using rank order, handling non-linear patterns and outliers better.
import pandas as pd
import numpy as np
df = pd.read_csv('sales_data.csv')
# Pearson correlation matrix
pearson_corr = df[['revenue', 'marketing_spend', 'customer_count']].corr(method='pearson')
print("Pearson Correlation:")
print(pearson_corr)
# Spearman for non-linear relationships
spearman_corr = df[['revenue', 'marketing_spend', 'customer_count']].corr(method='spearman')
print("\nSpearman Correlation:")
print(spearman_corr)
Visualizing Correlations
Heatmaps reveal correlation structures at a glance.
import matplotlib.pyplot as plt
import seaborn as sns
# Full correlation matrix for all numeric columns
numeric_df = df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)
Categorical Variable Correlations
Use Cramér's V for categorical-categorical relationships:
from scipy.stats import chi2_contingency
def cramers_v(confusion_matrix):
chi2 = chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
return np.sqrt(phi2 / min(k-1, r-1))
# Example: correlation between product category and customer segment
confusion = pd.crosstab(df['product_category'], df['customer_segment'])
v = cramers_v(confusion)
print(f"Cramér's V: {v:.3f}")
Correlation vs Causation Trap
Strong correlation never implies causation. Use lagged correlation analysis to explore temporal precedence:
# Check if marketing spend leads to revenue change
shifted_marketing = df['marketing_spend'].shift(7) # 7-day lag
df['lagged_correlation'] = df['revenue'].corr(shifted_marketing)
print(f"Lagged correlation (7 days): {df['lagged_correlation']:.3f}")
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Calculate correlation matrix for your dataset, filter for absolute correlation > 0.5, then visualize as a heatmap with annotations showing only high-correlation pairs.