Correlation Analysis — Data Analysis with Local AI (Chapter 10)

Correlation analysis quantifies the strength and direction of relationships between variables. Understanding correlations prevents spurious assumptions and reveals hidden patterns in data.

Pearson vs Spearman Correlation

Pearson measures linear relationships and assumes normal distribution. Spearman measures monotonic relationships using rank order, handling non-linear patterns and outliers better.

import pandas as pd
import numpy as np

df = pd.read_csv('sales_data.csv')

# Pearson correlation matrix
pearson_corr = df[['revenue', 'marketing_spend', 'customer_count']].corr(method='pearson')
print("Pearson Correlation:")
print(pearson_corr)

# Spearman for non-linear relationships
spearman_corr = df[['revenue', 'marketing_spend', 'customer_count']].corr(method='spearman')
print("\nSpearman Correlation:")
print(spearman_corr)

Visualizing Correlations

Heatmaps reveal correlation structures at a glance.

import matplotlib.pyplot as plt
import seaborn as sns

# Full correlation matrix for all numeric columns
numeric_df = df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0, fmt='.2f')
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)

Categorical Variable Correlations

Use Cramér's V for categorical-categorical relationships:

from scipy.stats import chi2_contingency

def cramers_v(confusion_matrix):
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    return np.sqrt(phi2 / min(k-1, r-1))

# Example: correlation between product category and customer segment
confusion = pd.crosstab(df['product_category'], df['customer_segment'])
v = cramers_v(confusion)
print(f"Cramér's V: {v:.3f}")

Correlation vs Causation Trap

Strong correlation never implies causation. Use lagged correlation analysis to explore temporal precedence:

# Check if marketing spend leads to revenue change
shifted_marketing = df['marketing_spend'].shift(7)  # 7-day lag
df['lagged_correlation'] = df['revenue'].corr(shifted_marketing)
print(f"Lagged correlation (7 days): {df['lagged_correlation']:.3f}")

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.