Statistical Analysis — Data Analysis with Local AI (Chapter 8)

Statistical analysis transforms raw data into insights about populations and processes. AI can guide technique selection based on data characteristics, explain statistical outputs in accessible terms, and identify when assumptions are violated.

Common statistical tasks include describing distributions, comparing groups, measuring relationships, and testing hypotheses. Each task has multiple applicable techniques; selection depends on data type, distribution shape, sample size, and analytical goals.

Descriptive Statistics Guidance

Describing data requires choosing appropriate summary measures and visualizations. AI can recommend measures suited to distribution characteristics.

import ollama
import pandas as pd
import numpy as np

def recommend_descriptive_stats(df: pd.DataFrame, columns: list = None) -> dict:
    """Recommend descriptive statistics for specified columns."""
    
    if columns:
        subset = df[columns]
    else:
        subset = df
    
    # Analyze characteristics
    analysis = {}
    for col in subset.columns:
        if pd.api.types.is_numeric_dtype(subset[col]):
            analysis[col] = {
                'mean': subset[col].mean(),
                'median': subset[col].median(),
                'std': subset[col].std(),
                'skew': subset[col].skew(),
                'missing': subset[col].isnull().sum()
            }
    
    prompt = f"""Data characteristics:
    {analysis}
    
    Recommend appropriate descriptive statistics and visualizations.
    For each column, explain why recommended measures fit the distribution."""
    
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content']

# Example usage
df = pd.DataFrame({
    'income': np.random.lognormal(10, 1, 1000),  # Skewed
    'age': np.random.normal(40, 15, 1000),  # Normal
    'score': np.random.exponential(5, 1000)  # Highly skewed
})

recommendations = recommend_descriptive_stats(df)
print(recommendations)

The model recommends median and IQR for skewed distributions while suggesting mean and standard deviation for approximately normal data.

Correlation Analysis

Measuring relationships between variables requires appropriate correlation measures and careful interpretation. AI can guide these decisions.

def recommend_correlation_approach(df: pd.DataFrame, x: str, y: str) -> dict:
    """Recommend correlation analysis approach for two variables."""
    
    x_dtype = df[x].dtype
    y_dtype = df[y].dtype
    x_skew = df[x].skew() if pd.api.types.is_numeric_dtype(x_dtype) else 0
    y_skew = df[y].skew() if pd.api.types.is_numeric_dtype(y_dtype) else 0
    x_missing = df[x].isnull().sum()
    y_missing = df[y].isnull().sum()
    
    prompt = f"""Variables:
    - {x}: dtype={x_dtype}, skew={x_skew:.2f}, missing={x_missing}
    - {y}: dtype={y_dtype}, skew={y_skew:.2f}, missing={y_missing}
    
    Recommend correlation measure (Pearson, Spearman, Kendall) and explain why.
    Also recommend visualization to accompany the correlation measure."""
    
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content']

# Recommend correlation approach
df = pd.DataFrame({
    'education_years': np.random.normal(14, 3, 100),
    'salary': np.random.lognormal(10.5, 0.5, 100)  # Non-normal
})
recommendation = recommend_correlation_approach(df, 'education_years', 'salary')
print(recommendation)

Pearson correlation assumes linear relationships and roughly normal distributions. Spearman correlation measures monotonic relationships and works with ordinal data or non-normal distributions. Kendall measures rank agreement and handles ties better for small samples.

Implementing AI Recommendations

from scipy import stats

def correlation_with_explanation(df: pd.DataFrame, x: str, y: str) -> dict:
    """Calculate correlation with AI-generated interpretation."""
    
    recommendation = recommend_correlation_approach(df, x, y)
    
    # Parse recommendation to determine measure
    # Default to Spearman if distribution is non-normal
    x_skew = abs(df[x].skew())
    y_skew = abs(df[y].skew())
    
    if x_skew < 1 and y_skew < 1:
        corr, pval = stats.pearsonr(df[x].dropna(), df[y].dropna())
        measure = 'Pearson'
    else:
        corr, pval = stats.spearmanr(df[x].dropna(), df[y].dropna())
        measure = 'Spearman'
    
    # Get interpretation
    prompt = f"""Pearson correlation = {corr:.3f}, p-value = {pval:.4f}
    n = {len(df[x].dropna())}
    
    Explain what this correlation means in plain language, including:
    - Strength and direction of relationship
    - Statistical significance
    - Practical implications"""
    
    interpretation = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}]
    )['message']['content']
    
    return {
        'measure': measure,
        'correlation': corr,
        'p_value': pval,
        'interpretation': interpretation
    }

# Calculate with explanation
result = correlation_with_explanation(df, 'education_years', 'salary')
print(f"Correlation: {result['correlation']:.3f}")
print(f"Interpretation: {result['interpretation']}")