AI-Guided EDA — Data Analysis with Local AI (Chapter 2)

Exploratory Data Analysis (EDA) is the foundation of any data project. Without understanding data structure, distributions, and relationships, analysis proceeds on assumptions that may not hold. AI-guided EDA uses local models to recommend systematic exploration strategies tailored to specific datasets.

The core problem EDA solves is uncertainty about data quality and structure. Values may be missing, duplicated, or incorrectly formatted. Distributions may be skewed, bimodal, or contain outliers. Relationships between variables may be linear, nonlinear, or absent. Manual EDA requires cycling through techniques until patterns emerge, often missing important features due to time constraints or limited imagination.

AI guidance addresses this by suggesting techniques based on data characteristics. When a model sees a dataset with twenty numeric columns, it can recommend examining correlations first, then distributions, then interactions. This structured approach reduces the chance of missing important patterns.

Structuring the Exploration

Effective EDA follows a progression from basic to specific. Start with overall dataset characteristics: row count, column count, data types, missing value patterns. Then examine individual variables: distribution shapes, central tendency, spread, outliers. Finally investigate relationships: correlations, group differences, temporal patterns.

import pandas as pd
import ollama

def ai_guided_eda(df: pd.DataFrame, analysis_goal: str) -> dict:
    """Use AI to recommend EDA approach for dataset and goal."""
    
    # Build context about the dataset
    context = f"""Dataset has {len(df)} rows and {len(df.columns)} columns.
    Columns: {list(df.dtypes.items())}
    Missing values: {df.isnull().sum().to_dict()}
    
    Analysis goal: {analysis_goal}
    """
    
    prompt = f"""Based on this dataset and analysis goal, recommend an EDA approach.
    List specific techniques to apply in order with reasoning for each.
    Consider data types, missing patterns, and statistical requirements.
    
    {context}"""
    
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content']

# Example usage
df = pd.read_csv('customer_data.csv')
recommendations = ai_guided_eda(df, "Understand customer churn drivers")
print(recommendations)

The model responds with prioritized recommendations, explaining why each technique matters for the stated goal. This explanation builds intuition about analysis strategies.

Handling Recommendation Failures

AI models sometimes recommend inappropriate techniques. They may suggest parametric tests for non-normal data, ignore missing value patterns, or suggest visualizations that misrepresent data. Effective AI-guided EDA requires verification at each step.

def validate_eda_recommendation(df: pd.DataFrame, technique: str) -> dict:
    """Verify AI recommendation is appropriate for this data."""
    
    checks = {
        'missing_values': df.isnull().sum().sum() > 0,
        'sample_size': len(df) < 30,
        'skewness': df.select_dtypes(include='number').skew().abs().max() > 2
    }
    
    prompt = f"""I'm planning to apply: {technique}
    Data characteristics: {checks}
    Is this technique appropriate? If not, what should I use instead?"""
    
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content']

# Example: checking if regression is appropriate
result = validate_eda_recommendation(
    df, 
    "Linear regression to predict sales"
)
print(result)

This validation step catches recommendations that violate statistical assumptions. The model's response should explicitly confirm or deny appropriateness with explanations.