AI for Data Analysis — Data Analysis with Local AI (Chapter 1)

Data analysis involves decisions at every step: which transformations to apply, which relationships to examine, which visualizations to create, which tests to run. Each decision requires knowledge that accumulates through experience. Practitioners often know what they want to accomplish but stumble on implementation details or choose inappropriate techniques due to unfamiliarity.

Local AI addresses this gap by providing contextual guidance throughout the analysis process. The key distinction is that local AI models running on Ollama process data privately, without sending potentially sensitive information to external services. This matters significantly when analyzing customer data, financial records, health information, or any dataset with confidentiality requirements.

Consider a typical scenario: examining sales data to understand why Q3 revenue declined. Without AI assistance, an analyst might start by plotting revenue over time, compute summary statistics by region, then manually examine outliers. With AI assistance, the model suggests a systematic approach: verify data integrity first, segment by product category and region, test for significant changes using appropriate statistical tests, then identify contributing factors through decomposition. The AI does not replace analytical thinking but structures the exploration.

The workflow for AI-assisted analysis follows a consistent pattern. First, load data and establish context by describing the dataset structure to the model. Second, ask specific questions about techniques or interpretations. Third, apply the model's suggestions while verifying they match the data. Fourth, iterate as understanding deepens.

Setting Up the Analysis Environment

Python provides the foundation for this course with several key libraries. Pandas handles data manipulation. Matplotlib and Seaborn create visualizations. Scipy provides statistical functions. Ollama's Python SDK connects to local models.

# Install required packages
pip install pandas matplotlib seaborn scipy ollama sqlglot

# Verify Ollama connection
import ollama

response = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'What statistical test compares means of two groups?'}]
)
print(response['message']['content'])

This basic connection test confirms that Ollama is running and accessible. The response validates the setup before attempting more complex interactions.

Model Selection for Analysis Tasks

Different models excel at different tasks. Smaller models like llama3.2 or mistral work well for straightforward questions about techniques and interpretations. Larger models handle complex multi-step reasoning about analysis strategies. Gemma models often provide concise responses useful for quick lookups.

# Test different models for statistical question accuracy
import ollama

question = "Is a chi-square test appropriate for comparing proportions in a 2x2 contingency table?"

models = ['llama3.2', 'mistral', 'gemma2:27b']
for model in models:
    response = ollama.chat(
        model=model,
        messages=[{'role': 'user', 'content': question}]
    )
    print(f"{model}: {response['message']['content'][:200]}...")

Compare responses for accuracy and helpfulness. The best model for statistical guidance is not necessarily the largest model but rather the one that provides correct, actionable information for the specific domain.