03. Automated Data Profiling
Data profiling extracts summary statistics and quality metrics from datasets. Automated profiling runs through common checks systematically, identifying issues that manual inspection might miss. AI-enhanced profiling adds intelligent interpretation of results, explaining what found issues mean for downstream analysis.
Manual profiling involves writing descriptive statistics queries, checking distributions, identifying duplicates, and comparing against expected schemas. This process is repetitive and error-prone. Automated profiling handles these checks consistently while AI interpretation explains the implications.
Profile Report Generation
Python's pandas-profiling library generates thorough reports, but AI enhancement adds interpretation that static reports lack. The combination produces reports that both identify issues and explain their significance.
import pandas as pd
from ydata_profiling import ProfileReport
import ollama
def generate_ai_profile_report(df: pd.DataFrame) -> str:
"""Generate profiling report with AI interpretation."""
# Generate standard profile report
profile = ProfileReport(
df,
title="Data Profile Report",
correlations={
'auto': {'calculate': True},
'spearman': {'calculate': True}
}
)
# Extract key findings from profile
findings = {
'missing': df.isnull().sum()[df.isnull().sum() > 0].to_dict(),
'duplicates': df.duplicated().sum(),
'skewness': df.select_dtypes(include='number').skew().to_dict(),
'cardinality': df.nunique().to_dict()
}
# Get AI interpretation
prompt = f"""Interpret these data profile findings:
{findings}
For each issue identified:
1. Explain what it means for data quality
2. Recommend specific remediation steps
3. Flag any issues that should block analysis"""
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': prompt}]
)
return response['message']['content']
# Usage example
df = pd.read_csv('transactions.csv')
interpretation = generate_ai_profile_report(df)
print(interpretation)
The profile report identifies what exists in the data. The AI interpretation explains what it means and what to do about it.
Interpreting Common Profile Patterns
Certain patterns appear frequently in profile reports. Understanding these patterns helps diagnose issues quickly when AI describes them.
High missing percentages in specific columns often indicate data collection problems. If "email" is 40% missing across thousands of records, the field may be optional in the source system, collected only for certain customer segments, or dropped during prior ETL processing. Investigate the collection mechanism before deciding on imputation strategies.
High cardinality in categorical columns affects memory usage and visualization. Columns with hundreds or thousands of unique values need different treatment than binary or low-cardinality categories. Consider grouping, filtering, or hierarchical encoding based on the analytical goal.
Skewness affects which statistical measures are appropriate. Highly skewed distributions make mean misleading; median and percentiles become more informative. Transformations like log or Box-Cox may normalize distributions for techniques requiring normality.
Zero-inflation appears when most values are zero with occasional non-zero values. This pattern occurs in count data like number of purchases, support tickets, or insurance claims. Standard statistical techniques may not apply; consider zero-inflated models or separate analysis of zeros versus non-zeros.
def interpret_profile_patterns(df: pd.DataFrame) -> dict:
"""Identify and interpret common profile patterns."""
interpretations = {}
for col in df.columns:
missing_pct = df[col].isnull().mean() * 100
if missing_pct > 20:
interpretations[col] = f"High missing rate ({missing_pct:.1f}%) - investigate collection"
if df[col].dtype == 'object':
unique_pct = df[col].nunique() / len(df) * 100
if unique_pct > 50:
interpretations[col] = f"High cardinality ({df[col].nunique()} unique) - consider grouping"
if pd.api.types.is_numeric_dtype(df[col]):
skew = df[col].skew()
if abs(skew) > 2:
interpretations[col] = f"High skewness ({skew:.2f}) - median more representative than mean"
return interpretations
# Test on data with various patterns
df = pd.DataFrame({
'email': ['[email protected]', None, '[email protected]', None, None],
'category': ['A'] * 5,
'amount': [0, 0, 0, 100, 200]
})
patterns = interpret_profile_patterns(df)
print(patterns)
Generate a profile report for a dataset with known quality issues. Use AI interpretation to prioritize remediation steps. Verify the AI recommendations by examining the raw data.