Text-to-SQL — Data Analysis with Local AI (Chapter 4)

Converting natural language questions to SQL queries enables non-technical users to access data directly. Local models can generate SQL without data leaving the environment, addressing privacy concerns that prevent using external services for query generation.

Text-to-SQL involves understanding the question, mapping concepts to table structures, and constructing valid SQL syntax. Local models can learn schema details and generate queries that respect table relationships and constraints.

Schema-Aware Query Generation

Effective query generation requires the model to understand table structures, column meanings, and relationships. Providing schema context significantly improves query accuracy.

import ollama
import sqlite3

def generate_sql(question: str, schema: str, examples: list = None) -> str:
    """Generate SQL query from natural language question."""
    
    prompt = f"""Given this database schema:
    {schema}
    
    Generate a SQL query for this question:
    {question}
    
    Return ONLY the SQL query, no explanation."""
    
    if examples:
        prompt = f"Examples of correct queries:\n{examples}\n\n{prompt}"
    
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content'].strip()

# Example schema
schema = """
CREATE TABLE customers (
    id INTEGER PRIMARY KEY,
    name TEXT,
    signup_date DATE,
    region TEXT
);

CREATE TABLE orders (
    id INTEGER PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    order_date DATE,
    total DECIMAL(10,2),
    status TEXT
);
"""

# Generate query
question = "How many customers signed up in each region last month?"
sql = generate_sql(question, schema)
print(sql)

The model should return something like:

SELECT c.region, COUNT(*) as customer_count
FROM customers c
WHERE c.signup_date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
  AND c.signup_date < DATE_TRUNC('month', CURRENT_DATE)
GROUP BY c.region

Validating Generated Queries

Models sometimes generate syntactically valid but semantically incorrect SQL. Validation catches these errors before they cause problems.

def validate_sql(sql: str, conn: sqlite3.Connection) -> dict:
    """Validate SQL query syntax and execution."""
    
    result = {
        'valid': False,
        'error': None,
        'row_count': 0,
        'sample_results': None
    }
    
    try:
        cursor = conn.cursor()
        cursor.execute(sql)
        
        # Verify execution succeeds
        columns = [desc[0] for desc in cursor.description]
        rows = cursor.fetchmany(5)  # Fetch sample results
        
        result['valid'] = True
        result['columns'] = columns
        result['sample_results'] = rows
        
    except Exception as e:
        result['error'] = str(e)
    
    return result

# Test with SQLite
conn = sqlite3.connect(':memory:')
conn.execute("CREATE TABLE test (a INTEGER, b TEXT)")
conn.execute("INSERT INTO test VALUES (1, 'x'), (2, 'y')")

# Validate a generated query
validation = validate_sql("SELECT * FROM test", conn)
print(f"Valid: {validation['valid']}")
print(f"Columns: {validation.get('columns')}")

Handling Query Failures

Common failure modes include incorrect table references, missing JOIN conditions, wrong aggregation logic, and misunderstood column meanings. Debugging involves comparing generated SQL against expected results and providing corrective feedback.

def debug_sql(question: str, sql: str, error: str, schema: str) -> str:
    """Debug failed SQL generation."""
    
    prompt = f"""Question: {question}
    Generated SQL: {sql}
    Error: {error}
    Schema: {schema}
    
    What is wrong with this query? Provide corrected SQL with explanation."""
    
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content']

Iterative debugging with specific error feedback produces progressively better queries.