Performance Testing — Capstone: Full-Stack AI App (Chapter 8)

Performance testing identifies bottlenecks before production traffic reveals them. Load testing verifies the system handles expected concurrent users. Stress testing finds the breaking point. Both require realistic test scenarios and careful metric collection.

The k6 load testing tool provides JavaScript scripting for complex scenarios:

// k6/load_test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 10 },   // Ramp up to 10 users
    { duration: '5m', target: 10 },   // Stay at 10 users
    { duration: '2m', target: 50 },   // Spike to 50 users
    { duration: '5m', target: 50 },   // Stay at 50 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<2000'],  // 95% under 2 seconds
    http_req_failed: ['rate<0.01'],     // Less than 1% failure rate
  },
};

const BASE_URL = __ENV.BASE_URL || 'http://localhost:8000';

export default function () {
  // Test document upload
  const uploadRes = http.post(
    `${BASE_URL}/api/v1/upload`,
    null,
    {
      files: {
        file: http.file(open('./test.pdf', 'b'), 'test.pdf', 'application/pdf'),
      },
    }
  );
  
  check(uploadRes, {
    'upload status 200': (r) => r.status === 200,
    'upload has document_id': (r) => r.json('document_id') !== undefined,
  });
  
  const documentId = uploadRes.json('document_id');
  
  // Test question asking
  const askRes = http.post(
    `${BASE_URL}/api/v1/ask`,
    JSON.stringify({
      question: 'What is the main topic?',
      document_id: documentId,
    }),
    {
      headers: { 'Content-Type': 'application/json' },
    }
  );
  
  check(askRes, {
    'ask status 200': (r) => r.status === 200,
    'ask response time < 5s': (r) => r.timings.duration < 5000,
  });
  
  sleep(1);
}

Run the load test with:

k6 run k6/load_test.js --out influxdb=http://localhost:8086/k6

Key metrics to collect include request duration percentiles (p50, p95, p99), error rate, throughput (requests/second), and resource utilization on each service. Memory growth over time indicates leaks. CPU saturation during inference is expected.

Common performance failures include connection pool exhaustion when database connections run out. The symptom is request timeouts with error messages about connection timeouts. Fix by increasing pool size or adding connection pooling middleware.

For model serving, the bottleneck is usually GPU memory. Profile with nvidia-smi to watch memory usage. If memory approaches limits during concurrent requests, requests queue or fail. Solution options: smaller batch sizes, reduced context window, or more GPU memory.