RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Capstone: Full-Stack AI App
  6. /Ch. 8
Capstone: Full-Stack AI App

08. Performance Testing

Chapter 8 of 18 · 15 min
KEY INSIGHT

Performance testing must run continuously in CI—catch regressions before they reach production.

Performance testing identifies bottlenecks before production traffic reveals them. Load testing verifies the system handles expected concurrent users. Stress testing finds the breaking point. Both require realistic test scenarios and careful metric collection.

The k6 load testing tool provides JavaScript scripting for complex scenarios:

// k6/load_test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 10 },   // Ramp up to 10 users
    { duration: '5m', target: 10 },   // Stay at 10 users
    { duration: '2m', target: 50 },   // Spike to 50 users
    { duration: '5m', target: 50 },   // Stay at 50 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<2000'],  // 95% under 2 seconds
    http_req_failed: ['rate<0.01'],     // Less than 1% failure rate
  },
};

const BASE_URL = __ENV.BASE_URL || 'http://localhost:8000';

export default function () {
  // Test document upload
  const uploadRes = http.post(
    `${BASE_URL}/api/v1/upload`,
    null,
    {
      files: {
        file: http.file(open('./test.pdf', 'b'), 'test.pdf', 'application/pdf'),
      },
    }
  );
  
  check(uploadRes, {
    'upload status 200': (r) => r.status === 200,
    'upload has document_id': (r) => r.json('document_id') !== undefined,
  });
  
  const documentId = uploadRes.json('document_id');
  
  // Test question asking
  const askRes = http.post(
    `${BASE_URL}/api/v1/ask`,
    JSON.stringify({
      question: 'What is the main topic?',
      document_id: documentId,
    }),
    {
      headers: { 'Content-Type': 'application/json' },
    }
  );
  
  check(askRes, {
    'ask status 200': (r) => r.status === 200,
    'ask response time < 5s': (r) => r.timings.duration < 5000,
  });
  
  sleep(1);
}

Run the load test with:

k6 run k6/load_test.js --out influxdb=http://localhost:8086/k6

Key metrics to collect include request duration percentiles (p50, p95, p99), error rate, throughput (requests/second), and resource utilization on each service. Memory growth over time indicates leaks. CPU saturation during inference is expected.

Common performance failures include connection pool exhaustion when database connections run out. The symptom is request timeouts with error messages about connection timeouts. Fix by increasing pool size or adding connection pooling middleware.

For model serving, the bottleneck is usually GPU memory. Profile with nvidia-smi to watch memory usage. If memory approaches limits during concurrent requests, requests queue or fail. Solution options: smaller batch sizes, reduced context window, or more GPU memory.

EXERCISE

Create a k6 load test that simulates 20 concurrent users over 10 minutes. Identify the bottleneck and document the limiting resource.

← Chapter 7
Integration Testing
Chapter 9 →
Security Audit