Documentation

Documentation serves multiple audiences: developers working on the codebase, operators deploying the system, and users understanding features. Each audience needs different content at different detail levels.

The README.md provides the starting point. It should cover what the project does, how to set it up locally, how to run tests, and where to find more documentation:

# AI Document Q&A

A full-stack application for querying documents using a local LLM.

## Quick Start

```bash
## Clone and configure
git clone https://github.com/example/ai-doc-qa.git
cd ai-doc-qa
cp .env.example .env

## Start services
docker compose up -d

## Run tests
docker compose exec backend pytest

Architecture

[Architecture diagram]

Requirements

Docker 24+
16GB RAM minimum (32GB recommended)
NVIDIA GPU with 8GB+ VRAM (optional, for GPU inference)


Developer documentation covers the codebase structure, development workflow, and testing approach. It should help a new developer become productive quickly:

```markdown
# Developer Guide

## Project Structure

ai-doc-qa/ ├── backend/ # FastAPI backend service │ ├── app/ │ │ ├── api/ # Route handlers │ │ ├── core/ # Configuration and utilities │ │ ├── models/ # Database models │ │ └── services/ # Business logic │ ├── tests/ │ └── requirements.txt ├── frontend/ # React frontend │ ├── src/ │ │ ├── features/ # Feature modules │ │ └── shared/ # Shared components │ └── package.json ├── model-server/ # Model serving └── docs/ # Documentation


## Development Workflow

1. Create a branch for your feature
2. Write tests first
3. Implement the feature
4. Run the full test suite
5. Submit a pull request

## Testing

Unit tests: `pytest backend/tests/unit`
Integration tests: `pytest backend/tests/integration`
E2E tests: `playwright test`

Architecture Decision Records (ADRs) document significant decisions. They capture context, options considered, and rationale:

# ADR-004: Model Serving Approach

## Status
Accepted

## Context
We need to serve a 7B parameter LLM for inference. The serving approach affects latency, throughput, resource usage, and operational complexity.

## Options Considered

### llama.cpp
- CPU or GPU execution
- Excellent quantization support
- Lower throughput for concurrent users
- Simple deployment

### vLLM
- GPU required
- Higher throughput via PagedAttention
- OpenAI-compatible API
- More complex setup

## Decision
We will support both options. Use llama.cpp for CPU deployments and small-scale usage. Use vLLM for GPU deployments with high concurrency. The backend abstracts the serving layer behind a common interface.

## Consequences
- Two separate Docker images for model serving
- Configuration option to select serving layer
- Performance testing required to determine scaling thresholds