Qwen 3 32B on NVIDIA GeForce RTX 4090
Measured this month.
Measurement
- tok/s
- 36.5
- TTFT
- 158 ms
- VRAM used
- 22.1 GB
- RAM used
- 5.2 GB
- Power
- 378 W
- Quant
- AWQ-INT4
- Context
- 32K
- Run date
- 2026-05-03
- Source
- community
Qwen 3 32B AWQ-INT4 — slightly heavier than Qwen 2.5 Coder 32B (newer architecture has slightly larger embedding tables). Same vLLM 0.17.1 settings. The decode speed delta vs Qwen 2.5 Coder is ~5%, well within run-to-run variance. The reasoning-mode toggle (<think> blocks) costs an additional 200-400 tokens per query, which is the operationally-significant detail — the wall-clock cost of reasoning mode dominates the small per-token throughput delta.
Why this confidence tier?
Confidence is rule-based. Every factor below contributed to the tier. We never expose a single numeric score; the tier label is auditable through this explanation alone.
- +Source: community submission
- Reproduce this benchmark →An independent reproduction with matching numbers lifts the tier and reduces single-source risk.
- Read the confidence methodology →Full editorial standards for tiering.
- Why we don't use percentages →Tier labels — auditable, no opaque score.
Cohort intelligence
How this measurement compares to the rest of the corpus. Only comparable rows (same model + hardware first, with relaxations labelled) are used. We never average across runtimes or quant formats unless explicitly told to.
Same hardware, different model
6 matching rowsWhat else this rig can run at the same quant bucket.
- 38.2 tok/srtx-4090AWQ-INT4Editorial
- 38.2 tok/srtx-4090AWQ-INT4Editorial
- 32.5 tok/srtx-4090AWQ-INT4Editorial
- 14.8 tok/srtx-4090Q4_K_MEditorial
- 8.0 tok/srtx-4090Q4_K_MEditorial
- +1 more
Reproduce this benchmark
Got the same model + hardware combo? Run the same measurement and submit your numbers. We'll pre-fill model, hardware, quant, and context — you just add your tok/s, VRAM, runtime version. If your numbers match within ±15%, this benchmark gets a confidence lift and a reproduction badge.
Related
Drill into the entity pages for this measurement.
Cite or export
Reference this benchmark in your work. Multiple formats; CC-BY attribution required.
Cite this benchmark or paste it into a README. Copy-to-clipboard; license is CC-BY-4.0 (attribution to RunLocalAI required).
<a href="https://runlocalai.co/benchmarks/329" rel="noopener">RunLocalAI: Qwen 3 32B on NVIDIA GeForce RTX 4090 — 36.5 tok/s</a>
Next recommended step
Got the same model + hardware? Run it and submit your numbers — successful reproductions lift this benchmark's confidence tier.