RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Benchmarks
  4. /Llama 3.3 70B Instruct × NVIDIA GeForce RTX 4090
◯Community submitted
Editorial benchmark

Llama 3.3 70B Instruct on NVIDIA GeForce RTX 4090

Measured this month.

Why trust this benchmark?

Measurement

tok/s
8.0
TTFT
2400 ms
VRAM used
—
RAM used
—
Power
—
Quant
Q4_K_M
Context
4K
Run date
2026-05-13
Source
community

V36.52 rigor detail

Protocol →
Steady-state median
8.00 tok/s
Runs captured
5
Scenario
Single-stream
Editorial notes

Public-source seed (V37 2026-05-13). Cross-referenced from the URL above. Tagged 'medium' confidence — we reserve 'high' for owner-run measurements.

Why this confidence tier?

Moderate confidence

Confidence is rule-based. Every factor below contributed to the tier. We never expose a single numeric score; the tier label is auditable through this explanation alone.

Factors
  • +Source: community submission
How to improve this benchmark's confidence
  • Reproduce this benchmark →An independent reproduction with matching numbers lifts the tier and reduces single-source risk.
  • Read the confidence methodology →Full editorial standards for tiering.
  • Why we don't use percentages →Tier labels — auditable, no opaque score.

Cohort intelligence

How this measurement compares to the rest of the corpus. Only comparable rows (same model + hardware first, with relaxations labelled) are used. We never average across runtimes or quant formats unless explicitly told to.

Insufficient comparison data. Insufficient cohort (0 comparable measurements). Outlier detection requires ≥5.

Same model + hardware, different runtime

1 matching row

Variance here is pure runtime / version drift. Wide spread suggests a runtime regression candidate worth investigating.

Median tok/s
14.8
Spread
14.8 – 14.8
  • 14.8 tok/srtx-4090Q4_K_M✓Editorial

Same model, different hardware

1 matching row

What this model looks like on adjacent hardware. Drives the 'should I upgrade?' question.

Median tok/s
12.0
Spread
12.0 – 12.0
  • 12.0 tok/sapple-m3-ultraQ4_K_M✓Editorial

Same hardware, different model

5 matching rows

What else this rig can run at the same quant bucket.

Median tok/s
38.2
Spread
32.5 – 150.0
CoV
77%
  • 32.5 tok/srtx-4090AWQ-INT4✓Editorial
  • 36.5 tok/srtx-4090AWQ-INT4✓Editorial
  • 38.2 tok/srtx-4090AWQ-INT4✓Editorial
  • 38.2 tok/srtx-4090AWQ-INT4✓Editorial
  • 150.0 tok/srtx-4090Q4_K_M✓Editorial

Reproduce this benchmark

Got the same model + hardware combo? Run the same measurement and submit your numbers. We'll pre-fill model, hardware, quant, and context — you just add your tok/s, VRAM, runtime version. If your numbers match within ±15%, this benchmark gets a confidence lift and a reproduction badge.

Reproduce this benchmark →

Related

Drill into the entity pages for this measurement.

Llama 3.3 70B Instruct model page
NVIDIA GeForce RTX 4090 hardware page
All measurements for this exact pair
Try NVIDIA GeForce RTX 4090 in the build engine

Cite or export

Reference this benchmark in your work. Multiple formats; CC-BY attribution required.

Cite this benchmark or paste it into a README. Copy-to-clipboard; license is CC-BY-4.0 (attribution to RunLocalAI required).

OG card (PNG)
1200x630, social-preview ready
Download SVG
vector card, scales cleanly
Embed this benchmark
Paste into a Reddit thread, blog post, or README — attribution baked in.
<a href="https://runlocalai.co/benchmarks/344" rel="noopener">RunLocalAI: Llama 3.3 70B Instruct on NVIDIA GeForce RTX 4090 — 8.0 tok/s</a>

Direct download: .json · .md · .bib · .svg

Next recommended step

Got the same model + hardware? Run it and submit your numbers — successful reproductions lift this benchmark's confidence tier.

Reproduce this benchmark
OrCompare other measurements for Llama 3.3 70B Instruct on NVIDIA GeForce RTX 4090See the benchmark roadmap
Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Submit a benchmark