Stack · L3 execution·Homelab tier · mobile

iPhone on-device AI stack — Llama 3.2 3B / Phi-3.5 Mini via MLX Swift

App-bundled local LLM inference on iPhone 15 Pro / 16 Pro using MLX Swift + a 3B-class quantized model. The mobile-AI stack you ship in production iOS apps — battery-aware, thermal-aware, App Store reviewable. No fake numbers; honest about the throttle curve.

By Fredoline Eruo · Last reviewed 2026-05-07

What this stack accomplishes

This is the iOS-app-bundled local LLM inference stack for production deployment in 2026. Apple Intelligence reshaped the conversation in 2024-2025; by 2026, shipping a 3B-class on-device model in your iOS app is operationally viable for summarization, classification, voice transcription post-processing, and offline-first features.

The honest framing of what this is and isn't:

  • Is: a production-grade path for a 3B-class model running on-device, app-bundled, no network calls, App Store reviewable.
  • Is not: a replacement for cloud LLMs. iPhone tok/s lags desktop GPU tok/s by 4-10×. Sustained workloads thermal-throttle.
  • Is not: a 7B-class deployment. iPhone RAM (8 GB) bottlenecks anything past 4B.

Hardware required

iPhone 15 Pro or newer (A17 Pro+ Neural Engine) · iPad M4 (38 TOPS NPU + 120 GB/s memory bandwidth) for tablet-tier · iOS 17.4+ for MLX Swift Apple Intelligence-class deployment · Mac with Xcode 15.4+ for the build/sign toolchain · ~5 GB Mac storage for the model + Xcode caches

Components — what to install and why

The stack
  1. 01
    HardwareTarget SoC (iPhone 16 Pro)
    apple-a18-pro

    A18 Pro 38 TOPS Neural Engine + 8GB RAM. The 8GB floor is what makes 3B-class models viable on-device — A17 Pro at 8GB also works but with tighter KV-cache headroom.

  2. 02
    HardwareTablet-tier alternative
    apple-m4-ipad

    iPad Pro M4 has 120 GB/s memory bandwidth (vs 60 on phones) — sustained-load throughput is meaningfully higher. The right target if your app is iPad-first or supports both form factors.

  3. 03
    ToolOn-device runtime (Apple-first-party Swift API)
    mlx-swift

    MLX Swift is Apple's first-party path. Same model checkpoints as desktop MLX-LM (write once, run on Mac + iPhone + iPad). Active Apple maintenance — updated alongside iOS releases. iOS-only is the catch.

  4. 04
    ModelPrimary 3B chat model
    llama-3.2-3b-instruct

    3B at INT4 quant (~1.9 GB on disk) fits comfortably in the 8GB iPhone RAM with 4K context. Llama Community License permits app-bundling. Apple's MLX Swift example apps demonstrate this exact configuration.

  5. 05
    ModelAlternative 3.8B model with stronger instruction-following
    phi-3.5-mini-instruct

    Phi-3.5 Mini is 3.8B and slightly heavier than Llama 3.2 3B but with better instruction-following polish. MIT licensed. Pick when prompt adherence matters more than raw throughput.

  6. 06
    ModelMultilingual 3B alternative
    qwen-2.5-3b-instruct

    Qwen 2.5 3B at INT4 is the multilingual choice. Note Qwen License for the 3B size class (not Apache 2.0). Similar memory footprint as Llama 3.2 3B.

Step-by-step setup (Swift Package + model checkpoint)

1. Add MLX Swift to your Xcode project

// Package.swift
dependencies: [
    .package(url: "https://github.com/ml-explore/mlx-swift", from: "0.18.0"),
    .package(
        url: "https://github.com/ml-explore/mlx-swift-examples",
        branch: "main"
    )
],
targets: [
    .executableTarget(
        name: "MyApp",
        dependencies: [
            .product(name: "MLX", package: "mlx-swift"),
            .product(name: "MLXLLM", package: "mlx-swift-examples"),
        ]
    )
]

2. Bundle a quantized model with the app

# On your Mac (model conversion)
pip install mlx-lm
mlx_lm.convert \
    --hf-path meta-llama/Llama-3.2-3B-Instruct \
    --quantize \
    --q-bits 4 \
    --mlx-path ./Llama-3.2-3B-Instruct-mlx-int4

# Output: ~1.9 GB. Add to Xcode project as a folder reference
# under YourApp/Resources/Models/. App Store binary cap is 4 GB, so
# 3B-INT4 fits easily; 7B-INT4 would not.

3. Load + run inference (Swift)

import MLX
import MLXLLM

let modelURL = Bundle.main.url(
    forResource: "Llama-3.2-3B-Instruct-mlx-int4",
    withExtension: nil
)!

let modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: .init(directory: modelURL)
)

let result = try await modelContainer.perform { context in
    let input = try await context.processor.prepare(
        input: .init(prompt: "Summarize this in one sentence: ...")
    )
    return try generate(
        input: input,
        parameters: .init(maxTokens: 200, temperature: 0.4),
        context: context
    )
}

print(result.output)
print("Tokens/sec: \(result.tokensPerSecond)")

4. Pre-warm at app launch (avoid first-token cliff)

// In SceneDelegate or App.init:
Task.detached(priority: .userInitiated) {
    try? await modelContainer.perform { context in
        // Warm-up generate of 1 token to load weights into NPU cache
        let warmup = try await context.processor.prepare(
            input: .init(prompt: " ")
        )
        _ = try generate(
            input: warmup,
            parameters: .init(maxTokens: 1),
            context: context
        )
    }
}
// First-real-query latency drops from 2-3s cold to <500ms warm.

Thermal + battery reality check

Mobile NPU + GPU inference is thermally bounded, not compute-bounded. The first 2-3 minutes of inference run at peak tok/s; past 5-10 minutes the device throttles 25-50% under sustained load. Plan your UX around this:

  • Bursty UX wins. 30-second summarization of an article: fast and snappy.
  • Continuous chat falls off. A 20-minute conversational session will visibly slow.
  • Background continuity — iOS aggressively suspends apps. Use BackgroundProcessingTask for long-running summaries; expect interruptions.
  • Battery: ~3-7% per 10-min active inference session on iPhone 16 Pro (editorial estimate). Measure on your workload.
  • Charging mitigates thermal throttling but adds heat in the other direction. Test the user experience while plugged in vs unplugged.

Expected outcome

Ship an iOS app that loads a 3B-model checkpoint at app start (~2-3 sec on iPhone 16 Pro), serves single-stream LLM inference at editorial-estimated 8-15 tok/s decode (cold), and gracefully degrades when the device thermal-throttles after 5-10 min of sustained load. Battery cost: ~3-7% per 10-min session at peak; verify on your specific device + workload before shipping.

App Store review considerations

  • App size: 3B-INT4 model is ~1.9 GB. Apple's app size cap on initial install (4 GB) tolerates this; cellular install limits (200 MB without override) do not. Use NSBundleResourceRequest for on-demand resource download if you need cellular installs.
  • Privacy disclosures: on-device inference is the simplest privacy story possible — disclose that AI runs on-device, no data leaves the device.
  • Battery transparency: heavy AI usage will get flagged in iOS Battery settings. Make this clear in your onboarding so users aren't surprised.
  • License compliance: bundling Llama 3.2 3B requires Llama Community License attribution in your About / Settings screen. Phi-3.5 (MIT) and Qwen 2.5 (Qwen License) have different requirements — check before submission.

Failure modes you'll hit

  1. Cold-start latency feels broken. First model load on app launch is 2-3 seconds on iPhone 16 Pro. Without pre-warm, the first user query feels frozen. Always pre-warm at app launch.
  2. Memory pressure crashes. 8 GB iPhone RAM is shared with iOS, your UI, and any other apps. The 3B-INT4 model + KV cache + activations consume ~3-3.5 GB of working memory; combined with iOS + your app, you can hit memory pressure on the iPhone 15 Pro (8 GB total). Test with os_proc_available_memory() instrumentation.
  3. Backgrounding kills inference mid-stream. If the user backgrounds your app during a long generation, iOS suspends the process. Save partial state and resume on foreground.
  4. Thermal throttling looks like the model got dumber. Throttled tok/s drops 30-50% under sustained load. UX-wise this can feel like degraded quality; instrument tok/s and surface a "device warming up" indicator if your UX needs it.
  5. iOS 17.4+ requirement. MLX Swift requires recent iOS. Check deployment target before assuming the API is available.

Troubleshooting

Symptom: model loads but generation is silent / hangs. Check that your model directory is added as a folder reference (blue folder icon in Xcode), not a group (yellow folder). Group references flatten the contents into the bundle root and break the MLX loader's file lookup.

Symptom: tok/s is 2-3× slower than expected. Verify the device is plugged in or has been cool-booted (no recent heavy CPU usage). Thermal-throttled measurements are not representative of the cold tok/s numbers.

Symptom: works on simulator, crashes on device. The simulator runs MLX on Apple Silicon Mac hardware; on-device runs on iPhone NPU. Memory mapping and quant kernel coverage differ. Always test on physical device before committing architecture decisions.

Variations and alternatives

Phi-3.5 Mini variant: swap the model bundle for Phi-3.5 Mini. Slightly heavier (3.8B vs 3B) but better instruction-following polish. MIT license simplifies attribution.

Multilingual variant: swap to Qwen 2.5 3B for stronger non-English support. Note Qwen License requires attribution.

iPad-first deployment: target iPad M4. 120 GB/s memory bandwidth (vs 60 on phones) sustains higher tok/s under load.

Cross-platform alternative: if you need Android too, see Android on-device AI stack (also v9-shipped). MLX Swift is iOS-only.

Who should avoid this stack

  • Cross-platform apps — MLX Swift is iOS-only. Use MLC LLM if you need shared toolchain.
  • 7B+ model requirements — iPhone RAM doesn't fit. Cloud or device-as-thin-client is the right answer.
  • Continuous-use workloads (live tutoring, real-time translation): thermal throttling will visibly degrade the experience.
  • App Store-cellular-install-critical apps — 1.9 GB model bundle won't install over cellular without on-demand resources rework.

Going deeper