How to troubleshoot local AI job tools
The four common breakages when running a local model for résumé and job-search work — model too big for VRAM, slow generation, repetitive output, broken file ingestion — each with the actual diagnosis and the operator-grade fix.
Answer first
Four breakages account for most of the “my local résumé tool isn't working” complaints we see: the model doesn't fit in VRAM and silently spills to CPU; generation is fine but glacial; the output is repetitive or invents facts; or AnythingLLM-style document ingestion fails to actually find anything you ask about. Each has a specific symptom and a fix that takes 5-15 minutes once you know what you're looking at.
This page walks the four in order from most common to least, with the diagnosis check that confirms it's actually that problem and not something else. The full error catalog (with exact log strings) is at /errors; the toolkit context is in /guides/local-ai-tools-for-resume-optimization.
Symptom 1 — model too big for VRAM
What it looks like. The model loads (no error), generation starts, but tokens come out at 0.5-3 tok/s instead of the 30-60 you expected. Your CPU fan is loud. Your GPU fan is quiet. The chat is technically working but feels like watching paint dry.
Why it happens. The runtime tried to load the model into VRAM, ran out, and silently moved layers to system RAM. CPU offload always works but is roughly 5-30x slower than full-GPU inference. The runtime does not error; it just slows down.
Diagnosis. While generation is running, watch nvidia-smi (Linux/Windows with NVIDIA) or Activity Monitor (macOS) and check two things: GPU memory utilization and GPU compute utilization. If memory is full and compute is near zero, you are CPU-bound. If memory has headroom and compute is high, you have a different problem (see Symptom 2).
Fix. Two options that actually work, and one that doesn't.
- Option A — drop a quantization step. If you're running Qwen 2.5 14B Q5, switch to Q4_K_M. The quality difference at the 14B class is small; the VRAM saved is real (about 1-2 GB on a 14B model). See /systems/quantization-formats for the cliff points.
- Option B — drop a model size. If 14B doesn't fit even at Q4, the right answer is 7-8B. Qwen 2.5 7B handles tailoring and cover-letter drafts well; the gap to 14B on this specific workflow is smaller than people expect.
- What does NOT work — Q2 quantization. Q2 fits a lot of model in a little memory but the quality cliff is severe. Output goes incoherent on hard prompts. Stay at Q4_K_M or above.
Confirm fit ahead of time at /will-it-run/custom; entering your VRAM and a model name returns a per-quantization verdict.
Symptom 2 — slow generation
What it looks like. Model fits in VRAM (you confirmed via Symptom 1 diagnosis), but tok/s is still half what you expected from the hardware. A 12 GB GPU running a 7B model should do 50-90 tok/s; if you're getting 15-25, something is wrong.
Why it happens. Three common causes. First, the runtime fell back to a CPU build because GPU drivers are missing or wrong. Second, you're running with an enormous context window (32K+) and the KV cache is larger than the model. Third, you have another process holding VRAM and the runtime is fighting for it.
Diagnosis. Check three things in order. First, run nvidia-smi and confirm the runtime process is on the GPU at all. If it's not listed under processes, the driver isn't connected. Second, check what context length you set in your runtime config; if it's 32K+ on a 12 GB card, drop it to 8K. Third, kill any other GPU-using process (browsers with hardware acceleration, video editors, other LLM runtimes) and retry.
Fix.
- For the driver case: reinstall the latest NVIDIA driver (or AMD ROCm package on Linux), then re-pull the runtime binary. Ollama in particular ships separate CPU and CUDA builds — confirm the GPU build is the one running.
- For the context-too-large case: drop
num_ctxto 8192 or 16384. Most résumé and cover-letter work fits comfortably in 8K. Reserve long-context for actual long-document tasks. - For the VRAM-contention case: close the other GPU-using apps. On Windows, the fight is often with Chrome's GPU process. On Linux, sometimes a stale ComfyUI or webui-server is still resident.
Symptom 3 — repetitive or hallucinated output
What it looks like. The model produces a paragraph that sounds fine for the first sentence, then repeats the same phrase three times, or invents a job experience you never had, or hallucinates a metric you can't back up.
Why it happens. Two distinct causes that need different fixes. Repetition is usually a sampling-temperature issue; the model is being too greedy and gets stuck in a loop. Hallucinated content is usually a prompt issue — the model is filling in plausible-sounding gaps because the prompt didn't pin it to verifiable inputs.
Diagnosis. If the failure is repetition (literally the same phrase 2-3 times), it's sampling. If the failure is plausible-but-wrong content (a degree you don't have, a metric that's not in your master CV), it's prompt design.
Fix for repetition. Set temperature: 0.7, repeat_penalty: 1.1, and top_p: 0.9 in your runtime config or in the LM Studio sliders. This is a stable default that works across most 7B-14B models for résumé work.
Fix for hallucination. The prompt has to constrain the model to your actual data. Bad: “Write a cover letter for a senior backend engineer role.” Good: “Here is the JD [paste]. Here is my master résumé [paste]. Write a three-paragraph cover letter using only experiences and metrics that appear in the résumé.” The explicit “only experiences and metrics that appear in the résumé” is doing a lot of work. The two-pass rule then catches anything that slipped through — you read the output and verify every factual claim before sending.
Symptom 4 — file ingestion broken (RAG)
What it looks like. You uploaded a JD or your master résumé into AnythingLLM (or another local RAG tool), asked the model a question about it, and the answer either ignores the file entirely or returns “I don't have information about that.”
Why it happens. Three concrete root causes. The PDF was a scanned image so the embedder got nothing. The chunk size is too large for the embedder's context window. The vector retrieval is returning fewer results than the model needs (default is often k=3, which misses long documents).
Diagnosis. Open the file in a text editor or PDF viewer. If you can't copy text out of the PDF, it's a scan and you need to OCR it first. If text is fine, check the chunk size in your RAG config — most embedders work best with 256-512 token chunks, not the 2K-4K some defaults use.
Fix.
- For scanned PDFs: OCR the file first. Tesseract on Linux/Mac, or paste the file into a free OCR tool, save as searchable PDF, then re-ingest. Most résumés that come back “empty” from RAG are scanned PDFs.
- For chunk-size issues: set chunk size to 384 tokens with 64-token overlap. Re-index the workspace.
- For retrieval-too-narrow: bump
top_kfrom 3 to 6 or 8. Verifies more chunks make it into the context the model sees. AnythingLLM exposes this in the workspace settings.
When to reset and start over
Sometimes the cumulative state of a runtime — half-applied driver updates, partial model downloads, a stale config — passes the threshold where chasing each symptom one at a time is more work than just nuking the install. The reset playbook: uninstall the runtime, delete the model cache directory, reinstall fresh from the website, pull one model, run a basic chat to confirm. Total time: 30-60 minutes, and you start clean.
Do this when you've spent more than two hours on a single symptom without progress, or when you've recently upgraded the OS, the GPU driver, or the runtime version and things stopped working. Live status of each runtime's known issues is at /runtime-health; recurring setup mistakes are catalogued in /guides/common-local-ai-setup-mistakes.
Next recommended step
Exact log strings, root causes, and operator-grade fixes for every common breakage.
Job-hunting tools that parse PDFs, run multiple inference passes, and cross-reference job boards in parallel hit VRAM harder than a single chat session ever will. A GPU with enough headroom handles these spikes gracefully. One that is already running near its limit will force you into constant restarts and truncated outputs — exactly when you need the tool to be reliable during a time-sensitive application window.
The VRAM floor that prevents mid-scan failures: best budget GPU for local AI.