RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI for Code Generation
  6. /Ch. 5
Local AI for Code Generation

05. Autocomplete Setup

Chapter 5 of 18 · 20 min
KEY INSIGHT

Autocomplete tuning involves balancing suggestion frequency, latency, and relevance through configuration parameters and model selection. Effective autocomplete requires tuning three dimensions: trigger behavior, suggestion quality, and response speed. The goal is suggestions that feel anticipatory without being intrusive. Trigger behavior controls when suggestions appear. Continue offers several modes: ```json { "autocomplete": { "disabled": false, "triggerMode": "automatic", "maxPrefixLines": 50, "maxSuffixLines": 10 } } ``` `triggerMode` accepts `automatic` (trigger on keystroke), `manual` (trigger on Ctrl+K), or `always` (continuous streaming). Streaming mode shows partial completions as they're generatedΓÇöuseful for seeing long completions but potentially distracting. The `maxPrefixLines` and `maxSuffixLines` settings control how much context gets sent to the model. Higher values improve suggestion relevance but increase latency and token usage. For typical Python files, 50 prefix lines captures most function and class context. The suffix should include enough lines to capture the current block's closing but not so much that it confuses the model. Debounce settings prevent autocomplete from triggering on every keystroke: ```json { "autocomplete": { "debounceDelay": 150, "quickShortcuts": ["Tab"], "multiline": true, "maxTokens": 300 } } ``` `debounceDelay` of 150ms prevents rapid-fire requests during fast typing. The `quickShortcuts` array lets you immediately accept a suggestion with Tab. `multiline` enables multi-line completions, which many developers prefer for generating complete function bodies. Model selection significantly impacts autocomplete quality. Smaller models (3-7B) generate faster but often produce irrelevant suggestions. Larger models (15B+) provide better suggestions but introduce latency. The sweet spot for most hardware is 7-15B models with FIM training. Prompt engineering for autocomplete uses a compressed context format: ``` Current file context: [recent imports and definitions] [function/class being edited] [...] [the partial line being typed] Suggested continuation: ``` The model receives only the most relevant context to keep token count low and inference fast. Common autocomplete failures: 1. **Empty suggestions**: Model isn't receiving proper context. Check API connectivity and model availability. 2. **Irrelevant completions**: Prefix/suffix context includes too much noise. Reduce context line counts. 3. **Truncated completions**: `maxTokens` limit is too low. Increase to 300-500 for longer completions. 4. **Slow suggestions**: Model too large for hardware. Consider quantizing or using a smaller model. Quality evaluation involves tracking suggestion acceptance rate. If you're accepting less than 20% of suggestions, either the model isn't matching your coding style or the trigger threshold is too aggressive. If you're accepting 60%+ but productivity feels unchanged, you may be using a verbose model that generates too much boilerplate.

EXERCISE

Configure autocomplete with a 300ms debounce, 40 max prefix lines, and multiline enabled. Use the editor for one hour, noting every time you dismiss a suggestion as irrelevant versus accept it. Adjust parameters based on your acceptance rate.

← Chapter 4
Fill-in-Middle Models
Chapter 6 →
Chat in Editor