How to build a custom AI coding assistant
LLM with code generation capability, editor API
What this does
A custom AI coding assistant extends a foundation model's capabilities with retrieval-augmented generation over a specific codebase. It integrates with an editor via the Language Server Protocol (LSP), retrieves relevant code context from a vector store, and provides inline completions and hover documentation.
Steps
Create a Python package with a class that inherits from lsp4py.LanguageServer. Override the text_document_completion handler to intercept completion requests. When a completion request arrives, extract the current file path, cursor position, and surrounding code context. Run a batch ingestion process over all source files in the repository using tree-sitter to parse each file into an abstract syntax tree, extract function and class definitions, and split them into chunks of up to 512 tokens. Encode each chunk with an embedding model and store vectors in a vector database such as Chroma or FAISS. Maintain a manifest file mapping each chunk to its source file and line number.
In the completion handler, query the vector database with the current editor document content and cursor-based query string, retrieving the top 5 most similar chunks. Append these chunks to the system prompt as context under a heading for relevant code. Construct the final prompt from a template: system instructions, codebase context, current file content, and the incomplete line at the cursor. Send this enriched prompt to the LLM API and return completions as a list of CompletionItem objects.
Publish the LSP server as a local network service on localhost port 5050. Install the editor extension that communicates with this endpoint. Configure the extension manifest to declare LSP capabilities: completion, hover, and definition. Write a configuration file config.toml in the repository root that sets lsp_host, embedding_model, and max_context_tokens so these values are read at startup rather than hardcoded.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Confirm the local starting state. Print the active binary, package version, model name, or configuration path before changing the workflow.
Run the smallest complete path. Execute the minimum command or script that proves the guide works end to end on the local machine.
Compare against expected output. Check the final line, status code, generated artifact, or model response against the verification section before expanding the setup.
Record the local run evidence. Save the exact command, runtime or package version, model name if applicable, and observed output so the result can be reproduced later.
Verification
Start the LSP server with python -m coding_assistant.server and confirm it listens on the configured port. Open a Python file in the editor with the extension enabled. Trigger a completion request at an incomplete function definition. Expected output: a completion list appears with at least one suggestion matching the surrounding code context; the logged prompt sent to the LLM contains the relevant code section with 3 to 5 chunks from the vector store; the response latency is below 3 seconds; hovering over a function shows its docstring retrieved from the indexed chunks.
Common failures
- LSP handshake failures: The editor extension and server version mismatch causes
initializerequests to fail. Verify both sides implement the same LSP protocol version by comparingserver_capabilitiesin the handshake payload. - Vector store returning off-topic chunks: Low-quality embedding models produce irrelevant matches. Evaluate retrieval precision by manually inspecting the top 10 results for 10 sample queries and swap the embedding model if average relevance drops below 0.7.
- Context window overflow: The enriched prompt exceeds the LLM's maximum token limit. Enforce a
max_context_tokenshard limit inconfig.tomland truncate chunks from the bottom of the context list first. - Slow completion latency: The vector search plus LLM round-trip exceeds acceptable response time. Add a caching layer so identical query strings return cached completions for a configurable TTL.
- File permission errors during indexing: The ingestion process cannot read files in nested directories. Run with filesystem permissions that grant recursive read access or explicitly list allowlisted directory patterns.
Related guides
- Set up CI/CD for AI model deployment — the LSP server container can be built and deployed using the CI/CD pipeline.
- Implement agent planning and task decomposition — task decomposition assists the coding assistant by breaking multi-file refactoring into sub-tasks.