Frameworks & tools

Streamlit

Streamlit is an open-source Python framework for turning data scripts into interactive web apps with minimal code. Operators encounter it when building custom UIs for local AI models—e.g., a chat interface or model comparison dashboard—without writing HTML, CSS, or JavaScript. Streamlit reruns the entire script on each user interaction, which matters for local AI because loading a model into VRAM on every click would be impractical; operators typically cache the model with @st.cache_resource to keep it resident.

Deeper dive

Streamlit works by executing a Python script from top to bottom whenever a user interacts with a widget (slider, button, text input). This reactive model makes prototyping fast but requires careful caching for expensive operations like loading a large language model. The @st.cache_resource decorator stores objects (e.g., a loaded model) in memory across reruns, avoiding repeated VRAM allocation. For local AI workflows, Streamlit is often paired with Hugging Face Transformers or llama.cpp via the llama-cpp-python bindings. Common patterns include a text input for prompts, a slider for temperature, and a button to generate text—all rendered with a few lines of Python. Streamlit's simplicity comes at the cost of fine-grained control; for production-grade serving, operators switch to Gradio or a dedicated API server.

Practical example

An operator building a local Llama 3.1 8B chat UI writes a streamlit_app.py that loads the model once with @st.cache_resource and wraps llama_cpp.Llama inside. The app provides a text area for the system prompt, a slider for max tokens (128–4096), and a chat history display. On an RTX 4090 (24 GB VRAM), the model at Q4_K_M (~5 GB) stays in VRAM, and generation runs at ~40 tok/s. Without caching, each button click would reload the model, taking ~10 seconds and saturating VRAM.

Workflow example

To run the Streamlit app, the operator executes streamlit run streamlit_app.py in the terminal. The browser opens a local URL (typically http://localhost:8501). The script uses st.chat_input for user prompts and st.chat_message to display assistant responses. The model is loaded via llama-cpp-python with Llama(model_path="llama-3.1-8b-instruct-q4_k_m.gguf", n_ctx=4096), cached with @st.cache_resource. When the operator adjusts the temperature slider, Streamlit reruns the script but skips model loading, only regenerating the response.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work