Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture designed to model sequential data while avoiding the vanishing gradient problem. It uses a cell state and three gates (input, forget, output) to control information flow, allowing it to retain long-range dependencies. Operators encounter LSTMs in older or smaller language models (e.g., some variants of GPT-2, character-level models) and in specialized tasks like speech recognition or time-series forecasting. In local AI, LSTMs are less common than Transformers for text generation, but they remain relevant for low-latency or low-memory scenarios where a full Transformer is overkill.
Deeper dive
LSTMs were introduced by Hochreiter & Schmidhuber (1997) to address the limitations of vanilla RNNs, which struggle to learn dependencies across many time steps due to vanishing gradients. The LSTM cell maintains a cell state that acts as a conveyor belt, with gates (sigmoid layers) controlling what to forget, what to store, and what to output. The forget gate decides which information to discard, the input gate updates the cell state with new candidate values, and the output gate filters the cell state to produce the hidden state. This gating mechanism allows LSTMs to remember information for thousands of steps. In practice, LSTMs were the dominant architecture for sequence modeling (language, speech, translation) until the Transformer (2017) surpassed them in quality and parallelizability. For local AI, LSTMs are still used in lightweight models (e.g., for on-device speech recognition) and as components in hybrid architectures (e.g., LSTM + attention). They are also common in time-series forecasting tools like Prophet or custom PyTorch models.
Practical example
An operator running a local speech-to-text model on an RTX 3060 (12 GB VRAM) might use a small LSTM-based model like 'vosk-model-small-en-us-0.15' (40 MB) to achieve real-time transcription with ~50 ms latency. This fits entirely in VRAM and uses ~1 GB of memory, leaving room for other tasks. In contrast, a Transformer-based model like Whisper tiny (150 MB) would use more VRAM and have higher latency (~200 ms) on the same hardware.
Workflow example
Reviewed by Fredoline Eruo. See our editorial policy.