Streaming response
A streaming response delivers LLM-generated tokens to the client as they're produced rather than waiting for the full response — typically over Server-Sent Events or WebSocket. Streaming makes long responses feel faster (the user sees the first words within ~500ms regardless of total response length) and supports incremental UI updates.
Streaming is table-stakes for any LLM application with user-visible latency. Without streaming, a 30-second generation feels like a 30-second hang; with streaming, the user sees output flowing within the first second and can read along. Implementation involves three changes vs non-streaming: the API call uses the streaming endpoint, the server forwards tokens to the client as they arrive, and the UI renders incrementally with smooth append. Error handling becomes more nuanced (partial response on failure) and structured output is harder (the full schema isn't valid until the last token). Most production frameworks (Vercel AI SDK, Anthropic SDK, OpenAI SDK) handle the mechanics; the application focuses on UX.
Related terms
- Structured output
Structured output is the LLM feature that guarantees the response matches a provided schema (JSON Schema, Zod, Pydantic) — eliminating the parsing failures and format drift that plagued early LLM applications.
- Inference cost
Inference cost is the per-request economic cost of running an LLM — typically billed per million input tokens and per million output tokens, with output tokens often 3-5x more expensive than input.
- Large language model (LLM)
A large language model is a neural network trained on enormous text corpora to predict the next token given preceding tokens — typically with billions to trillions of parameters.