Cross-cutting

Streaming response

A streaming response delivers LLM-generated tokens to the client as they're produced rather than waiting for the full response, typically over Server-Sent Events or WebSocket. Streaming makes long responses feel faster (the user sees the first words within ~500ms regardless of total response length) and supports incremental UI updates.

May 23, 2026

Streaming is table-stakes for any LLM application with user-visible latency. Without streaming, a 30-second generation feels like a 30-second hang; with streaming, the user sees output flowing within the first second and can read along. Implementation involves three changes vs non-streaming: the API call uses the streaming endpoint, the server forwards tokens to the client as they arrive, and the UI renders incrementally with smooth append. Error handling becomes more nuanced (partial response on failure) and structured output is harder (the full schema isn't valid until the last token). Most production frameworks (Vercel AI SDK, Anthropic SDK, OpenAI SDK) handle the mechanics; the application focuses on UX.