OpenAPI 3.2 brings native, first‑class ways to describe APIs that send data as a sequence of events instead of a single, monolithic payload. This is a big deal for real‑time apps—LLMs, analytics feeds, chat, logs, and anything that benefits from progressive rendering or lower perceived latency. At a high level, OpenAPI 3.2 formalizes “sequential” media types and lets you specify the schema of each item in the stream using a new itemSchema on a response media type. That means your documentation can be explicit about the shape of each event, not just the overall connection.

  • New capability: Use itemSchema under content to define the structure of each streamed event.

Supported sequential media types:

  • SSE: text/event-stream
  • JSON Lines: application/jsonl
  • JSON Sequences: application/json-seq
  • Multipart Mixed: multipart/mixed

This unlocks consistent tooling, clearer client expectations, and better validation for streaming APIs.

A quick primer: how streaming differs from “normal” responses

  • Normal response: One payload, one schema.
  • Streaming response: Many items over time; each item conforms to the itemSchema. The transport stays open while the server emits items; the client processes incrementally.

Common patterns you’ll see:

  • SSE for text/token streams in browsers
  • JSONL for structured event logs and incremental model outputs
  • Multipart for mixed binary/text chunks (e.g., speech + text)
  • Sentinel events like [DONE] to cleanly signal the end of a stream

Example: describing an LLM’s streaming API (SSE)

Here’s a minimal but realistic OpenAPI 3.2 spec for a text‑generation endpoint that streams tokens via Server‑Sent Events. The server emits events as they become available; clients render tokens progressively.

openapi: 3.2.0
info:
  title: LLM Streaming API
  version: 1.0.0
paths:
  /generate:
    post:
      summary: Stream generated text from the LLM
      description: |
        Streams model output incrementally using Server-Sent Events (SSE).
        Each event contains a token chunk; a sentinel signals the end.
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              required: [model, prompt]
              properties:
                model:
                  type: string
                  description: Model identifier (e.g., "gpt-4o-mini")
                prompt:
                  type: string
                  description: The input prompt for text generation
                max_tokens:
                  type: integer
                  minimum: 1
                  default: 256
                temperature:
                  type: number
                  minimum: 0
                  maximum: 2
                  default: 0.7
      responses:
        "200":
          description: Stream of generation events via SSE
          headers:
            Content-Type:
              schema:
                type: string
                enum: ["text/event-stream"]
          content:
            text/event-stream:
              itemSchema:
                oneOf:
                  - type: object
                    required: [event, data]
                    properties:
                      event:
                        type: string
                        enum: ["token"]
                        description: Event type
                      data:
                        type: object
                        required: [text, index]
                        properties:
                          text:
                            type: string
                            description: Token or text chunk
                          index:
                            type: integer
                            minimum: 0
                            description: Incrementing token index
                          logprobs:
                            type: number
                            nullable: true
                            description: Optional per-token log prob
                          finish_reason:
                            type: string
                            nullable: true
                            enum: ["stop", "length", "content_filter", null]
                  - type: object
                    required: [event, data]
                    properties:
                      event:
                        type: string
                        enum: ["summary"]
                      data:
                        type: object
                        properties:
                          usage:
                            type: object
                            properties:
                              prompt_tokens: { type: integer, minimum: 0 }
                              completion_tokens: { type: integer, minimum: 0 }
                              total_tokens: { type: integer, minimum: 0 }
                          model:
                            type: string
                  - type: object
                    required: [event, data]
                    properties:
                      event:
                        type: string
                        enum: ["done"]
                      data:
                        type: string
                        enum: ["[DONE]"]
        "400":
          description: Invalid request
          content:
            application/json:
              schema:
                type: object
                required: [error]
                properties:
                  error:
                    type: string
        "429":
          description: Rate limited
        "500":
          description: Server error
  

Notes:

  • text/event-stream matches the SSE transport browsers understand.
  • itemSchema with oneOf captures normal token events, an optional final summary, and the explicit end‑of‑stream sentinel.
  • Errors follow regular non‑streaming JSON shapes with standard HTTP codes.

Variant: JSON Lines (JSONL) streaming for SDKs and backend clients

If your clients prefer framed JSON instead of SSE, you can offer application/jsonl with the same itemSchema. Each line is one JSON object.

responses:
  "200":
    description: Stream of generation events via JSON Lines
    content:
      application/jsonl:
        itemSchema:
          oneOf:
            - type: object
              required: [type, token]
              properties:
                type:
                  type: string
                  enum: ["token"]
                token:
                  type: object
                  required: [text, index]
                  properties:
                    text: { type: string }
                    index: { type: integer, minimum: 0 }
            - type: object
              required: [type, data]
              properties:
                type:
                  type: string
                  enum: ["done"]
                data:
                  type: string
                  enum: ["[DONE]"]
  

Designing a great streaming contract

  • Be explicit about termination: Include a clear sentinel (e.g., event: "done" with data: "[DONE]") so clients can stop reading deterministically.
  • Separate event kinds: Use oneOf to model different event shapes (tokens vs. summary vs. control).
  • Carry minimal context: Include a token index, optional finish_reason, and a final summary with usage for billing/UX.
  • Document timeouts and reconnection: For SSE you can also expose an SSE retry interval; for clients, document expected timeouts and backoff.
  • Error semantics: Prefer failing fast (non‑200) before starting the stream. If you must signal errors mid‑stream, reserve a control event type (e.g., event: "error") and describe it in oneOf.

Client expectations

  • SSE clients: Use EventSource in browsers or an SSE library on servers. Parse event and data fields.
  • JSONL clients: Read line‑delimited JSON; process each line as one event.
  • Backpressure: Streaming responses are pull‑driven by TCP; keep events small and frequent for smoother UX.
  • Idempotency: If you support retries, include id/cursor fields so clients can resume safely.

Why this is useful for LLMs

  • Faster first token: Users see output immediately, improving perceived latency.
  • Progressive rendering: Stream tokens as they’re generated; UIs don’t block.
  • Richer telemetry: Emit intermediary signals (e.g., tool_call, reasoning, usage) as distinct event types.
  • Interoperability: A standardized spec makes SDKs and gateways simpler to build and maintain.

See the OpenAPI 3.2 specification here.