Skip to main content
By default a run returns one block of text when it finishes. Two common cases call for something else. If a run takes seconds to minutes, you can stream tokens as they’re produced instead of waiting for the end. And if a downstream system has to consume the result, you can ask for structured JSON that matches a schema rather than parse prose. This page covers both, plus how to control response length. For the exhaustive reference — every stream event, error handling, and full schema examples — see Output Control.

Streaming

Streaming delivers the response incrementally as Server-Sent Events instead of one final payload — the right default for chat UIs, long answers, and anything interactive. Set stream: true and iterate the events, discriminating on each event’s type.
from perplexity import Perplexity

client = Perplexity()

stream = client.responses.create(
    preset="fast-search",
    input="Explain what a model card is and its typical sections.",
    stream=True,
)

for event in stream:
    if event.type == "response.output_text.delta":
        print(event.delta, end="")
    elif event.type == "response.completed":
        print(f"\n\nDone. Usage: {event.response.usage}")
The stream emits typed events you discriminate by type. The ones you’ll handle most:
Event typeMeaning
response.createdThe initial response object
response.output_text.deltaA chunk of streaming text — append delta
response.output_text.doneThe text item is complete
response.reasoning.search_queries / response.reasoning.search_resultsSearch activity during the run
response.sandbox.resultsA sandbox invocation finished
response.completedTerminal success — final response and usage
response.failed / errorTerminal failure
Set stream_options: {"include_usage": true} to ensure token usage rides along on the final response.completed event. For the complete event catalog and per-event payloads, see the Agent API reference. Within a run, search activity is reported before the text that uses it: the response.reasoning.search_queries and response.reasoning.search_results events arrive ahead of the response.output_text.delta events.

Background runs

Streaming keeps a connection open for the lifetime of the run. For runs that take minutes — deep research, heavy sandbox work — submit with background: true, then poll for the result by ID. The run continues server-side even if your client disconnects.
import time
from perplexity import Perplexity

client = Perplexity()

response = client.responses.create(
    model="openai/gpt-5.5",
    input="Produce a competitive landscape report for the EV charging market.",
    tools=[{"type": "web_search"}, {"type": "sandbox"}],
    background=True,
)

while response.status in ("queued", "in_progress"):
    time.sleep(2)
    response = client.responses.retrieve(response.id)

print(response.output_text)

Reconnecting to a durable stream

Background runs are durable, so you can also stream one live and reconnect after a drop. Request GET /v1/responses/{id}?stream=true&starting_after=N to resume from the event after sequence number N. Reconnect is only valid within the response’s reconnect window; once that window expires, the endpoint returns 400, and you fall back to a plain GET /v1/responses/{id} for the final snapshot. See the Agent API reference.

Control response length

Two levers shape how much the model writes:
  • max_output_tokens — a hard cap on generated tokens. The run stops when it’s hit. When generation is cut short this way, the response carries an incomplete_details object whose reason explains why. Use it to bound cost and latency.
  • text.verbositylow, medium, or high, for OpenAI models that support it. A soft preference for terse vs. expansive answers, without a hard cutoff.
response = client.responses.create(
    model="openai/gpt-5.5",
    input="Summarize the latest jobs report.",
    max_output_tokens=400,
    text={"verbosity": "low"},
)

Structured output

When another system consumes the result, free-form prose is hard to work with — you end up writing brittle parsers. Structured output makes the model return JSON that conforms to a schema you define, so you can deserialize it directly. Set response_format to a json_schema:
{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "your_schema_name",
      "schema": { "type": "object", "properties": { } }
    }
  }
}
name is required (1–64 characters; letters, numbers, underscores, and dashes). schema is a valid JSON Schema object. The response text will conform to the schema unless the output is cut off by max_output_tokens.
from typing import List, Optional
from pydantic import BaseModel
from perplexity import Perplexity

class CompanySummary(BaseModel):
    name: str
    sector: str
    headquarters: str
    key_products: Optional[List[str]] = None

client = Perplexity()

response = client.responses.create(
    preset="pro-search",
    input="Summarize NVIDIA: sector, headquarters, and key products.",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "company_summary",
            "schema": {
                **CompanySummary.model_json_schema(),
                "required": list(CompanySummary.model_fields.keys()),
                "additionalProperties": False,
            },
        },
    },
)

summary = CompanySummary.model_validate_json(response.output_text)
print(summary.name, summary.sector)
Reinforce the schema in your prompt (“Return the data as a JSON object matching the schema”) to improve adherence.
Avoid asking for links inside the JSON. A model emitting URLs as part of structured output can produce malformed or fabricated links. Pull links from the citations or search_results items in the response output instead.

Next steps

Keep context

Carry state across multiple turns.

Output Control reference

Every stream event, error handling, and full structured-output examples.