RAG with Perplexity Embeddings

This guide walks through building a complete retrieval-augmented generation (RAG) pipeline using Perplexity’s Embeddings API and Agent API. It covers document chunking, embedding with both standard and contextualized models, building an in-memory vector index, querying for relevant context, and generating grounded answers.

This guide focuses on the end-to-end pipeline. For API reference details on individual embedding types, see Standard Embeddings and Contextualized Embeddings.

Pipeline Overview

A RAG pipeline retrieves relevant information from your own documents before generating an answer, grounding model responses in your data rather than relying solely on parametric knowledge.

The steps are:

Chunk your source documents into manageable pieces with overlap.
Embed each chunk using a Perplexity embedding model.
Index the embeddings for similarity search.
Query by embedding the user question with the same model.
Retrieve the top-k most similar chunks.
Generate an answer by passing the retrieved context to the Agent API.

Prerequisites

Install the Perplexity SDK:

pip install perplexityai

If you don’t have an API key yet:

Get your Perplexity API Key

Navigate to the API Keys tab in the API Portal and generate a new key.

Then export your API key as an environment variable:

export PERPLEXITY_API_KEY="your-api-key"

Document Chunking

Split your documents into chunks small enough for the model’s context window while preserving semantic coherence. Overlapping chunks ensure that information at chunk boundaries is not lost.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
    """Split text into overlapping chunks by character count."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

document = """Retrieval-augmented generation (RAG) is a technique that combines
information retrieval with text generation. Rather than relying solely on a
language model's training data, RAG systems first search a knowledge base for
relevant documents, then use those documents as context when generating a
response. This reduces hallucinations and allows the system to provide answers
grounded in specific, up-to-date sources."""

chunks = chunk_text(document, chunk_size=300, overlap=50)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:60]}...")

A chunk size of 300-500 characters with 50-100 characters of overlap works well for most use cases. For structured documents (markdown, HTML), consider splitting on headings or paragraph boundaries instead of raw character counts.

Embedding with the Standard Model

Standard embeddings treat each text independently. Use them when chunks are self-contained and don’t rely on surrounding context.

import base64
import numpy as np
from perplexity import Perplexity

client = Perplexity()

def decode_embedding(b64_string: str) -> np.ndarray:
    """Decode a base64-encoded int8 embedding to a float32 numpy array."""
    return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)

chunks = [
    "RAG combines retrieval with generation to ground responses in real data.",
    "Document chunking splits text into overlapping segments for embedding.",
    "Cosine similarity measures the angle between two embedding vectors.",
]

response = client.embeddings.create(input=chunks, model="pplx-embed-v1-4b")
embeddings = [decode_embedding(emb.embedding) for emb in response.data]
print(f"Embedded {len(embeddings)} chunks, each with {len(embeddings[0])} dimensions")

Embedding with the Contextualized Model

Contextualized embeddings understand that chunks belong to the same document. The model uses cross-chunk attention so that each chunk’s embedding incorporates information from its neighbors. The key API difference is the nested array structure: each inner array contains chunks from a single document.

from perplexity import Perplexity

client = Perplexity()

# Two source documents, each split into chunks
doc1_chunks = [
    "RAG combines retrieval with generation to produce grounded answers.",
    "The retrieval step searches a vector index for chunks similar to the query.",
    "The generation step uses retrieved context to produce a final response."
]
doc2_chunks = [
    "Embedding models convert text into dense vector representations.",
    "Cosine similarity is the standard metric for comparing embeddings."
]

# Pass as nested arrays (one inner array per document)
response = client.contextualized_embeddings.create(
    input=[doc1_chunks, doc2_chunks],
    model="pplx-embed-context-v1-4b"
)

# Nested response: response.data[doc_idx].data[chunk_idx]
for doc in response.data:
    for chunk in doc.data:
        print(f"Doc {doc.index}, Chunk {chunk.index}: {chunk.embedding[:20]}...")

Chunk ordering matters. Chunks within each document must be passed in their original sequential order. The contextualized model uses positional context to relate neighboring chunks, so shuffling them will degrade embedding quality.

Querying a Contextualized Index

When using contextualized embeddings, wrap each query as a single-element inner list (e.g., [[query]]) so the API treats it as a single-chunk document:

from perplexity import Perplexity
import base64, numpy as np

client = Perplexity()

def decode_embedding(b64: str) -> np.ndarray:
    return np.frombuffer(base64.b64decode(b64), dtype=np.int8).astype(np.float32)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Index with contextualized model (chunks share cross-chunk attention)
doc_chunks = [
    "RAG combines retrieval with generation to produce grounded answers.",
    "The retrieval step finds chunks similar to the user query.",
    "The generation step uses retrieved context to produce a final response.",
]
ctx_response = client.contextualized_embeddings.create(
    input=[doc_chunks],  # nested array: one inner list per document
    model="pplx-embed-context-v1-4b"
)
index = [
    {"embedding": decode_embedding(chunk.embedding), "text": doc_chunks[chunk.index]}
    for chunk in ctx_response.data[0].data
]

# Query the index
query = "How does retrieval work in RAG?"
q_response = client.contextualized_embeddings.create(
    input=[[query]], model="pplx-embed-context-v1-4b"
)
q_emb = decode_embedding(q_response.data[0].data[0].embedding)
results = sorted(index, key=lambda x: cosine_similarity(q_emb, x["embedding"]), reverse=True)
print(f"Top result: {results[0]['text']}")

Building a Vector Index

This example uses numpy for cosine similarity with a simple in-memory index. For production systems with millions of vectors, use a dedicated vector database (Pinecone, Weaviate, Qdrant, etc.).

import base64
import numpy as np
from perplexity import Perplexity

client = Perplexity()

def decode_embedding(b64_string: str) -> np.ndarray:
    return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Documents to index
documents = {
    "RAG Overview": [
        "Retrieval-augmented generation grounds LLM responses in external data.",
        "RAG reduces hallucinations by providing factual context to the model.",
        "A typical RAG pipeline has three stages: indexing, retrieval, and generation."
    ],
    "Embedding Models": [
        "Embedding models map text to dense vector representations.",
        "Similar texts produce vectors that are close in the embedding space.",
        "Perplexity offers both standard and contextualized embedding models."
    ]
}

# Build index: list of (embedding, text, doc_title) tuples
index = []
for title, chunks in documents.items():
    response = client.embeddings.create(input=chunks, model="pplx-embed-v1-4b")
    for emb_obj in response.data:
        index.append({
            "embedding": decode_embedding(emb_obj.embedding),
            "text": chunks[emb_obj.index],
            "doc_title": title
        })

print(f"Indexed {len(index)} chunks")

Query Pipeline

The full query pipeline embeds the user question, retrieves the top-k most similar chunks, and passes them as context to the Agent API for answer generation.

def rag_query(question: str, index: list[dict], top_k: int = 3, min_score: float = 0.3) -> str:
    """Embed question -> retrieve similar chunks -> generate answer."""
    # Step 1: Embed the question
    query_response = client.embeddings.create(input=[question], model="pplx-embed-v1-4b")
    query_emb = decode_embedding(query_response.data[0].embedding)

    # Step 2: Retrieve top-k chunks above the minimum similarity threshold
    scored = sorted(
        [{"score": cosine_similarity(query_emb, item["embedding"]), **item} for item in index],
        key=lambda x: x["score"], reverse=True
    )[:top_k]
    scored = [item for item in scored if item["score"] >= min_score]

    if not scored:
        return "No relevant context found for this question."

    # Include source attribution alongside each chunk
    context = "\n\n".join(
        f"[Source: {item['doc_title']}]\n{item['text']}" for item in scored
    )

    # Step 3: Generate answer via Agent API
    response = client.responses.create(
        model="openai/gpt-5.4",
        input=question,
        instructions=(
            "Answer based only on the provided context. "
            "Cite sources by name when referencing specific information. "
            "If the context does not contain enough information, say so.\n\n"
            f"Context:\n{context}"
        )
    )
    return response.output_text

answer = rag_query("What are the stages of a RAG pipeline?", index)
print(answer)

Start with top_k=3 and min_score=0.3 for most use cases. Raise top_k to 5–7 for broad questions or short chunks. Raise min_score to 0.5–0.7 if retrieved chunks contain irrelevant information. Lower it toward 0.2 for diverse or ambiguous queries.

Standard vs Contextualized Comparison

Aspect	Standard (`pplx-embed-v1-4b`)	Contextualized (`pplx-embed-context-v1-4b`)
Input format	Flat list of texts	Nested arrays grouped by document
Context awareness	Each text embedded independently	Chunks share cross-chunk context within each document
Best for	FAQ entries, standalone texts, short documents	Document paragraphs, article sections
Chunk ordering	Order does not matter	Must be in original document order
Query embedding	`client.embeddings.create(input=[query])`	`client.contextualized_embeddings.create(input=[[query]])`
Price (4b model)	$0.03 / 1M tokens	$0.05 / 1M tokens

When to Use Standard Embeddings

Chunks are self-contained and do not rely on surrounding context.
Your content consists of FAQ pairs, product descriptions, or short independent entries.
You need the lowest cost per token.

When to Use Contextualized Embeddings

Chunks come from longer documents where meaning depends on neighboring text.
A chunk like “This approach improves performance by 20%” only makes sense with its surrounding context.
You are embedding paragraphs from articles, reports, or technical documentation.
You want higher retrieval accuracy at a modest cost increase.

Matryoshka Dimensions

Perplexity embedding models support Matryoshka Representation Learning (MRL), which concentrates the most important information in the first N dimensions. You can request reduced dimensions directly via the API for faster search and smaller storage.

import base64
import numpy as np
from perplexity import Perplexity

client = Perplexity()

texts = ["Matryoshka embeddings allow dimension reduction without re-embedding."]

def decode_embedding(b64: str) -> np.ndarray:
    return np.frombuffer(base64.b64decode(b64), dtype=np.int8)

# Full dimensions (2560 for 4b model)
full = client.embeddings.create(input=texts, model="pplx-embed-v1-4b")

# Reduced to 512 dimensions via the API
reduced = client.embeddings.create(input=texts, model="pplx-embed-v1-4b", dimensions=512)

print(f"Full: {len(decode_embedding(full.data[0].embedding))} dimensions")
print(f"Reduced: {len(decode_embedding(reduced.data[0].embedding))} dimensions")

Dimension reduction tradeoffs for the pplx-embed-v1-4b model:

Dimensions	Storage per Vector	Relative Quality	Use Case
2560 (full)	2.5 KB	Highest	Maximum accuracy, small datasets
1024	1 KB	Very high	Good balance for most applications
512	512 B	High	Large-scale retrieval, fast search
256	256 B	Moderate	Extremely large datasets, coarse filtering
128	128 B	Lower	First-pass candidate filtering

Use the dimensions parameter in the API call rather than manually truncating vectors. The API applies proper normalization for the requested dimension count. Start with full dimensions and reduce only when storage or latency becomes a bottleneck.

Batch Processing

When embedding large document collections, process them in batches to stay within API rate limits. The standard API accepts up to 512 texts per request with a combined limit of 120,000 tokens.

import asyncio
import base64
import numpy as np
from perplexity import AsyncPerplexity

def decode_embedding(b64_string: str) -> np.ndarray:
    return np.frombuffer(base64.b64decode(b64_string), dtype=np.int8).astype(np.float32)

async def batch_embed(texts: list[str], batch_size: int = 100) -> list[np.ndarray]:
    """Embed texts in batches with rate limiting."""
    async with AsyncPerplexity() as client:
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            response = await client.embeddings.create(
                input=batch, model="pplx-embed-v1-4b"
            )
            all_embeddings.extend(decode_embedding(e.embedding) for e in response.data)
            print(f"Embedded {min(i + batch_size, len(texts))}/{len(texts)}")
            if i + batch_size < len(texts):
                await asyncio.sleep(0.1)  # Brief delay between batches
        return all_embeddings

# Usage
texts = [f"Document chunk number {i} with content." for i in range(500)]
embeddings = asyncio.run(batch_embed(texts, batch_size=100))
print(f"Total: {len(embeddings)} embeddings")

For contextualized embeddings, batch at the document level using client.contextualized_embeddings.create(input=batch_of_doc_arrays) with the same pattern. The contextualized API accepts up to 512 documents with 16,000 total chunks per request.

Rate limits: Keep batch sizes well within the API limits (512 texts / 120,000 tokens for standard; 512 documents / 16,000 chunks for contextualized) and add small delays between requests to avoid throttling.

Complete Example

A self-contained pipeline that indexes two documents with contextualized embeddings and answers questions against the indexed content.

import base64
import numpy as np
from perplexity import Perplexity

client = Perplexity()

# --- Helpers ---

def chunk_text(text: str, chunk_size: int = 400, overlap: int = 80) -> list[str]:
    chunks, start = [], 0
    while start < len(text):
        chunk = text[start:start + chunk_size].strip()
        if chunk:
            chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

def decode_embedding(b64: str) -> np.ndarray:
    return np.frombuffer(base64.b64decode(b64), dtype=np.int8).astype(np.float32)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# --- Source documents ---

DOCUMENTS = {
    "Quantum Computing": (
        "Quantum computers use qubits that can exist in superposition, representing "
        "0 and 1 simultaneously. Unlike classical bits, qubits leverage quantum "
        "interference to perform calculations. Quantum entanglement allows qubits to "
        "be correlated, enabling parallel processing at scale. Current quantum computers "
        "from IBM, Google, and others have dozens to hundreds of physical qubits."
    ),
    "Machine Learning": (
        "Machine learning enables computers to learn from data without explicit "
        "programming. Supervised learning uses labeled examples to train models for "
        "classification and regression. Neural networks with many layers (deep learning) "
        "excel at image recognition and language tasks. Training requires large datasets "
        "and significant compute, often using GPUs or TPUs."
    ),
}

# --- Step 1: Index with the model ---

def build_index(documents: dict[str, str]) -> list[dict]:
    index = []
    for title, text in documents.items():
        chunks = chunk_text(text)
        response = client.contextualized_embeddings.create(
            input=[chunks],
            model="pplx-embed-context-v1-4b"
        )
        for chunk_obj in response.data[0].data:
            index.append({
                "embedding": decode_embedding(chunk_obj.embedding),
                "text": chunks[chunk_obj.index],
                "doc_title": title,
            })
    print(f"Indexed {len(index)} chunks from {len(documents)} documents")
    return index

# --- Step 2: Query the index, retrieve, generate ---

def rag_query(question: str, index: list[dict], top_k: int = 3, min_score: float = 0.3) -> str:
    q_resp = client.contextualized_embeddings.create(
        input=[[question]], model="pplx-embed-context-v1-4b"
    )
    q_emb = decode_embedding(q_resp.data[0].data[0].embedding)

    results = sorted(
        [{"score": cosine_similarity(q_emb, item["embedding"]), **item} for item in index],
        key=lambda x: x["score"], reverse=True
    )[:top_k]
    results = [r for r in results if r["score"] >= min_score]

    if not results:
        return "No relevant context found for this question."

    context = "\n\n".join(f"[{r['doc_title']}]\n{r['text']}" for r in results)

    response = client.responses.create(
        model="openai/gpt-5.4",
        input=question,
        instructions=(
            "Answer based only on the provided context. "
            "Cite the source name in brackets when referencing information. "
            "If the context is insufficient, say so.\n\n"
            f"Context:\n{context}"
        )
    )
    return response.output_text

# --- Run ---

if __name__ == "__main__":
    index = build_index(DOCUMENTS)

    questions = [
        "What makes qubits different from classical bits?",
        "What hardware is used to train machine learning models?",
    ]
    for q in questions:
        print(f"\nQ: {q}")
        print(f"A: {rag_query(q, index)}")

Next Steps

Standard Embeddings

API reference for standard embedding parameters and response format.

Contextualized Embeddings

API reference for contextualized embedding parameters and response format.

Best Practices

Encoding formats, similarity metrics, normalization, and error handling.

Agent API

Learn more about the Responses API used for answer generation.

​Pipeline Overview

​Prerequisites

Get your Perplexity API Key

​Document Chunking

​Embedding with the Standard Model

​Embedding with the Contextualized Model

​Querying a Contextualized Index

​Building a Vector Index

​Query Pipeline

​Standard vs Contextualized Comparison

​When to Use Standard Embeddings

​When to Use Contextualized Embeddings

​Matryoshka Dimensions

​Batch Processing

​Complete Example

​Next Steps

Standard Embeddings

Contextualized Embeddings

Best Practices

Agent API

Pipeline Overview

Prerequisites

Document Chunking

Embedding with the Standard Model

Embedding with the Contextualized Model

Querying a Contextualized Index

Building a Vector Index

Query Pipeline

Standard vs Contextualized Comparison

When to Use Standard Embeddings

When to Use Contextualized Embeddings

Matryoshka Dimensions

Batch Processing

Complete Example

Next Steps